Convert .pdf to text12/7/2022 We can remove this with a simple one-liner. You’ll notice that the text has many instances of “\n” within it when you print it out. # Getting Executive Summary page_obj1 = pdf_reader.getPage(12) page_obj2 = pdf_reader.getPage(13) executive_summary = page_obj1.extractText() page_obj2.extractText() Now let’s pull all the text from pages 12 and 13 and combine them to get the executive summary. If you print the page_obj you will get something quite unreadable to the human eye. Convert .pdf to text how to## How to create a page objec page_obj = pdf_reader.getPage(12) We can pull out an individual page using the following method. Convert .pdf to text pdf#We know from looking at the original PDF that we are interested in pages 12 and 13 where the Executive Summary resides. # Converting the object into a PDF Reader Object pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj) # If you want to find out the number of pages in the PDF use this # command print(pdf_reader.numPages) Now we need to convert pdf_file_obj into a PyPDF2 object so that we can use the library to search through the Indonesia Energy Outlook to extract our text of interest. pdf_file_obj = open("/content/content-indonesia-energy-outlook-2019-english-version.pdf","rb") We must save the PDF as an object before we can start using PyPDF2 on it. !pip install PyPDF2 import PyPDF2īefore we move to the next step make sure you have loaded the PDF document into the file repository on the left of the colab environment. This library isn’t pre-installed in the Google colab environment so we will have to install it before importing the PyPDF2 into our code. PyPDF2 can do much more than just extract text and if you are curious about its other capabilities, you can read about them here. The library we will use to extract the PDF text is called PyPDF2. Note: The following code explanation is designed for the Google colab environment. With the PDF and text identified let’s move on to using python to extract the Executive Summary. For the purpose of this post, I am only going to focus on extracting the text from the Executive Summary on pages xii and xiii. If you open the link to the PDF you will find a long report with many pages and figures. Following the theme of my last post, I’m going to use another PDF focused on Indonesia’s current energy situation with the Indonesia Energy Outlook 2019 Report published by the Secretariat General of the National Energy Council.
0 Comments
Leave a Reply.AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |