New: Create and edit Word documents with DOCX Editor in WebViewer
By Heather Dinsdale, John Chow | 2022 Dec 09
This tutorial explains how to extract text from a PDF using Python and the Apryse SDK for machine learning.
In this case, “extract text from a PDF” doesn’t mean just a paragraph or two from a single document — it means extracting text from possibly thousands of PDFs, using automation and batch processing.
In this tutorial, you will:
The tutorial provides a code sample for a very basic text extraction using a Python script with the Apryse SDK. We’ll also cover methods you can use to extract all text or even specific text in a PDF. Finally, this tutorial will touch on other data, such as metadata and images, which you can extract from a PDF using Python.
In addition to simply extracting basic text, use the Apryse Intelligent Document Processing (IDP) add-on, featuring Data Extraction capability, to perform layout-aware PDF text extraction in Python. Apryse IDP includes powerful PDF data extraction that recognizes and extracts any document layout along with content elements, such as tabular data, form fields, and text, to structured JSON and Excel right out of the box. As a result, it gives organizations scalability and leading accuracy in PDF data extractions — it eliminates costs associated with extensive templating, rules, and data entry.
For more information, check out the Python PDF library documentation.
To start using Python and the Apryse SDK, you need the following:
Follow these steps to get started:
You can also visit the Python Get Started page or the Python PDF Content Extraction Library.
Now that your Python environment is set up and you’ve downloaded the Apryse SDK, let’s extract some text.
Run the following code sample for a very basic text extraction using a Python script with the Apryse SDK:
doc = PDFDoc(filename) page = doc.GetPage(1) txt = TextExtractor() txt.Begin(page) # Read the page word = Word() line = txt.GetFirstLine() while line.IsValid(): word = line.GetFirstWord() while word.IsValid(): # word.GetString() word = word.GetNextWord() line = line.GetNextLine()
Next, decide what to do with the extracted text. You can save it to another text file, or in a database. Execute the following code to specify where to send your extracted text.
def dumpAllText (reader): element = reader.Next() while element != None: type = element.GetType() if type == Element.e_text_begin: print("Text Block Begin") elif type == Element.e_text_end: print("Text Block End") elif type == Element.e_text: bbox = element.GetBBox() print("BBox: " + str(bbox.GetX1()) + ", " + str(bbox.GetY1()) + ", " + str(bbox.GetX2()) + ", " + str(bbox.GetY2())) textString = element.GetTextString() print(textString) elif type == Element.e_text_new_line: print("New Line") elif type == Element.e_form: reader.FormBegin() dumpAllText(reader) reader.End() element = reader.Next()
You can even use a utility method to extract all text content from a specific region, like a rectangle on a PDF page. This is useful if you’re extracting text from multiple documents that share the same layout, like invoices or forms. The rectangle coordinates are expressed in PDF user/page coordinate system.
def ReadTextFromRect (page, pos, reader): reader.Begin(page) srch_str = RectTextSearch(reader, pos) reader.End() return srch_str
To see a code sample for full text extraction, go to Read a PDF File Sample and under TextExtract, click Python. You can also download more code samples.
In addition to simple text, you can also extract data from a PDF using Python, including:
Note: The Apryse Intelligent Data Extraction component add-on is required to perform the following task.
Apryse IDP performs layout-aware text extraction right out of the box for any structured or semi-structured data in PDF, while offering different conversion formats for processing options. It reliably recognizes tables, accurately extracts text and tabular data, detects and understands articles of text in a document, and detects various types of form fields.
To use IDP, follow these steps.
PDFNetC\Libfolder within the SDK folder.
LicenseKey.pyto add the demo license key.
Samples\DataExtractionTest\PYTHONwithin the Python SDK folder to view a sample Python script using the Data Extraction Module.
1. First import the Apryse SDK and add-ons from above.
import site site.addsitedir("../../../PDFNetC/Lib") #Path to Lib folder import sys from PDFNetPython import * import platform sys.path.append("../../LicenseKey/PYTHON") #path to the licensekey location from LicenseKey import *
2. Initialize the Apryse SDK.
PDFNet.Initialize(LicenseKey) PDFNet.AddResourceSearchPath("../../../PDFNetC/Lib/") #Path to the Lib Folder
3. Call the Data Extraction Suite Python function of choice.
JSON doc structure:
DataExtractionModule.ExtractData(PATH_TO_INPUTFILE, PATH_TO_OUTPUT_JSON, DataExtractionModule.e_DocStructure)
Excel table data:
Detect form field JSON:
DataExtractionModule.ExtractData(PATH_TO_INPUTFILE, PATH_TO_OUTPUT_JSON, DataExtractionModule.e_Form)
Variants of the method can output to a string, which can then be processed however the user needs.
In this tutorial, you extracted data for machine learning with Python and the Apryse SDK. You then used the scripts to decide where to send extracted data.
You can also visit the Python documentation to see what else you can do with PDFs using Python, including:
If you have any questions or features you would like to see next, do not hesitate to reach out to us directly.
Share this post