Available Now: Explore our latest release with enhanced accessibility and powerful IDP features
By Heather Dinsdale, John Chow | 2022 Dec 09
4 min
Tags
python
extract
This tutorial explains how to extract text from a PDF using Python and the Apryse SDK for machine learning.
In this case, “extract text from a PDF” doesn’t mean just a paragraph or two from a single document — it means extracting text from possibly thousands of PDFs, using automation and batch processing.
In this tutorial, you will:
The tutorial provides a code sample for a very basic text extraction using a Python script with the Apryse SDK. We’ll also cover methods you can use to extract all text or even specific text in a PDF. Finally, this tutorial will touch on other data, such as metadata and images, which you can extract from a PDF using Python.
In addition to simply extracting basic text, use the Apryse Intelligent Document Processing (IDP) add-on, featuring Data Extraction capability, to perform layout-aware PDF text extraction in Python. Apryse IDP includes powerful PDF data extraction that recognizes and extracts any document layout along with content elements, such as tabular data, form fields, and text, to structured JSON and Excel right out of the box. As a result, it gives organizations scalability and leading accuracy in PDF data extractions — it eliminates costs associated with extensive templating, rules, and data entry.
For more information, check out the Python PDF library documentation.
Learn about the latest release of Apryse IDP.
To start using Python and the Apryse SDK, you need the following:
Follow these steps to get started:
You can also visit the Python Get Started page or the Python PDF Content Extraction Library.
Now that your Python environment is set up and you’ve downloaded the Apryse SDK, let’s extract some text.
Run the following code sample for a very basic text extraction using a Python script with the Apryse SDK:
doc = PDFDoc(filename)
page = doc.GetPage(1)
txt = TextExtractor()
txt.Begin(page) # Read the page
word = Word()
line = txt.GetFirstLine()
while line.IsValid():
word = line.GetFirstWord()
while word.IsValid():
# word.GetString()
word = word.GetNextWord()
line = line.GetNextLine()
Next, decide what to do with the extracted text. You can save it to another text file, or in a database. Execute the following code to specify where to send your extracted text.
def dumpAllText (reader):
element = reader.Next()
while element != None:
type = element.GetType()
if type == Element.e_text_begin:
print("Text Block Begin")
elif type == Element.e_text_end:
print("Text Block End")
elif type == Element.e_text:
bbox = element.GetBBox()
print("BBox: " + str(bbox.GetX1()) + ", " + str(bbox.GetY1()) + ", "
+ str(bbox.GetX2()) + ", " + str(bbox.GetY2()))
textString = element.GetTextString()
print(textString)
elif type == Element.e_text_new_line:
print("New Line")
elif type == Element.e_form:
reader.FormBegin()
dumpAllText(reader)
reader.End()
element = reader.Next()
You can even use a utility method to extract all text content from a specific region, like a rectangle on a PDF page. This is useful if you’re extracting text from multiple documents that share the same layout, like invoices or forms. The rectangle coordinates are expressed in PDF user/page coordinate system.
def ReadTextFromRect (page, pos, reader):
reader.Begin(page)
srch_str = RectTextSearch(reader, pos)
reader.End()
return srch_str
To see a code sample for full text extraction, go to Read a PDF File Sample and under TextExtract, click Python. You can also download more code samples.
In addition to simple text, you can also extract data from a PDF using Python, including:
Note: The Apryse Intelligent Data Extraction component add-on is required to perform the following task.
Apryse IDP performs layout-aware text extraction right out of the box for any structured or semi-structured data in PDF, while offering different conversion formats for processing options. It reliably recognizes tables, accurately extracts text and tabular data, detects and understands articles of text in a document, and detects various types of form fields.
To use IDP, follow these steps.
PDFNetC\Lib
folder within the SDK folder.LicenseKey.py
to add the demo license key.Samples\DataExtractionTest\PYTHON
within the Python SDK folder to view a sample Python script using the Data Extraction Module.RunTest.bat
or RunTest.sh
to run DataExtractionTest.py
.Samples\TestFiles\Output
.1. First import the Apryse SDK and add-ons from above.
import site
site.addsitedir("../../../PDFNetC/Lib") #Path to Lib folder
import sys
from PDFNetPython import *
import platform
sys.path.append("../../LicenseKey/PYTHON") #path to the licensekey location
from LicenseKey import *
2. Initialize the Apryse SDK.
PDFNet.Initialize(LicenseKey)
PDFNet.AddResourceSearchPath("../../../PDFNetC/Lib/") #Path to the Lib Folder
3. Call the Data Extraction Suite Python function of choice.
JSON doc structure:
DataExtractionModule.ExtractData(PATH_TO_INPUTFILE, PATH_TO_OUTPUT_JSON, DataExtractionModule.e_DocStructure)
Excel table data:
DataExtractionModule.ExtractToXSLX(PATH_TO_INPUTFILE, PATH_TO_OUTPUT_XLSX
Detect form field JSON:
DataExtractionModule.ExtractData(PATH_TO_INPUTFILE, PATH_TO_OUTPUT_JSON, DataExtractionModule.e_Form)
Variants of the method can output to a string, which can then be processed however the user needs.
In this tutorial, you extracted data for machine learning with Python and the Apryse SDK. You then used the scripts to decide where to send extracted data.
To learn more about text extraction, visit Extracting Text from a PDF on Cross-Platform (Core) and check out our WebViewer showcase to try out the PDF text extractor. The demo uses JavaScript, but the results are like what you’d see using Python.
You can also visit the Python documentation to see what else you can do with PDFs using Python, including:
If you have any questions or features you would like to see next, do not hesitate to reach out to us directly.
Tags
python
extract
Heather Dinsdale
John Chow
Product Manager
Related Products
Share this post
PRODUCTS
Enterprise
Small Business
Popular Content