How to Extract Text from a PDF Using Python

By Heather Dinsdale, John Chow | 2022 Dec 09

4 min

Prerequisites

Copied to clipboard

To start using Python and the Apryse SDK, you need the following:

A Python environment (Apryse supports Python 3)
A free Apryse account, so you can:
* Get a trial license key
* Download the Apryse SDK
* Download samples and sample code
The Apryse text extraction demo (optional)
The Apryse Intelligent Document Processing (IDP) add-on (optional, for layout-aware text extraction)

Step 1: Get Started

Copied to clipboard

Follow these steps to get started:

Go to the Download Center to get or sign in with a Apryse account.
Choose your operating system—Windows, Linux, or macOS.
Click Reveal to get a trial key.
In the Download section, select Python as the language.
Download Python version 2 or 3.
(Optional) In the Get Started section, download the Python 2 or 3 Guide to get the Precompiled Python & PDF library integration. The guide will help you run Apryse samples and integrate a free trial of the Apryse SDK into Python applications. Your free trial includes unlimited trial usage and support from solution engineers.

You can then download and run the Apryse SDK and samples.

You can also visit the Python Get Started page or the Python PDF Content Extraction Library.

Step 2: Extract Text from a PDF Using Python

Copied to clipboard

Now that your Python environment is set up and you’ve downloaded the Apryse SDK, let’s extract some text.

Run the following code sample for a very basic text extraction using a Python script with the Apryse SDK:

doc = PDFDoc(filename) 
page = doc.GetPage(1) 
txt = TextExtractor() 
txt.Begin(page) # Read the page 
word = Word() 
line = txt.GetFirstLine() 
while line.IsValid(): 
    word = line.GetFirstWord() 
    while word.IsValid(): 
        # word.GetString() 
        word = word.GetNextWord() 
    line = line.GetNextLine()

Where to Send Extracted Text

Next, decide what to do with the extracted text. You can save it to another text file, or in a database. Execute the following code to specify where to send your extracted text.

def dumpAllText (reader): 
    element = reader.Next() 
    while element != None: 
        type = element.GetType() 
        if type == Element.e_text_begin: 
            print("Text Block Begin") 
        elif type == Element.e_text_end: 
            print("Text Block End") 
        elif type == Element.e_text: 
            bbox = element.GetBBox() 
            print("BBox: " + str(bbox.GetX1()) + ", " + str(bbox.GetY1()) + ", "
                  + str(bbox.GetX2()) + ", " + str(bbox.GetY2())) 
            textString = element.GetTextString() 
            print(textString) 
        elif type == Element.e_text_new_line: 
            print("New Line") 
        elif type == Element.e_form: 
            reader.FormBegin() 
            dumpAllText(reader) 
            reader.End() 
        element = reader.Next()

Extract Text from a Specific Region of a PDF Page

You can even use a utility method to extract all text content from a specific region, like a rectangle on a PDF page. This is useful if you’re extracting text from multiple documents that share the same layout, like invoices or forms. The rectangle coordinates are expressed in PDF user/page coordinate system.

def ReadTextFromRect (page, pos, reader): 
    reader.Begin(page) 
    srch_str = RectTextSearch(reader, pos) 
    reader.End() 
    return srch_str

Full Text Extraction

To see a code sample for full text extraction, go to Read a PDF File Sample and under TextExtract, click Python. You can also download more code samples.

Extracting Other Data from a PDF

In addition to simple text, you can also extract data from a PDF using Python, including:

Digital signatures
Intuitive page content based on a concept of graphical elements
Structured Unicode text, including style and positioning information, from any PDF using the text recognition engine (pdftron.PDF.TextExtractor)
Metadata, embedded fonts, ICC color profiles, U3D streams, and embedded files
Image extraction

Step 3: Run Layout-Aware Data Extraction Tests in Python with the Apryse SDK

Copied to clipboard

Note: The Apryse Intelligent Data Extraction component add-on is required to perform the following task.

Apryse IDP performs layout-aware text extraction right out of the box for any structured or semi-structured data in PDF, while offering different conversion formats for processing options. It reliably recognizes tables, accurately extracts text and tabular data, detects and understands articles of text in a document, and detects various types of form fields.

To use IDP, follow these steps.

Install Python for your platform (Windows 32- and 64-bit supported. Linux 64 bit and ARM64 supported).
Identify the platform version installed (3.11 is latest, versions 3.5 to 3.11 are supported).
Download the Apryse SDK for Python 3 for the current platform.
Unzip the Python SDK.
Download the Apryse Intelligent Data Extraction component add-on for the SDK.
Unzip contents to the PDFNetC\Lib folder within the SDK folder.
Modify LicenseKey.py to add the demo license key.
Open Samples\DataExtractionTest\PYTHON within the Python SDK folder to view a sample Python script using the Data Extraction Module.
Run RunTest.bat or RunTest.sh to run DataExtractionTest.py.
View output in Samples\TestFiles\Output.

Script up a Python Example using the Data Extraction Suite

1. First import the Apryse SDK and add-ons from above.

import site
site.addsitedir("../../../PDFNetC/Lib") #Path to Lib folder
import sys
from PDFNetPython import *

import platform

sys.path.append("../../LicenseKey/PYTHON") #path to the licensekey location
from LicenseKey import *

2. Initialize the Apryse SDK.

PDFNet.Initialize(LicenseKey)
   
    PDFNet.AddResourceSearchPath("../../../PDFNetC/Lib/") #Path to the Lib Folder

3. Call the Data Extraction Suite Python function of choice.
JSON doc structure:

DataExtractionModule.ExtractData(PATH_TO_INPUTFILE, PATH_TO_OUTPUT_JSON, DataExtractionModule.e_DocStructure)

Excel table data:

DataExtractionModule.ExtractToXSLX(PATH_TO_INPUTFILE, PATH_TO_OUTPUT_XLSX

Detect form field JSON:

DataExtractionModule.ExtractData(PATH_TO_INPUTFILE, PATH_TO_OUTPUT_JSON, DataExtractionModule.e_Form)

Variants of the method can output to a string, which can then be processed however the user needs.

Conclusion

Copied to clipboard

In this tutorial, you extracted data for machine learning with Python and the Apryse SDK. You then used the scripts to decide where to send extracted data.

To learn more about text extraction, visit Extracting Text from a PDF on Cross-Platform (Core) and check out our WebViewer showcase to try out the PDF text extractor. The demo uses JavaScript, but the results are like what you’d see using Python.

You can also visit the Python documentation to see what else you can do with PDFs using Python, including: