Available Now: Explore our latest release with enhanced accessibility and powerful IDP features

How to Extract Text from a PDF Using Python

By Heather Dinsdale, John Chow | 2022 Dec 09

Sanity Image
Read time

4 min

This tutorial explains how to extract text from a PDF using Python and the Apryse SDK for machine learning.

In this case, “extract text from a PDF” doesn’t mean just a paragraph or two from a single document — it means extracting text from possibly thousands of PDFs, using automation and batch processing.

In this tutorial, you will:

  • Use the Apryse SDK to run the bulk text extraction from your PDFs, automating the process.
  • Use Python scripts to specify what information to extract, from where, and where to send the extracted data.
  • Run layout-aware data extraction tests in Python with the Apryse SDK.

The tutorial provides a code sample for a very basic text extraction using a Python script with the Apryse SDK. We’ll also cover methods you can use to extract all text or even specific text in a PDF. Finally, this tutorial will touch on other data, such as metadata and images, which you can extract from a PDF using Python. 

In addition to simply extracting basic text, use the Apryse Intelligent Document Processing (IDP) add-on, featuring Data Extraction capability, to perform layout-aware PDF text extraction in Python. Apryse IDP includes powerful PDF data extraction that recognizes and extracts any document layout along with content elements, such as tabular data, form fields, and text, to structured JSON and Excel right out of the box. As a result, it gives organizations scalability and leading accuracy in PDF data extractions — it eliminates costs associated with extensive templating, rules, and data entry.

For more information, check out the Python PDF library documentation.

Learn about the latest release of Apryse IDP. 

Prerequisites

Copied to clipboard

To start using Python and the Apryse SDK, you need the following:

  • A Python environment (Apryse supports both Python 2 and Python 3)
  • A free Apryse account, so you can:
    * Get a trial license key
    * Download the Apryse SDK
    * Download samples and sample code
  • The Apryse text extraction demo (optional) 
  • The Apryse Intelligent Document Processing (IDP) add-on (optional, for layout-aware text extraction)

Step 1: Get Started

Copied to clipboard

Follow these steps to get started:

  1. Go to the Download Center to get or sign in with a Apryse account. 
  2. Choose your operating system—Windows, Linux, or macOS.
  3. Click Reveal to get a trial key. 
  4. In the Download section, select Python as the language. 
  5. Download Python version 2 or 3.
  6. (Optional) In the Get Started section, download the Python 2 or 3 Guide to get the Precompiled Python & PDF library integration. The guide will help you run Apryse samples and integrate a free trial of the Apryse SDK into Python applications. Your free trial includes unlimited trial usage and support from solution engineers.

    You can then download and run the Apryse SDK and samples.

You can also visit the Python Get Started page or the Python PDF Content Extraction Library.

Step 2: Extract Text from a PDF Using Python

Copied to clipboard

Now that your Python environment is set up and you’ve downloaded the Apryse SDK, let’s extract some text.

Run the following code sample for a very basic text extraction using a Python script with the Apryse SDK: 

doc = PDFDoc(filename) 
page = doc.GetPage(1) 
txt = TextExtractor() 
txt.Begin(page) # Read the page 
word = Word() 
line = txt.GetFirstLine() 
while line.IsValid(): 
    word = line.GetFirstWord() 
    while word.IsValid(): 
        # word.GetString() 
        word = word.GetNextWord() 
    line = line.GetNextLine()

Where to Send Extracted Text

Next, decide what to do with the extracted text. You can save it to another text file, or in a database. Execute the following code to specify where to send your extracted text.

def dumpAllText (reader): 
    element = reader.Next() 
    while element != None: 
        type = element.GetType() 
        if type == Element.e_text_begin: 
            print("Text Block Begin") 
        elif type == Element.e_text_end: 
            print("Text Block End") 
        elif type == Element.e_text: 
            bbox = element.GetBBox() 
            print("BBox: " + str(bbox.GetX1()) + ", " + str(bbox.GetY1()) + ", "
                  + str(bbox.GetX2()) + ", " + str(bbox.GetY2())) 
            textString = element.GetTextString() 
            print(textString) 
        elif type == Element.e_text_new_line: 
            print("New Line") 
        elif type == Element.e_form: 
            reader.FormBegin() 
            dumpAllText(reader) 
            reader.End() 
        element = reader.Next()

Extract Text from a Specific Region of a PDF Page

You can even use a utility method to extract all text content from a specific region, like a rectangle on a PDF page. This is useful if you’re extracting text from multiple documents that share the same layout, like invoices or forms. The rectangle coordinates are expressed in PDF user/page coordinate system.

def ReadTextFromRect (page, pos, reader): 
    reader.Begin(page) 
    srch_str = RectTextSearch(reader, pos) 
    reader.End() 
    return srch_str 

Full Text Extraction

To see a code sample for full text extraction, go to Read a PDF File Sample and under TextExtract, click Python. You can also download more code samples.

Extracting Other Data from a PDF

In addition to simple text, you can also extract data from a PDF using Python, including:

  • Digital signatures
  • Intuitive page content based on a concept of graphical elements
  • Structured Unicode text, including style and positioning information, from any PDF using the text recognition engine (pdftron.PDF.TextExtractor) 
  • Metadata, embedded fonts, ICC color profiles, U3D streams, and embedded files
  • Image extraction 

Step 3: Run Layout-Aware Data Extraction Tests in Python with the Apryse SDK

Copied to clipboard

Note: The Apryse Intelligent Data Extraction component add-on is required to perform the following task.

Apryse IDP performs layout-aware text extraction right out of the box for any structured or semi-structured data in PDF, while offering different conversion formats for processing options. It reliably recognizes tables, accurately extracts text and tabular data, detects and understands articles of text in a document, and detects various types of form fields.

To use IDP, follow these steps.

  1. Install Python for your platform (Windows 32- and 64-bit supported. Linux 64 bit and ARM64 supported).
  2. Identify the platform version installed (3.11 is latest, versions 3.5 to 3.11 are supported).
  3. Download the Apryse SDK for Python 3 for the current platform.
  4. Unzip the Python SDK.
  5. Download the Apryse Intelligent Data Extraction component add-on for the SDK.
  6. Unzip contents to the PDFNetC\Lib folder within the SDK folder.
  7. Modify LicenseKey.py to add the demo license key.
  8. Open Samples\DataExtractionTest\PYTHON within the Python SDK folder to view a sample Python script using the Data Extraction Module.
  9. Run RunTest.bat or RunTest.sh to run DataExtractionTest.py.
  10. View output in Samples\TestFiles\Output.

Script up a Python Example using the Data Extraction Suite

1. First import the Apryse SDK and add-ons from above.

import site
site.addsitedir("../../../PDFNetC/Lib") #Path to Lib folder
import sys
from PDFNetPython import *

import platform

sys.path.append("../../LicenseKey/PYTHON") #path to the licensekey location
from LicenseKey import *

2. Initialize the Apryse SDK.

PDFNet.Initialize(LicenseKey)
   
    PDFNet.AddResourceSearchPath("../../../PDFNetC/Lib/") #Path to the Lib Folder

3. Call the Data Extraction Suite Python function of choice.
JSON doc structure:

DataExtractionModule.ExtractData(PATH_TO_INPUTFILE, PATH_TO_OUTPUT_JSON, DataExtractionModule.e_DocStructure)

Excel table data:

DataExtractionModule.ExtractToXSLX(PATH_TO_INPUTFILE, PATH_TO_OUTPUT_XLSX

Detect form field JSON:

DataExtractionModule.ExtractData(PATH_TO_INPUTFILE, PATH_TO_OUTPUT_JSON, DataExtractionModule.e_Form)

Variants of the method can output to a string, which can then be processed however the user needs.

Conclusion

Copied to clipboard

In this tutorial, you extracted data for machine learning with Python and the Apryse SDK. You then used the scripts to decide where to send extracted data. 

To learn more about text extraction, visit Extracting Text from a PDF on Cross-Platform (Core) and check out our WebViewer showcase to try out the PDF text extractor. The demo uses JavaScript, but the results are like what you’d see using Python.

You can also visit the Python documentation to see what else you can do with PDFs using Python, including: 

  • Splitting or merging documents page by page
  • Cropping pages
  • Merging multiple pages into a single page
  • Extracting text from PDF
  • Extracting layout-aware text
  • Rotating PDF pages
  • Merging PDFs
  • Splitting PDFs
  • Adding watermark to PDF pages
  • Encrypting and decrypting PDF files and more!

If you have any questions or features you would like to see next, do not hesitate to reach out to us directly. 

Sanity Image

Heather Dinsdale

Sanity Image

John Chow

Product Manager

Share this post

email
linkedIn
twitter