A Guide to PDF Data Extraction Using Apryse SDK and Python

By Josh Coffey | 2023 Jul 20

4 min

Automated PDF Data Extraction Use Cases

Copied to clipboard

Accurately recognizing and extracting content from PDFs is vital in today’s digital document workflows. Whether your requirements involve extracting text and form field data, analyzing financial results, or generating reports, the ability to extract data from PDFs plays a crucial role.

The Challenges of Accurate PDF Data Extraction

Copied to clipboard

PDF is one of the most prevalent formats for the distribution of modern business documents. However, accessing the data contained within them can be challenging. The PDF format’s original purpose was to serve as an output format, ensuring documents would be consistently displayed across various computers and operating systems. Unless a PDF was created as a structured document with descriptive tagging (e.g., the PDF/A and PDF/UA standards) to allow content to be easily readable, the content is locked away and inaccessible using normal methods.

Thankfully, using the Apryse SDK Intelligent Document Processing (IDP) add-on’s Data Extraction capabilities, it is possible to programmatically recognize and extract PDF content in structured JSON and Excel formats, making it easy to analyze or reuse elsewhere.

Leveraging the Apryse IDP Add-on for Accurate PDF Data Extraction

Copied to clipboard

In this tutorial, we will show how to extract table data from PDF and export it to tabular formatted JSON or Excel XLSX format and convert PDF into structured JSON that describes the entire PDF. We’ll also show how you can process PDFs using an AI-based algorithm to detect form fields and output a JSON file that describes their location and type.

Prerequisites

Copied to clipboard

This guide assumes the developer has a Python 3 environment preconfigured with PIP already installed. Screenshots in this guide will be from Windows running Python 3 in WSL (Windows Subsystem for Linux) but have been tested in Windows, Linux, and WSL on Python 3. If you don’t have Python 3, visit Python’s getting started guide at https://www.python.org/about/gettingstarted/.

Apryse SDK Trial Key

Copied to clipboard

If you don’t already have an Apryse account, go to https://dev.apryse.com and register a new account with Apryse. This allows Apryse to grant you a demo license key which will be used with the Apryse SDK to enable demo functionality.

Log into https://dev.apryse.com with your registered account. For this guide, we’ll be developing a Python application, so select your development OS. This guide was tested in Windows, Linux, and WSL.

Below the Platform selection is a blurred field with your unique developer trial key. Click Reveal to show the key. Copy and paste this into a text file, as we will need it later for use in your code to enable usage of the Apryse SDK.

Download Apryse Data Extraction Module

Copied to clipboard

Scroll down the page to “Step 4: Get Started”. Select Python as the language and expand the “Modules” section. This lists optional binary packages for additional Apryse SDK functionality. We will need the “Data Extraction Module”. Click the download button to download the Data Extraction Module archive.

Direct links are available below:

Windows: https://pdftron.s3.amazonaws.com/downloads/DataExtractionModuleWindows.zip
Linux / WSL: https://pdftron.s3.amazonaws.com/downloads/DataExtractionModuleLinux.tar.gz

Install Python Apryse SDK

Copied to clipboard

Next, we need to install PDFNetPython3, the package name for the Python Apryse SDK. Note that this package is hosted on Apryse’s S3 private repository and is no longer maintained in the Python Package Index (PyPI).

You can install the Python Apryse SDK with PIP:

pip install apryse-sdk --extra-index-url=https://pypi.apryse.com

Download Python Apryse SDK Samples

Copied to clipboard

Now that we have an Apryse SDK Trial Key we can download and run the Apryse Python Samples. Download the samples from our website at: https://docs.apryse.com/downloads/PDFNetWrappers/PDFNetPython3.zip

On Linux/WSL you can also do this on your terminal with wget or curl.

wget https://www.pdftron.com/downloads/PDFNetWrappers/PDFNetPython3.zip

Once the download is complete, extract the zip file somewhere convenient. If you are using Linux or WSL, most distributions include the unzip utility if you wish to do this step from your terminal.

unzip PDFNetPython3.zip

Before we can run any of the sample code, we will first need to add our Apryse SDK Trial Key. Edit the file Samples/LicenseKey/PYTHON/LicenseKey.py and only modify the following line:

LicenseKey = "YOUR_PDFTRON_LICENSE_KEY"

Replace YOUR_PDFTTRON_LICENSE_KEY with your Apryse SDK Trial Key we created earlier.

Install Data Extraction Module

Copied to clipboard

In order to use the Data Extraction Module, we need to let our application know where to find it. Additional resource paths, such as our Data Extraction Module, can be added to our application using the following method call:

PDFNet.AddResourceSearchPath("path/to/lib")

The sample code expects these libraries to be available in the same folder as the sample code at PDFNetC/Lib/.

For Windows

Create a new folder inside of PDFNetPython3 called PDFNetC. Unzip DataExtractionModuleWindows.zip to the newly created PDFNetC directory.

For Linux / WSL

Create a new folder in PDFNetPython3 called PDFNetC. Extract the .tar.gz file into the newly created directory.

Your resulting directory structure should be:

mkdir PDFNetPython3/PDFNetC

tar -xf DataExtractionModuleLinux.tar.gz -C ./PDFNetPython3/PDFNetC/

Run the Data Extraction Sample

Copied to clipboard

We should be able to run the sample code now. Navigate to PDFNetPython3/Samples/DataExtractionTest/PYTHON and run the sample data extraction code by running the DataExtractionTest.py script.

python3 DataExtractionTest

We can see the results of this by looking at the Samples\TestFiles\Output directory. For each JSON and Excel file, there is a corresponding PDF file in the TestFiles directory used as the input.

Next, let’s look at how much code was required for the samples to work. You may be surprised at how easy it is to extract data from a PDF document using the Apryse SDK!

Sample Python Application for PDF Data Extraction

Copied to clipboard

All of the code we executed is contained in PDFNetPython3/Samples/DataExtractionTest/PYTHON/DataExtractionTest.py. Let’s open it up and take a look at the code required for each of the sample extractions that ran.

The Data Extraction Module has three main APIs, which have been divided into three separate sections within the sample code:

Tabular Data Extraction

Copied to clipboard

The first type of extraction we’re going to look at is tabular data extraction, which will convert some sample PDFs containing tables into both tabular formatted JSON as well as Excel XLSX files. The conversions performed by the tabular functions will convert all content in the PDF into an Excel or tabular JSON file, so non-tabular data, such as paragraphs, will be included. If this data is not required in the output, it will have to be manually removed post-conversion.

# Extract tabular data as a JSON file 
outputFile = outputPath + "table.json" 
DataExtractionModule.ExtractData(inputPath + "table.pdf", outputFile, DataExtractionModule.e_Tabular) 
 
# Extract tabular data as a JSON string 
outputFile = outputPath + "financial.json" 
json = DataExtractionModule.ExtractData(inputPath + "financial.pdf", DataExtractionModule.e_Tabular) 
WriteTextToFile(outputFile, json) 
 
# Extract tabular data as an XLSX file 
outputFile = outputPath + "table.xlsx" 
DataExtractionModule.ExtractToXLSX(inputPath + "table.pdf", outputFile)

Document Structure Conversion

Copied to clipboard

The second section will convert some sample PDFs into a document structure JSON which describes the PDF in its entirety. This JSON will contain a JSON element for every item in the PDF, whether it’s text, images, graphics, or tables. Each element will have position data as well as text formatting so that the JSON is an accurate 1:1 reconstruction of the PDF.

# Extract document structure as a JSON file 
outputFile = outputPath + "paragraphs_and_tables.json" 
DataExtractionModule.ExtractData(inputPath + "paragraphs_and_tables.pdf", outputFile, DataExtractionModule.e_DocStructure) 
 
# Extract document structure as a JSON string 
outputFile = outputPath + "tagged.json" 
json = DataExtractionModule.ExtractData(inputPath + "tagged.pdf", DataExtractionModule.e_DocStructure) 
WriteTextToFile(outputFile, json)

Form Fields Extraction

Copied to clipboard

The third sample block will process a PDF with an AI-based algorithm and produce a JSON document describing the location and type of detected form fields. This AI will detect forms from not only PDF native forms but also flat non-interactive PDFs containing forms for printing and additionally from scanned image-based documents.

# Extract form fields as a JSON file 
outputFile = outputPath + "formfields-scanned.json" 
DataExtractionModule.ExtractData(inputPath + "formfields-scanned.pdf", outputFile, DataExtractionModule.e_Form) 
 
# Extract form fields as a JSON string 
outputFile = outputPath + "formfields.json" 
json = DataExtractionModule.ExtractData(inputPath + "formfields.pdf", DataExtractionModule.e_Form) 
WriteTextToFile(outputFile, json)

Conclusion

Copied to clipboard

The sample project will show that you can extract data from PDFs using the Apryse SDK and the Data Extraction Module in just a few lines of code. Visit our Intelligent Data Extraction guide for more details on our cross-platform API, or for more general help with Python development and the Apryse SDK, visit our get started guides for Python. You can also chat with us on Discord if you have any questions.

When you’re ready to add IDP and intelligent data extraction to your existing Apryse Server SDK license, contact our sales team.