How to Accurately Extract PDF Data Using Apryse SDK and Node.js

By Josh Coffey | 2023 Jul 20

3 min

PDF Data Extraction Use Cases

Copied to clipboard

Automating the extraction of data in PDF documents is increasingly necessary in document workflows. Being able to extract text and form field data, analyze financial results, generate reports, and more, means accurately recognizing and extracting content from PDFs is essential.

Overcoming PDF Data Extraction Challenges

Copied to clipboard

Even though PDF is one of the most popular formats for business documents, accessing the data in a PDF scan can be difficult. Since the PDF format was conceived as an output format for displaying documents consistently across many different computers or operating systems, there is no guarantee that content will be accessible. If you’re lucky, the PDF might conform to either the PDF/A or PDF/UA standards, which require content to be structured and meaningfully tagged so that it is easily readable and accessible. More likely, though, you’ll have a PDF with no structure at all, with text stored as characters located somewhere on a page.

However, you can overcome these challenges by using the Apryse SDK Intelligent Document Processing (IDP) add-on. With its powerful Data Extraction capabilities, it enables the automatic recognition and accurate extraction of PDF content as structured JSON or Excel data.

Achieving Accurate Data Extraction From PDF with the Apryse IDP Add-on

Copied to clipboard

This tutorial will guide you through the process of extracting table data from PDFs and exporting it to tabular formatted JSON or Excel XLSX format. Additionally, you will learn how to convert a PDF into structured JSON, providing a comprehensive description of its contents. We'll also demonstrate how to utilize an AI-based algorithm to identify form fields within PDFs and generate a corresponding JSON file containing information about their location and type.

Prerequisites

Copied to clipboard

This guide assumes the developer has a Node.js environment preconfigured. Screenshots in this guide will be from Windows running Node.js in WSL (Windows Subsystem for Linux) but have been tested in Windows, Linux, and WSL on Node 18. If you don't have Node.js, an installer can be downloaded at https://nodejs.org/

Apryse SDK Trial Key

Copied to clipboard

If you don't already have an Apryse account, go to https://dev.apryse.com and register a new account with Apryse. This allows Apryse to grant you a demo license key which will be used with the Apryse SDK to enable demo functionality.

Developer Portal

Log into https://dev.apryse.com with your registered account. For this guide, we’ll be developing a Node.js application, so select your development OS. This guide was tested in Windows, Linux, and WSL.

Below the Platform selection is a blurred field with your unique developer trial key. Click Reveal to show the key. Copy and paste this into a text file, as we will need it later for use in your code to enable usage of the Apryse SDK.

Download Center Platform and Trial Key

Download Apryse Data Extraction Module

Copied to clipboard

Scroll down the page to "Step 4: Get Started". Select JavaScript as the language and expand the "Modules" section. This lists optional binary packages for additional Apryse SDK functionality. We will need the “Data Extraction Module”. Click the download button to download the Data Extraction Module archive. Direct Links are available below:

Windows: https://pdftron.s3.amazonaws.com/downloads/DataExtractionModuleWindows.zip
Linux / WSL: https://pdftron.s3.amazonaws.com/downloads/DataExtractionModuleLinux.tar.gz

Install Node.js Samples

Copied to clipboard

Now that we have an Apryse SDK Trial Key, we can download and run the Apryse Node Samples. All we need to do is install the @pdftron/pdfnet-node-samples package. Details are on npm at https://www.npmjs.com/package/@pdftron/pdfnet-node-samples

npm install @pdftron/pdfnet-node-samples

Once the download is complete we can find the code samples in the newly created directory ./node_modules/@pdftron/pdfnet-node-samples/.

Before we can run any of the sample code, we will first need to add our Apryse SDK Trial Key. Edit the file samples/LicenseKey/LicenseKey.js and only modify the following line:

const LicenseKey = 'YOUR_PDFTRON_LICENSE_KEY';

Replace YOUR_PDFTTRON_LICENSE_KEY with your Apryse SDK Trial Key we created earlier.

Install Data Extraction Module

Copied to clipboard

In order to use the Data Extraction Module, we need to let our application know where to find it. Additional resource paths, such as our Data Extraction Module, can be added to our application using the following method call:

await PDFNet.addResourceSearchPath('/path/to/lib');

The sample code expects these libraries to be installed at ./node_modules/@pdftron/pdfnet-nodesamples/lib.

For Windows:

Extract DataExtractionModuleWindows.zip to the above directory.

For Linux / WSL:

Extract the .tar.gz file to the above location using the command:

tar -xf DataExtractionModuleLinux.tar.gz -C ./node_modules/@pdftron/pdfnet-node-samples/

NOTE: if you are using a case-sensitive file system, you may need to rename the folder from Lib to lib (all lowercase) for the included sample code to find it.

Run the Data Extraction Sample

Copied to clipboard

We should be able to run the sample code now. Navigate to ./node_modules/@pdftron/pdfnet-nodesamples/samples/DataExtractionTest and run the sample data extraction tests with:

node DataExtractionTest

Terminal showing output of running data extraction tests

Terminal Output

We can see the results of this by looking at the node_modules\@pdftron\pdfnet-node-samples\samples\TestFiles\Output directory. For each JSON and Excel file, there is a corresponding pdf file in the TestFiles directory used as the input.

Financial data successfully extracted from a PDF to an Excel file

Next, let’s look at how much code was required for the samples to work. You may be surprised at how easy it is to extract data from a PDF document using the Apryse SDK!

Sample Node.js Application for PDF Data Extraction

Copied to clipboard

All of the code we executed is contained in ./node_modules/@pdftron/pdfnet-node-samples/samples/DataExtractionTest/DataExtractionTest.js. Let’s open it up and take a look at the code required for each of the sample extractions that ran.

The Data Extraction Module has three main APIs, which have been divided into three separate sections within the sample code:

Extracting Tabular Data

Copied to clipboard

The tabular data tests convert some sample PDFs which contain tables into both tabular formatted JSON as well as Excel XLSX files. Conversions performed by the tabular functions will convert all content in the PDF into an Excel or tabular JSON file, so non-tabular data, such as paragraphs, will also be included. If you don’t require this data in the output, you will have to manually remove it post-conversion.

// Write tabular data as JSON file
let outputFile = outputPath + 'table.json';
await PDFNet.DataExtractionModule.extractData(inputPath + 'table.pdf', outputFile, PDFNet.DataExtractionModule.DataExtractionEngine.e_Tabular);

// Extract tabular data as a JSON string
const json = await PDFNet.DataExtractionModule.extractDataAsString(inputPath + 'financial.pdf', PDFNet.DataExtractionModule.DataExtractionEngine.e_Tabular);

// Extract tabular data as an XLSX file
outputFile = outputPath + 'table.xlsx';
await PDFNet.DataExtractionModule.extractToXLSX(inputPath + 'table.pdf', outputFile);

Document Structure Conversion

Copied to clipboard

The second section converts some sample PDFs into a document structure JSON which describes the PDF in its entirety. This JSON will contain a JSON element for every item in the PDF, for example, text, images, graphics, and tables. Each element will have its position data as well as text formatting so that the JSON will be an accurate 1:1 reconstruction of the PDF.

// Extract document structure as a JSON file
let outputFile = outputPath + 'paragraphs_and_tables.json';
await PDFNet.DataExtractionModule.extractData(inputPath + 'paragraphs_and_tables.pdf', outputFile, PDFNet.DataExtractionModule.DataExtractionEngine.e_DocStructure);

// Extract document structure as a JSON string
const json = await PDFNet.DataExtractionModule.extractDataAsString(inputPath + 'tagged.pdf', PDFNet.DataExtractionModule.DataExtractionEngine.e_DocStructure);

Extracting Form Fields

Copied to clipboard

The third sample block processes a PDF with an AI-based algorithm to produce a JSON document that describes the location and type of detected form fields. This AI will detect forms from not only PDF native forms but also flat non-interactive PDFs containing forms for printing and even from scanned image-based documents.

// Extract form fields as a JSON file
let outputFile = outputPath + 'formfields-scanned.json';
await PDFNet.DataExtractionModule.extractData(inputPath + 'formfields-scanned.pdf', outputFile, PDFNet.DataExtractionModule.DataExtractionEngine.e_Form);

// Extract form fields as a JSON string
const json = await PDFNet.DataExtractionModule.extractDataAsString(inputPath + 'formfields.pdf', PDFNet.DataExtractionModule.DataExtractionEngine.e_Form)

Conclusion

Copied to clipboard

The sample project demonstrates that you only need a few lines of code to extract data from PDFs using the Apryse SDK and the Data Extraction Module. Visit our Intelligent Data Extraction guide for more details on our cross-platform API, or for more general help with Node.js development and the Apryse SDK, visit our developer guides for Node.js.If you have any further questions, you can visit our Discord to chat with us.

When you’re ready to add IDP and intelligent data extraction to your existing Apryse Server SDK license, contact our sales team.