How to Extract Data From PDF Using Apryse SDK and C#

By John Chow | 2023 Jun 07

5 min

Use Cases for PDF Data Extraction

Copied to clipboard

There are many use cases for automated PDF data extraction in modern document workflows. Whether you need to extract text and form field data, perform analysis of financial results, or generate reports, accurate content recognition and extraction are essential.

Why Accurate Extraction Matters – Yet Hard to Achieve

Copied to clipboard

PDF is one of the most widely used formats for business documents, yet accessing the data, they contain presents a number of challenges. This is because PDF was originally designed purely as an output format so that documents could be displayed identically regardless of your computer or operating system. Unless the PDF was created specifically as a structured, tagged document (such as PDF/A and PDF/UA) to allow easily accessible content, then you’re likely to run into problems.

Fortunately, there are solutions. Using the Apryse SDK Intelligent Document Processing (IDP) add-on’s Data Extraction capabilities, you can automate accurate recognition and extraction of content in PDF documents as structured JSON and Excel data.

Using the Apryse IDP Add-on for Accurate Data Extraction from PDF

Copied to clipboard

This tutorial will cover table data extraction from PDF to tabular formatted JSON or Excel XLSX format or conversion of PDF into structured JSON that describes the PDF in its entirety. Finally, we’ll show how to process a PDF with an AI-based algorithm to detect form fields and produce JSON describing their location and type.

Prerequisites

Copied to clipboard

This guide assumes the developer has a Windows .NET developer environment preconfigured. If not, Visual Studio Community edition can be downloaded and installed. In addition, the .NET Core 2.0 Runtime must be installed, which can be downloaded at https://dotnet.microsoft.com/en-us/download/dotnet/thank-you/runtime-2.1.30-windows-x64-installer.

Setup for Development

Copied to clipboard

1. Go to https://dev.apryse.com and register a new account with Apryse. This allows Apryse to grant you a demo license key which will be used with theApryse SDK to enable demo functionality.

2. Log into https://dev.apryse.com with your registered account. For this guide, we’ll be developing on Windows with C#, so select Windows.

3. Below the Platform selection is a blurred field with your unique developer trial key. Click Reveal to show the key. Copy and paste this into a text file, as we will need it later for use in your code to enable usage of the Apryse SDK.

4. Scroll down to Step 3, and select C# for a programming language. This will filter the SDK downloads to all C# compatible SDK variants. For this guide, we will download the .NET Core 64-bit SDK. This will download PDFNetC64.zip, which is available at https://pdftron.s3.amazonaws.com/downloads/PDFNetC64.zip.

5. Scroll down the page to “Step 4: Get Started”. Select C# for the language and expand the “Modules” section. This lists optional binary packages for additional Apryse SDK functionality. We will need the “Data Extraction Module”. Click the download button to download DataExtractionModuleWindows.zip, which is available at https://pdftron.s3.amazonaws.com/downloads/DataExtractionModuleWindows.zip.

6. Unzip PDFNetC64.zip to a location of your choosing. For this guide, we will just unzip to the root of C:\. This will result in C:\PDFNet64\which contains the base Apryse SDK folders for Windows .NET development.

7. Open the DataExtractionModuleWindows.zip, where you will find a “Lib” folder. Unzip the .zip file into C:\PDFNetC64 so that the DataExtractionModuleWindows.zip Lib folder gets merged with C:\PDFNetC64\Lib. If done correctly there should now be aC:\PDFNetC64\Lib\Windows folder with some additional folders and binaries. Now the environment is set up for developing with the Apryse SDK and the Data Extraction APIs of the IDP add-on.

Building a Sample .NET Application for PDF Data Extraction

Copied to clipboard

Now that the Apryse SDK is set up on your Windows development PC, we can start building a sample application. To do so requires we add the Demo License key copied from dev.apryse.com

1. Navigate to the LicenseKey sample project, which in this guide is available at C:\PDFNetC64\Samples\LicenseKey\CS\ Open the LicenseKey.cs file for editing. This file will contain the LicenseKey utilized by the SDK at run time.

2. Within the LicenseKey.cs file, you’ll see a line of code declaring a private variable:

private static string key = "YOUR_PDFTRON_LICENSE_KEY";

Replace YOUR_PDFTRON_LICENSE_KEY with the demo license key copied from dev.apryse.com. Now when running sample projects from the Apryse SDK, they will properly initialize the SDK but also have some demo limitations, such as limiting page numbers for batch operations.

3. Open the DataExtractionTest sample project in your dev environment so we can test the Data Extraction in .NET. This will be available at C:\PDFNetC64\Samples\DataExtractionTest\CS\DataExtractionTest.csproj
Opening the project in Visual Studio will generate a bin and obj folder automatically.

4. With the project now open in Visual Studio, we can now see some sample C# code that runs all aspects of the Data Extraction APIs of the IDP Addon. Open DataExtractionTest.cs to view the sample code of the project.

5. The Data Extraction Module has three main APIs, which have been divided into three sample functions within the sample code:

TestTabularData()
TestDocumentStructure()
TestFormFields()

TestTabularData() will convert some sample PDFs containing tables into both tabular formatted JSON as well as Excel XLSX files. The conversions performed by the tabular functions will convert all content in the PDF into an Excel or tabular JSON file, so non-tabular data, such as paragraphs, will be included. If this data is not required in the output, it will have to be manually removed post-conversion.

// Extract tabular data as a JSON file 
DataExtractionModule.ExtractData(input_path + "table.pdf", output_path + "table.json", DataExtractionModule.DataExtractionEngine.e_tabular); 

// Extract tabular data as a JSON string 
string json = DataExtractionModule.ExtractData(input_path + "financial.pdf", DataExtractionModule.DataExtractionEngine.e_tabular); 
System.IO.File.WriteAllText(output_path + "financial.json", json); 

// Extract tabular data as an XLSX file 
DataExtractionModule.ExtractToXLSX(input_path + "table.pdf", output_path + "table.xlsx"); 

// Extract tabular data as an XLSX stream (also known as filter) 
MemoryFilter output_xlsx_stream = new MemoryFilter(0, false); 
DataExtractionModule.ExtractToXLSX(input_path + "financial.pdf", output_xlsx_stream); 
output_xlsx_stream.SetAsInputFilter(); 
output_xlsx_stream.WriteToFile(output_path + "financial.xlsx", false);

TestDocumentStructure() will convert some sample PDFs into a document structure JSON which describes the PDF in its entirety. This JSON will contain a JSON element for every item in the PDF, whether it’s text, images, graphics, or tables. Each element will have position data as well as text formatting so that the JSON is an accurate 1:1 reconstruction of the PDF.

// Extract document structure as a JSON file 
DataExtractionModule.ExtractData(input_path + "paragraphs_and_tables.pdf", output_path + "paragraphs_and_tables.json", DataExtractionModule.DataExtractionEngine.e_doc_structure); 

// Extract document structure as a JSON string 
string json = DataExtractionModule.ExtractData(input_path + "tagged.pdf", DataExtractionModule.DataExtractionEngine.e_doc_structure); 
System.IO.File.WriteAllText(output_path + "tagged.json", json);

TestFormFields() will process a PDF with an AI-based algorithm and produce a JSON document describing the location and type of detected form fields. This AI will detect forms from not only PDF native forms but also flat non-interactive PDFs containing forms for printing, and additionally from scanned image-based documents.

// Extract form fields as a JSON file 
DataExtractionModule.ExtractData(input_path + "formfields-scanned.pdf", output_path + "formfields-scanned.json", DataExtractionModule.DataExtractionEngine.e_form); 

// Extract form fields as a JSON string 
string json = DataExtractionModule.ExtractData(input_path + "formfields.pdf", DataExtractionModule.DataExtractionEngine.e_form); 
System.IO.File.WriteAllText(output_path + "formfields.json", json);

Sample documents utilized in the code are available for viewing in C:\PDFNetC64\Samples\TestFiles

6. To run the sample code, there is a RunTest.bat batch file at C:\PDFNetC64\Samples\DataExtractionTest\CS\RunTest.bat. This will execute the command line application built by DataExtractionTest.cs. If successful, no errors should be visible in the terminal.

7. To view the output of the sample code functions, open the folder at C:\PDFNetC64\Samples\TestFiles\Output. Here you will find JSON documents and Excel files, which are the result of the previously mentioned test functions.

Conclusion

Copied to clipboard

The sample project will show that very few lines of code are required to extract data from PDFs using the Apryse SDK and the Data Extraction Module. Visit our Intelligent Data Extraction guide for more details on our cross-platform API, or for more general help with .NET development and the Apryse SDK, visit our developer guides for .NET.

When you’re ready to add IDP and intelligent data extraction to your existing Apryse Server SDK license, contact our sales team.