Home

All Blogs

Apryse Answers: Diving into PDF Data Extraction

Isaac Maw

Technical Content Creator

Published June 17, 2026

Updated June 17, 2026

5 min

Apryse Answers: Diving into PDF Data Extraction

Isaac Maw

Technical Content Creator

Summary: What is the best package to read text content from a PDF? Apryse Answers Episode 4 explores PDF data extraction by solving real developer questions from Reddit. Learn how Apryse Smart Data Extraction goes beyond OCR to preserve document structure, extract tables, identify key-value pairs, classify documents, and process scanned PDFs. Discover how to build intelligent extraction workflows using Apryse Server SDK, convert unstructured PDFs into AI-ready JSON, and improve automation, search, compliance, and document processing accuracy.

Smart Data Extraction

PDF documents are designed to put document style and content together in a single self-contained file, keeping formatting and appearance consistent across experiences. However, PDF documents aren’t machine-friendly, especially when they originate as scans or images of text.

OCR is one tool that helps detect text, but it doesn’t preserve document structure or formatting. When a PDF with columns, headings, footers, and tables is processed by OCR, the output is a wall of text, straight across the page. This jumbles content from different table columns, discards important structure such as headings, and results in more work to make the output usable.

Apryse Answers is our video series where Developer Relations Manager April Schuppel finds real developer questions on Reddit, and answers them using Apryse SDKs. In episode 4, April tackles questions about PDF Data Extraction.

Find the code from the video on Github.

What is Smart Data Extraction?

Copied to clipboard

Apryse Smart Data Extraction is a set of SDK tools that solve the challenges of PDF data extraction. Smart Data Extraction goes beyond OCR to capture valuable data from a range of document structures, including tables, headings, footers, and unstructured text such as paragraphs.

Try the Demo in our Showcase

Smart Data Extraction includes the following primary modes of intelligent extraction:

Tabular Data Extraction
- Extract tables from PDFs—even with merged cells or multi-row headers—and export to JSON or Excel for reporting, analysis, or AI.
Document Structure Recognition
- Parse the full logical structure: headers, footers, lists, images, styling, and paragraphs. Ideal for screen reading, content routing, transformation, or compliance workflows.
Form Field Identification
- Detect visual fields in flat PDFs and generate fillable interactive forms or structured JSON for onboarding or form reuse.
Key-Value Extraction
- Identify key-value relationships in documents with no explicit form layout. Extract data from invoices, resumes, and informal layouts without setting up templates or rules.
- Exclusive training to support key-value extraction on CAD and other technical drawing title blocks.
Document Classification
- Assign predefined categories to document pages based on their content and structure.

Let’s take a look at how April answered three questions from Reddit, then dive into the documentation for building Extraction workflows.

Question 1: What is the best package to read text content from a PDF in JS?

Copied to clipboard

This user needs to programmatically read text from a PDF. This could be required to help route documents at intake for processing or storage, or to use document data to power other automation, such as AI summaries, document generation, or trigger events.

Apryse SDK can go beyond OCR to improve outcomes and reduce errors for these use cases. For example, Apryse Document Classification detects document types, and key value extraction pulls essential data from unstructured text, such as names and numbers.

Question 2: Can you extract text from a PDF scan?

Copied to clipboard

I had a patron who had a document that she needed to edit. She did not have the original Word file. Another colleague thought we could scan her doc to PDF and then extract the text from that PDF to insert into a new Word doc that she could edit. But I cannot find if this is a thing that can be done. Anyone know?

Scanning a document results in an image of text, which requires OCR to convert into searchable text. In addition to Smart Data Extraction to power downstream processing of scanned documents, Apryse PDF to DOCX conversion could serve this user.

Question 3: Is there a tool which extracts the text from a PDF, but keeps formatting?

Copied to clipboard

For my work, I need to extract the text from PDFs quite a lot and also keep the formatting. I used to do it manually, but recently found pdftotext by xpdf, which speeds the process up. However, this only creates a .txt file with plain text and no formatting (only bold, italics, underlined, and regular would be enough).

Is there a tool which extracts the text from a PDF and keeps formatting?

Losing the formatting of a document is a major reason why developers seek out more capable data extraction solutions beyond OCR.

How to Build Smart Data Extraction

Copied to clipboard

Get Your Trial Key

Copied to clipboard

To try any Apryse SDK in your environment, start with your free trial key.

Get Started with Server SDK

Copied to clipboard

The Smart Data Extraction module is part of the Apryse Server SDK. To install the Server SDK, follow the instructions in our documentation based on your specific framework and language requirements.

The Server SDK supports a number of frameworks, runtimes, and languages in all major platforms for delivering applications from a single codebase.

Install the Smart Data Extraction Module

Copied to clipboard

When using Python on Windows or Linux you can install the package via PIP with this command:

pip install --extra-index-url=https://pypi.apryse.com apryse-data-extraction

You can find other installation methods, such as npm for node.js or installing the package directly, in the documentation guide.

For error handling purposes, it is generally advisable to test whether the module is available via the IsModuleAvailable function. Since the Data Extraction suite consists of multiple modules, an extra parameter is used to clarify the component to test.

if not DataExtractionModule.IsModuleAvailable(DataExtractionModule.e_Tabular):
   pass # Unable to run Data Extraction: PDFTron SDK Tabular Data module not available.
if not DataExtractionModule.IsModuleAvailable(DataExtractionModule.e_DocStructure):
   pass # Unable to run Data Extraction: PDFTron SDK Structured Output module not available.
if not DataExtractionModule.IsModuleAvailable(DataExtractionModule.e_Form):
   pass # Unable to run Data Extraction: PDFTron SDK AIFormFieldExtractor module not available.
if not DataExtractionModule.IsModuleAvailable(DataExtractionModule.e_GenericKeyValue):
   pass # Unable to run Data Extraction: PDFTron SDK AIGenericKeyValue module not available.
if not DataExtractionModule.IsModuleAvailable(DataExtractionModule.e_DocClassification):
   pass # Unable to run Data Extraction: PDFTron SDK AIDocClassification module not available.

Using the Extraction Engines

Copied to clipboard

The simplest way to use Document Classification, Document Structure Recognition, and the other tools in Smart Data Extraction, is to specify the name of the input PDF file and the name of the output JSON file, then select the required engine:

DataExtractionModule.ExtractData("Invoice.pdf", "Invoice_Classified.json", DataExtractionModule.e_DocClassification)

The engines are:

DataExtractionModule.e_DocStructure
DataExtractionModule.e_DocClassification
DataExtractionModule.e_Tabular
DataExtractionModule.e_Form)
DataExtractionModule.e_GenericKeyValue)

This code outputs document data in a JSON file, but additional configuration is possible for each module. Check out the options, such as outputting a JSON string instead, in the documentation.

Preprocessing for Data Extraction

Copied to clipboard

Before extraction begins, documents often need to be cleaned, normalized, or digitized. Apryse supports a full preprocessing toolkit—so your inputs are structured, accurate, and AI-ready.

These capabilities are modular and can be used independently or together, depending on your workflow:

OCR (Optical Character Recognition)
Converts scanned or image-based PDFs into machine-readable text.
Deskewing & Despeckling
Cleans up crooked or noisy scans—improving OCR, table parsing, and layout accuracy.
Layer Flattening
Normalizes multi-layer PDFs for consistent rendering and analysis.
Rotation & Cleanup
Re-orients pages and removes visual clutter like stamps or overlays.
Redaction
Removes sensitive or unwanted content—ideal before sending data to AI or external systems.
PDF Conversion
Convert documents to HTML, Word, Excel, or JSON for labeling, annotation, or system integration.

These preprocessing tools improve downstream performance across:

SLM training pipelines
RAG and semantic search
Compliance automation and classification workflows

No hallucinations. No unstructured text blobs. Just labeled, model-ready JSON.

Sample Code: Extract Tabular Data, Form Fields and Document Structure from a PDF

Copied to clipboard

Sample code shows how to use the Apryse Data Extraction module to extract tabular data, document structure and form fields from PDF documents. Find the sample here in other languages.

Next Steps

Copied to clipboard

If you have any questions about your Smart Data Extraction trial and how to get started with licensing for your use case, contact sales. Check out the Github for Apryse Answers to find the specific project demonstrated in the video!

Apryse Answers continues next week with questions about annotations and document collaboration. See you there!

View all blogs

How to Solve Six Common Problems when Getting Started with Apryse WebViewer

Using CoPilot to create a tool to extract tables from PDFs

2026 Jul 24

React PDF Viewer FAQ: Developers’ Top Questions Answered

2026 Jul 22

Ready to get started?

#--------------------------------------------------------------------------------------- # Copyright (c) 2001-2025 by Apryse Software Inc. All Rights Reserved. # Consult LICENSE.txt regarding license information. #--------------------------------------------------------------------------------------- import site site.addsitedir("../../../PDFNetC/Lib") import sys from PDFNetPython import * import platform sys.path.append("../../LicenseKey/PYTHON") from LicenseKey import * #--------------------------------------------------------------------------------------- # The Data Extraction suite is an optional PDFNet add-on collection that can be used to # extract various types of data from PDF documents. # # The Apryse SDK Data Extraction suite can be downloaded from # https://docs.apryse.com/core/guides/info/modules#data-extraction-module # # Please contact us if you have any questions. #--------------------------------------------------------------------------------------- # Relative path to the folder containing the test files. inputPath = "../../TestFiles/" outputPath = "../../TestFiles/Output/" def WriteTextToFile(outputFile, text): # Write the contents of text to the disk f = open(outputFile, "w") try: f.write(text) finally: f.close() def main(): # The first step in every application using PDFNet is to initialize the # library. The library is usually initialized only once, but calling # Initialize() multiple times is also fine. PDFNet.Initialize(LicenseKey) PDFNet.AddResourceSearchPath("../../../PDFNetC/Lib/") #----------------------------------------------------------------------------------- # The following sample illustrates how to extract tables from PDF documents. #----------------------------------------------------------------------------------- # Test if the add-on is installed if not DataExtractionModule.IsModuleAvailable(DataExtractionModule.e_Tabular): print("") print("Unable to run Data Extraction: Apryse SDK Tabular Data module not available.") print("-----------------------------------------------------------------------------") print("The Data Extraction suite is an optional add-on, available for download") print("at https://docs.apryse.com/core/guides/info/modules#data-extraction-module . If you have already") print("downloaded this module, ensure that the SDK is able to find the required files") print("using the PDFNet.AddResourceSearchPath() function.") print("") else: try: # Extract tabular data as a JSON file print("Extract tabular data as a JSON file") outputFile = outputPath + "table.json" DataExtractionModule.ExtractData(inputPath + "table.pdf", outputFile, DataExtractionModule.e_Tabular) print("Result saved in " + outputFile) #------------------------------------------------------ # Extract tabular data as a JSON string print("Extract tabular data as a JSON string") outputFile = outputPath + "financial.json" json = DataExtractionModule.ExtractData(inputPath + "financial.pdf", DataExtractionModule.e_Tabular) WriteTextToFile(outputFile, json) print("Result saved in " + outputFile) #------------------------------------------------------ # Extract tabular data as an XLSX file print("Extract tabular data as an XLSX file") outputFile = outputPath + "table.xlsx" DataExtractionModule.ExtractToXLSX(inputPath + "table.pdf", outputFile) print("Result saved in " + outputFile) #------------------------------------------------------ # Extract tabular data as an XLSX stream (also known as filter) print("Extract tabular data as an XLSX stream") outputFile = outputPath + "financial.xlsx" options = DataExtractionOptions() options.SetPages("1") # page 1 outputXlsxStream = MemoryFilter(0, False) DataExtractionModule.ExtractToXLSX(inputPath + "financial.pdf", outputXlsxStream, options) outputXlsxStream.SetAsInputFilter() outputXlsxStream.WriteToFile(outputFile, False) print("Result saved in " + outputFile) except Exception as e: print("Unable to extract tabular data, error: " + str(e)) #----------------------------------------------------------------------------------- # The following sample illustrates how to extract document structure from PDF documents. #----------------------------------------------------------------------------------- # Test if the add-on is installed if not DataExtractionModule.IsModuleAvailable(DataExtractionModule.e_DocStructure): print("") print("Unable to run Data Extraction: PDFTron SDK Structured Output module not available.") print("-----------------------------------------------------------------------------") print("The Data Extraction suite is an optional add-on, available for download") print("at https://docs.apryse.com/core/guides/info/modules#data-extraction-module . If you have already") print("downloaded this module, ensure that the SDK is able to find the required files") print("using the PDFNet.AddResourceSearchPath() function.") print("") else: try: # Extract document structure as a JSON file print("Extract document structure as a JSON file") outputFile = outputPath + "paragraphs_and_tables.json" DataExtractionModule.ExtractData(inputPath + "paragraphs_and_tables.pdf", outputFile, DataExtractionModule.e_DocStructure) print("Result saved in " + outputFile) #------------------------------------------------------ # Extract document structure as a JSON string print("Extract document structure as a JSON string") outputFile = outputPath + "tagged.json" json = DataExtractionModule.ExtractData(inputPath + "tagged.pdf", DataExtractionModule.e_DocStructure) WriteTextToFile(outputFile, json) print("Result saved in " + outputFile) except Exception as e: print("Unable to extract document structure data, error: " + str(e)) #----------------------------------------------------------------------------------- # The following sample illustrates how to extract form fields from PDF documents. #----------------------------------------------------------------------------------- # Test if the add-on is installed if not DataExtractionModule.IsModuleAvailable(DataExtractionModule.e_Form): print("") print("Unable to run Data Extraction: PDFTron SDK AIFormFieldExtractor module not available.") print("-----------------------------------------------------------------------------") print("The Data Extraction suite is an optional add-on, available for download") print("at https://docs.apryse.com/core/guides/info/modules#data-extraction-module . If you have already") print("downloaded this module, ensure that the SDK is able to find the required files") print("using the PDFNet.AddResourceSearchPath() function.") print("") else: try: # Extract form fields as a JSON file print("Extract form fields as a JSON file") outputFile = outputPath + "formfields-scanned.json" DataExtractionModule.ExtractData(inputPath + "formfields-scanned.pdf", outputFile, DataExtractionModule.e_Form) print("Result saved in " + outputFile) #------------------------------------------------------ # Extract form fields as a JSON string print("Extract form fields as a JSON string") outputFile = outputPath + "formfields.json" json = DataExtractionModule.ExtractData(inputPath + "formfields.pdf", DataExtractionModule.e_Form) WriteTextToFile(outputFile, json) print("Result saved in " + outputFile) #----------------------------------------------------------------------------------- # Detect and add form fields to a PDF document. # PDF document already has form fields, and this sample will update to new found fields. print("Extract form fields as a pdf file, update to new") doc = PDFDoc(inputPath + "formfields-scanned-withfields.pdf") DataExtractionModule.DetectAndAddFormFieldsToPDF(doc) outputFile = outputPath + "formfields-scanned-fields-new.pdf" doc.Save(outputFile, SDFDoc.e_linearized) doc.Close() print("Result saved in " + outputFile) #----------------------------------------------------------------------------------- # Detect and add form fields to a PDF document. # PDF document already has form fields, and this sample will keep the original fields. print("Extract form fields as a pdf file, keep original") doc = PDFDoc(inputPath + "formfields-scanned-withfields.pdf") options = DataExtractionOptions() options.SetOverlappingFormFieldBehavior("KeepOld") DataExtractionModule.DetectAndAddFormFieldsToPDF(doc, options) outputFile = outputPath + "formfields-scanned-fields-old.pdf" doc.Save(outputFile, SDFDoc.e_linearized) doc.Close() print("Result saved in " + outputFile) except Exception as e: print("Unable to extract form fields data, error: " + str(e)) #--------------------------------------------------------------------------------------- # The following sample illustrates how to extract key-value pairs from PDF documents. #--------------------------------------------------------------------------------------- if not DataExtractionModule.IsModuleAvailable(DataExtractionModule.e_GenericKeyValue): print() print("Unable to run Data Extraction: Apryse SDK AIPageObjectExtractor module not available.") print("---------------------------------------------------------------") print("The Data Extraction suite is an optional add-on, available for download") print("at https://docs.apryse.com/core/guides/info/modules#data-extraction-module . If you have already downloaded this") print("module, ensure that the SDK is able to find the required files") print("using the PDFNet.AddResourceSearchPath() function.") print() else: try: print("Extract key-value pairs from a PDF") # Simple example: Extract Keys & Values as a JSON file DataExtractionModule.ExtractData(inputPath + "newsletter.pdf", outputPath + "newsletter_key_val.json", DataExtractionModule.e_GenericKeyValue) print("Result saved in " + outputPath + "newsletter_key_val.json") # Example with customized options: # Extract Keys & Values from pages 2-4, excluding ads options = DataExtractionOptions() options.SetPages("2-4") p2_exclusion_zones = RectCollection() # Exclude the add-on on page 2 # These coordinates are in PDF user space, with the origin at the bottom left corner of the page # Coordinates rotate with the page, if it has rotation applied. p2_exclusion_zones.AddRect(Rect(166, 47, 562, 222)) options.AddExclusionZonesForPage(p2_exclusion_zones, 2) p4_inclusion_zones = RectCollection() p4_exclusion_zones = RectCollection() # Only include the article text for page 4, exclude ads and headings p4_inclusion_zones.AddRect(Rect(30, 432, 562, 684)) p4_exclusion_zones.AddRect(Rect(30, 657, 295, 684)) options.AddInclusionZonesForPage(p4_inclusion_zones, 4) options.AddExclusionZonesForPage(p4_exclusion_zones, 4) print("Extract Key-Value pairs from specific pages and zones as a JSON file") DataExtractionModule.ExtractData(inputPath + "newsletter.pdf", outputPath + "newsletter_key_val_with_zones.json", DataExtractionModule.e_GenericKeyValue, options) print("Result saved in " + outputPath + "newsletter_key_val_with_zones.json") except Exception as e: print("Unable to extract key-value data, error: " + str(e)) #----------------------------------------------------------------------------------- # The following sample illustrates how to extract document classes from PDF documents. #----------------------------------------------------------------------------------- # Test if the add-on is installed if not DataExtractionModule.IsModuleAvailable(DataExtractionModule.e_DocClassification): print("") print("Unable to run Data Extraction: PDFTron SDK AIPageObjectExtractor module not available.") print("-----------------------------------------------------------------------------") print("The Data Extraction suite is an optional add-on, available for download") print("at https://docs.apryse.com/documentation/core/info/modules/. If you have already") print("downloaded this module, ensure that the SDK is able to find the required files") print("using the PDFNet.AddResourceSearchPath() function.") print("") else: try: # Simple example: classify pages as a JSON file print("Classify pages as a JSON file") outputFile = outputPath + "Invoice_Classified.json" DataExtractionModule.ExtractData(inputPath + "Invoice.pdf", outputFile, DataExtractionModule.e_DocClassification) print("Result saved in " + outputFile) #------------------------------------------------------ # Classify pages as a JSON string print("Classify pages as a JSON string") outputFile = outputPath + "Scientific_Publication_Classified.json" json = DataExtractionModule.ExtractData(inputPath + "Scientific_Publication.pdf", DataExtractionModule.e_DocClassification) WriteTextToFile(outputFile, json) print("Result saved in " + outputFile) #------------------------------------------------------ # Example with customized options: print("Classify pages with customized options") options = DataExtractionOptions() # Classes that don't meet the minimum confidence threshold of 70% will not be listed in the output JSON options.SetMinimumConfidenceThreshold(0.7) outputFile = outputPath + "Email_Classified.json" DataExtractionModule.ExtractData(inputPath + "Email.pdf", outputFile, DataExtractionModule.e_DocClassification, options) print("Result saved in " + outputFile) except Exception as e: print("Unable to extract document structure data, error: " + str(e)) PDFNet.Terminate() print("Done.") if __name__ == '__main__': main()

Apryse Answers: Diving into PDF Data Extraction

Table Of Contents

What is Smart Data Extraction?

Question 1: What is the best package to read text content from a PDF in JS?

Question 2: Can you extract text from a PDF scan?

Question 3: Is there a tool which extracts the text from a PDF, but keeps formatting?

How to Build Smart Data Extraction

Get Your Trial Key

Get Started with Server SDK

Install the Smart Data Extraction Module

Using the Extraction Engines

Preprocessing for Data Extraction

Sample Code: Extract Tabular Data, Form Fields and Document Structure from a PDF

Next Steps

Related Articles

View all blogs

How to Solve Six Common Problems when Getting Started with Apryse WebViewer

Using CoPilot to create a tool to extract tables from PDFs

React PDF Viewer FAQ: Developers’ Top Questions Answered

Ready to get started?