Don’t “Dig for Data” — Automate Accurate Data Extraction from PDF with Apryse IDP

By Valerie Yates, John Chow | 2023 Feb 17

5 min

What Does Apryse IDP with Intelligent PDF Data Extraction Look Like?

Copied to clipboard

PDF data extraction presents challenges because PDFs are designed to transfer information, such as natural reading order, to humans, not machines. Under the hood, a PDF file is composed of Cos objects and isn’t WYSIWYG. A reader application parses the PDF file — and after extraction of the objects, an output file may not be in reading order or even resemble the original human-intended PDF.

As a result, conventional extractors require extensive upfront work: templating, data entry, or training documents to infer logical structures from PDF and to get meaningful data out. Additionally, semi-structured PDF content is highly variable, meaning it’s time-consuming and costly for developer staff to create customized templates to prepare the extractor for every possible document type. These factors make it difficult for organizations to automate their PDF data processing at scale.

But what if PDF data extraction worked right out of the box instead — without training the model on every type of document used across your organization, without creating rules, or having to check for errors post-conversion? Apryse IDP does just that for any structured or semi-structured data in PDF while offering different conversion formats for processing options. It reliably recognizes tables, accurately extracts text and tabular data, and detects and understands articles of text in a document.

The results should speak for themselves. So, visit the documentation to learn more about Apryse IDP and what it can do in your environment. There's no trial key required to get started.

Secure, Fast Setup

Getting set up is straightforward. New Intelligent Data Extraction features are part of the IDP add-on to the Apryse Server SDK, meaning you can use your language of choice to embed the API into your application. Developers get complete control over extracted data and the workflow itself. Apryse IDP provides greater reliability, performance, and cost-effective scalability compared to external extraction services and on-demand processing.

How Intelligent Data Extraction Just Works

Copied to clipboard

Apryse‘s Intelligent Data Extraction capability looks like this:

No rules or templates — Avoids extensive upfront work associated with templating and drives accuracy and cost-effective scalability.
Handles multi-modal content — Text, tables, and forms, including fields in informal forms, such as scanned PDFs or forms specified in text without interactive form data in the file.
Layout awareness — Preserves natural reading order, logical relationships between elements, and contiguous blocks of text in JSON, for error-free extraction.
Conversion to different file formats — JSON and Excel allow flexible data consumption.
Table and cell recognition — Extracts tabular data into Excel, enabling easy analysis or further processing. Provides coordinates to tables in the JSON.

Let’s take a closer look.

Table Detection and Tabular Data Extraction

Table Recognition detects table boundaries, rows, and columns, and challenging aspects like spanning cells that trip up some extraction tools.

A PDF Table in a SEC 10-Q Report

Image of table data converted JSON then opened in Excel

Our Excel output capturing a 10-Q table

The Table Data Extractor works within the recognizer workflow but is a separate function. It extracts tabular data into an Excel spreadsheet file, with one table per sheet, and to a JSON companion file.

You can use the extractor to pull out just one table element — from one PDF or many PDFs at once.

Structure Recognition — PDF to JSON

Copied to clipboard

Structure Recognition refers to awareness of how the content elements in a PDF are positioned on a page and in relation to each other, rather than simply what these elements are (such as text and tables). As part of parsing the PDF, the Intelligent Data Extraction component reconstructs the formatting and layout (structure) of content elements and what they look like on screen into JSON.

The component recognizes headers, footers, paragraphs, and reading order and puts table and image coordinates into the JSON file. Along with the JSON instructions on their placement, this makes it easy to repurpose content and reflow the document, for example, if you want to republish content in another application, such as a mobile viewer.

Formatted text in our Q-10

Image of the document structure in JSON

What’s Next with Apryse Intelligent Document Processing?

Copied to clipboard

The new Apryse Intelligent Document Processing solution, including extraction, enables efficient and leading accuracy PDF content extraction without need for extensive upfront training or templates.

Try out the new IDP features in your environment (no trial key required) and let us know how the solution performed. Visit the documentation for samples and feature details. Also, visit our page on JSON for details on the output structure and tags.

As always, your feedback is invaluable, helping us continually tune and improve performance and accuracy. If you have any issues with your free trial, don’t hesitate to send any questions you might have our way via the trial support.

When you’re ready to start with the Apryse Server SDK and IDP, or to add IDP to your existing Apryse Server SDK license, contact Sales.