Pre-Purchase Insights: Everything you need to know before you buy.
By Valerie Yates, John Chow | 2023 Feb 17
New Apryse Intelligent Document Processing, featuring Data Extraction capability, unlocks information trapped in any PDF. Automatically convert content elements and structure to JSON or XLSX with leading accuracy and at scale.
We’re excited to announce the arrival of Apryse’s Intelligent Document Processing (IDP), a significant step forward in enabling your IDP processes and efficiently extracting contents from stored documents into your database at any scale.
This article, part one of a three-part series, introduces the new IDP solution and breaks down its all-new data extraction capability.
Apryse IDP includes powerful PDF data extraction that recognizes and extracts any document layout along with content elements, such as tabular data and text, to structured JSON and Excel right out of the box. As a result, it gives organizations scalability and leading accuracy in PDF data extractions — it eliminates costs associated with extensive templating, rules, and data entry.
In part two, we look at examples of how Apryse IDP allows businesses to enjoy new levels of operational efficiency when extracting, processing, or analyzing content from PDFs.
Update - 2023-02-28: Part 3 of this release series explores the new form field detection feature, powered by Apryse AI and also part of the IDP add-on.
Learn more about the intelligent data extraction capability.
PDF data extraction presents challenges because PDFs are designed to transfer information, such as natural reading order, to humans, not machines. Under the hood, a PDF file is composed of Cos objects and isn’t WYSIWYG. A reader application parses the PDF file — and after extraction of the objects, an output file may not be in reading order or even resemble the original human-intended PDF.
As a result, conventional extractors require extensive upfront work: templating, data entry, or training documents to infer logical structures from PDF and to get meaningful data out. Additionally, semi-structured PDF content is highly variable, meaning it’s time-consuming and costly for developer staff to create customized templates to prepare the extractor for every possible document type. These factors make it difficult for organizations to automate their PDF data processing at scale.
But what if PDF data extraction worked right out of the box instead — without training the model on every type of document used across your organization, without creating rules, or having to check for errors post-conversion? Apryse IDP does just that for any structured or semi-structured data in PDF while offering different conversion formats for processing options. It reliably recognizes tables, accurately extracts text and tabular data, and detects and understands articles of text in a document.
Getting set up is straightforward. New Intelligent Data Extraction features are part of the IDP add-on to the Apryse Server SDK, meaning you can use your language of choice to embed the API into your application. Developers get complete control over extracted data and the workflow itself. Apryse IDP provides greater reliability, performance, and cost-effective scalability compared to external extraction services and on-demand processing.
Apryse‘s Intelligent Data Extraction capability looks like this:
Let’s take a closer look.
Table Recognition detects table boundaries, rows, and columns, and challenging aspects like spanning cells that trip up some extraction tools.
A PDF Table in a SEC 10-Q Report
Our Excel output capturing a 10-Q table
The Table Data Extractor works within the recognizer workflow but is a separate function. It extracts tabular data into an Excel spreadsheet file, with one table per sheet, and to a JSON companion file.
You can use the extractor to pull out just one table element — from one PDF or many PDFs at once.
Structure Recognition refers to awareness of how the content elements in a PDF are positioned on a page and in relation to each other, rather than simply what these elements are (such as text and tables). As part of parsing the PDF, the Intelligent Data Extraction component reconstructs the formatting and layout (structure) of content elements and what they look like on screen into JSON.
The component recognizes headers, footers, paragraphs, and reading order and puts table and image coordinates into the JSON file. Along with the JSON instructions on their placement, this makes it easy to repurpose content and reflow the document, for example, if you want to republish content in another application, such as a mobile viewer.
Formatted text in our Q-10
Image of the document structure in JSON
The new Apryse Intelligent Document Processing solution, including extraction, enables efficient and leading accuracy PDF content extraction without need for extensive upfront training or templates.
Try out the new IDP features in your environment (no trial key required) and let us know how the solution performed. Visit the documentation for samples and feature details. Also, visit our page on JSON for details on the output structure and tags.
As always, your feedback is invaluable, helping us continually tune and improve performance and accuracy. If you have any issues with your free trial, don’t hesitate to send any questions you might have our way via the trial support.
When you’re ready to start with the Apryse Server SDK and IDP, or to add IDP to your existing Apryse Server SDK license, contact Sales.