Putting Apryse’s New IDP and PDF Content Extractor Engine to Work

By Valerie Yates, John Chow | 2023 Feb 17

6 min

Why is JSONifying PDF So Useful?

Apryse intelligent data extraction identifies the logical order of a specified PDF and outputs a JSON file that mirrors this order. Critically, the JSON copy captures the context of the data. Rather than just a raw data export, in other words, it gets the relationship and structure of content elements.

The JSON file identifies text blocks and their dimensions and regions. For example, you can quickly identify relevant paragraphs by performing programmatic word count on text in the JSON. You can also find the coordinates of a table in the file for easy retrieval and verification.

A well-structured JSON is your ally for automating document processes, as it lets you rebuild PDF content into a dataset — a format that programs easily understand, manipulate, and leverage at scale. Thus, data previously trapped in thousands of scanned or native PDFs becomes available to your organization in many ways. For example, you can feed that JSON data into a dashboard, a report, an analytics tool, and more.

To recap, Apryse’s new IDP with intelligent data extraction enables the following:

Text extraction: Intelligent data extraction looks for textual content, sentences, and paragraphs, finding all passages longer than a specified number of words long, for example. The component then pulls out those regions from the JSON file.
Tabular extraction: The tool identifies table cells, columns, rows, and spanning cells. You can pull tables out from the JSON document or extract to Excel if you want only tabular data in spreadsheets.

As a bonus, the structured data is available to your organization without the need for upfront templating or training, constant maintenance of templates, or manual repair of extracted data.

Now let’s look at a couple of specific use cases of these extraction features and JSON in a few industries.

1. Automated Data Labelling — Say Goodbye to Costly Manual Tagging for AI

Copied to clipboard

Sentiment analysis studies the subjective information in an expression — the opinion, emotion, or attitude toward something, such as a topic, person, object, or entity. With the help of sentiment analysis, data analysts can get information about how businesses and companies are perceived by consumers.

Expressions can be tagged as positive, neutral, or negative. For example:

😊

“I like your new product.” Positive.

😐

“I investigated a trial license.” Neutral.

🙁

“I don’t understand the point of the new product.” Negative.

Eliminating the Weak Link in Labelling

To create a training dataset for NLP (natural language processing) or sentiment analysis, you must first choose relevant chunks of text or tag text with labels. Machine learning models should use only high-quality, reliable data. But a heavy human component in the labelling process represents a weak link, which consumes most of the AI team project time. As a result, teams spend considerable resources developing custom solutions to process text. And many teams still tag text manually.

Here’s a real-world example: One company currently uses college students to manually clip out articles and scan them for media monitoring, outsourcing the labelling process.

Using Apryse Server SDK and IDP, however, this company can automate detection of news articles in scanned PDFs, label them, and then store extracted text. This approach significantly reduces errors and eliminates manual processing.

Out of the box, IDP reconstructs article structure into JSON. And with document structure, such as headlines, bylines, images, captions, advertisements, and article paragraphs available programmatically, you can quickly segment content, and have your AI do the sentiment analysis or other NLP applications on relevant text.

2. Automated Intake — Turn Documents into Your Data Lake

Copied to clipboard

Apryse's IDP with intelligent data extraction can also be applied to any form of intake process or existing repository to turn your documents into data at scale. Think of PDF documents in your CRM storage, and especially those that are only images of documents, such as scanned or photographed contracts with hand-drawn signatures.

Let's look at some examples: Say your accounting or financial organization has vast numbers of invoices (various layout styles), scanned receipts (images saved to PDF), or even native, editable PDF documents — all of which need data extraction, processing, and classification. Or, say you’re processing medical intake forms, collecting patient history, past surgeries, symptoms, and so on.

Manual data entry for these forms cannot scale; it takes up to several minutes per form. It's also error prone. Thus, it becomes hard to keep up with the influx, let alone take on new business or patients.

However, despite the costs, many organizations continue with manual intake because it's familiar — and automated alternatives that are reliable and accurate aren't common.

What about templating? Creating and maintaining a comprehensive library of templates and rules isn't cost-effective, because developers are scarce and costly resources. Also, templates are seldom bullet-proof because intake forms come in all shapes and sizes and customers often change them.

Your staff — or expensive consultants — find themselves constantly catching exceptions, changing settings, and adjusting templates instead of focusing on the business.

Again, this is where Apryse IDP’s intelligent data extraction comes in: it leaps past manual processes and template-driven extraction. Instead, it automatically detects and extracts specific PDF data, one PDF at a time or in batch mode.

This automation is a boon to any accounting, insurance, or healthcare organization. It improves extraction accuracy, opens new data sources for business insights, and significantly reduces the workload so staff focuses on business-critical activities.

Learn how IDP boosts efficiency and compliance in the finance sector. Read the blog.

What’s Next with Apryse IDP and PDF Data Extraction?

Copied to clipboard

The new Apryse IDP enables efficient and accurate PDF content extraction without need for extensive upfront training or templates.

In this article, we just scratched surface with two examples. Other uses abound such as content republishing, when you want to reconstruct PDFs or parts of them somewhere else. For example:

Turning restaurant menu PDFs into an app-friendly layout for mobile devices.
Digitizing archived printed media into web or mobile app content.
Digitizing forms into an app experience and/or auto-filling forms from a database.

We’d love to see what you create using Apryse IDP. If you have any issues or questions during your free trial, don’t hesitate to drop us a line or leave us a note. Use our free trial support to talk to an engineer.

When you’re ready to add IDP and intelligent data extraction to your existing Apryse Server SDK license, contact Sales.