Exporting OCR Data to JSON with Apryse

By Garry Klooesterman | 2025 Jan 30

4 min

Introduction

Copied to clipboard

Optical Character Reconstruction (OCR) technology has transformed how we handle printed documents. Converting images of text into a machine-readable form often results in unstructured and difficult to analyze data. A powerful solution is to structure this OCR output into a format like JSON (JavaScript Object Notion). OCR to JSON extraction makes text searchable, accessible, and easy to integrate with various applications, enabling efficient data analysis, automated entry, archiving, and more. Ultimately, OCR to JSON unlocks the value of previously inaccessible information.

In this blog, we’ll explore JSON, exporting data to JSON using an OCR SDK, how the Apryse OCR SDK will be your top choice, and the process to export OCR to JSON.

JSON

Copied to clipboard

JSON, a lightweight data-interchange format, is easily usable by humans and machines. Built on a subset of JavaScript, and drawing familiar conventions from languages like C, C++, C#, Java, JavaScript, Python, and others, JSON offers an ideal and flexible way to exchange data. Its straightforward structure requires no specialized knowledge or tools for analysis and interpretation.

The Apryse Intelligent Document Processing (IDP) module uses JSON as a data structure and Artificial Intelligence (AI) to analyze and understand the content within the PDF document.

Apryse OCR SDK

Copied to clipboard

Apryse OCR SDK is a powerful, enterprise-ready OCR SDK solution ready to transform your document workflows.

Key Features

Handles various document types, including tables, multiple languages, and different text orientations.
Automatically enhances image quality, leading to more accurate image recognition.
Efficiently handles high-volume workflows.
A server-based solution, keeping your sensitive material protected.
Easily integrated into existing systems and server environments.
Can be fine-tuned for optimal performance by adjusting processing parameters.

Developers can fine-tune the OCR engine for optimal performance by adjusting processing parameters to meet the needs and requirements of any project.

From reduced costs and increased efficiency to improved accessibility and customized data output, the Apryse OCR SDK offers a comprehensive suite of benefits for your business.

OCR Engines

Copied to clipboard

Apryse offers three OCR engines to add advanced text extraction capabilities to your applications.

LEADTOOLS OCR Engine
Tesseract OCR Module
IRIS OCR Module

For this blog, we’ll use the default OCR module, which contains the LEADTOOLS OCR Engine and the Tesseract OCR Module.

Export OCR to JSON

Copied to clipboard

JSON’s structured format makes it ideal for organized data storage and seamless exchange between applications and systems. When you export OCR to JSON, you not only preserve the extracted text but also include valuable metadata – information about the text’s appearance with the document, such as font-size, orientation, and more. This rich data set simplifies storage, access, and future use. Furthermore, the extraction process itself can be implemented with minimal code.

Step 1: Download and install the OCR module.

Step 2: Now let’s use OCR to extract the data to a JSON file.

async function main() {
// Setup empty destination doc
const doc = await PDFNet.PDFDoc.create();
const image_path = "path/to/image";
// Extract OCR results as JSON
const json = await PDFNet.OCRModule.getOCRJsonFromImage(doc, image_path, opts);
}
PDFNet.runWithCleanup(main);

Output Attributes

Now that we have our JSON file with the extracted OCR data, we can see the various attributes of the data. Output consists of nested arrays for pages, paragraphs, lines, and words. Pages have the following metadata:

Figure 1: Page metadata in JSON

Each word has the following metadata:

Figure 2: Word metadata in JSON

Sample JSON output

Here we can see what the JSON output from OCR extraction could look like.

Figure 3: Sample JSON output

So, what do I with the data?

Copied to clipboard

Great question! Now that our data has been extracted, organized, and stored in a JSON file, we can explore its potential uses, such as:

Storing the data for later use.
Using the data in another application, database, or system.
Applying corrected or processed data back onto the original input document.

Conclusion

Copied to clipboard

In conclusion, by utilizing the robust Apryse OCR SDK to extract OCR data to JSON, businesses can unlock valuable information trapped within static formats. Streamlined by the OCR SDK’s capabilities, this process makes data searchable and accessible and facilitates streamlined integration with various applications and workflows. When using Apryse’s OCR to JSON functionality, you can maximize efficiency, improve accessibility, and realize the full potential of your information.

Check out a demo of Apryse OCR SDK in action. Contact our sales team for any questions.

Need help setting up? Join our Discord community for support and discussions.