AVAILABLE NOW: Spring 2026 Release

Home

All Blogs

Handwritten Form Processing: Automating Data Capture from Unstructured Documents

Published May 06, 2026

Updated May 08, 2026

Read time

4 min

email
linkedIn
twitter
link

Handwritten Form Processing: Automating Data Capture from Unstructured Documents

Sanity Image

Isaac Maw

Technical Content Creator

Summary: Understanding what's required for a successful ICR pipeline is essential for achieving repeatable, accurate results. With Apryse SDK, you can build a pipeline that performs for handwritten form extraction.

OCR is a valuable tool for converting non-readable text, such as a PDF or scanned image, to machine-readable text. Converting handwritten text requires different technology. Intelligent character recognition (ICR) uses AI trained on handwritten letter shapes to interpret real handwriting. Rather than matching characters against fixed templates, ICR uses neural networks and machine learning to analyze individual writing styles, adapt over time, and extract meaning from even the most unstructured inputs. This closes a critical gap in end-to-end automation workflows by converting previously inaccessible handwritten content into structured, searchable data.

Sanity Image

However, processing handwritten forms in practice requires more than a good ICR model. For accuracy in production, the model is the tip of the iceberg. 80% of successful implementation is in image preprocessing, field detection, post-processing and validation that improves results, prevents errors, and makes detection more reliable.

Handwritten Form Processing Pipeline Breakdown

Copied to clipboard

Let’s take a look at the pipeline for a handwritten form, from scan to output.

1.  Ingestion & Normalization

Copied to clipboard

First, documents must be in the best format and condition for ICR results. This includes converting to a file format that can be deskewed, rotated, and despeckled to reduce noise, and upscaled to a sufficient resolution. For example, OCR quality starts to diminish exponentially if glyph bounding boxes get smaller than 20x20 pixel, and if a document is upside down, models can’t match or interpret glyphs (u and n, for example.)

2. Preprocessing

Copied to clipboard

Next, forms are preprocessed. This modifies the document image to produce best results. For example, the document color is binarized to eliminate shadows, and undergoes noise removal, contrast tweaks, and line removal for printed grids.

3. Field Detection

Copied to clipboard

With the document preprocessed, next, template-based or ML-based field detection can be applied to separate user responses from form content. For example, Apryse offers Template Extraction for structured forms such as ACORD forms,  and Smart Data Extraction for more variable inputs.

4. Recognition

Copied to clipboard

This step is where the model is run, such as OCR for print labels, or ICR for handwritten forms. OCR matches individual glyphs to a reference to identify characters, while ICR uses AI to match a much more varied set of glyphs to alphabet letters.

5. Post-processing

Copied to clipboard

Following recognition, it is likely that some post-processing will be beneficial, such as a common spell checker or comparing results against white/blacklists. For this purpose, you can first, extract text and corresponding metadata as JSON before re-applying the processed results to the input document.

6. Structured output

Copied to clipboard

Finally, the system provides structured JSON output suitable for downstream use. Apryse ICR JSON output consists of nested arrays:

  • Array of pages
  • Array of paragraphs
  • Array of lines
  • Array of words

The JSON output has additional metadata and attributes such as page number, coordinates and character orientation.

Why Cloud APIs Hit a Ceiling

Copied to clipboard

Cloud APIs from hyperscalers like Google, Amazon and Microsoft are excellent for stages 4 and 5. However, for best results, developers must implement preprocessing steps on their own. In addition, these API services bill per page, process documents in a third-party cloud, and don’t support on-prem deployment for privacy (GDPR, HIPAA) sensitive workloads.

With per-page costs, API services can quickly exceed SDK license costs as usage increases, and the investment in building a pre-processing stack to support the API service adds to vendor lock-in risks.

The Two Capture Architectures: Template-Based vs. Template-Free

Copied to clipboard

Developers can choose the most effective extraction solution based on the use case. For structured forms where the system handles only one type of form, where user input is at the same page coordinates in every document, template-based extraction is the right choice.

Using Apryse Template Extraction, developers define a template by selecting field coordinates on a sample form. Then, the engine pulls values from those specific coordinates on every match. This solution is great for insurance ACORD forms, tax forms, and other structured forms. However, it breaks on variable layouts, or forms that are updated and changed frequently.

For these, template-free extraction uses machine learning to intelligently detect fields on any document and extract data as key-value pairs. While Smart Data Extraction is more flexible, it requires more validation after extraction.

Document Classification can be used to sort a stream of incoming documents and identify documents against a large library of document types. This can help route documents to the correct extraction model.

Building for Handwriting Extraction

Copied to clipboard

Handwriting is messy. Handwritten glyphs of the 26 alphabet characters vary widely, and values often escapes boxes in the form design. Pen color also may vary. These variables mean preprocessing matters more for achieving good results with OCR, including field detection, binarization, and contrast. It also means post-processing matters more, as dictionary lookup can catch a misread patient name.

Human-in-the-loop review is an essential stage in any AI-assisted process, and ICR is no different, especially as inputs like strikethroughs and poor penmanship can impact results with even the best models.

Building Smart Data Extraction with Apryse

Copied to clipboard

Process a scanned document

Copied to clipboard

Make a searchable PDF by adding invisible text to an image-based PDF, such as a scanned document, using Handwriting ICR:

doc = PDFDoc(input_pdf_path)

# Run ICR on the .pdf with the default options
HandwritingICRModule.ProcessPDF(doc)

We also have a full code sample to add searchable/selectable text to an image-based PDF, like a scanned document, which shows how to use the Apryse Handwriting ICR module on scanned documents in multiple programming languages. The Handwriting ICR module can make searchable PDFs and extract scanned text for further indexing. Samples are available in Python, C# (.Net), C++, Go, Java, Node.js (JavaScript), PHP, Ruby, VB, and Obj-C.

Extract handwritten text as JSON

Copied to clipboard

If you want to apply raw ICR output to the input document, you can call HandwritingICRModule.ProcessPDF. However, it is likely that some post-processing will be beneficial, e.g., common spell checker or comparing results against white/blacklists. For this purpose, you can, first, extract text and corresponding metadata as JSON before re-applying the processed results to the input document.

Template Extraction for structured forms

Copied to clipboard

Visit the documentation to get started with Apryse Template Extraction. The workflow includes:

  • Create Template File: Use the Template designer to create the templates.
  • Extract Templated Data: After classifying the associated template to a file, The Template Extraction SDK allows you to extract data from the input file based on the template, including support for different types of deformation for the data layout between the template and the input file.
  • Classify File Against Templates: The Template Extraction SDK allows you to classify an input file against group of templates.
  • Create Classification Cache: To speed up the classification of an input file against many templates, the SDK provides a way to create a cache data that will be used if present to classify the correct template fast.
  • Use Classify Catalogue to Speed Up Classification: If you have generated cache to optimize the classification speed, you can utilize it by calling `EnableCacheLookup(true)` for `TemplateClassifier.Builder`.

Next Steps

Copied to clipboard

If you’re ready to get started with Apryse Smart Data Extraction for ICR and Template Extraction for structured forms, you can start your free trial to test it out. Don’t hesitate to contact sales with any questions.

FAQ

Copied to clipboard

Q: Can Smart Data Extraction handle cursive?

A: While Apryse ICR performs well for cursive writing in structured fields, It may be more challenging for freeform prose. You can reach out to sales for help with ICR implementation and use cases.

Q: How do I evaluate against my current tool?

A: One method of comparing different OCR and ICR modules is to run 500 sample documents through both, then compare STP rate + reviewer effort. This method goes beyond accuracy to help determine the total cost of the tool.

Ready to get started?

Sign up for a free trial to begin implementing the Apryse SDK in your application!