Home

All Blogs

How to Identify OCR Needs in a Folder of PDFs

Updated January 26, 2026

Read time

5 min

email
linkedIn
twitter
link

How to Identify OCR Needs in a Folder of PDFs

Sanity Image

Apryse

Sanity Image

In this blog, we'll take a look at how searchable PDFs are created using OCR, and identifying files in bulk for this process.

Receiving a mix of both raster and searchable PDFs can be a pain, especially for those who get these files from clients daily. The Apryse OCR and PDF SDK makes it easy for developers to check if a file can be converted to searchable PDF using OCR.

Within PDFs are a few different object types, including:

  • Text
  • Rectangle
  • Image

This blog post will showcase how to check all object types from stored PDFs, then calculate if the PDF needs to converted to a searchable PDF using the OCR SDK. Here's how it works:

  1. At the start of your application, you will want to start up the OCR engine just in case you do have to OCR any files.
  2. Once the engine is started, we will want to look through each PDF from a specified folder.
  3. For each file found, load it as PDFDocument and parse the pages using ParsePages while setting PDFParseOptions to only look at the objects.
  4. Once all the objects have been looked at, compare all of the non-text objects to all of the objects found. If more than 10% of the objects non-text objects, then it's fair to assume that majority of the PDF is not searchable and you will need to OCR the document.

Here’s the code sample:

Copied to clipboard

Now you have a starting point to check if OCR is needed for a folder of PDF files. Simple as that!

Next Steps

Copied to clipboard

To learn more about Apryse OCR, visit our documentation. If you have any questions or are ready to get started, contact sales or check out the Server SDK trial.