AVAILABLE NOW: Spring 2025 Release

How to Check if Folder of PDFs Need OCR 

By Apryse | 2025 Jun 25

Sanity Image
Read time

5 min

Receiving a mix of both raster and searchable PDFs can be a pain, especially for those who get these files from clients daily. The Apryse OCR and PDF SDK makes it easy for developers to check if a file needs to have OCR ran or not.

Within PDFs are a few different object types, including:

  • Text
  • Rectangle
  • Image

For this blog post, I'll show you how to check all object types from stored PDFs, then calculate if the PDF needs to converted to a searchable PDF using the OCR SDK. At the start of your application, you will want to start up the OCR engine just in case you do have to OCR any files. Once the engine is started, we will want to look through each PDF from a specified folder. For each file found, load it as PDFDocument and parse the pages using ParsePages while setting PDFParseOptions to only look at the objects.

Once all the objects have been looked at, compare all of the non-text objects to all of the objects found. If more than 10% of the objects non-text objects, then it's fair to assume that majority of the PDF is not searchable and you will want to OCR the document.

Here’s the code sample:

using pdftron; 

using pdftron.PDF; 

using pdftron.SDF; 

 

namespace ConsoleApp1 

{ 

    internal class Program 

    { 

        static void Main(string[] args) 

        { 

            PDFNet.Initialize(PDFTronLicense.License); 

            PDFNet.AddResourceSearchPath(PDFTronLicense.ModulePath); 

 

            string inputDir = @"path_to_input_dir"; 

            string outputDir = @"path_to_output_dir"; 

 

            foreach (string filePath in Directory.GetFiles(inputDir, "*.pdf")) 

            { 

                using (PDFDoc doc = new PDFDoc(filePath)) 

                { 

                    doc.InitSecurityHandler(); 

 

                    int textCount = 0; 

                    int nonTextCount = 0; 

 

                    ElementReader reader = new ElementReader(); 

 

                    for (int i = 1; i <= doc.GetPageCount(); i++) 

                    { 

                        pdftron.PDF.Page page = doc.GetPage(i); 

                        reader.Begin(page); 

 

                        Element element; 

                        while ((element = reader.Next()) != null) 

                        { 

                            switch (element.GetType()) 

                            { 

                                case Element.Type.e_text: 

                                    textCount++; 

                                    break; 

                                default: 

                                    nonTextCount++; 

                                    break; 

                            } 

                        } 

                        reader.End(); 

                    } 

 

                    int totalElements = textCount + nonTextCount; 

                    double nonTextPercentage = (double)nonTextCount / totalElements * 100; 

 

                    if (nonTextPercentage > 10) 

                    { 

                        OCRModule.ProcessPDF(doc, null); 

                        string outputFilePath = Path.Combine(outputDir, Path.GetFileName(filePath)); 

                        doc.Save(outputFilePath, SDFDoc.SaveOptions.e_linearized); 

                        Console.WriteLine($"OCR performed on {filePath} and saved to {outputFilePath}"); 

                    } 

                    else 

                    { 

                        Console.WriteLine($"No OCR needed for {filePath}"); 

                    } 

                } 

            } 

        } 

    } 

} 

Now you have a starting point to check if OCR is needed for a folder of PDF files. To learn more about Apryse OCR, visit our documentation. 

Sanity Image

Apryse

Share this post

email
linkedIn
twitter