AVAILABLE NOW: Spring 2025 Release

How to Convert PDF to Text in C# and Java

By Apryse | 2025 Apr 03

Sanity Image
Read time

7 min

While PDF files are flexible and portable, unfortunately they are not always searchable. In fact, a very common request is for the ability to parse text from PDFs. Luckily, the Apryse OCR Engine makes extracting searchable text from PDF files a breeze. Apryse’s AI-enhanced engine can accept any PDF (searchable or not) and extract the text from it, using OCR where necessary. After extraction Apryse SDK can save that information to a text file, a searchable PDF file, or any of our other 150+ supported document formats.

Below are two outlines on how to get started reading text from PDFs in C# and Java.

C# - Get Text From PDF

The following is an outline for a C# console app that will OCR an input file and print the text to the console.

using pdftron; 

using pdftron.PDF; 

 

namespace ConsoleApp1 

{ 

    internal class Program 

    { 

        static void Main(string[] args) 

        { 

            PDFNet.Initialize("your_license_key"); 

            PDFNet.AddResourceSearchPath(@"path_to_your_ocr_module"); 

 

            // Directory containing the PDF files 

            string directoryPath = "samples"; 

 

            // Get all PDF files in the directory 

            string[] pdfFiles = Directory.GetFiles(directoryPath, "*.pdf"); 

 

            foreach (string pdfFile in pdfFiles) 

            { 

                Console.WriteLine($"Processing file: {pdfFile}"); 

 

                using (PDFDoc doc = new PDFDoc(pdfFile)) 

                { 

                    doc.InitSecurityHandler(); 

 

                    // Extract text from the document 

                    string allText = ExtractTextFromDocument(doc); 

 

                    // If no text is found, perform OCR 

                    if (string.IsNullOrEmpty(allText.Trim())) 

                    { 

                        Console.WriteLine("No text found, performing OCR..."); 

                        PerformOCR(doc); 

                        allText = ExtractTextFromDocument(doc); 

                    } 

 

                    Console.WriteLine(allText); 

                } 

            } 

        } 

 

        static string ExtractTextFromDocument(PDFDoc doc) 

        { 

            TextExtractor txt = new TextExtractor(); 

            string allText = ""; 

 

            for (int i = 1; i <= doc.GetPageCount(); i++) 

            { 

                txt.Begin(doc.GetPage(i)); 

                allText += txt.GetAsText() + "\n"; 

            } 

 

            return allText; 

        } 

 

        static void PerformOCR(PDFDoc doc) 

        { 

            OCROptions ocrOptions = new OCROptions(); 

            ocrOptions.AddLang("eng"); 

            OCRModule.ProcessPDF(doc, ocrOptions); 

        } 

    } 

} 

Java – Get Text From PDF

The Apryse engine is capable of storing extracted text into one of over 150 supported file formats. Here is an example of the Java implementation.

import com.pdftron.pdf.*; 

import java.io.File; 

import com.pdftron.common.PDFNetException; 

 

public class App { 

public static void main(String[] args) { 

    try { 

      PDFNet.initialize("your_license_key"); 

      PDFNet.addResourceSearchPath("path_to_your_ocr_module"); 

 

      // Directory containing the PDF files 

      String directoryPath = "src\\samples"; 

 

      // Get all PDF files in the directory 

      File dir = new File(directoryPath); 

      File[] pdfFiles = dir.listFiles((d, name) -> name.toLowerCase().endsWith(".pdf")); 

 

      if (pdfFiles != null) { 

        for (File pdfFile : pdfFiles) { 

          System.out.println("Processing file: " + pdfFile.getAbsolutePath()); 

 

          PDFDoc doc = new PDFDoc(pdfFile.getAbsolutePath()); 

          doc.initSecurityHandler(); 

 

          // Extract text from the document 

          String allText = extractTextFromDocument(doc); 

 

          // If no text is found, perform OCR 

          if (allText.trim().isEmpty()) { 

            System.out.println("No text found, performing OCR..."); 

            performOCR(doc); 

            allText = extractTextFromDocument(doc); 

          } 

 

          System.out.println(allText); 

          doc.close(); 

        } 

      } 

    } catch (PDFNetException e) { 

      e.printStackTrace(); 

    } 

  } 

 

  static String extractTextFromDocument(PDFDoc doc) throws PDFNetException { 

    TextExtractor txt = new TextExtractor(); 

    StringBuilder allText = new StringBuilder(); 

 

    for (int i = 1; i <= doc.getPageCount(); i++) { 

      txt.begin(doc.getPage(i)); 

      allText.append(txt.getAsText()).append("\n"); 

    } 

 

    return allText.toString(); 

  } 

 

  static void performOCR(PDFDoc doc) throws PDFNetException { 

    OCROptions ocrOptions = new OCROptions(); 

    ocrOptions.addLang("eng"); 

    OCRModule.processPDF(doc, ocrOptions); 

  } 

} 

 

Apryse documentation has a step-by-step guide to converting files with the document converter in Java and C#.

Get Your Trial

Get your trial of Apryse SDK for free. It’s fully-functional, and even comes with unlimited chat and email support.

Next Steps

Stay tuned for more conversion examples to see how the Apryse OCR engine will easily fit into any workflow converting PDF files into other document files or images and back again. Need help in the meantime? Contact our sales team or reach out to us on Discord. 

Sanity Image

Apryse

Share this post

email
linkedIn
twitter