AVAILABLE NOW: Spring 2025 Release
By Apryse | 2025 Apr 03
7 min
Tags
java
C#
pdf conversion
While PDF files are flexible and portable, unfortunately they are not always searchable. In fact, a very common request is for the ability to parse text from PDFs. Luckily, the Apryse OCR Engine makes extracting searchable text from PDF files a breeze. Apryse’s AI-enhanced engine can accept any PDF (searchable or not) and extract the text from it, using OCR where necessary. After extraction Apryse SDK can save that information to a text file, a searchable PDF file, or any of our other 150+ supported document formats.
Below are two outlines on how to get started reading text from PDFs in C# and Java.
The following is an outline for a C# console app that will OCR an input file and print the text to the console.
using pdftron;
using pdftron.PDF;
namespace ConsoleApp1
{
internal class Program
{
static void Main(string[] args)
{
PDFNet.Initialize("your_license_key");
PDFNet.AddResourceSearchPath(@"path_to_your_ocr_module");
// Directory containing the PDF files
string directoryPath = "samples";
// Get all PDF files in the directory
string[] pdfFiles = Directory.GetFiles(directoryPath, "*.pdf");
foreach (string pdfFile in pdfFiles)
{
Console.WriteLine($"Processing file: {pdfFile}");
using (PDFDoc doc = new PDFDoc(pdfFile))
{
doc.InitSecurityHandler();
// Extract text from the document
string allText = ExtractTextFromDocument(doc);
// If no text is found, perform OCR
if (string.IsNullOrEmpty(allText.Trim()))
{
Console.WriteLine("No text found, performing OCR...");
PerformOCR(doc);
allText = ExtractTextFromDocument(doc);
}
Console.WriteLine(allText);
}
}
}
static string ExtractTextFromDocument(PDFDoc doc)
{
TextExtractor txt = new TextExtractor();
string allText = "";
for (int i = 1; i <= doc.GetPageCount(); i++)
{
txt.Begin(doc.GetPage(i));
allText += txt.GetAsText() + "\n";
}
return allText;
}
static void PerformOCR(PDFDoc doc)
{
OCROptions ocrOptions = new OCROptions();
ocrOptions.AddLang("eng");
OCRModule.ProcessPDF(doc, ocrOptions);
}
}
}
The Apryse engine is capable of storing extracted text into one of over 150 supported file formats. Here is an example of the Java implementation.
import com.pdftron.pdf.*;
import java.io.File;
import com.pdftron.common.PDFNetException;
public class App {
public static void main(String[] args) {
try {
PDFNet.initialize("your_license_key");
PDFNet.addResourceSearchPath("path_to_your_ocr_module");
// Directory containing the PDF files
String directoryPath = "src\\samples";
// Get all PDF files in the directory
File dir = new File(directoryPath);
File[] pdfFiles = dir.listFiles((d, name) -> name.toLowerCase().endsWith(".pdf"));
if (pdfFiles != null) {
for (File pdfFile : pdfFiles) {
System.out.println("Processing file: " + pdfFile.getAbsolutePath());
PDFDoc doc = new PDFDoc(pdfFile.getAbsolutePath());
doc.initSecurityHandler();
// Extract text from the document
String allText = extractTextFromDocument(doc);
// If no text is found, perform OCR
if (allText.trim().isEmpty()) {
System.out.println("No text found, performing OCR...");
performOCR(doc);
allText = extractTextFromDocument(doc);
}
System.out.println(allText);
doc.close();
}
}
} catch (PDFNetException e) {
e.printStackTrace();
}
}
static String extractTextFromDocument(PDFDoc doc) throws PDFNetException {
TextExtractor txt = new TextExtractor();
StringBuilder allText = new StringBuilder();
for (int i = 1; i <= doc.getPageCount(); i++) {
txt.begin(doc.getPage(i));
allText.append(txt.getAsText()).append("\n");
}
return allText.toString();
}
static void performOCR(PDFDoc doc) throws PDFNetException {
OCROptions ocrOptions = new OCROptions();
ocrOptions.addLang("eng");
OCRModule.processPDF(doc, ocrOptions);
}
}
Apryse documentation has a step-by-step guide to converting files with the document converter in Java and C#.
Get your trial of Apryse SDK for free. It’s fully-functional, and even comes with unlimited chat and email support.
Stay tuned for more conversion examples to see how the Apryse OCR engine will easily fit into any workflow converting PDF files into other document files or images and back again. Need help in the meantime? Contact our sales team or reach out to us on Discord.
Tags
java
C#
pdf conversion
Apryse
Share this post
PRODUCTS
Platform Integrations
End User Applications
Popular Content