Available Now: Explore our latest release with enhanced accessibility and powerful IDP features

A Developer’s Guide to PDF Redaction Using the Apryse Java SDK

By Ian Morris | 2023 Sep 22

Sanity Image
Read time

5 min

Explore the world of PDF redaction in Java with the Apryse SDK. Learn how redaction works, master the art of text redaction, and discover the benefits of customized and batch redaction in this comprehensive developer's guide

Introduction

Copied to clipboard

For modern digital document workflows, being able to securely remove personal identifiable information (PII) or other confidential content from documents is essential in many industries. This process is known as redaction, which is a form of editing where information is permanently removed from a document, rather than simply covering it up or otherwise obscuring it.

We’ve written a few articles about redaction recently, explaining what it is, why the permanent removal of sensitive and private information is important, and what happens if it goes wrong. We’ve also written tutorials on how to automate secure (or “true”) redaction of such data from documents using our Apryse WebViewer with languages such as JavaScript and the React framework.

In this article however, we’re going to focus on how developers can use the Apryse SDK to programmatically redact content from PDF documents. The code samples we’ll reference here are for Java, though if you prefer to code in JavaScript, C#, C++, Python, Ruby etc., don’t worry! Our comprehensive documentation has got you covered with samples for these, and other languages.

How Redaction Works in Apryse SDK

Copied to clipboard

The redaction process in Apryse (formerly known as PDFTron) SDK uses the pdftron.PDF.Redactor class, and consists of two steps:

1. Content Identification

To specify the content that should be redacted, or the regions where content to be removed is located in a document, you use redact annotations. These are specific types of markup which are added to the PDF document. The content for redaction can be identified either interactively (e.g. using pdftron.PDF.PDFViewCtrl) or programmatically (e.g. using pdftron.PDF.TextSearch or pdftron.PDF.TextExtractor). You can see, move and redefine these annotations to ensure they completely cover the information to be removed before the next step is performed.

2. Content Removal

Once the content to be redacted has been defined, you instruct the Apryse SDK to apply the redact regions using pdftron.PDF.Redactor.Redact(), which will completely remove the content in the area specified by the redact annotations. There are a number of options to control the style of the redaction overlay (including color, text, font, border, transparency, etc.), so you can customize how the redactions will look. For example, instead of a simple black rectangle you might prefer descriptive labels to show the type of content that was redacted.

Blog image

An example of redaction overlays which have been customized to show contextual information.

Prerequisites

Copied to clipboard

This guide assumes you have a preconfigured Java development environment with the Apryse Java SDK already installed. If not, follow the steps described in the Java Get started guide.

For other languages/environments, follow the Cross-Platform API guide to download and get an unlimited free trial of our PDF library. While the code snippets in this article are for Java, you can find equivalent snippets and full code samples. The concepts described in this article, however, are applicable to all versions of the Apryse SDK.

How to Redact Text in PDFs in Java

Copied to clipboard

The code snippet below is taken from our full Java redaction sample, and demonstrates adding multiple redaction objects to an array named vec. If we look at the first redaction object, we can see the specified page number (1st argument: 1), and the coordinates of the redaction rectangle (2nd argument: Rect(100, 100, 550, 600)). The 3rd argument (false) is a Boolean indicating that content inside the redaction area will be removed. In addition, a text label (“Top Secret”) for the redaction overlay is specified.

Redactor.Redaction[] vec = new Redactor.Redaction[7]; 
vec[0] = new Redactor.Redaction(1, new Rect(100, 100, 550, 600), false, "Top Secret"); 
vec[1] = new Redactor.Redaction(2, new Rect(30, 30, 450, 450), true, "Negative Redaction"); 
vec[2] = new Redactor.Redaction(2, new Rect(0, 0, 100, 100), false, "Positive"); 
vec[3] = new Redactor.Redaction(2, new Rect(100, 100, 200, 200), false, "Positive"); 
vec[4] = new Redactor.Redaction(2, new Rect(300, 300, 400, 400), false, ""); 
vec[5] = new Redactor.Redaction(2, new Rect(500, 500, 600, 600), false, ""); 
vec[6] = new Redactor.Redaction(3, new Rect(0, 0, 700, 20), false, ""); 
 
Redactor.Appearance app = new Redactor.Appearance(); 
app.redactionOverlay = true; 
app.border = false; 
app.showRedactedContentRegions = true; 
 
redact(input_path + "newsletter.pdf", output_path + "redacted.pdf", vec, app); 

We then specify appearance settings for the redactions:

  • redactionOverlay is set to true, indicating that redacted areas will have an overlay.
  • border is set to false, meaning there won't be a border around redacted areas.
  • showRedactedContentRegions is set to true, indicating that the redacted content regions will be visible in the document.

Finally, we perform the redactions on the specified input file, and write to the output file.

Can I Undo Redactions in a Document?

Copied to clipboard

Once the redact annotations have been applied, then it is not possible to undo the redaction. The pdftron.PDF.Redactor class makes sure that if a portion of an image, text, or vector graphics is contained in a redaction region, that portion of the image or path data is destroyed and is not simply hidden with clipping or image masks. When you redact in PDF the actual physical structure of the PDF is modified to prevent the malicious retrieval of redacted information. So, you can be sure that if you redact a document in this way, it cannot be unredacted.

Using the Apryse SDK’s API you can also review and remove metadata and other content that can exist in a PDF document, such as XML Forms Architecture (XFA) and Extensible Metadata Platform (XMP) content. 

How to Create a Template for Batch Redactions

Copied to clipboard

You might need to perform identical redactions on multiple documents, for example to remove a Social Security Number (SSN) from tax returns. If the document format is predictable and the information to be redacted is always located in the same place, then you can create a redaction template to be applied to all the required documents. 

First, define the redaction regions in your document as in the previous code snippet. Then, you can extract the redaction annotations to the Forms Data Format (FDF), and then export that as an XML-based XFDF file, as in the following code snippet:

PDFDoc doc = new PDFDoc(filename); 

// Extract annotations to FDF. 
// Optionally use e_both to extract both forms and annotations 
FDFDoc doc_annots = in_doc.fdfExtract(PDFDoc.e_annots_only); 

// Export annotations from FDF to XFDF. 
doc_annots.saveAsXFDF(output_xfdf_filename); 

You now have an XFDF file containing the redact region coordinates you defined which can be applied to all subsequent documents. To do this, import the file and merge the FDF data into the PDF document as below:

// Import annotations from XFDF to FDF 
FDFDoc fdf_doc = FDFDoc.createFromXFDF(xfdf_filename); 

// Merge FDF data into PDF doc 
PDFDoc doc = new PDFDoc(filename); 
doc.fdfMerge(fdf_doc); 

See the full FDFTest code sample for more information on the FDF merge/extraction functionality.

Want a Quick Demonstration of Redaction?

Copied to clipboard

On the Apryse Showcase site you can see a demonstration of how Apryse SDK securely redacts content to see how it’s implemented in the WebViewer SDK. Our WebViewer is a JavaScript implementation that brings many of the capabilities of the original SDK to web applications, including secure document redaction.

As well as delivering high-quality rendering, conversion, and document manipulation capabilities in an intuitive user interface, there are many security benefits for using the WebViewer for redacting documents. Since the data processing is done completely client-side, users can securely access and redact documents without needing server connections or external dependencies. See our recent blog to learn more about securely redacting sensitive information with WebViewer. 

Conclusion

Copied to clipboard

Developers can take advantage of the Apryse SDK’s versatility to bring flexibility and efficiency to redaction workflows, enhancing your users' experience and document processing productivity.

There is a wealth of documentation for in-depth insights into the capabilities of the library, enabling you to customize the redaction process according to your specific requirements.

Don’t forget, you can also reach out to us on Discord if you have any issues.

Sanity Image

Ian Morris

Technical Writer

Share this post

email
linkedIn
twitter