Available Now: Explore our latest release with enhanced accessibility and powerful IDP features
By Ian Morris | 2023 Sep 22
5 min
Tags
java
PDF SDK
apryse sdk
Explore the world of PDF redaction in Java with the Apryse SDK. Learn how redaction works, master the art of text redaction, and discover the benefits of customized and batch redaction in this comprehensive developer's guide
For modern digital document workflows, being able to securely remove personal identifiable information (PII) or other confidential content from documents is essential in many industries. This process is known as redaction, which is a form of editing where information is permanently removed from a document, rather than simply covering it up or otherwise obscuring it.
We’ve written a few articles about redaction recently, explaining what it is, why the permanent removal of sensitive and private information is important, and what happens if it goes wrong. We’ve also written tutorials on how to automate secure (or “true”) redaction of such data from documents using our Apryse WebViewer with languages such as JavaScript and the React framework.
In this article however, we’re going to focus on how developers can use the Apryse SDK to programmatically redact content from PDF documents. The code samples we’ll reference here are for Java, though if you prefer to code in JavaScript, C#, C++, Python, Ruby etc., don’t worry! Our comprehensive documentation has got you covered with samples for these, and other languages.
The redaction process in Apryse (formerly known as PDFTron) SDK uses the pdftron.PDF.Redactor
class, and consists of two steps:
To specify the content that should be redacted, or the regions where content to be removed is located in a document, you use redact annotations. These are specific types of markup which are added to the PDF document. The content for redaction can be identified either interactively (e.g. using pdftron.PDF.PDFViewCtrl
) or programmatically (e.g. using pdftron.PDF.TextSearch
or pdftron.PDF.TextExtractor
). You can see, move and redefine these annotations to ensure they completely cover the information to be removed before the next step is performed.
Once the content to be redacted has been defined, you instruct the Apryse SDK to apply the redact regions using pdftron.PDF.Redactor.Redact()
, which will completely remove the content in the area specified by the redact annotations. There are a number of options to control the style of the redaction overlay (including color, text, font, border, transparency, etc.), so you can customize how the redactions will look. For example, instead of a simple black rectangle you might prefer descriptive labels to show the type of content that was redacted.
An example of redaction overlays which have been customized to show contextual information.
This guide assumes you have a preconfigured Java development environment with the Apryse Java SDK already installed. If not, follow the steps described in the Java Get started guide.
For other languages/environments, follow the Cross-Platform API guide to download and get an unlimited free trial of our PDF library. While the code snippets in this article are for Java, you can find equivalent snippets and full code samples. The concepts described in this article, however, are applicable to all versions of the Apryse SDK.
The code snippet below is taken from our full Java redaction sample, and demonstrates adding multiple redaction objects to an array named vec
. If we look at the first redaction object, we can see the specified page number (1st argument: 1
), and the coordinates of the redaction rectangle (2nd argument: Rect(100, 100, 550, 600)
). The 3rd argument (false)
is a Boolean indicating that content inside the redaction area will be removed. In addition, a text label (“Top Secret”) for the redaction overlay is specified.
Redactor.Redaction[] vec = new Redactor.Redaction[7];
vec[0] = new Redactor.Redaction(1, new Rect(100, 100, 550, 600), false, "Top Secret");
vec[1] = new Redactor.Redaction(2, new Rect(30, 30, 450, 450), true, "Negative Redaction");
vec[2] = new Redactor.Redaction(2, new Rect(0, 0, 100, 100), false, "Positive");
vec[3] = new Redactor.Redaction(2, new Rect(100, 100, 200, 200), false, "Positive");
vec[4] = new Redactor.Redaction(2, new Rect(300, 300, 400, 400), false, "");
vec[5] = new Redactor.Redaction(2, new Rect(500, 500, 600, 600), false, "");
vec[6] = new Redactor.Redaction(3, new Rect(0, 0, 700, 20), false, "");
Redactor.Appearance app = new Redactor.Appearance();
app.redactionOverlay = true;
app.border = false;
app.showRedactedContentRegions = true;
redact(input_path + "newsletter.pdf", output_path + "redacted.pdf", vec, app);
We then specify appearance settings for the redactions:
redactionOverlay
is set to true
, indicating that redacted areas will have an overlay.border
is set to false
, meaning there won't be a border around redacted areas.showRedactedContentRegions
is set to true
, indicating that the redacted content regions will be visible in the document.Finally, we perform the redactions on the specified input file, and write to the output file.
Once the redact annotations have been applied, then it is not possible to undo the redaction. The pdftron.PDF.Redactor
class makes sure that if a portion of an image, text, or vector graphics is contained in a redaction region, that portion of the image or path data is destroyed and is not simply hidden with clipping or image masks. When you redact in PDF the actual physical structure of the PDF is modified to prevent the malicious retrieval of redacted information. So, you can be sure that if you redact a document in this way, it cannot be unredacted.
Using the Apryse SDK’s API you can also review and remove metadata and other content that can exist in a PDF document, such as XML Forms Architecture (XFA) and Extensible Metadata Platform (XMP) content.
You might need to perform identical redactions on multiple documents, for example to remove a Social Security Number (SSN) from tax returns. If the document format is predictable and the information to be redacted is always located in the same place, then you can create a redaction template to be applied to all the required documents.
First, define the redaction regions in your document as in the previous code snippet. Then, you can extract the redaction annotations to the Forms Data Format (FDF), and then export that as an XML-based XFDF file, as in the following code snippet:
PDFDoc doc = new PDFDoc(filename);
// Extract annotations to FDF.
// Optionally use e_both to extract both forms and annotations
FDFDoc doc_annots = in_doc.fdfExtract(PDFDoc.e_annots_only);
// Export annotations from FDF to XFDF.
doc_annots.saveAsXFDF(output_xfdf_filename);
You now have an XFDF file containing the redact region coordinates you defined which can be applied to all subsequent documents. To do this, import the file and merge the FDF data into the PDF document as below:
// Import annotations from XFDF to FDF
FDFDoc fdf_doc = FDFDoc.createFromXFDF(xfdf_filename);
// Merge FDF data into PDF doc
PDFDoc doc = new PDFDoc(filename);
doc.fdfMerge(fdf_doc);
See the full FDFTest code sample for more information on the FDF merge/extraction functionality.
On the Apryse Showcase site you can see a demonstration of how Apryse SDK securely redacts content to see how it’s implemented in the WebViewer SDK. Our WebViewer is a JavaScript implementation that brings many of the capabilities of the original SDK to web applications, including secure document redaction.
As well as delivering high-quality rendering, conversion, and document manipulation capabilities in an intuitive user interface, there are many security benefits for using the WebViewer for redacting documents. Since the data processing is done completely client-side, users can securely access and redact documents without needing server connections or external dependencies. See our recent blog to learn more about securely redacting sensitive information with WebViewer.
Developers can take advantage of the Apryse SDK’s versatility to bring flexibility and efficiency to redaction workflows, enhancing your users' experience and document processing productivity.
There is a wealth of documentation for in-depth insights into the capabilities of the library, enabling you to customize the redaction process according to your specific requirements.
Don’t forget, you can also reach out to us on Discord if you have any issues.
Tags
java
PDF SDK
apryse sdk
Ian Morris
Technical Writer
Share this post
PRODUCTS
Enterprise
Small Business
Popular Content