Apryse Announces Acquisition of AI-Powered Document Toolkit Provider LEAD Technologies

Explaining Redaction, and the Different Ways to Redact

By Ian Morris | 2023 Jul 20

Sanity Image
Read time

4 min

In an era where data privacy and security are paramount, the protection of sensitive information in documents is more important than ever. The industry-standard way to achieve this is through redaction, a process that ensures confidential information is hidden or permanently removed from documents.

In this article, we will explore what redaction is, the ways it can be achieved, and why “true” or secure redaction of PDF documents is important. We’ll also show you how the Apryse WebViewer SDK offers comprehensive solutions for secure document redaction in addition to its industry-leading rendering, conversion, and document manipulation capabilities.

What is Redaction?

Copied to clipboard

Redaction is the process of obscuring or removing sensitive information from documents to prevent unauthorized access or disclosure. It involves selectively marking or deleting specific content, such as personal identification numbers, financial data, or confidential paragraphs while preserving the integrity and structure of the document.

In the past, redaction was a task performed manually using tools such as black markers or grease paint to physically obscure information in documents. This was a simple, if tedious process since paper is a format with no metadata that needs to be erased. The invention of the photocopier in the 20th century was a benefit, though, since a single redacted document could be reproduced as many times as needed.

Blog image

A redacted official document from the mid-twentieth century.

However, the advent of PDF in 1993 brought the ability to easily share documents without using paper, with the first PDF redaction plugin for Acrobat being released in 1997. In 2006, the release of the PDF1.7 specification introduced the concept of redaction annotations, meaning redaction was now a “standard” feature of PDF. 

Why Do You Need Redaction?

Copied to clipboard

PDF redaction is commonly used in various industries that deal with sensitive or confidential information or are subject to data protection and privacy regulations, such as the General Data Protection Regulation (GDPR) in the European Union or the Health Insurance Portability and Accountability Act (HIPAA) in the United States. Secure redaction of information helps organizations comply with these regulations and avoid potential penalties.

Some of the main industries that rely on PDF redaction include:

Legal: Law firms, courts, and legal professionals often handle documents containing sensitive information, such as personal data, financial records, or privileged communications. Redaction ensures that confidential details are removed before sharing or submitting documents.

Government: Government agencies at various levels deal with classified or confidential information that must be protected. Redaction is employed to safeguard sensitive data in PDF documents related to national security, intelligence, law enforcement, or public administration, for example, if such information is determined to be exempt from a Freedom of Information Act (FOIA) request.

Healthcare: The healthcare industry deals with sensitive patient information protected by privacy regulations such as the Health Insurance Portability and Accountability Act (HIPAA). Medical records, research studies, insurance claims, and other healthcare-related documents often require redaction to prevent unauthorized access to personal data.

Finance and Banking: Financial institutions, including banks, investment firms, and accounting firms, handle confidential financial statements, customer records, transaction documents, and other sensitive information. PDF redaction is essential to ensure compliance with regulations such as the Gramm-Leach-Bliley Act (GLBA) and safeguard client privacy.

Human Resources: HR departments regularly deal with private employee information, such as social security numbers, salary details, performance reviews, and disciplinary actions. PDF redaction helps HR professionals protect sensitive employee data when sharing documents or responding to legal inquiries.

Education: Educational institutions, including universities, colleges, and schools, handle student records, academic transcripts, financial aid documents, and other confidential information. PDF redaction assists in safeguarding student privacy and complying with relevant data protection laws, such as the Family Educational Rights and Privacy Act (FERPA).

Research and Development: Industries engaged in research and development, such as pharmaceuticals, biotechnology, and engineering, rely on PDF redaction to protect intellectual property, patent applications, research findings, and other sensitive information before sharing them internally or with external collaborators.

Insurance: Insurance companies and agencies process policy documents, claim forms, medical records, and other sensitive data. PDF redaction helps protect personally identifiable information (PII) and ensure compliance with industry regulations.

It's important to note that while these industries commonly utilize PDF redaction, the need for redaction can arise in any sector where sensitive information needs to be protected from unauthorized access or disclosure.

Working in Angular? Check out our specific guide to redaction in Angular with WebViewer here.

How PDF Redaction Works

Copied to clipboard

When redacting PDF documents, it is essential to completely remove confidential information from the document, not just obscure or mask it. A common mistake is to open the document and draw a black rectangle over the text to be redacted. However, this only adds the rectangle as an image layer on top of the original content-which can easily be removed to reveal the “redacted” text underneath.

For secure, or “true” redaction, you need software thatactually removes the redacted content while leaving the rest of the document untouched. As mentioned earlier, version 1.7 of the PDF specification introduced redaction annotations, which enable secure redaction using a two (or optionally three) step process. 

In the first step, the sensitive content is identified, and the redactions are placed over it. In the second step, the redactions are verified and then applied. This completely removes the redacted text from the PDF document's content stream. In addition, if the document contains descriptive metadata such as bookmarks or links, you may need to sanitize this data if the metadata contains sensitive information.

Blog image

Using the Apryse WebViewer to select content for redaction.

As we explained in two recent blogs, the ApryseJavaScript WebViewerSDK allows developers to easily automate the secure redaction of PDFs and over 30 different document types using high-quality conversion, including MS Office files (doc, docx, xlsx, pptx) and even images. However, the WebViewer’s intuitive browser-based interface also allows non-developers to truly redact sensitive and confidential data.

Check out our Showcase to try out the Apryse SDK’s secure redaction capabilities for yourself!

When Redaction Goes Wrong

Copied to clipboard

Despite the fact that the PDF format allows for the “true” redaction of sensitive data, there have been several high-profile cases in recent years where classified or confidential data was retrieved from redacted PDF documents. One of the most famous was in 2019 when lawyers for former Trump campaign manager Paul Manafort filed an official response to a report from Special Counsel Robert Mueller’s team into Russian interference with the 2016 election: more commonly known as the Mueller Report. The report said Manafort lied to prosecutors when giving evidence, though, in their response, his lawyers argued in a heavily redacted document that their client had “provided complete and truthful information to the best of his ability.” 

However, journalists quickly discovered the supposedly redacted text was simply covered by thick black bars, and by simply copying it into a new document, the unredacted text was magically visible. What likely happened was someone had simply used Word’s highlighter tool to draw over the text before exporting as a PDF, or they used software to apply redaction annotations but forgot to “apply” them when saving the PDF. Whatever occurred, politically and legally sensitive information was released publicly, which was probably the last thing Manafort’s legal team wanted to happen.

Another widely reported case of a “redaction failure” followed the publication of a disputed COVID-19 vaccine contract between the European Commission and the international biopharmaceutical company AstraZeneca. While the sensitive content in the main text of the contract had been redacted properly, much of the redacted information was still present in the document’s bookmarks. As can be seen from the screenshot below, this information included the €870 million estimated Cost of Goods and other information which should not have been disclosed.

Blog image

The AstraZeneca contract with redacted text partially restored from the Bookmarks data.

There are many other examples of such redaction failures, though accidentally releasing confidential data is not the only way things can go wrong with redacted documents. One such example again related to the Mueller Report, although this time the fault was with the DoJ rather than an external legal team. As detailed by a series of articles by the PDF Association, rather than being made available as a “native” PDF with searchable text, what the US Department of Justice (DoJ) initially released was simply a set of low-quality scanned images embedded in a PDF.

This was particularly disappointing since not only did this make the dissemination of its contents difficult, but it also violated the ADA/Section 508 regulations. Section 508 of the Rehabilitation Act requires all Federal agencies to ensure digital documents are accessible to persons with disabilities, and the DoJ has a clear policy stating that public documents are Section 508 compliant and able to be easily processed by accessibility software such as screen readers. To be accessible, For PDF documents, this means they conform to “Level a” of the PDF Archiving standard (PDF/A), or, ideally, the PDF standard for Universal Accessibility (PDF/UA). These standards require documents to have text that can be easily searched and tagged with a hierarchical structure tree so that elements such as reading order, figures, and tables are explicitly identified through metadata. 

The PDF Association’s analysis concluded that while the redactions had been done by professional software rather than manually, as a PDF document, the report was in a sorry state overall. With non-searchable text, no tags and a broken document structure, and many other failings, the redacted PDF of the Mueller Report was far from a modern digital document suitable for the archiving and accessibility requirements of a government agency. A version of the document using optical character recognition (OCR) to add a searchable text layer and accessibility tags was eventually released, but many issues still remained due to the inadequate quality of the source document.

Now you know what not to do when redacting documents, let’s see how it should be done.

Introducing the Apryse WebViewer SDK

Copied to clipboard

The Apryse WebViewer SDK is a robust software development kit that empowers organizations to seamlessly integrate powerful redaction functionalities into their applications and workflows. Using our powerful and user-friendly WebViewer, the redaction process can be streamlined and automated to ensure the security of your data.

Here are the key features and benefits it offers for secure redaction of sensitive content:

1. Automated Text and Image Redaction

The ApryseWebViewer SDK enables automated redaction of PDF and over thirty different document types, thanks to its high-quality conversion capabilities. Whether you need to remove data from PDF, MS Office files, images, and more, WebViewer has you covered. It supports precise identification and redaction of sensitive content, ensuring comprehensive protection of confidential information.

2. Regex-Based Redaction

WebViewer also supports regular expression-based redaction, allowing users to define custom patterns for identifying and redacting specific types of content. This feature is particularly useful when dealing with complex data formats or recurring patterns.

3. Bulk Redaction

With the ApryseWebViewer SDK, users can efficiently redact multiple occurrences of sensitive information across a document or an entire document repository. This feature saves time and effort, especially when dealing with large volumes of documents.

4. Visual Redaction Verification

The SDK offers a visual verification feature that enables users to preview and validate redacted content before finalizing the document. This ensures accuracy and minimizes the risk of inadvertently disclosing sensitive information.

Blog image

5. Metadata Redaction

Apart from visible content, the ApryseWebViewer SDK also redacts document metadata, such as author names, timestamps, or revision history. This comprehensive approach ensures that all potentially sensitive information is protected.

6. Collaborative Redaction

The SDK supports collaborative redaction, enabling multiple users to work together on redacting documents simultaneously. Decision-makers can assign redaction tasks, track progress, and maintain an audit trail, facilitating teamwork and accountability.

7. Document Watermarking

To further enhance document security, the ApryseWebViewer SDK provides document watermarking capabilities. Your employees can add watermarks, such as "Confidential" or "Do Not Distribute," to redacted documents to reinforce their restricted nature.

8. Video/Audio Redaction

We have comprehensively covered secure redaction of data within documents, but what if you have videos that contain sensitive information? By embedding our add-on Video SDK, you can use WebViewertoannotate and redact video frame-by-frame without needing to use separate video editing software. Simply adjust the redaction length in the video timeline, and select whether you want to redact video only, audio only, or both.

Blog image

Conclusion

Copied to clipboard

As organizations grapple with the challenges of protecting sensitive information, implementing robust redaction processes becomes crucial. The Apryse WebViewer SDK offers an all-in-one solution for secure and efficient document redaction. By automating the process, providing regex-based redaction, facilitating collaborative workflows, and offering comprehensive verification features, Apryse empowers organizations to safeguard confidential information and comply with privacy regulations.

Embrace the Apryse WebViewer SDK’s power to enhance document security and foster a culture of data protection in your organization. Check out the Apryse Showcase to try out WebViewer’s powerful and user-friendly document redaction capabilities for yourself, along with its best-in-class rendering, conversion, and document manipulation.

Sanity Image

Ian Morris

Technical Writer

Share this post

email
linkedIn
twitter