Available Now: Explore our latest release with enhanced accessibility and powerful IDP features

Auto-tagging for PDF/UA – Improving Accessibility for all using the Apryse SDK

By Roger Dunham | 2024 Dec 05

Sanity Image
Read time

5 min

Summary: PDF SDK auto-tagging helps make documents PDF/UA compliant by automatically adding semantic structure to PDFs, such as headings, paragraphs, tables, and alternative text for images. This structure ensures accessibility for screen readers and assistive technologies, making the content navigable and usable for individuals with disabilities. It is necessary to meet legal accessibility standards

(like WCAG) and improve inclusivity in digital documents.

Introduction

Copied to clipboard

PDFs are everywhere and used for a huge range of purposes. They are a great way of storing information in a way that is easy to read on either a computer monitor or mobile device. But what happens if you have limited vision? In that case understanding a largely visual layout is all but impossible.

This is where the PDF/UA standard aims to help - ensuring that PDF documents are fully usable by people with disabilities, including those who rely on assistive technologies like screen readers.

The standard (formally ISO 14289) specifies rules and structures (such as tagged text, image descriptions, and reading order) to make the content of compliant PDFs accessible – in contrast with regular PDF which may contain visual elements that are not easily interpreted.

Blog image

Figure 1 - Part of a PDF - some parts of the text are headings, some links and some regular text. If you were partially sighted this would be much harder to interpret. PDF/UA attempts to address this.

PDF/UA not only promotes inclusivity, but it is also good for business -improving compliance with accessibility laws and standards, such as the Americans with Disabilities Act (ADA) and the European Accessibility Act.

PDF/UA and the Web Content Accessibility Guidelines (WCAG) both relate to accessibility.

In this article we will look at how you can use the Apryse SDK to automatically ‘tag’ a PDF as a step towards making it PDF/UA compliant. We will also see how the Apryse SDK does this entirely within software under your control, rather than sending the file over the internet to be processed on a server belonging to a third party.

What are Tags?

Copied to clipboard

Within a PDF, tags are elements that provide information about what the various of the PDF are, and how they relate to one another. This is important for accessibility where, for example, a screen reader needs to know how to interpret the various parts of the document.

For example, they can be used to indicate that something is a title rather than regular text or part of a list.

Blog image

Figure 2- A tagged PDF, with list items selected (shown in Xodo PDF Studio).

The Role of Tags in PDF/UA compliance

Copied to clipboard

Tagging alone is not enough for PDF/UA compliance. There are other technical requirements that also need to be complied with. One example is that all images, charts and logos should have text (“alternative Text” or “alt-text”) that describes their content. That’s hardly a new idea though -as early as 1993 the idea was included in the specification for HTML 1.2.

While tools exist to help to describe what an image contains, they are not perfect, so some manual processing will be needed to create meaningful descriptions of images. That manual processing also gives the opportunity to verify that the colors used for the text and background have sufficient contrast to be legible.

Adding Tags when creating a PDF

If you are creating a new document using Microsoft Word, then you can use the built in Accessibility Assistant to make an accessible DOCX file (other Word Processing tools, such as LibreOffice have a similar option).

Blog image

Figure 3 - Word contains an Accessibility Assistant which helps to add tags to a document when it is being created.

You can then export the DOCX file using the Create PDF/XPS document options. This will allow you to specify that you want to include the Document Structure tags.

Blog image

Figure 4 - Exporting the DOCX file to a PDF including tags.

The exported PDF will then contain tags, and because you have control of the original document this is probably the best way to create an accessible PDF.

But what if you have an existing document that was not tagged when it was created?

Adding Tags to an existing, untagged PDF

Copied to clipboard

If you are given the task of making an existing PDF accessible, then it’s possible to manually add tags to a PDF. Several PDF editors, for example Adobe Acrobat and Xodo PDF Studio support that functionality.

Blog image

Figure 5 – It is possible to manually add tags to an existing PDF, but for a large document that is a lot of tedious work

Adding tags by hand, though, would be extremely slow and tedious for all but the simplest of documents. Thankfully the Apryse SDK offers a way to do this automatically.

It does so by leveraging the DataExtraction module, one of a number of modules that provide dedicated functionality to the SDK.

Once you have downloaded the SDK and the DataExtraction module, and got a trial license key,  the actual processing of a PDF to automatically include tags is trivially easy.

The following code is based on the PDF/UA sample which is available in many languages including C#, C++, Go, Java and Node.js.

PDFUAConformance pdf_ua = new PDFUAConformance(); 
pdf_ua.AutoConvert(input_file, output_file); 

Under the hood, the SDK is calculating the reading order of the document, working out what each part of the document means – what is a table of contents, or a list item, or a paragraph, and is adding tags to the PDF, all with no need for input on your behalf. 

The result is a PDF that is fully tagged.

Blog image

Figure 6 - The auto-tagged PDF.

It’s still necessary to manually verify that the tagging is correct, and to add alt-text to images, but a huge amount of the tedious work has been done for you. What a time saving!

In this example the original document was not tagged at all. Sometimes though you will get a document that is already tagged, but not in a way that is PDF/UA compliant, or indeed the tags might be entirely wrong. That’s not a problem – the Apryse SDK will retag your document for you.

Verifying that a document is PDF/UA compliant

Copied to clipboard

Once you have the auto-tagged PDF, you will want to verify if it is PDF/UA compliant.

One tool that you can use to do this, is VeraPDF. Before we look at that it is important to note that while it can test for certain things, for example that all images have alt-text, it can’t verify that the alt-text is meaningful, so there may still be some manual fix-ups needed.

If we verify the PDF that we created using PDFUAConformance.autoConvert, VeraPDF informs us that it is compliant.

Blog image

Figure 7 - Having automatically tagged our document it passes PDF.UA validation.

If you want, there is also a detailed Report generated that gives more information.

Blog image

Figure 8 – Auto-tagging using the Apryse SDK has made a PDF/UA compliant document.

That was easy!

The disappointing result when using Acrobat to auto-tag a PDF

Copied to clipboard

Adobe Acrobat Pro also offers the ability to automatically tag PDFs.

In fact, it can do so in two ways, either entirely on your local machine, or by uploading your PDF to a server (which then uses the same functionality that is available from the Adobe PDF Services API). Which option is chosen depends on the user’s settings.

Let’s look at the same file converted by both options.

Tagging the PDF locally

Automatically adding tags to the PDF with Acrobat Pro takes just a few seconds when the processing is performed on the local machine. The result is a tagged PDF, which you can inspect using a suitable viewer (whether Acrobat or Xodo PDF Studio.

Blog image

Figure 9 - Acrobat can automatically tag the PDF, performing the processing on the local machine.

Superficially the result looks good, but there are some problems. If you look at the tag for the title, you can see it is considered H1 level heading. On the other hand, then next heading – the word “Introduction”- is H3.

Having an H3 heading directly beneath an H1 level one is rather confusing – what happened to level H2? Not only is it confusing, but it is also not compliant with the PDF/UA specification.

Don’t just take my word for it - VeraPDF agrees. It also finds other issues with the file.

Blog image

Figure 10 - VeraPDF found a number of issues with the automatically tagged file, including that the heading levels were not correct.

Acrobat also contains a PDF/UA verifier, and that reports the same problems.

Blog image

Figure 11 - Acrobat’s built in PDF/UA verifier says that the file that Acrobat created was not compliant.

Tagging the PDF using the Adobe PDF Services API 

Copied to clipboard

Let’s look at the other option that Adobe Acrobat Pro offers - uploading the PDF to a server and have it auto-tagged there. You would expect this to give a better result (otherwise why did Adobe make it available?).

There are of course security issues with uploading a file to a server for processing. While Acrobat do not keep a copy of the file on their server, the very process of having the file leave the security of your machines and uploading it via the internet is a risk to data security – if the file were intercepted then it could have serious consequences.

Blog image

Figure 12 - Acrobat’s accessibility tagging may upload your sensitive documents to a server.

Risks aside, the cloud-based auto-tagging does successfully create tags, although the structure differs from those that were generated locally.

One example of this is that the title is now correctly detected as a title.

Blog image

Figure 13 - The tags created by Acrobat used cloud-based auto-tagging are different from those that were generated locally.

Once again though, the PDF fails PDF-UA compliance, albeit for different reasons.

Blog image

Figure 14 - the file created by Acrobat's cloud-based auto-tagging is not PDF/UA compliant.

While the problems with the Adobe conversion can probably be resolved, the solution offered by the Apryse SDK is clearly superior. Not only is the PDF created as a compliant file, but the processing is also performed on the local machine, meaning that the document never leaves hardware over which you have control, and it only takes a single function call.

Conclusion

Copied to clipboard

Adding tags to a PDF to support PDF/UA compliance can be hugely laborious. The Apryse SDK provides a fast and secure way of interpreting the PDF and generating appropriate tags. While some manual steps will still be needed (to create alt-text for example), the tedious work has been removed allowing you to concentrate on what is important.

Using the Apryse SDK has other benefits too – providing access to a wealth of other functionality- including annotations, applying redactions, and conversion to, and from, a wide range of document formats.

There is lots of documentation for the SDK to help you to get started quickly, but if you have any questions, please feel free to reach out to us on Discord.

Sanity Image

Roger Dunham

Share this post

email
linkedIn
twitter