COMING SOON: Fall 2024 Release

PDF to HTML in Exact Mode: Why the Little Things Matter

By Ryan Barr, Roger Dunham | 2024 Jul 26

Sanity Image
Read time

4 min

Summary: The PDF-to-HTML conversion SDK enables the seamless transformation of PDFs into HTML, supporting both reflow and fixed position (exact mode) to ensure high-fidelity rendering and accessibility. It efficiently handles batch processing and customization to meet diverse conversion needs.

Introduction

Copied to clipboard

The PDF file format is great, providing a way to have shareable documents that have a consistent look on a wide range of devices (with PDF/A available where consistency to archiving standards is required).

So why is there any need to convert from PDF to HTML?

Historically, HTML had the benefits of:

  1. Being viewable on any platform without any need for a plugin.
  2. Built-in text selection and searching.
  3. Ease of integration into existing web applications.
  4. Efficient indexing by search engines.

However, with the development of Apryse WebViewer, for example, the first three of these benefits can now be achieved using PDFs directly within the browser. Furthermore, Google has been able to index PDFs for years, so the fourth issue is also not a concern.

Some users, however, still need to be able to convert PDFs into HTML – often to support legacy software. If that includes you then read on.

The Apryse SDK offers two main options when converting a PDF into HTML:

Fixed Position (Exact) Mode: Where the resulting PDF looks exactly the same as the original PDF, with the same font sizes, styles, and so on.

Reflow mode: Where the content of the PDF is extracted but may be rearranged on the page to make it easier to understand. An example of this is with column-based text where rather than having the user need to read from top to bottom of one column, then do the same in the next column, the reader can, instead, just read the information as a paragraph. Some customers have used this process as an intermediate step for extracting text from PDFs. There may be easier ways to do that now, though, such as IDP.

In this article, we will only look at Fixed Position (or Exact-Mode) conversion and see how the results produced by Apryse compare with those of competitors.

How Fixed Position PDF to HTML Conversion Works

Copied to clipboard

The first step in the process is what we call flattening. This is the process of merging all non-text elements of a PDF page and converting them into a single background image which is an accurate appearance of even the most complex PDF files. While this may sound simple, when taking into account the entire PDF specification, this stage is actually quite complex.

The other major part of this task is deciding what text cannot be displayed correctly unless it is also merged into the background image. This typically happens when the color of text is merged with non-uniform/gradient coloring in the background. Since the PDF standard supports eleven different blend modes for merging colors you can imagine that this is a complex issue.

A further part of process is working out which text is partially hidden behind another object and dealing with it appropriately, deciding whether it should stay as text or be merged into the background image. It’s not easy, and it is easy to get wrong.

Apryse offers two tools that support Exact-mode conversion: Xodo PDF Studio and the Apryse SDK. The former is a desktop tool, the latter is a library that you can use within your own code.

What about Xodo.com?

Apryse offers an online PDF-to-HTML converter via Xodo. However, that creates reflowed, rather than fixed position HTML and, therefore, was not a like-for-like tool. As such, it has been excluded from this test.

Both Apryse tools do a great job of this conversion, and we will see in a minute how they compare with a range of free online PDF to HTML conversion tools.

Before we do that, let's look at a few things that can go wrong in the conversion. To help us, we will use a sample PDF that contains lots of vectors and rotated text. In this case, the various rivers and boundaries are all created using vectors, but the text consists of characters from fonts and may be angled or sinuous rather than just horizontal and straight lines.

Blog image

Figure 1 - The original PDF viewed on xodo.com.

Being vector-based, it allows you to zoom in and still retain the crispness of the PDF.

Blog image

Figure 2 - Part of the same PDF and zoomed in 4800%, the text and lines are still 'crisp' .

There are three common issues that can happen when converting this PDF into HTML.

Text is converted into an image

Ten years ago, it was common for all text to be converted into images. Today, the horizontal text is retained as text by all of the tools tested, but the rotated text is sometimes converted into an image.

Blog image

Figure 3- The conversion with sodaPDF results in the angled text being converted into part of the background image and consequently to a loss of quality. Note that there is also an offset, so the word Danube is now over the river which was not the case in the original file.

This is a problem since reading blurry text is slow and tiring. Furthermore, conversion of text to an image means that any ability to be able to search for the text has been lost.

Words having incorrectly aligned or spaced letters

Some tools create words where the individual characters are not located correctly. This results in words that are difficult to read. You can see this in the word ‘Danube’ in the example below.

Blog image

Figure 4 - pdf.io has converted the angled text as text, but the location of characters does not match the original PDF - Look at the word 'Danube'

Other tools, like Xodo PDF Studio, do a great job with the text being correctly located and spaced, even when it is curved.

Blog image

Figure 5 - The output from Xodo PDF Studio. The letters in 'Danube' are correctly located.

Occluded (hidden) text being incorrectly rendered

At the top right of the sample file, there is a title box that is hiding part of at least one word.

Blog image

Figure 6 - The original PDF contains text, marked with the arrow, that is mostly hidden (in this case by the title).

Many converters don’t correctly allow for this, so that the word is visible even though it shouldn’t be.

Blog image

Figure 7 - The output from FreeConvert (and many other online tools) does not correctly obscure the word, so that the generated HTML does not look like the PDF.

While, in this case, the result just looks a little wrong, it is easy to imagine how it could cause confusion – was the word in the title, or behind it? And what if the word had been hidden in the PDF as a weak form of redaction (people really do that), and now your customer gets to see what was intended to be hidden…?

Read more about Apryse's Redaction capabilties.

Apryse does its best to avoid these situations. Where text is only slightly occluded the issue may be ignored, but in other cases ‘flattening’ the text into the background image is appropriate. In either case, the text remains in the HTML output as transparent text elements so that text search and selection can still work.

This means that in this example the output from Apryse is extremely similar to the PDF.

Blog image

Figure 8 - the output from Xodo Pro Studio, the word is correctly hidden but can be searched for if required.

Furthermore, while the word ’Arad’ is hidden, it can still be searched for within the browser, which is extremely cool.

Comparing different tools

Copied to clipboard

Six different online converters, from the wide range available, were compared with the output from Xodo PDF Studio and the Apryse SDK.

The comparison looked at:

  • Generated file format (this can be a single file, or a zip file - or a folder- that contains a collection of separate files)
  • File size (for zip files this was the size of the unzipped package)
  • Whether angled text was kept as text
  • Whether angled text had correct spacing and alignment
  • Whether occluded text was correctly hidden

In each case the conversion was performed using default options, since let’s face it, no-one ever reads instructions, so a good vendor would have chosen default values that give a good result.

In the case of the Apryse SDK, which was not a pre-built tool – a selection of typical defaults was used.

Results

Copied to clipboard

The results are shown below:

Blog image

There are lots of PDF to HTML converters out there. However, the output quality of many of them is not perfect. If you want exact conversion, then “not-quite” isn’t really good enough.

Are there other disadvantages to using online tools?

Copied to clipboard

Online alternatives might be quick and easy, but do you really know what they do with your files? Are they safely deleted, or could they be leaked to the internet as has recently happened for thousands of personal documents? Do you want the potential for someone outside of your company being able to view them?

With both Xodo PDF Studio and Apryse SDK, the conversion is performed on your own hardware, so the risk of data being intercepted or retained by a third party is completely removed.

Next Steps

Copied to clipboard

Xodo PDF Studio offers an affordable desktop solution that you can use today to convert files (and perform many other operations on files).

Alternatively, you can use the Apryse SDK to build a solution dedicated to your specific needs, you have easy access to a wide range of options – giving you control over how the conversion works. If you run into any problems, then you can contact us via our Discord channel.

As a closing thought, if you are just embarking on the process of converting from PDF to exact-mode HTML you may want to ask whether using an in-browser PDF Viewer such as WebViewer might be a better solution. Alternatively, if you are converting from PDF to reflowed HTML, then you may want to check out other conversion options such as converting directly from PDF to Office format.

Sanity Image

Ryan Barr

Sanity Image

Roger Dunham

Share this post

email
linkedIn
twitter