Available Now: Explore our latest release with enhanced accessibility and powerful IDP features

PDF vs HTML: Choose the Better Format for Document Viewing

By Roger Dunham | 2024 Jun 05

Sanity Image
Read time

11 min

Summary: Explore the difference between viewing PDFs directly or converting them to HTML first. We'll discuss the advantages of each format and provide guidance on how to convert PDFs to HTML using Apryse tools. We also highlight how choosing WebViewer can make content reflow easy and seamless, ensuring a better viewing experience.

Introduction

Copied to clipboard

Here’s the problem – you have a PDF, and you want to view its contents in a web browser. What’s the best way to do it?

In this article, we will look at the benefits of viewing the PDF directly, rather than converting it into HTML and then viewing the resulting HTML.

Before we do though, let’s look at the benefits of each format.

Why View Documents in HTML?

Copied to clipboard

HTML is the language of the internet, and there are good reasons for that:

  • HTML is primarily about content. Well-structured HTML can reflow to match the browser's size and shape, making it easy to read on a wide range of devices.
  • HTML is very versatile and can support text, images, video, and sound.
  • HTML is easy to style using CSS (Cascading Style Sheets).
  • HTML can have interactivity added to it using JavaScript.
  • HTML is SEO friendly. Search engines can easily crawl and index the content, improving a website’s search engine results.
  • HTML potentially supports accessibility. Semantic elements, if present, make web content available to people with disabilities.

Quote

Discover how our SDK can streamline your PDF-to-HTML conversions

Why View Documents in PDF?

Copied to clipboard

PDF is a ubiquitous file format – used for invoices, legal contracts, financial results, and pay slips, to name just a few. There are also vast numbers in use. Ten years ago, it was estimated that there were 2.5 trillion in the world, and the number now will be many times higher. It is truly a file format that is here for the long term.

But why is the PDF format so successful?

  • PDFs are about appearance. The way the content is designed is paramount.
  • PDF supports complex formatting that looks the same across a wide range of devices and platforms.
  • PDF files can be viewed on almost any type of device using freely available PDF readers like Adobe Acrobat Reader, Xodo, and Firefox.
  • PDF supports file security . It can either completely restrict access, or limit access to specific functionality, like the ability to print or extract pages.
  • PDF supports digital signatures, making it suitable for legal contracts. There is confidence that the document has not been altered since signing, and that the document was actually signed by the person that claims to have done so.

PDFs also have some great additional content features:

  • They can contain layers, allowing the user to choose to see just some parts of the data. This is great for things like CAD drawings where the volume of the entire data can be overwhelming.
  • They support PDF Portfolios (or Packages), allowing the PDF to be used as a container for attaching other files. This is great for bundling legal documents together.

Note that not all PDF readers support these features. Portfolios, for example, are supported in Adobe Acrobat and Apryse WebViewer but not in Chrome or many other viewers.

Sample Documents

Copied to clipboard

The success of converting from PDF to HTML depends largely on the document type. Here, we will look at three distinctly different types of documents.

A Sales and Purchase Agreement

This is a simple monochrome text-based document, with numbered lists and various heading styles. It is typical of many kinds of legal documents and reports.

Blog image

Figure 1 – A typical legal document

A Sales Brochure

Sales brochures often have graphics, images, and text that may be full width or split into columns. The text in sales brochures and similar documents is also likely to be in multiple colors and fonts, and the overall layout has usually been carefully designed.

Blog image

Figure 2 – A typical sales brochure with graphics and multiple columns

A CAD Drawing

It is possible to convert a CAD (computer-aided design) drawing into a PDF. You can try that out at the Apryse Showcase.

In such cases there is often a lot of data, typically split into multiple layers – one for each aspect of the design (e.g., Water, HVAC, etc.).

This means that when someone looks at the PDF, they can see a subset of the data by enabling or disabling layers, rather than being overwhelmed with all the detail.

Blog image

Figure 3 – A CAD drawing where each color relates to a different aspect of the design

With this amount of data and complexity, it is often necessary to zoom in closely to see all the details – perhaps needing to zoom in to 40 times the original document size.

Blog image

Figure 4 – A small part of the PDF. Being able to zoom in to 4,000% allows you to see the details. Not all PDF viewers support that level of zoom.

Let’s see how these documents appear when viewed directly as PDF.

Viewing a PDF Directly

Copied to clipboard

All three of our PDFs can be viewed directly as PDFs. We know that, since the screenshots above were all taken using xodo.com – which uses the Apryse WebViewer to render PDFs directly within the web browser (Chrome in my case).

In fact, the screenshots also illustrate one of the great things about PDF viewers. You can choose whether to see just a single page, or two pages side by side, or the entire document. You also have access to thumbnails, layers, bookmarks, and so on – all within a single intuitive application.

What are the Problems with Viewing PDFs Directly?

One of the main issues is that the layout of the PDF is specified within the PDF. This means if the document display doesn’t fit into the device, it may not be possible to see all the content without scrolling horizontally.

Blog image

Figure 5 – On a narrow device it may not be possible to see all of the document at the same time, requiring the user to scroll horizontally.

The problem is even worse is there are multiple columns, since it may then be necessary to scroll both vertically (to read all of one column) and horizontally (to read the next column). That can make it hard to get an overview of the document.

Blog image

Figure 6 – Xodo on an emulated device. It is not easy to see the entire document at once – you need to scroll both vertically and horizontally to read the content.

Similarly, if the document is very large, like our CAD drawing, it may be difficult to read one particular part unless the document is zoomed in. In that case, it is necessary to scroll both vertically and horizontally to read other parts of the document.

Converting the PDF to HTML

Copied to clipboard

There are lots of ways to convert from PDF to HTML – you can do it directly within Adobe Acrobat, Xodo.com, PDF Studio Pro, and many other tools. You can also do it using the Apryse SDK Structured Output module, which is available for Windows, Linux, and macOS.

For now, let’s look at the Structured Output method, since that gives us various configuration options which help illustrate what can be done.

Structured Output is easy to call from a variety of languages including C++, Java, Go, and C#.

var htmlOutputOptions = new pdftron.PDF.Convert.HTMLOutputOptions(); 

pdftron.PDF.Convert.ToHtml(inputPath, outputFile, htmlOutputOptions); 

The htmlOutputOptions object is not needed, but if present, it allows us to modify the conversion behavior. We will see that in a few moments.

If we use the default settings, the generated HTML appears to be extremely similar to the original files.

Blog image

Figure 7 – The three PDFs converted into HTML and displayed in a browser

However, if you zoom in to the CAD drawing we start to see some real limitations of the method. Not only are layers no longer supported, but the CAD drawing lines are now blurry wide lines – there is no option to zoom in to the same level of detail. This is because those lines were were originally zero width, so they rendered as 1 pixel wide, no matter how much you zoomed in on the PDF.

Blog image

Figure 8 – Just a small part of the CAD drawing. It is no longer possible to zoom into the detail.

Furthermore, we have lost the ability to see two pages of the same document side by side, or to see thumbnails, or layers. We still have the same problems we had with displaying the image on a small device – you can’t see everything at once.

At its best, we can recreate an HTML-based web page that looks as good as the PDF. But it might be less functional, it has all the same problems, and we have to do extra work to get there. This seems like the worst of all worlds.

Before we abandon conversion to HTML entirely, let’s look at one of the conversion options that may be useful: the ability to reflow text.

Reflowing content requires the software to detect the logical reading flow of the text and create HTML that supports that. This means that if the page no longer has a fixed width, the text should wrap around, making everything readable.

Enabling reflow with Structured Output is easy – you just need to specify the option ContentReflowSetting.e_reflow_full.

var htmlOutputOptions = new pdftron.PDF.Convert.HTMLOutputOptions(); 

htmlOutputOptions.SetContentReflowSetting(pdftron.PDF.Convert.HTMLOutputOptions.ContentReflowSetting.e_reflow_full); 

pdftron.PDF.Convert.ToHtml(inputPath, outputFile, htmlOutputOptions); 

If we look at our three sample documents, we get quite different results.

For our legal document, the result is great. The text reflows beautifully and we can read it, even on a narrow device.

Blog image

Figure 9 – The reflowed sales and purchase agreement, shown on an emulated device. The content is all readable. This is a great result.

We also get good results for the sales brochure. The text reflows well, and the vertical columns are well handled. However, there was some white text on a blue background, and that is now being shown as white text on a white background.

There are ways to solve that, but it is just extra hassle.

Blog image

Figure 10 – The sales brochure. Content has reflowed making it easy to read. The conversion to HTML has also kept the color of the text that was in the quote at the bottom of the page – it is now white on a white background.

Another problem is that graphics are hard to reflow. They tend to get broken up, and their context is lost.

Blog image

Figure 11 – The problem with graphics: they are difficult to reflow.

Not being able to handle graphics is a disaster when trying to convert a CAD drawing into HTML. How on earth can you try to reflow that?

What tends to happen is that some text is reflowed, but the vast majority of the actual CAD drawing remains as a monolithic block, so it is still not very useful.

Blog image

Figure 12 – Reflowing a CAD drawing still does not make it useful.

However, if you convert from PDF to HTML there is still a loss of functionality.

In addition to loss of support for page views and layers, converting from PDF to HTML also means that you have lost the ability to control which users have access to the content (at least within the content), and the built-in digital signature support.

A Third Way: Apryse WebViewer and Reader Mode

Copied to clipboard

Apryse WebViewer comes to the rescue with “Reader mode.” This option allows you to toggle between seeing the PDF in “fixed position” and seamlessly reflowing the text. Reader mode is not enabled by default, but it’s easy to enable it.

instance.UI.enableElements(['readerPageTransitionButton']); 

Once that is done an extra item appears on the ViewControls.

Blog image

Figure 13 – The Reader option in ViewControls when it has been enabled

When you click on this option, the PDF is automatically reflowed, making it easy to read. Graphics and images aren’t shown, because as we saw earlier, it is difficult to reflow graphics. So this still won’t work for CAD drawings.

Blog image

Figure 14 – The sales brochure in Reader mode. The text in the columns has been reflowed and what was white text at the bottom of the page is also now visible.

Reader mode has a couple of other benefits: You can quickly toggle between “Reader” and “Normal” mode, you can show thumbnails, and you can still add annotations to the PDF. (Note that the types of annotations available in Reader mode is reduced – they need to apply to the text, not the page location.)

Blog image

Figure 15 – Adding an annotation to text in Reader mode

And those annotations will still be present when you swap to Normal mode.

Blog image

Figure 16 – The annotation is still available when you switch back to Normal mode.

That’s some pretty awesome functionality. You get all the benefits of reflowed HTML and fixed layout.

Conclusion

Copied to clipboard

Virtually everything you might want to do with the generated HTML can already be done with the PDF, provided you have a suitable viewer. You can search for text within a PDF, and search engines can index your website whether it uses PDF or HTML.

It is certainly possible to convert from PDF to HTML and then display the HTML in a browser. There are many tools either from Apryse or other vendors that do so, but there really is no need to do that. At best, the quality is the same, but with extra effort.

There are some great benefits of not converting to PDF. If you are using Apryse WebViewer, you have support for layers and portfolios, access to awesome zoom levels, the ability to compare two documents side by side, and annotation functionality. This enables multiple people to collaborate on a single document at the same time.

Plus, there is the support for digital signatures. And, with just a little work, the ability to have AI summarize your PDFs for you – try out the AskPDF feature on xodo.com.

One niche case for converting from PDF to HTML is to provide an intermediate format that can then be parsed for data analysis. If that is your thing, you should check out IDP – Intelligent Data Processing – which may be able to offer the same results faster and more simply.

When you are ready to get started with Apryse, try out the online showcase, then get yourself a trial license and dive in. If you run into any problems, reach out to us on Discord and our Solution Engineers will be happy to help.

Sanity Image

Roger Dunham

Share this post

email
linkedIn
twitter