How to Integrate a PDF Viewer into HTML5 Apps

By James Borthwick | 2013 Aug 08

9 min

The Need for an HTML5 PDF Viewer

Copied to clipboard

One area of the user experience where HTML5 apps have been historically weak is in their ability to display a PDF within the app. For a long time, “viewing” a PDF on the web meant downloading it and opening it in a different program. Next came browser PDF plugins, that would take over the browser screen in order to display the PDF. A small improvement, but still not integrated and certainly not a good user experience.

So, if the goal is to add an embedded PDF viewer using HTML5 into a web app, how can that be done? There are a number of approaches, each with pros and cons. Keep reading to see what techniques exist, and which might be best for your app.

Techniques to Embed a PDF Viewer in HTML5 Apps

Copied to clipboard

1. Rasterization to Images

This is the simplest way to get “PDF” onto the web. Take the PDF, convert to image via CLI, and serve. PDF on the web in a format that is compatible with all browsers on all operating systems. However, there are some issues:

No vector content limits quality at high resolutions
Storage- and bandwidth-heavy bitmap data
Does not support PDF capabilities such as forms or a standard method of annotations, needs extra work to simulate text selection
Scalability problems: computationally expensive to rasterize, large storage requirements
Requires extra work to implement text selection and indexability

While converting to images may be a good solution for some applications, it is unlikely to be an optimal one. So what can we do?

2. HTML DOM

The idea here is to use the browser’s native text rendering and layer it on top of an image that contains all of the non-text data. (This technique is implemented by Apryse in pdfton.PDF.Convert.ToHtml().) While it sounds like an incremental change from full rasterization, there are some significant advantages:

Text quality is often preserved. People are especially sensitive to the quality of text, so preserving the vector nature of the glyphs is a big improvement.
Allows the user to use the browser’s standard text selection/copying capabilities, which can also be read by search engine robots.

So while this is a step up from full rasterization, problems remain:

Quality for non-text elements is sacrificed for all non-text data.
Accurate text positioning is possible, however it requires a separate for every letter. Doing this reduces page load speed and the ability to search/index/select text. So one must accept this limitation, or instead accept somewhat inaccurate text positioning.
Degrades to full PDF rasterization when text is semi-transparent, partially occluded or covered by transparent objects, pattern-filled objects, etc.
It is easy for users to save DOM content locally, which is a concern if serving copyright content.
Storage requirements could be significant.

3. SVG

The W3C recognized the need to bring high-quality vector graphics to the web, and proposed SVG (scalable vector graphics). At first, this technology seems very promising: it will deliver the vector data and precise positioning we want, with fonts, gradients, masks and more. A “PDF killer” some predicted. Apryse took action and developed the first PDF to SVG converter in 2001. However, widespread adoption of SVG and the supplanting of PDF never came to pass. Why not? Here are a few reasons:

SVG is not fully compatible with the PDF graphics model (e.g. transparency/blend mode), making it impossible to faithfully reproduce PDF content using SVG.
A bloated spec designed to also compete with Flash, incorporating scripting and animation, put a high burden for those wishing to implement the spec completely.
It is missing support for efficient monochrome compression, which is important for many scanned business documents.
Worst of all, most implementations were incomplete and buggy. Until IE9, Microsoft did not support SVG at all, and even now there is no support for SVG fonts. In other browsers (Chrome, Firefox) there are many glitches related to text positioning.

SVG had some built in technical limitations, but its biggest problem was (and still is) a lack of complete and correct implementations within browsers. Ultimately it has found success in certain niches, but it has not experienced widespread adoption for general use cases.

4. HTML5 Canvas

So where does that leave us? Not surprising, we are going to take a close look at “HTML5,” specifically the canvas. Does this technology finally deliver the ability to view a PDF inline? Will it succeed where others have come up short?

The HTML5 Canvas gives us 2D drawing capabilities similar to a system level library like GDI and Direct2D on Windows, and Quartz on OS X and iOS. This means that shapes, curves, text, and opacities can be represented mathematically, and rendered by the canvas at any resolution. So the big question is can we “translate” the mathematical representation of content in a PDF to a series of Javascript commands that draw them to the HTML5 Canvas. Let’s take a look.

PDF → JS Code → HTML5 Canvas: pdf.js

Copied to clipboard

The “holy grail” would be to use JavaScript to directly read a PDF and draw it onto an HTML5 canvas. This would offer a number of benefits:

Vector graphics
Render the PDF directly rather than using an intermediate format (such as images or SVG)
Would not suffer from limitations of the previously outlined techniques
Consistent behaviour across browsers

Building such a system would seem a significant task, but it has in fact been attempted by the Mozilla Foundation in pdf.js. Pdf.js is an impressive technical achievement, but close examination leads one to conclude that it unfortunately suffers from many usability and quality issues (read our complete guide to evaluating PDF.js). This is not a reflection of pdf.js per se, but rather a technical limitation that would be inherent in any product that attempted to use Javascript/HTML5 to render a PDF. Some of the problems we encountered:

1. Accuracy

From the ‘get-go’ pdf.js faced issues on the rendering side. For example, standard HTML5 Canvas does not support paths with dashes, the even-odd fill rule, or PDF blend modes. Since Mozilla developers were in control of their own browser they were able to bandage Firefox with custom extensions (prefixed with moz-… ). Unfortunately these extensions are not part of the HTML5 standard and are not supported by all browsers, including the dominant mobile browsers. Also even with all of the custom moz extensions, ‘pdf.js’ can’t deal with some transparency groups, overprint, some soft masks, non-rgb color spaces, etc. Perhaps one day all browsers will add every extension required to accurately render a PDF, however the project clearly showed some limitations of implementing a complex graphics system in JS (read our updated guide on PDF.js rendering accuracy).

pdf.js Rendering (left) & Correct Rendering (right)

pdf.js Rendering (left) & Correct PDF rendering (right)

2. Performance

JavaScript is much slower than native code. Despite using GPU accelerated canvas rendering, viewing PDFs in pdf.js is slower than native viewers/plug-ins that do not use hardware acceleration. Native viewers will always be able to stay one step ahead of JavaScript viewers in terms of performance.

3. Reliability

With browsers, the mobile PDF viewers using HTML5 do not respond well when they run out of memory: they simply exit, i.e. crash. Because PDF documents can be large and use complex resources it is not difficult to exceed the limit. (The same issues exist on the desktop, but thanks to large amounts of RAM and virtual memory, they are less critical.) For more information find our recently published PDF.js reliability benchmark where we opened 1,663 PDF files in PDF.js.

4. Usability

Because pdf.js uses PDF documents ‘as is,’ it is likely that the documents have not been “linearized,” that is, saved in a format that is streamable over the web. This means that the entire document must be downloaded (and stored in memory) before it can be rendered, leaving the user waiting. Although this issue is not specific to a Javascript viewer, it is a drawback to using PDF documents that have not been processed for online viewing.

A Solution: PDF→ PDFNet → JS Code → HTML5: WebViewer

Copied to clipboard

What can be done to resolve these shortcomings? When you look at the source of the problems, it is that PDF documents can simply be too big and complicated to be competently handled by a pure JavaScript/HTML5 Canvas solution. So, perhaps with some pre-processing, a PDF can be normalized to a format that can be properly handled by a pure JavaScript/HTML 5 Canvas viewer. What needs to be done?

Optimize the file for fast random access loading. This means that any page could be fetched and displayed regardless of which other pages in the document have already been downloaded.
Downsample high resolution images so that they do not consume large amounts of memory, which is a real problem on mobile devices.
Reduce the complexity of a document for accurate and efficient display on mobile devices. This means analyzing a PDF page element-by-element, looking for simplifications and alternate means of representing content that is known to be compatible with HTML5 Canvas. This may also mean rasterizing content that cannot in any way be accurately rendered by an HTML5 Canvas.
Normalize all images to a form that can be natively decoded by a browser

So, how well does this work? After 3+ years of implementing these optimizations for WebViewer, we are able to say that it works very well. Once the PDF has been optimized for web viewing, all of pdf.js’s shortcomings melt away, and viewing is:

fast
reliable
high-quality
cross-browser
mobile-friendly

These optimized documents have also served as a good basis for implementing PDF features, such as interactive forms and annotations.

Conclusion

Copied to clipboard

Displaying a PDF in a PDF viewer using HTML5 is by no means trivial. What is clear is that for accurate and reliable viewing in a web browser, the PDF needs to be “normalized” to a web friendly representation. Some normalization methods, such as converting to images, do work, but with limitations. Sophisticated normalization, such as what is done for WebViewer, offer an experience that approaches that of a native PDF viewer.

April 2015 Update

Copied to clipboard

What a difference 18 months makes. Most of the article above holds, however new technology and an innovative approach has allowed us to provide reliable and correct in-browser PDF rendering without the need to pre-process. (And no, not by using pdf.js, its problems remain.) Check out the newly released Webviewer 2.0, and our post on PDFNetJS.