Available Now: Explore our latest release with enhanced accessibility and powerful IDP features

A Simple Example of Converting PDF to HTML

By Ryan Barr | 2014 May 25

Sanity Image
Read time

4 min

We have received lots of interest in our new PDF to HTML/EPUB conversion since it was released in PDFNet 6.0. With this interest we have also gotten questions on customizing the output. So today I’ll provide a quick demo of converting a PDF to HTML using PDFNet.

In another post, I will go into some of the details particular to PDF to EPUB conversion, but everything in today’s post applies to both HTML and EPUB output.

Furthermore, while PDFNet is available in C/C++, Java, Objective-C, Python, Ruby, PHP, VB and C#, due to its popularity I decided to do this demo in C#. The PDFNet API is consistent enough that you should be able to easily translate to another language.

Setup

First, download PDFNet from our download page.

For this demo I downloaded PDFNet for Windows Desktop .Net 4+. But you can just as easily download any of our desktop versions (including Linux and Mac).

After unzipping the download, navigate to the Samples folder, and select one of the Visual Studio solutions. For me, I chose Samples_2013.sln.

Once in Visual Studio, right click the ConvertTestCS2013 project and select Set as Startup Project.

For this demo, we will simulate the following requirements:

  • Convert only odd number pages
  • Target iOS devices
  • High image quality (DPI)
  • Use PNG instead of JPG
  • No HTML hyperlinks to URL’s outside of the document

Since we are targeting iOS, a quick look at Apple’s official Safari iOS resource limits shows we want to have a 3 megapixel (MP) limit. (For more info, read the Safari Web Content Guide.) We will also crank up the DPI so the output looks as good as possible on a retina display.

Code

Here then is the code to accomplish the above.

using (PDFDoc doc = new PDFDoc(inputPath + "newsletter.pdf"))
{
    doc.InitSecurityHandler();
    // remove all even pages
    if(doc.GetPageCount() > 1)
    {
        PageIterator itr = doc.GetPageIterator();
        itr.Next(); // skip first page
        while (itr.HasNext())
        {
            doc.PageRemove(itr); // remove even pages
            itr.Next();
        }
    }
    pdftron.PDF.Convert.HTMLOutputOptions options = new pdftron.PDF.Convert.HTMLOutputOptions();
    options.SetInternalLinks(true);
    options.SetExternalLinks(false);
    options.SetPreferJPG(false);
    options.SetDPI(300);
    options.SetMaximumImagePixels(3000000);
    options.SetSimplifyText(true);
    options.SetScale(2.0);
    pdftron.PDF.Convert.ToHtml(doc, outputPath + "newsletter_odd_pages", options);

What does all the code above mean?

After initializing the library, and opening the document, we first modify the document in memory by removing the even numbered pages. As long as we do not call PDFDoc.Save(), then these changes do not affect the original source file.

Tip: There are lots of more code example’s showing how to use PDFNet, available in the downloaded samples, and on our forum.

PDF to HTML Options

Now onto the PDF to HTML code.

options.SetInternalLinks(true);
options.SetExternalLinks(false);

Above we make sure internal links are enabled, which ensures that any internal links in a PDF are included in the HTML, for example a table of contents. The next line though disables any links that would take the reader outside of the document, such as another website.

options.SetPreferJPG(false);
options.SetDPI(300);
options.SetMaximumImagePixels(3000000);

Next, we turn on PNG image output, increase the image DPI to 300, but set a 3 MP limit so as not to overload iOS device. The result will be that PNG’s will be generated at 300 DPI, except where that would put the image over 3MP. In the latter case, the image will be down-sampled to the highest DPI that will keep it under 3MP.

options.SetSimplifyText(true);

Here, we enable text optimization. This attempts to merge text runs in the PDF file, to reduce HTML DOM complexity, and reduce HTML file size. This can result in text placement not matching exactly what was in the PDF, but to the human eye it is typically not noticeable, even when viewing the output side by side with the original. On the other hand, it will reduce download, layout, and rendering times.

options.SetScale(2.0)

Finally, we will scale the html output so that it is easier to read in the browser, without having to rely on the browser to zoom.

DocPub CLI

For those that prefer command line tools, here is how you would get the same output using our DocPub command line conversion tool.

docpub.exe -f html --internal_links --prefer_jpg false --dpi 300 --max_image_pixels 3000000 --simplify_text --scale 2.0 input.pdf

To get started download DocPub for Windows, MacOS and Linux.

Conclusion

I hope that you find this information useful, and that you give PDF to HTML conversion a test drive soon. Let us know if you have any feedback⁠—it is greatly appreciated! If you have downloaded our free trial and experienced any issues, we offer free trial support. If you have any other questions you can contact us.

Sanity Image

Ryan Barr

Share this post

email
linkedIn
twitter