A Simple Example of Converting PDF to HTML

By Ryan Barr | 2014 May 25

4 min

Setup

First, download PDFNet from our download page.

For this demo I downloaded PDFNet for Windows Desktop .Net 4+. But you can just as easily download any of our desktop versions (including Linux and Mac).

After unzipping the download, navigate to the Samples folder, and select one of the Visual Studio solutions. For me, I chose Samples_2013.sln.

Once in Visual Studio, right click the ConvertTestCS2013 project and select Set as Startup Project.

For this demo, we will simulate the following requirements:

Convert only odd number pages
Target iOS devices
High image quality (DPI)
Use PNG instead of JPG
No HTML hyperlinks to URL’s outside of the document

Since we are targeting iOS, a quick look at Apple’s official Safari iOS resource limits shows we want to have a 3 megapixel (MP) limit. (For more info, read the Safari Web Content Guide.) We will also crank up the DPI so the output looks as good as possible on a retina display.

Code

Here then is the code to accomplish the above.

using (PDFDoc doc = new PDFDoc(inputPath + "newsletter.pdf"))
{
    doc.InitSecurityHandler();
    // remove all even pages
    if(doc.GetPageCount() > 1)
    {
        PageIterator itr = doc.GetPageIterator();
        itr.Next(); // skip first page
        while (itr.HasNext())
        {
            doc.PageRemove(itr); // remove even pages
            itr.Next();
        }
    }
    pdftron.PDF.Convert.HTMLOutputOptions options = new pdftron.PDF.Convert.HTMLOutputOptions();
    options.SetInternalLinks(true);
    options.SetExternalLinks(false);
    options.SetPreferJPG(false);
    options.SetDPI(300);
    options.SetMaximumImagePixels(3000000);
    options.SetSimplifyText(true);
    options.SetScale(2.0);
    pdftron.PDF.Convert.ToHtml(doc, outputPath + "newsletter_odd_pages", options);

What does all the code above mean?

After initializing the library, and opening the document, we first modify the document in memory by removing the even numbered pages. As long as we do not call PDFDoc.Save(), then these changes do not affect the original source file.

Tip: There are lots of more code example’s showing how to use PDFNet, available in the downloaded samples, and on our forum.

PDF to HTML Options

Now onto the PDF to HTML code.

options.SetInternalLinks(true);
options.SetExternalLinks(false);

Above we make sure internal links are enabled, which ensures that any internal links in a PDF are included in the HTML, for example a table of contents. The next line though disables any links that would take the reader outside of the document, such as another website.

options.SetPreferJPG(false);
options.SetDPI(300);
options.SetMaximumImagePixels(3000000);

Next, we turn on PNG image output, increase the image DPI to 300, but set a 3 MP limit so as not to overload iOS device. The result will be that PNG’s will be generated at 300 DPI, except where that would put the image over 3MP. In the latter case, the image will be down-sampled to the highest DPI that will keep it under 3MP.

options.SetSimplifyText(true);

Here, we enable text optimization. This attempts to merge text runs in the PDF file, to reduce HTML DOM complexity, and reduce HTML file size. This can result in text placement not matching exactly what was in the PDF, but to the human eye it is typically not noticeable, even when viewing the output side by side with the original. On the other hand, it will reduce download, layout, and rendering times.

options.SetScale(2.0)

Finally, we will scale the html output so that it is easier to read in the browser, without having to rely on the browser to zoom.

DocPub CLI

For those that prefer command line tools, here is how you would get the same output using our DocPub command line conversion tool.

docpub.exe -f html --internal_links --prefer_jpg false --dpi 300 --max_image_pixels 3000000 --simplify_text --scale 2.0 input.pdf

To get started download DocPub for Windows, MacOS and Linux.

Conclusion

I hope that you find this information useful, and that you give PDF to HTML conversion a test drive soon. Let us know if you have any feedback⁠—it is greatly appreciated! If you have downloaded our free trial and experienced any issues, we offer free trial support. If you have any other questions you can contact us.

A Simple Example of Converting PDF to HTML

Setup

Code

PDF to HTML Options

DocPub CLI

Conclusion

Resources

Related Articles

View all blogs

How to Solve Six Common Problems when Getting Started with Apryse WebViewer

Using the API to get more from Spreadsheet Editor

Adding Multiple Digital Signatures with Apryse WebViewer SDK and JavaScript