PDFTron is now Apryse. Same great products, new name.
By Adam Pez | 2022 Jun 15
An organization's documents go through many lifecycle stages -- from simple creation to review, collaboration, to revision, and then storage for long-term re-use. If you're building a digital workflow or commercial application, you'll probably want to equip your users with the most efficient formats to get the job done at each stage you support.
There are many reasons why someone would need to convert from PDF to editable Office formats like DOCX/Word. But the process can be challenging if they don't know where to start.
This blog gives you a quick comparison of PDF and Office use cases, then introduces an easy way to serve users their PDFs in editable formats like Word – by leveraging an accurate PDF-to-Office conversion SDK that supports the entire document lifecycle.
Watch the following video for more info on adding a PDF-to-DOCX/Word API in a Node.js environment. Or just skip to the end of this post to find steps with your platform and language of choice.
The Word DOCX format allows for easily edited documents – however, different versions of Office display documents in different ways; even the same version of Office can give variable results, for example, depending whether fonts specified in the document are available or need to be substituted.
As a result, you can end up with a document that looks quite a bit different from what the author intended. A carefully crafted resume looks great when you write it – but disappointing and unprofessional to the recipient. Similarly, a contract may look different to different parties, complicating negotiations.
In contrast, the Portable Document Format (PDF) is an incredible invention that solves the problem of preserving the author’s original intended design when viewed across different devices. It allows documents to be shared between users with the high expectation that the content will look the same to everyone, both the author and the reader. PDFs are also great if your workflow requires additional, rich collaboration capabilities on top such as signing, annotations, or form filling.
You might pause here and ask: if PDF is so wonderful as a “fixed” representation of the original document – then why bother editing it at all?
There are reasons:
If you spend some time searching, you can find components to let you edit a PDF directly. For example, we offer high-quality
Beyond the pain of additional software licensing costs, the disadvantage of a PDF editor is that, while simple editing is possible, complex editing is very demanding.
This is because, when editing PDFs directly, most changes do not reflow automatically; even small changes can have an unexpected impact on the user's ability to get work done. Say you make a change to a single paragraph, moving it one line down, for example. Now you may have to adjust any following paragraphs on the same page. And what happens if your changes push content onto the next page or next column over? Users will need to reflow content manually – and it will be almost impossible or very time consuming for them to recreate the original intended spacing and other formatting.
Let’s take a close up look at a couple of cases where users will wish to convert a PDF into editable Word.
The following examples are based on the PDF that can be found at the following address.
There is nothing special about this contract. I could have created an example contract, but I prefer to use a "real" one that someone else made, to prove that the technology works on real-world documents.
Let's imagine that we need to make two changes to the contract.
In Clause 3 we need to remove the list item (iv) as follows:
We could just remove the section in a PDF editor. Some editors are clever enough to know that this is a numbered list, and adjust the numbers, but many tools are not so good, and delete the text but don’t correct the numbers.
List Item (iv) is edited out but the list numbering now needs changes.
To get your list to look as it should, you would then have to edit each line item after the one removed. However, Apryse’s PDF-to-DOCX converts to a Word document instead that is easy to edit. Just two clicks and the problem is solved.
Two clicks later, the old list item (iv) is gone and Word dynamically renumbers the list.
Now let’s look at a second problem in the contract. We need to add a whole new section “Oversight” between sections 8.1 and 8.2. This will mean that 8.2 and all of the later items will need to be renumbered.
A new section needs to be added between sections 8.1 and 8.2
Trying to do this by editing the PDF in Acrobat is extremely difficult and in any event will take a significant amount of time.
On the other hand, editing the contract in Word is easy. Enter a few blank lines after section 8.1, copy a couple of lines from the following section to act as a template, then enter the words you require – and you’re done.
Notice how the following section “Transparency” (above) has been renumbered from 8.2 to 8.3, as have all the later sections, even those several pages later. Word is great at doing that, and Apryse has allowed you to get to the stage where Word can perform its magic in just a few seconds.
The examples in this blog just looked at how Apryse supports accurate list item detection. But there are many more cool things that our embeddable PDF-to-DOCX API can recover – such as headers and footers, tables and annotations. The same conversion module (Structured Output) also works with leading accuracy for PDF to Excel and PDF to PowerPoint.
You can set up the PDF-to-Office SDK module (Structured Output) in any MacOS, Linux, or Windows server or desktop environment using your language of choice.
Short answer: Yes!
The technology behind our PDF-to-Office module is the industry benchmark, leveraged by many leading brand document processing and blue-chip companies in their products and enterprise software – but the module is developed and maintained by Apryse.
As a result, you will be able to reconstitute a Word document that looks very similar to the original PDF – with the same number of columns on each page, with the same number of lines in each column, the same number words in each line, and so on – with the same look and feel of the original copy.
Try our Office SDK today on your PDFs to experience the results. Visit our download center to set up your free SDK trial with your preferred platform and language. Then download the Structured Output Module and visit the documentation.
If you have any questions, suggestions, or just want to chat about your requirements – drop us a line.
MS Office SDK
Share this post