Transforming PDFs to Office Documents with Apryse and Ruby

By Roger Dunham | 2024 Jan 11

10 min

Introduction

Copied to clipboard

When you need to transfer information between people, the Portable Document Format (PDF) excels for both presentation and archival purposes. It offers an accurate view of your intended document that looks the same to the reader, irrespective of what operating system of hardware they are using. (Sometimes there may be issues with font availability when viewing the document, but Apryse has a solution for that.)

However, situations may arise where you need to transform PDFs into editable and well-structured Office documents. Perhaps a change is needed but the original Word document no longer exists or cannot be found, or the PDF was created from a hard-coded report generator and a minor change like an updated logo is required. Or maybe you just want to use the PDF content as the basis for a new document.

While PDF documents can often be directly edited, non-trivial changes that result in more text than will fit in the available space, or changes in the items within a numbered list, can be both extremely difficult and laborious to do, and very time-consuming to get to look correct.

Fortunately, the Apryse SDK offers a mechanism for converting PDFs into Office documents. This SDK is available for multiple programming languages including C#, C++, Python, Go, Ruby, and JavaScript. In this article we will:

See why converting a PDF into Office is useful
Try out Apryse sample code for reconstructing a document from PDF
Look at how this functionality can be used within a Ruby application

Why Use Apryse to Convert PDF to Office?

Copied to clipboard

With the advanced document processing features Apryse offers, you can not only convert Office documents to PDF, but also reconstruct those documents back from PDF, maintaining their formatting and structure, using the Structured Output add-on module.

What’s more, with no extra effort or coding, the Structured Output module can reconstruct a Word document from a scanned PDF, provided that the scan quality is good enough. The technology is even clever enough to include Word features such as lists and tables of contents.

For example, if there are rows with numbers in front of them, then these can be interpreted as a numbered list.

Figure 1 – Part of a Word document reconstructed from a scanned PDF. The elements of a numbered list are shown.

This is fantastic from an editing point of view, since if a new list item is added, or an existing one is deleted or moved, then Word deals with the renumbering automatically.

Figure 2 – The same document after the removal of the original item 3.0. Note how the other list items have automatically updated.

Imagine just how much time that will save.

Better still, for many scanned documents the OCR language is automatically detected.

Sample Project for Converting a PDF into a Office Document

Copied to clipboard

To provide Ruby support for the Apryse SDK, a Ruby wrapper is available for Linux and macOS.

In this article, we are going to walk through the steps needed to create the Ruby wrapper, and get the Structured Output module and a license. If you already have those in place, then you may wish to jump ahead to Running the Sample When Everything is Correctly Configured.

While the macOS version is available pre-built, it is necessary to build the Linux wrapper yourself. It is not difficult, and there is a blog and video that explains how to do so.

In addition to the actual SDK, the wrapper contains a wealth of examples that illustrate Apryse functionality, not just for converting PDFs into Office documents, but also for viewing, editing, and manipulating documents, and viewing CAD drawings, as well as many other features.

Let’s look at the sample for reconstructing an Office document from a PDF. If you are interested in going the other way, then have a look at the blog post Converting an Office Document into a PDF.

The sample was written using Ubuntu 22.02 running on WSL.

Getting Started

Copied to clipboard

In the setup for this article, the Apryse Ruby wrapper for Linux has been created in a folder called /wrappers-10-6/ PDFNetWrappers.

If you follow the instructions for generating the wrapper, then this folder will already contain a folder called PDFNetC (which contains the actual Apryse SDK), plus a Samples folder, and folders that are placeholders for wrappers in Go, Python, and PHP. Most of these can be safely removed, leaving just the PDFNetC and Samples folders.

Figure 3 – The content of the Wrappers folder at the start of this article. Superfluous folders have been removed.

Gotcha! Within the PDFNetC folder there is also a Samples folder. You can combine the Samples folders there, but then you will need to modify the path to the PDFNetRuby library before the code will run, so for now we will leave things as they are.

Open the PDF2OfficeTest folder. This contains subfolders with sample code for Go, PHP, Python, and Ruby. As this article is intended to get you started, we will ignore the superfluous folders and just navigate to the Ruby folder.

Figure 4 – The content of the PDF2OfficeTest sample folder for Ruby

Getting the Structured Output Module

Copied to clipboard

While many of the samples can be run immediately using just the SDK, back-converting PDF to Office requires the add-on module “Structured Output.” This is one of several optional modules that provide additional functionality – others include support for computer-aided design (CAD), data extraction, and optical character recognition (OCR).

If you don’t have the module available when you run the sample, you will get a message that informs you of the problem.

Figure 5 – The error message if you try to run the sample without the Structured Output module being available

There are several ways to install the Structured Output module, so if you have a preferred method feel free to use that rather than the following instructions.

Create a new folder called “temp” in the PDFNetWrappers folder and navigate into it.

Download the Structured Output module using:

wget https://www.pdftron.com/downloads/StructuredOutputLinux.tar.gz

Figure 6 – Typical output when downloading the Structured Output module

Extract the archive using: tar xvzf StructuredOutputLinux.tar.gz

Figure 7 – The folder structure after extracting the archive

Copy the files from the temp folder into the PDFNetC/Lib folder using:

mv Lib/Linux/* ../PDFNetC/Lib

You can check that this has all worked by navigating to that folder and listing the files.

StructuredOutput and fonts2.pdf should be present along with a folder called tessdata.

Figure 8 – The contents of the Lib folder after correctly setting up the Structured Output module

Great, that’s all of the files downloaded correctly. You can now delete the temp folder that you created as it is no longer needed.

Before we go further let’s try running the sample code.

Enter ./RunTest.sh

An error will occur unless you have already entered the license key. The solution to the problem is included in the message.

Figure 9 – The error that will occur if you do not specify a license key

Unlike many of the samples, the Structured Output sample requires either a commercial or trial license key.

Getting an Apryse SDK Trial Key

Copied to clipboard

If you don't already have an Apryse account, go to https://dev.apryse.com and register a new account.

This allows Apryse to grant you a demo license key which will be used with the Apryse SDK to enable demo functionality. If you haven’t already done so, you can download your Apryse Trial key here.

This license key needs to be copied into the file Samples/Licensing/RUBY/LicenseKey.rb.

Figure 10 – The location to copy your license key

Running the Sample When Everything is Correctly Configured

Copied to clipboard

Once you have the library, Structured Output Module, and license, run the sample by calling ./RunTest.sh.

Figure 11 – The output of successful PDF2Office conversions

This time, the code will run and perform six PDF to Office conversions. These conversions are happening entirely within Linux, with no dependency on Microsoft Office.

If you get an error Exit Code 0XFE, this means that the license is incorrect. This can happen if mismatched versions of the Ruby Wrapper and Structured Output module are used.

Reviewing the Output of the Conversions

Copied to clipboard

Before we look at how the conversions work, let’s look at the results. You don’t need Office installed on the machine where the conversion occurs, so feel free to copy the files to a location where you can open them using Office.

Figure 12 – The output folder after the conversion completes, shown via a Windows Explorer window for convenience

For these sample conversions, the same source PDF was converted into Excel, PowerPoint, and Word.

Figure 13 – The original PDF containing a page of text, and an invoice with a table

Not only has an editable Word document been recreated, but the fonts, line, and paragraph breaks are the same, wherever possible, in the new document as they were in the PDF – exactly the way that you want it.

Figure 14 – The reconstructed DOCX document, shown within Word

In the same way, the PowerPoint presentation faithfully represents the original file, with each page in the PDF converted into a separate slide.

Figure 15 – The file reconstructed as a PowerPoint presentation

It is easy to imagine how a PDF that was created from a Word document should look when it is reconstructed back into a Word document. What we have done here, though, is take a Word document, convert it to PDF, then convert it back into a PowerPoint presentation. As such, some of the functionality may be a little different, since PowerPoint does not support everything that Word does. Nonetheless, in this example, the result is extremely good.

But how should the PDF look when converted into a spreadsheet? That is a very different format from a Word document. Let’s open the converted file and see.

The first thing that you will note is that it contains two sheets. However, neither sheet contains the text from the first page of the PDF.

Figure 16 – The first sheet reconstructed from the PDF

Figure 17 – The second sheet reconstructed from the PDF

What has happened is that the Structured Output module has correctly identified that there was no tabular data on the first page of the PDF, while finding two separate tables on the second page.

Figure 18 – A detailed view of the second page of the PDF, indicating the two tables that are present

The Structured Output module was developed to be useful rather than just dumping the text into a spreadsheet so that it looks like the PDF, which might initially seem correct but is not actually very useful

As such, the default behavior when converting from PDF to Excel is to discard non-table data and to place each identified table onto a separate sheet. That means, in this case, the whole of the first page is discarded. These options can easily be overridden if you need something different, but experience has shown that the default options give the best results for the largest number of users.

Note: This is not the only way within Apryse to extract tabular data from PDFs.

Interested in extracting data with the Intelligent Document Processing (IDP) add-on? Check out our recent blog for more information about that module.

The effect of using options when converting files can be seen in the three remaining files produced by the sample code. These files contain the output of just a single page, and were created by specifying a page range in the conversion options.

How the Code Works

Copied to clipboard

The sample code has been created to illustrate how easy it is to get started, while giving a hint as to what other options are available.

At its very simplest, conversion can (after library initialization) be as simple as:

Convert.ToWord($inputPath + "paragraphs_and_tables.pdf", $outputFile)

That’s right, you can convert a PDF to Word with just a single line of code!

This can be extended to include options by using a WordOutputOptions object. In the following example the page range is being specified, but many other options are supported to fine tune the conversion.

$wordOutputOptions = Convert::WordOutputOptions.new()

# Convert only the first page

$wordOutputOptions.SetPages(1, 1);

Convert.ToWord($inputPath + "paragraphs_and_tables.pdf", $outputFile, $wordOutputOptions)

Conversion to Excel or PowerPoint is performed in a similar way, either with or without options, simply by using the ToExcel or ToPowerPoint methods.

Convert.ToExcel($inputPath + "paragraphs_and_tables.pdf", $outputFile)

It really is that easy.

Learn how to generate PDFs using Ruby on Rails.

Conclusion

Copied to clipboard

With Apryse’s SDK, developers have an efficient and straightforward method for reconstructing Office documents from PDFs.

Whether you’re creating a document recovery tool, an application for content extraction, or any solution that necessitates the reverse conversion of documents, Apryse gives you the essential tools to achieve this task.

The Ruby wrapper for the SDK allows you to work in a familiar language within either Linux or macOS.

Dive in and try the library’s capabilities, including the ability to customize the reconstruction process to meet your own specific requirements.

If you encounter any issues, remember that you can also reach out to us on Discord for assistance and support.