Font Substitution When Converting from PDF to Office

By Roger Dunham | 2024 May 23

10 min

Introduction

Copied to clipboard

The Apryse SDK is a superb document processing API. It supports annotation, page manipulation, redaction, digital signatures and handling of CAD files, as well as many other features. It also allows you to convert from Office to PDF without the need for MS Office (or anything similar) to be installed, and supports converting from PDF back into Office.

In this article we will look at the causes of, and solutions for, a common problem when converting from PDF to Office – that the fonts in the generated document are not the same as those in the original.

Font changes can also occur when converting from Office to PDF, and even when viewing PDFs on machines where specific fonts are not installed. In both those cases, Apryse has a solution.

Read about understanding font substitution when converting to PDF, and how Apryse handles missing fonts in WebViewer.

Why Fonts Matter

Copied to clipboard

By default, when a PDF is converted into an Office document, it can only use the fonts available on the machine where the conversion occurs. This means that if the PDF uses fonts that are not available, then the generated document will contain a font that differs from the original PDF.

The differences might be trivial, but they can be much more serious.

The Apryse SDK tries to replace a missing font with something that has the same character widths, but on a machine with limited fonts that may not be possible.

In a situation where the original and replacement fonts have different character widths, then the number of characters that fit on a line may change, altering where line breaks occur, and in turn causing the text on a specific page to change as some flows onto the next page.

This has the potential to cause chaos – if the document refers to text on a specific line (or page), font substitution could result in different text being at that position. In that case, someone viewing the original PDF will see one thing, and someone seeing the reconstructed Word document will potentially see something else.

It’s more complex than this. In fact, when converting from PDF to Office to transfer information, there are two places where substitution can occur:

When the PDF is converted into an Office document
When the Office document is displayed to the user

This article is primarily about converting the PDF into an Office document, but since the way the converted document is displayed can vary, it is worth understanding what can happen there.

What Can Go Wrong when an Office Document is Displayed?

Copied to clipboard

Let’s consider a file that was created on a machine where a font called Jeepers is installed.

On that machine the DOCX file looks exactly as we expect.

Figure 1 – A Word document shown on a machine where the font Jeepers is installed

However, if that DOCX file is emailed to someone who opens it on a machine where Jeepers (or any other required font) is not available, Word will quietly use a font that is available.

Figure 2 – The exact same document shown on a machine where the font Jeepers is not installed. Word has quietly replaced it with Calibri.

Does this matter? Well, it might. Presumably the document author chose the specific font for a reason, and now the document looks different – potentially in an unknown way – from how the author intended.

The same issue can occur when displaying PowerPoint presentations. While it's fairly easy to identify what happened in Word, it is much harder to identify within PowerPoint.

Ensuring the End-User Sees the Intended Font

Copied to clipboard

There are five ways to prevent the user seeing a substituted font when they view the Word (or PowerPoint) document:

The document creator should only use common fonts which are installed everywhere.
Ensure the document recipient installs any required uncommon fonts on their machine.
Embed the font into the Word document.
Cheat and get the user to view the DOCX using WebViewer.
Cheat and share the document as a PDF, and get the user to view it using WebViewer. Here it can still be edited, if necessary.

The Problem with Using Only Common Fonts

Even common fonts are not necessarily that common. The default fonts installed on a Windows machine differ from those on a macOS or Linux machine.

There is also a reason that hundreds of extra fonts exist – people want to customize the look and feel of their documents. Insisting that users only have access to a few fonts will not be popular.

The Problem with Installing Fonts Everywhere

While this might be a practical solution within a company where there is close control over hardware, installing “corporate fonts” to ensure that in-house documents have the same look on all machines is simply not feasible on a wider scale. Installing thousands of fonts takes time and uses memory. Furthermore, since many fonts have commercial licenses that require them to be paid for, this could also get very expensive.

The Problem with Embedding Fonts

Word allows fonts to be embedded – either the entire font, or just those characters that are used. Embedding the entire font can significantly increase the size of the DOCX file, which might be a problem. Embedding only the characters that are actually used results in a less of an increase in file size – but can cause issues if you edit the document. If you try to use a character that’s not in the original document and not embedded, the glyph for that character won’t be available for the specified font.

In any event, not all fonts can be embedded due to licensing restrictions – so this is a partial solution at best.

Using WebViewer for Viewing and Editing DOCX and PDFs

WebViewer runs in the browser and has no external dependencies. Document viewing and editing occurs entirely within the browser, so your potentially confidential documents are not sent to some off-site server for processing.

WebViewer does, by default, use locally installed fonts, so there can be issues if a font is not available locally. However, Apryse has a great solution for this. Read about the Self-serve Web Fonts system.

Font Substitution when Converting from PDF to Office Using Apryse Structured Output

Copied to clipboard

This is a particular issue when converting files on a machine that has few fonts installed. That is likely on some Windows Server installations, as well as many Linux distros.

As an illustration of the problem, let’s look at a PDF generated from a Word document on a machine where the fonts Jeepers and Tinos fonts are installed.

On the original machine, the PDF looks just like the Word document it was created from.

Figure 3 – The PDF created from the DOCX on a machine where Jeepers and Tinos are installed. The PDF is very similar to the Word document, as expected.

If we converted the PDF back into a Word document on the original machine (where the fonts are installed), the reconstructed Word document would look very similar to the PDF (and the original Word document).

However, converting the PDF back into a Word document on a different machine, where those two fonts are not installed, is a problem. When Apryse finds a font that is not present, it will substitute something that is available. In this case Jeepers has been substituted with Bookman Old Style, and Tinos with Times New Roman.

We can see this by selecting the first line and seeing the selected font.

Figure 4 – The result of converting the PDF to Office on a machine where the fonts Jeepers and Tinos are not installed. In both cases, those fonts have been swapped for an available font.

This is not just a display issue. If the font Jeepers is subsequently installed, the Word document would still not use it. As far as it is concerned, the correct font is Bookman Old Style.

Installing the font before converting the file would result in it being used. This is a potential solution, but we have already seen that installing many fonts may be impractical.

How Does Apryse Structured Output Solve this Problem?

Structured Output supports the use of a fonts database file.

If such a file exists in the same folder as StructuredOutput.exe, and it is called fonts2.pdf, the font database will be used to provide the fonts that for PDF to Office conversion.

Figure 5 – The PDF converted with a fonts2.pdf font dictionary. Note that the font name is correct (even if it renders incorrectly).

Now when the DOCX file is opened, the names of the fonts will be correct and no substitution will have occurred. That’s exactly what we wanted.

However, if you open the file on a machine where Jeepers and Tinos are missing, as we saw earlier, Word will silently substitute those fonts. However, when opened on a machine where the fonts are present, the result will be correct.

When trying out Structured Output as a proof of concept, this can look like a serious problem. In reality it isn’t. While you may wish to convert files on a machine that has few fonts installed, like a Linux server, you rarely want to view the generated files on such a machine. On most user’s machines, the DOCX file will render correctly as they likely do have the fonts installed.

It is worth noting that Structured Output uses ONLY the fonts that are within the fonts2.pdf file (if that file exists) – it does not fall back to that file if a font is required. As such, a font can be installed locally but will not be used if it is not in the fonts2.pdf file.

If you wish, you can use this to your advantage. For example, if you use different versions of the fonts database file for different users, you can control which fonts are available for each user. That would not be possible if fonts had to be installed locally.

Alternatively, you could ensure that all converted documents use a corporate font, irrespective of what fonts were used in the PDF. For example, in the following image, the only font in the database was Italic Courier New – and as a result, all of the text has been converted into that font.

Figure 6 – The test PDF converted to DOCX using a fonts2.pdf font database that contained just the Courier New Italic font

Creating a Font Dictionary

Apryse has built a tool that creates a font database from all, or a subset of, the fonts that are locally available.

Read more about generating a fonts database.

Note that that tool creates a file called fonts.pdf. In order to use the database, copy that file into the same folder as StructuredOutput.exe and rename the file as fonts2.pdf.

Embedding Fonts in the Reconstructed Office Document

Copied to clipboard

We saw earlier how fonts can potentially be embedded within an Office document. The Structured Output team considered implementing this for Word, but it soon became clear that for a real-world document containing many pages with many fonts, the size of the DOCX files would become impractical if multiple full fonts were embedded. Plus, sub-setting fonts would not be helpful if the document needed to be edited.

However, PowerPoint presentations are generally only viewed (or have just one or two words edited). As such, embedding fonts is a potential solution. If you are interested in this option, then please contact support via Discord.

Conclusion

Copied to clipboard

Font substitution is a complex subject, and one that we get many questions about. The Apryse team offers good solutions whether you are viewing documents, converting from Office to PDF, or from PDF to Office.

The Apryse SDK doesn’t stop there – it offers a wealth of superb document processing technology.

When you are ready to get started with PDF to Office, see the documentation for the SDK to begin quickly.

The process of reconstructing PDFs to Office is complicated – we’ve been at it for more than 25 years and we are not finished yet. If you would like to ask a question about it, reach out to us on Discord.