Apryse Announces Acquisition of AI-Powered Document Toolkit Provider LEAD Technologies
By Roger Dunham | 2023 Dec 06
This article highlights the vulnerability of programs, especially those handling user-provided PDFs, to security threats. With PDFs being a ubiquitous and trusted format, vulnerabilities in rendering libraries, like PDFium – used in Google Chrome as well as other software - pose significant risks to billions of users.
The blog delves into the subtle threat of Access Violations, explaining their causes—uninitialized pointers, dangling pointers, and buffer overflows—using pirate analogies. It further explores how Apryse utilizes tools like WinDbg and GFlags to detect and mitigate such vulnerabilities, ensuring the security of their software.
Any program that works with user provided input, is exposed to that input being provided maliciously in an attempt to compromise security. The internet is a hostile place.
Working with user provided PDFs is no different. PDFs are a phenomenally widely used format, with trillions of them in existence. Their ubiquity also means that users tend to trust them; it’s hardly surprising when PDFs are used for bank statements, invoices, pay-slips, and many other documents that we see on a daily basis.
If you couple the ubiquity of PDFs with the fact that Google Chrome has 60% of the Worldwide browser market, and that Chrome uses the library PDFium to render PDFs, then a bug in PDFium has the potential to affect billions of users. This makes PDFium a library of interest to cyber-criminals looking for bugs that may allow cyber-attacks.
Managing PDF Security Risks: Quality Maintenance Against Access Violations
In the realm of cybersecurity, the subtlest vulnerabilities can sometimes become the most dangerous. One such inconspicuous threat is the Access Violation (also known as a Segmentation Fault, or SegFault).
This article explores what Access Violations are, some of their underlying causes, and why they pose a substantial security threat.
At its core, a segmentation fault occurs when a program tries to access a region of memory that it should not be accessing. Many modern operating systems contain mechanisms to protect each process by isolating its memory space, preventing one application from interfering with the memory of another. Access Violations signal a breach of this protective barrier, allowing unauthorized access to memory regions.
Three common causes of Access violations are:
Since this could be a rather dull subject, we will use Pirate based analogies to illustrate it.
There are lots of great technical explanations of what a pointer is, but for this article let’s just consider it to be an item on a pirate’s shopping list.
The pirate adds something to the list, gets that item, and having got it, they can refer to the item from the list (fire the 3rd cannon on the port side), and when they are finished with the item they take it off the list.
Before we move on, I should say that the example code isn’t written in any particular coding language – it’s meant to be just illustrate the idea. As such the code is more, what you would call guidelines, than actual rules.
Having said that proviso, let’s add something to the list:
Figure 2 - An analogy for a pointer. The item on the list can be referred to and used, but only after it has actually been created. Adding it to the list is not enough.
Which would be done in code as
This means that we have created a location in memory where the Cannon object can be referred to. At the moment that memory is potentially just garbage.
Next, we need to actually instantiate the object that the pointer is referring to
c = new Cannon();
So now the pointer is pointing at an actual cannon object, not just an effectively random piece of memory.
(In reality, if we were pirates, we would do that lots of times, creating a new pointer for each cannon, so that we had lots of shiny new cannons, and we could refer to each one using its own pointer, but for now let’s assume that we only have one.)
Then when we want to use the cannon, we can call the “fire” function.
And we would expect the cannon to fire, sending a cannonball through the air, causing the other ship to surrender and leading to a wealth of plunder.
Figure 3 - A properly constructed cannon, working exactly as it should.
So far so good. But what would happen if we hadn’t initialised the cannon, and our code looked instead like?
In this case the pointer doesn’t actually point at the thing that we intended – the memory is still just garbage. What actually happens depends a little on the operating system, but one possibility is that whatever was located at the place in memory that the pointer referred to, might do something (we’ll talk more about that in a minute). But most likely, the program will just crash.
Figure 4 - If you try to use a pointer that has not been initialised then your program is likely to crash.
Crashing programs are more than just an inconvenience. Programs crashing in response to a specific file could be used to initiate a Denial of Service (DoS) attack, potentially shutting down a website.
A very similar issue can occur when a pointer is created, correctly initialised then freed (telling the Operating System that the memory used by the object is no longer needed)
Freeing pointers is good practice since it stops memory leaks, and leaks on board a pirate ship are always a bad thing.
As an example, the following code allocates exactly the amount of memory needed for a treasure chest object, creates the treasure chest, which is then exchanged for rum, and finally the memory is tidied up.
chest* ptrChest = (chest *)malloc(sizeof(chest));
rum = swapForRum(ptrChest)
As soon as we call free, the pointer no longer has a valid treasure chest object.
However, the pointer is still pointing at the place in memory where the treasure chest was located. And because it takes some time for the memory to be used for a different object, it is very likely that the object still exists, at least for a little while.
As such, attempting to get more rum will probably work, even though the pointer has been freed.
rum = swapForRum(ptrChest)
But at some point, the memory will get recycled, and then anything could happen; the memory that the pointer refers to might contain a different treasure chest, or a cannon, parrot food, or maybe nothing at all.
That might be a major problem if you’ve promised the crew a barrel of rum for their hard work, and instead you give them parrot food. It’s the kind of thing that might cause you to find yourself cast adrift on a small boat, with only your parrots for company.
Figure 5 - If you try to access a pointer after it has been freed then your crew may end up very unhappy.
From a cyber-security point of view this kind of issue might just crash the program (which can lead to a DoS attack), but it might also allow data to be read, or written, when that should not be possible. That potentially could lead to private information being exposed, or even the running of code that is not part of the program (leading to the machine becoming part of a botnet, or allowing hackers access to the entire network, or the installation of ransomware).
And that kind of bug can be very difficult to find – the code may behave correctly on many occasions, then fail when something entirely unrelated occurs, simply because only at that point is the memory reused for a different object.
An interesting mathematical problem is calculating the number of cannon balls in a pyramid.
Figure 6 - An analogy of a buffer that is the correct size. The number of cannon balls fits perfectly in the frame.
The total number of cannon balls in a square frame can be exactly calculated using just the length of the sides – for a pyramid of side length 6, the answer is 91.
But what happens if you could specify not just the side length, but also the number of cannon balls that had to fit. Perhaps saying there were 95?
If you were very careful you might manage to balance all of the balls so that they stacked.
Figure 7 - An analogy of a buffer overrun - there are more cannon balls than can fit into the pyramid, so the heap is unstable.
Sometime very soon though, some of those cannon balls are going to fall off the stack and go running down the deck.
Figure 8 - An analogy for a buffer overflow. The cannon balls didn't fit into the heap, and may now cause all kinds of problem.
A buffer overflow is exactly the same kind of problem – the amount of data doesn’t fit into the space that has been reserved for it, and therefore gets written into parts of memory where it shouldn’t be.
And this is a big problem. At the very least we have lost some data. But potentially the extra data was malicious and had been designed to deliberately overwrite parts of the memory possibly including the area that was storing the actual program. When that happens the program may then do things that weren’t intended by the developer, such as connect to a remote, hacker’s, server.
One of the ways that this bug can occur is when a PDF contains an image in JBig2 format.
JBig2 was created more than 20 years ago, and is a very effective algorithm at compressing fax images. However, its specification, in simplified terms) allows the image size to be specified by width, height and total length.
This ability to specify an image length that was greater than it should be was then used to allow a zero-click attack on iPhones in 2021, which started the process of installing the Pegasus spyware onto the phone without the user’s knowledge.
Two useful tools are available within Windows (and similar ones exist for Linux and macOS). These are WinDbg and GFlags.
WinDbg is a low level debugger that can be used to execute a program, potentially with command line arguments. The output is logged, along with the stack trace and register details if an error occurs.
GFlags allows the way that Windows behaves when executing a program to be tweaked. While it has many options, the one that is of interest in this article is “Enable Page Heap”.
Figure 9 - The UI of GFlags. In this case options are being specified for a program called ChConTest, and the Enable Page Heap option has been set.
The page heap options results in a specific pattern of bytes being added at the end of each heap allocation and these patterns are then examined when the allocations are freed. There should be no difference in these patterns, so if there is then something is wrong.
In addition, with full-page heap verification, an inaccessible page is added at the end of each allocation so that the program stops immediately if it accesses memory beyond the allocation. One issue with this from a practical point of view is that because full heap verification uses a full page of memory for each allocation, its widespread use can cause system memory shortages.
By using these tools together, it is possible to run an application, have WinDbg check that memory allocations are correct and, if not, log the stack trace and memory register values, and terminate the \program.
This mechanism can then be automated, allowing the process to work with a specific version of the software under test, and a range of PDFs, parsing the generated log file after each conversion looking for reported errors.
That sounds all rather complex, so let’s look at an example of this being used.
The xPDF library is a widely used open-source library which was used in some versions of iOS and macOS.
The exploit involved a carefully designed PDF that contained a malicious JBig2 image.
We can test the issue by using WinDbg to run the tool PdfToPng (which is also based on xPDF) to open the problem PDF.
In version 4.0.4, after the bug was fixed, an error occurs.
Figure 10 - The output when converting a deliberately corrupt PDF in the latest version of xPDF.
That’s as expected – the PDF is genuinely invalid.
However, repeating the test with version 4.0.3 of xPDF, which contains the bug, throws an Access Violation when the same PDF is processed.
Figure 11 - The Access Violation that occurs when using WinDbg to run a program that opens a maliciously corrupt file in xPDF. You can see the name of the function – readTextRegionSeg, in which the error occurs, as well as the values of the Registers at that time.
We can use the information from the log files to track down the problem function as the first step to fixing the error.
Apryse recognize that Access Violations are a potentially significant risk to cyber-security. One way that we are working to minimise the risk is to use the tools described in this article to test the behavior of the Apryse SDK when working with a collection of known problem PDFs.
Before each release, the Release Candidate version of the software is used to try to convert these PDFs, and the resulting log files are checked for the presence of Access Violations, buffer overruns and stack overflows.
Figure 12 - The number of Access Violation and other issues found using WinDbg and GFlags when converting PDFs to Word with different versions of the Apryse software.
This has allowed us to identify that bugs exist, when they were fixed, and crucially, that they stay fixed.
The tools described in this article offer one method for reducing the risk of Access Violations occurring with software.
While pirates in the movies can be endearing and charming, modern-day pirates - cyber-criminals - lack that charm. They don’t care who they hurt, or whether it destroys your business, or exposes confidential medical information, or costs tens of millions of dollars or your career.
The precautions taken by Apryse reduce the risk of exposure to these threats, while providing great functionality. Head over to apryse.com for more information.
Share this post