PDFTron is now Apryse. Same great products, new name.
By Aman Kumar, Valerie Yates | 2022 Apr 27
Apryse’s AI platform offers superior PDF table detection and extraction compared to products developed by Adobe, Amazon, and Google when tested on industry-standard, benchmark datasets.
We recently updated Apryse’s artificial intelligence platform, which uses deep learning methods to help companies extract complex tables in PDFs accurately into multiple formats.
Naturally, we care a lot about the tool’s performance – most of all, because in the world of content extraction, accuracy trumps all. We wanted to create a tool that would be a table extraction game changer in the data accuracy department. Yielding nearly perfect accuracy on key parameters, this tool would make intensive manual review a thing of the past.
Therefore, to assess how far we’ve progressed towards our vision we needed to know how our updated platform fares against titans in the document recognition space, such as Adobe’s AI systems, Amazon Textract, and Google Table Parser.
We’ve packaged our experiment highlights in the following section, but read the full blog to get the whole story. If you’re hungry for details, including all the output per evaluated system, don’t hesitate to reach out to our deep learning team. We’d be more than happy to share our results and answer any questions about our methodologies.
In a nutshell:
→ Apryse SDK offers the highest accuracy when it comes to correctly detecting tables in PDFs. In our experiment, we came out on top on overall accuracy results at nearly 98%.
→ Apryse SDK is significantly more accurate in recognizing the contents of tables – the trickiest aspect of table recognition. We got an overall score of nearly 94%, which outscored our competitors by a margin of nearly 3%.
You can experience Apryse’s AI-powered table recognition and extraction for yourself by visiting our online demo.
Does a 3% difference in overall accuracy between two systems matter that much in the end? We think it does – data is useful only if it tells an accurate and complete story.
When dirty data happens, teams spend a lot of time manually correcting outputs. If not caught and corrected, errors flow into downstream data analysis, further compounding errors. And if you’re dealing with thousands of documents and tables each day, correcting what seems to be a small margin of error upfront translates to a huge burden of manual correction further down the workstream.
A 3% difference equals a huge amount of time saved for teams over the course of a year – time saved they can then spend on more meaningful work, like data analysis and not data entry.
And now let’s explore PDF tables and the experiment Apryse SDK ran last fall.
Tables are a great way to represent information in a structural form. Countless online PDFs contain valuable tabular data, and knowledge workers across many fields need to unlock this data for processing and analysis.
Accurate PDF table detection, table structure recognition, and table data extraction are essential in many fields:
However, the two pivotal problems in the domain of table understanding and extraction are:
Modern OCR- and algorithm-based systems are OK at detecting table boundaries (#1). Where they don’t do well is at recognizing internal table structure (#2). They stumble when table layouts are heterogeneous, tables are side by side or span pages, rulings are absent, and cells are unruly. Think of misaligned content in cells, empty cells, cells with multi-line content, and spanning cells. One cell can span several cells, vertically or horizontally, and spanning cells can cross multiple pages.
Traditional approaches just don’t get the contextual meaning of the contents, which is vital to accurate and meaningful extraction. So, we’ve been working on a better way.
We ran our experiment from August to November, 2021. It consisted of two separated parts:
We analyzed how Apryse, Adobe, Amazon, and Google did at correctly recognizing the following:
We picked two sample tables to capture and demonstrate differences in system performance on common extraction challenges, such as merged cells and spanning cells.
This test was designed to capture the overall accuracy of the four tools on a large body of tables. We ran the systems against three standard public datasets and an in-house, private dataset, which resulted in overall accuracy scores, in percentage.
Again, the scores pertained to accurate recognition of table boundaries and table content.
For information about the four systems we evaluated, visit:
The two sample tables we used represent many of the challenges we find in table and table content recognition, such as spanning cells, misaligned cell boundaries, boundaries that overlap other objects, and headers that are not clearly configured. Extraction systems often encounter problems with these factors, which are common in real-world tables, and we found evidence of this difficulty in our evaluations.
Tested against the first sample table, all systems except for Adobe correctly detected table boundaries. With the second sample table, all four systems correctly detected table boundaries.
Half of the systems had trouble recognizing cell boundaries, column headers, and spanning cells.
The following tables (beneath) summarize how the four systems fared with the two sample tables.
Now you get to see what inaccurate table content detection actually looks like. First, the image below shows the second sample table from dataset ICDAR-2013 in its original PDF form, without detection boundaries.
And now here’s the output for this table produced by one of the tested systems, with problem areas called out.
We ran the four systems against three standard public datasets and an in-house, private dataset, which resulted in overall accuracy scores, in percentage. The standard public datasets were cTDaR-modern, ICDAR, and SciTSR. Research and commercial institutions frequently use these public datasets for unbiased evaluation of their machine learning models.
Our in-house evaluation dataset was generated from multiple sources. We used it to represent thousands of documents collected from our customers and the types of complex tables they typically use for extraction.
To ensure equal footing for each system during comparison, our model had no prior training on our in-house evaluation repository.
A standard metric used by deep learning teams worldwide to assess the success of their image recognition tools is Intersection-over-Union (IoU). IoU calculates how accurately the AI detects an object.
We used IoU and sometimes an extended version of IoU in our tests to crunch the overall accuracy statistics for each system’s ability to detect table boundaries and then its contents, such as rows, columns, and individual cells.
This image (above) shows the accurate column boundary in green. The purple box represents an inaccurate prediction of the column (left and right) boundaries. The left purple column crosses the word 'Acquisition'. Furthermore, the purple column boundary on the left cuts the '$' symbols out of the cells. We compute the area of overlap between the predicted bounding box and the accurate bounding box. Dividing the area of overlap by the area of union yields our final score — the IoU.
Accurate table detection was a challenge for some systems with some datasets. The common stumbling block proved to be recognition of tabular cell structure (cell boundaries, rows/columns overlapping with text, spanning cells, etc.).
To our delight, Apryse got top scores for overall accuracy in table detection and table content recognition, achieving the highest total percentage in both cases. These scores confirm that our model performs better than the others in total tabulations (all datasets).
Overall Scores for Table Detection
Our experiment demonstrates that Apryse SDK and its deep learning model did pretty well against solutions available from Adobe AI systems, Google Table Parser, and Amazon Textract.
As we’ve said earlier, we are proud that the Apryse solution not only keeps pace with commercially successful solutions developed by larger competitors but in fact, outperforms in two key areas:
Apryse is the only system that in all cases accurately recognized table boundaries in the first part of the experiment. Overall, Apryse is nearly 98% accurate compared to Adobe’s 95%, Google’s 87%, and Amazon’s 95%.
Apryse is the only system that in all cases accurately recognized table structure in the first part of the experiment. Overall, Apryse is nearly 94% accurate compared to Adobe’s 82%, Google’s 72%, and Amazon’s 89%.
Apryse outscored our next most accurate competitor by a margin of nearly 3% when it came to table content recognition – a small but significant percentage that lets your teams crunch data and make decisions with confidence.
We’ve been improving our accuracy by training our model on good quality and diverse training data. As a result, we’ve seen improvements in our evaluation numbers for the metrics and datasets we’ve used in this experiment compared to the past, and the quality of our deep learning model itself has advanced as well.
We’re looking forward to testing Apryse SDK again against the competition in the near future, as we’re confident that we’ll see further improvements in accuracy next time around. We’re encouraged that applications branching from our core system will build on this foundation of accuracy, and we look forward to testing these applications too one day.
To learn more about Apryse ’s advanced table detection and extraction system, try our online demo or visit Apryse.
And don’t hesitate to contact us with any questions! Our engineers would be happy to chat with you about your project and requirements, and answer any questions about our technology and how we can help support you in meeting your goals.
Sr. Content Strategist
Share this post