PDFTron is now Apryse. Same great products, new name.

Semantic Content Recognition in PDF

By Apryse | 2016 Jan 07

Sanity Image
Read time

1 min

Semantic content recognition is the ability to identify components of a document by their “class” – that is if any particular content constitutes a title, subtitle, section, paragraph, word, figure, caption, table, etc. This is a problem, that despite decades of research, remains open. Available solutions are unreliable and are far, far behind the ability of a human being.

At the 2015 PDF Technical Conference, Apryse’s CTO gave a presentation addressing the problem of semantic content recognition in PDF. The presentation gives an overview of the problem itself, why it has been such a hard problem to solve, and how the industry as a whole might organize itself to finally develop solutions that perform with the same accuracy as a person.

Sanity Image


Share this post