Available Now: Explore our latest release with enhanced accessibility and powerful IDP features

Converting PDF to PDF/A Using Python

By Isaac Maw | 2025 Mar 05

Sanity Image
Read time

3 min

Summary: The PDF/A standard is an ISO 19005 compliant standard for digital document archiving. With the PDF/A conversion SDK, batch process documents, and convert 20+ file types to PDF/A in Python.

Maintaining a secure, reliable archive of documents is part of business operations in a wide variety of industries, from legal and insurance, to healthcare, engineering, and finance, and software developers are tasked with finding the solutions that make digital archiving work. With PDF/A conversion using Python, this archival document file format standard can become easy to use with your existing systems.

Historically, paper documents have been stored in filing rooms and storage facilities, with complex categorization and filing, indexing systems, and retention systems. Of course, digital documents have long since eclipsed paper for many business documents, and digital document archiving saves space, cost, and other organization and maintenance efforts, while making retrieval easier.

As our platforms, standards, and operating systems continually evolve and change, digital files that aren’t designed for archiving can be unreliable over time. Other file formats, fonts and storage media can become obsolete, leaving important documents corrupted or unusable.

For example, using standard PDF files or .DOCX files for archiving may result in challenges such as:

  • Font Embedding: these file types may not embed all fonts used in the document. If the fonts are not embedded, the document might not display correctly on different systems that do not have the same fonts installed.
  • Security Risks: .DOCX files can be vulnerable to macro viruses and other security threats. In addition, metadata, previous versions, and track changes may be accessible in .DOCX files, causing security and privacy compliance issues.
  • Long-Term Accessibility: Over time, the software required to open .docx files may become obsolete. This could make it difficult to access the documents in the future.

What is PDF/A?

Copied to clipboard

The PDF/A standard is designed to solve these issues. The ‘A’ in PDF/A stands for ‘archival,’ and the PDF/A has features designed to preserve documents, including formatting and fonts as well as raster and vector graphics, for long term storage.

PDF/A complies with the ISO 19005 standard for electronic document file format for long-term preservation. By embedding all necessary elements, PDF/A ensures that documents are consistently rendered and reliable over time.

PDF to PDF/A Conversion

Copied to clipboard

To convert your documents into the PDF/A archival format, our cross-platform PDF/A SDK converts 20+ file formats into ISO-compliant PDF/A files that pass VeraPDF validation.

The PDF/A SDK converts from 20+ file formats, including PDF, JPG, HTML, Word, and TIFF into VeraPDF-valid ISO-compliant PDF/A files. The SDK can also repair non-compliant PDF/A files. It also supports all PDF/A versions and conformance levels: PDF/A-1A, PDF/A-1B, PDF/A-2A, PDF/A-2B, PDF/A-2U, PDF/A-3A, PDF/A-3B, PDF/A-3U, PDF/A-4, PDF/A-4E, PDF/A-4F.

The SDK supports high volume PDF to PDF/A batch conversion from the command-line, or as a development library integrated into a document workflow automation.

The process analyzes the content of existing PDF files and performs a sequence of modifications in order to produce a PDF/A compliant document. Features that are not suitable for long-term archiving (such as encryption, obsolete compression schemes, missing fonts, or device-dependent color) are replaced with their PDF/A compliant equivalents. Because the conversion process applies only necessary changes to the source file, the information loss is minimal. Also, because the converter provides a detailed report for each change, it is simple to inspect changes and to determine whether the conversion loss is acceptable.

How to Use the SDK in Python

Copied to clipboard

Check out our documentation guide for all the details on using the PDF/A SDK. We also provide sample code for using the PDF/A conversion SDK in Python.

#---------------------------------------------------------------------------------------
# Copyright (c) 2001-2023 by Apryse Software Inc. All Rights Reserved.
# Consult LICENSE.txt regarding license information.
#---------------------------------------------------------------------------------------
import site
site.addsitedir("../../../PDFNetC/Lib")
import sys
from PDFNetPython import *
sys.path.append("../../LicenseKey/PYTHON")
from LicenseKey import *
#---------------------------------------------------------------------------------------
# The following sample illustrates how to parse and check if a PDF document meets the
#    PDFA standard, using the PDFACompliance class object. 
#---------------------------------------------------------------------------------------
def PrintResults(pdf_a, filename):
    err_cnt = pdf_a.GetErrorCount()
    if err_cnt == 0:
        print(filename + ": OK.")
    else:
        print(filename + " is NOT a valid PDFA.")
        i = 0
        while i < err_cnt:
            c = pdf_a.GetError(i)
            str1 = " - e_PDFA " + str(c) + ": " + PDFACompliance.GetPDFAErrorMessage(c) + "."
            if True:
                num_refs = pdf_a.GetRefObjCount(c)
                if num_refs > 0:
                    str1 = str1 + "\n   Objects: "
                    j = 0
                    while j < num_refs:
                        str1 = str1 + str(pdf_a.GetRefObj(c, j))
                        if j < num_refs-1:
                            str1 = str1 + ", "
                        j = j + 1
            print(str1)
            i = i + 1
        print('')	
def main():
    # Relative path to the folder containing the test files.
    input_path = "../../TestFiles/"
    output_path = "../../TestFiles/Output/"
    
    PDFNet.Initialize(LicenseKey)
    PDFNet.SetColorManagement()     # Enable color management (required for PDFA validation).
    
    #-----------------------------------------------------------
    # Example 1: PDF/A Validation
    #-----------------------------------------------------------
    filename = "newsletter.pdf"
    # The max_ref_objs parameter to the PDFACompliance constructor controls the maximum number 
    # of object numbers that are collected for particular error codes. The default value is 10 
    # in order to prevent spam. If you need all the object numbers, pass 0 for max_ref_objs.
    pdf_a = PDFACompliance(False, input_path+filename, None, PDFACompliance.e_Level2B, 0, 0, 10)
    PrintResults(pdf_a, filename)
    pdf_a.Destroy()
    
    #-----------------------------------------------------------
    # Example 2: PDF/A Conversion
    #-----------------------------------------------------------
    filename = "fish.pdf"
    pdf_a = PDFACompliance(True, input_path + filename, None, PDFACompliance.e_Level2B, 0, 0, 10)
    filename = "pdfa.pdf"
    pdf_a.SaveAs(output_path + filename, False)
    pdf_a.Destroy()
    
    # Re-validate the document after the conversion...
    pdf_a = PDFACompliance(False, output_path + filename, None, PDFACompliance.e_Level2B, 0, 0, 10)
    PrintResults(pdf_a, filename)
    pdf_a.Destroy()
	
    PDFNet.Terminate()
    print("PDFACompliance test completed.")
if __name__ == '__main__':
    main()

To get started, try the SDK now or contact our team.

Sanity Image

Isaac Maw

Technical Content Creator

Share this post

email
linkedIn
twitter