Convert PDF to HTML in Python

3 min readMay 23, 2024

Converting PDF to HTML in Python can be a valuable and handy skill for developers looking to automate the conversion process and integrate it into their workflows. This process allows for creating web-compatible versions of PDF documents, making them more accessible and user-friendly.

This article will give two examples to show you how to convert PDF files to HTML using Python.

1. Convert PDF to HTML Simply in Python

2. Convert PDF to HTML with Options in Python (Whether to Embed SVG, Image, etc.)

Python Library for PDF to HTML Conversion

Spire.PDF for Python is a reliable library for working with PDF files. It allows developers to create, read, manipulate and convert PDF file easily in Python. To begin, we will need to install it using the pip command:

pip install Spire.PDF

Convert PDF to HTML Simply in Python

Through the below example, we can convert PDF to HMTL with two or three lines of code. Simply load a PDF file, and then save it in HTML format using SaveToFile() method.

Python code:

from spire.pdf.common import *
from spire.pdf import *

# Load a PDF document
pdf = PdfDocument()
pdf.LoadFromFile("file.pdf")

# Save the PDF document to HTML format
pdf.SaveToFile("PdfToHtml.html", FileFormat.HTML)
pdf.Close()

Output:

Click to request a trial license to remove the red watermark

Convert PDF to HTML with Options in Python (Whether to Embed SVG, Image, etc.)

If additional settings are required during conversion, Spire.PDF for Python provides the SetPdfToHtmlOptions() method of the PdfConvertOptions class to specify the conversion option. Though it, we can set whether to embed SVG or images, set the maximum number of pages contained in each HTML file, etc.

The SetPdfToHtmlOptions() method accepts the following parameters:

useEmbeddedSvg (bool): Indicates whether to embed SVG in the HTML file.
useEmbeddedImg (bool): Indicates whether to embed images in the HTML file. (This option only works when useEmbeddedSvg is set to False).
maxPageOneFile (bool): Specifies the maximum number of pages contained per HTML file. (This option only works when useEmbeddedSvg is set to False.)
useHighQualityEmbeddedSvg (bool): Indicates whether to use high-quality embedded SVG in the HTML file. (This option works when useEmbeddedSvg is set to True).

Python Code:

from spire.pdf.common import *
from spire.pdf import *

# Load a PDF document
pdf = PdfDocument()
pdf.LoadFromFile("file.pdf")

# Set to embed images in the HTML file and each file contains only one page
pdfToHtmlOptions = pdf.ConvertOptions
pdfToHtmlOptions.SetPdfToHtmlOptions(False, True, 1, False)

# Save the PDF document to HTML format
pdf.SaveToFile("PdfToHtmlWithOptions.html", FileFormat.HTML)
pdf.Close()

Conclusion

Though the above code samples, we can convert PDF to HTML format simply or customize conversion options to meet specific requirements and achieve the desired output functionality in the HTML files.

Check other PDF conversion and processing features of the Python library:

Convert PDF to HTML in Python

Python Library for PDF to HTML Conversion

Convert PDF to HTML Simply in Python

Convert PDF to HTML with Options in Python (Whether to Embed SVG, Image, etc.)

Conclusion

Spire.PDF for Python Program Guide Content

Spire.PDF for Python is a professional PDF development component that enables developers to create, read, edit…

Written by Andrew Wilson

No responses yet