Extract Text and Images from PDF with Python

Andrew Wilson
3 min readDec 28, 2023

--

Extracting content from PDFs can help us access the information in the document for further analysis and processing. In addition, the extracted text or images can be reused in other related projects. To extract text and images from PDFs in Python, you can use the third-party library Spire.PDF for Python. Check below for details on how to accomplish it.

Python Library for PDF Extraction

Spire.PDF for Python supports various PDF processing functions such as creating, writing, reading and converting PDFs. Before starting, use the following pip command to install the Python library.

pip install Spire.PDF

Extract Text from PDF with Python

Spire.PDF for Python provides the PdfPageBase.ExtractText() method to extract text from a PDF page. Based on different needs, you can choose to extract only the text of a page or traverse all pages to extract the text of the entire PDF file.

Python code:

from spire.pdf import *
from spire.pdf.common import *

# Create an instance of the PdfDocument class
pdf = PdfDocument()

# Load the PDF document
pdf.LoadFromFile("Budget.pdf")

# Create a TXT file to save the extracted text
extractedText = open("Output/ExtractedText.txt", "w", encoding="utf-8")

# Iterate through the pages of the document
for i in range(pdf.Pages.Count):
# Get the page
page = pdf.Pages.get_Item(i)
# Extract text from the page
text = page.ExtractText()
# Write the text to the text file
extractedText.write(text + "\n")

extractedText.close()
pdf.Close()

Output:

Extract Text from PDF

Extract Text from a Rectangular Area in a PDF Page in Python

If you only need to extract the text in a specified area of a PDF page, you can specify a rectangular range and then use PdfPageBase.ExtractText (RectangleF rectangleF) method to extract the text.

Python code:

from spire.pdf import *
from spire.pdf.common import *

# Create an object of PdfDocument class
pdf = PdfDocument()

# Load a PDF document
pdf.LoadFromFile("Budget.pdf")

# Get the first page
page = pdf.Pages.get_Item(0)

# Extract text from a rectangular area on the page
text = page.ExtractText(RectangleF(90.0, 300.0, 770.0, 100.0))

# Save the extracted text to a text file
extractedText = open("Output/Extracted.txt", "w", encoding="utf-8")
extractedText.write(text)
extractedText.close()
pdf.Close()

Output:

Extract Text from a Rectangular Area in a PDF Page

Extract Images from PDF in Python

In addition to extracting text, Spire.PDF for Python also provides the PdfPageBase.ExtractImages() method to extract images from PDF files. To extract all images in a PDF file and save them to a specified path, refer to the following Python code.

from spire.pdf import *
from spire.pdf.common import *

# Create an instance of PdfDocument class
pdf = PdfDocument()

# Load the PDF document
pdf.LoadFromFile("Budget.pdf")

# Create a list to store the images
images = []

# Iterate through the pages in the document
for i in range(pdf.Pages.Count):
# Get a page
page = pdf.Pages.get_Item(i)
# Extract the images from the page and store them in the created list
for img in page.ExtractImages():
images.append(img)

# Save the images in the list as PNG files
i = 0
for image in images:
i += 1
image.Save("Images/Image-{0:d}.png".format(i), ImageFormat.get_Png())

pdf.Close()

Output:

Extract Images from PDF

If you are interested, click for more information about the PDF Python library:

https://www.e-iceblue.com/Tutorials/Python/Spire.PDF-for-Python/Program-Guide/Spire.PDF-for-Python-Program-Guide-Content.html

--

--

Andrew Wilson
Andrew Wilson

Written by Andrew Wilson

Explore C#, Java and Python solutions for processing Word/Excel/PowerPoint/PDF files.

No responses yet