Extract Tables from PDFs in Python
Extracting tables from PDF allows you to export the table data to other editable file formats such as Excel. This helps with data analysis, comparisons, statistics and other operations, enabling users to better understand and use the data.
This article will give you well-structured details and guidelines on how to extract data from PDF tables using Python. The following two examples are included:
- Export PDF Table to TXT in Python
- Export PDF Table to Excel in Python
Python Libraries for Extracting PDF Tables
- Spire.PDF for Python: The Python PDF library is used to extract data from PDF tables.
- Spire.XLS for Python: The Python Excel library is used to export PDF table data to Excel files.
Before you begin, both Python libraries can be installed using the following pip command.
pip install Spire.PDF
pip install Spire.XLS
Export PDF Table to TXT in Python
To complete the task, follow the mains steps below:
- Load a PDF file through the LoadFromFile() method of the PdfDocument class.
- Iterate though the pages in the PDF file and extract tables from them using the ExtractTable() method of the PdfTableExtractor class.
- Iterate though the extracted tables and each row and column within them, and then get the cell data using the GetText(rowIndex, columnIndex) method of the PdfTable class.
- Write the PDF table data into a txt file.
Python Code:
from spire.pdf.common import *
from spire.pdf import *
# Load a PDF file
pdf = PdfDocument()
pdf.LoadFromFile("table.pdf")
# Create a list to store the extracted data
builder = []
# Create a PdfTableExtractor object
extractor = PdfTableExtractor(pdf)
# Loop through the pages to extract tables within them
for page in range(pdf.Pages.Count):
tables = extractor.ExtractTable(page)
# Loop through the tables
if tables is not None and len(tables) > 0:
for table in tables:
row = table.GetRowCount()
column = table.GetColumnCount()
# Loop through the rows and columns
for i in range(row):
for j in range(column):
# Get text from the specific cell
text = table.GetText(i, j)
# Add the text to the list
builder.append(text + " ")
builder.append("\n")
builder.append("\n")
# Write to a txt file
with open("ExtractPDFTable.txt", "w", encoding="utf-8") as file:
file.write("".join(builder))
Export PDF Table to Excel in Python
This task requires Spire.PDF for Python library in conjunction with the Spire.XLS for Python library. By export the PDF table data to an Excel file, you can perform further analysis using the advanced features of Excel.
The main steps to extract data from PDF tables are the same as those in the above sample. The difference is that the above code writes the PDF table data directly to the TXT file, whereas in this example you also need to do the following:
- Create an Excel workbook and add worksheets to it.
- Write the text obtained from PDF table cells to specific cells in the worksheet through the Worksheet.Range[rowIndex, columnIndex].Value property.
- Save the Excel file using the SaveToFile() method of the Workbook class.
Python Code:
from spire.pdf import *
from spire.xls import *
# Load a PDF file
pdf = PdfDocument()
pdf.LoadFromFile("table.pdf")
# Create a Workbook object
workbook = Workbook()
workbook.Worksheets.Clear()
# Create a PdfTableExtractor object
extractor = PdfTableExtractor(pdf)
sheetNumber = 1
# Loop through the pages to extract tables within them
for page in range(pdf.Pages.Count):
tables = extractor.ExtractTable(page)
# Loop through the tables
if tables is not None and len(tables) > 0:
for table in tables:
# Add a worksheet
sheet = workbook.Worksheets.Add(f"sheet{sheetNumber}")
# Get row number and column number of a certain table
row = table.GetRowCount()
column = table.GetColumnCount()
# Loop through the rows and columns
for i in range(row):
for j in range(column):
# Get text from the specific cell
text = table.GetText(i, j)
# Write PDF table cell data to the specified cell in the worksheet
sheet.Range[i + 1, j + 1].Value = text
# Auto-fit columns
sheet.AllocatedRange.AutoFitColumns()
sheetNumber += 1
# Save the Excel file
workbook.SaveToFile("ExportPDFTableToExcel.xlsx", ExcelVersion.Version2016)
Conclusion:
Extracting PDF table data using Python can help users increase productivity and automate data processing. This is especially useful for users or organizations that deal with large amounts of data on a regular basis. Go to the forum for any questions while using.