Extract/Read Text from PDF in Java

PDF is a file format that cannot be eaisly edited, and it’s diffcult for us to directly extract the content inside it. This article will demonstrate how to extract text from a PDF file with a free Java API (Free Spire.PDF for JAVA).

Installation
Method 1: Download the free API and unzip it.Then add the Spire.Pdf.jar file to your project as dependency.

Method 2: You can also add the jar dependency to maven project by adding the following configurations to the pom.xml.

<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.pdf.free</artifactId>
<version>3.9.0</version>
</dependency>
</dependencies>

Java Code

import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import java.io.*;

public class Extract_Text {

public static void main(String[] args) {

//Create a PdfDocument instance
PdfDocument doc=new PdfDocument();
//Load the PDF file
doc.loadFromFile("C:\\Users\\Administrator\\Desktop\\test.pdf");

//Create a StringBuilder instance
StringBuilder sb=new StringBuilder();

PdfPageBase page;

//Loop through PDF pages and get text of each page
for(int i=0;i<doc.getPages().getCount();i++){
page=doc.getPages().get(i);
sb.append(page.extractText(true));
}
FileWriter writer;
try {
//Write text into a .txt file
writer = new FileWriter("ExtractText.txt");
writer.write(sb.toString());
writer.flush();
} catch (IOException e) {
e.printStackTrace();
}

doc.close();
}
}

The output .txt file:

Sharing Java Code