Read/Extract Text & Images from Word in C#

Andrew Wilson
4 min readNov 20, 2024

--

When processing Word documents, you may occasionally need to extract document content, including text and images, for reuse in other projects, documents, or marketing materials.

Manual extraction is both cumbersome and time-consuming, and for large or repetitive tasks, automation of text and image extraction can greatly increase productivity. In this article, we’ll cover how to programmatically extract text and images from a Word document in C# using a free third-party library.

  • Extract Text from a Specified Paragraph in C#
  • Extract Text from a Word document in C#
  • Extract Images from a Word document in C#

Free .NET Word Library

The free third-party library we need to use is called Free Spire.Doc for .NET. You can either download the library from the below link to manually add reference to your project, or install it directly via NuGet.

Extract Text from a Specified Paragraph in C#

The Paragraph.Text property can be used to retrieve the text content of a specified paragraph. The following are the steps to extract text from a Word paragraph and export to a .txt file.

  1. Import the necessary namespaces;
  2. Load a Word document through the LoadFromFile() method;
  3. Create a StringBuilder instance to store extracted text;
  4. Access a specified section, and then access a specified paragraph in the section;
  5. Get the text of the paragraph using the Paragraph.Text property;
  6. Append the extracted text to the StringBuilder instance;
  7. Write the text in the StringBuilder instance to a .txt file.

C# code:

using Spire.Doc;
using Spire.Doc.Documents;
using System.Text;
using System.IO;

namespace ExtractParagraphText
{
class Program
{
static void Main(string[] args)
{
// Load a Word document
Document doc = new Document();
doc.LoadFromFile("Roche Limit.docx");

// Create a StringBuilder instance to store extracted text
StringBuilder sb = new StringBuilder();

// Get the first section
Section section = doc.Sections[0];

// Get the second paragraph in the section
Paragraph paragraph = section.Paragraphs[1];

// Get text from the paragraph and append to the StringBuilder instance
sb.AppendLine(paragraph.Text);

// Write to a text file
File.WriteAllText("ParagraphText.txt", sb.ToString());
}
}
}
Extract the text of the second paragraph in Word with C#

Extract Text from a Word document in C#

The free .NET Word library also provides a simple method Document.GetText() to retrieve the text content of an entire Word document. The following are the steps to extract text from a Word Document and export to a .txt file.

  1. Import the necessary namespaces;
  2. Load a Word document through the LoadFromFile() method;
  3. Create a StringBuilder instance to store extracted text;
  4. Get the text of the Word document using the Document.GetText() method;
  5. Append the extracted text to the StringBuilder instance;
  6. Write the text in the StringBuilder instance to a .txt file.

C# code:

using Spire.Doc;
using System.Text;
using System.IO;

namespace ExtractWordText
{
class Program
{
static void Main(string[] args)
{
// Load a Word document
Document doc = new Document();
doc.LoadFromFile("Roche Limit.docx");

// Create a StringBuilder instance to store extracted text
StringBuilder sb = new StringBuilder();

// Get text from the Word document
string text = doc.GetText();

// Append the extracted text to the StringBuilder instance
sb.AppendLine(text);

// Write to a text file
File.WriteAllText("ExtractWordText.txt", sb.ToString());
}
}
}
Extract the text of the entire Word document with C#

Extract Images from a Word document in C#

To extract images from a Word document, you need to iterate through each child objects to determine if it is a DocPicture. If so, then you can save the image out of the document. The following are the steps to extract imaged from Word and save to a specified file path.

  1. Import the necessary namespaces;
  2. Load a Word document through the LoadFromFile() method;
  3. Iterate through each section and then each paragraph of each section;
  4. Iterate through each child objects of a paragraph;
  5. Determine if a specific child object is a DocPicture. If yes, save the image out of the document using DocPicture.Image.Save(String, ImageFormat) method.

C# code:

using Spire.Doc;
using Spire.Doc.Documents;
using Spire.Doc.Fields;
using System;

namespace ExtractImages
{
class Program
{
static void Main(string[] args)
{
//Load a Word document
Document doc = new Document();
doc.LoadFromFile("Roche Limit.docx");

int index = 0;

// Iterate through each section of document
foreach (Section section in doc.Sections)
{
// Iterate through each paragraph of section
foreach (Paragraph paragraph in section.Paragraphs)
{
// Iterate through each document object of a specific paragraph
foreach (DocumentObject docObject in paragraph.ChildObjects)
{
// Dertermine if the DocumentObjectType is picture
if (docObject.DocumentObjectType == DocumentObjectType.Picture)
{
// If yes, save the image out of the document
DocPicture picture = docObject as DocPicture;
picture.Image.Save(string.Format("Images\\image_{0}.png", index), System.Drawing.Imaging.ImageFormat.Png);
index++;
}
}
}
}
}
}
}
Extract all images from a Word document in C#

--

--

Andrew Wilson
Andrew Wilson

Written by Andrew Wilson

Explore C#, Java and Python solutions for processing Word/Excel/PowerPoint/PDF files.

No responses yet