How to Convert PDF and DOCX to Structured TXT Formats for RAG: A Tutorial on Unstructured

In today's blog post, we'll explore one of the most powerful yet underrated libraries for handling unstructured data: Unstructured. This library is invaluable for converting various file formats like PDF, DOCX, HTML, and even emails into structured text formats. This capability is essential for numerous applications, including large language model (LLM) training and retrieval-augmented generation (RAG). Let's dive into what Unstructured offers and how you can use it to streamline your data processing tasks.

What is Unstructured?

Unstructured is a versatile library designed to preprocess and structure text documents from various unstructured formats. While it doesn't directly relate to artificial intelligence (AI), it plays a crucial role in preparing data for AI tasks. Whether you need to train an LLM or perform RAG, Unstructured can help convert messy, unstructured data into a clean, structured format.

Supported File Formats

Unstructured supports a wide range of file formats, including:

PDF
HTML
Emails
DOCX
PPTX
JSON
XML
And many more

Key Concepts

Elements

The core of Unstructured's functionality revolves around "elements." Elements represent different parts of a document, such as:

Titles
Captions
Paragraphs
Tables
Figures

Data Ingestion and Processing

Unstructured helps in two main areas:

Data Ingestion: Importing data from various file formats into your pipeline.
Data Processing: Preprocessing data through partitioning, cleaning, and chunking.

Installation

You can install Unstructured using pip. Note that some file formats require additional dependencies.

pip install unstructured

For more specialized formats like DOCX and PPTX, you can install additional dependencies:

pip install unstructured[all-docs]

Step-by-Step Tutorial

We'll demonstrate how to use Unstructured to convert data from three file formats: HTML, PDF, and email.

HTML Parsing

First, let's parse an HTML file.

Install Necessary Libraries:

!pip install unstructured
!pip install pillow

Download and Prepare Files:

Download the HTML file into your working directory.

Parse the HTML File:

from unstructured.documents.html import HTMLDocument

# Load the HTML document
html_doc = HTMLDocument.from_file('path/to/your/file.html')

# Access the document's elements
for element in html_doc.elements:
    print(element.text)

PDF Parsing

Next, let's parse a PDF file. Unstructured offers two strategies for PDF parsing: fast and high accuracy.

Import and Parse PDF:

from unstructured.partition.pdf import partition_pdf

# Load the PDF document with high accuracy
pdf_elements = partition_pdf('path/to/your/file.pdf')

# Print the elements
for element in pdf_elements:
    print(element.text)

Using Fast Strategy:

pdf_elements_fast = partition_pdf('path/to/your/file.pdf', strategy='fast')

for element in pdf_elements_fast:
    print(element.text)

Email Parsing

Finally, let's parse an email.

Import and Parse Email:

from unstructured.partition.auto import partition

# Load the email document
email_elements = partition('path/to/your/file.eml')

# Print the email elements
for element in email_elements:
    print(element.text)

Extract Metadata:

for element in email_elements:
    print(element.metadata.to_dict())

Conclusion

Unstructured is a powerful tool for converting various unstructured file formats into structured text. This capability is essential for tasks like LLM training and RAG. By leveraging Unstructured, you can simplify your data preprocessing pipeline and focus more on building intelligent systems.

Summary

Introduction to Unstructured: Understanding its utility in AI and data preprocessing.
Key Concepts: Elements, data ingestion, and processing.
Installation: How to install the Unstructured library.
Step-by-Step Tutorial:

HTML Parsing
PDF Parsing
Email Parsing

Conclusion: Summarizing the utility and benefits of Unstructured.

For more details and to access the library, visit the GitHub repository.

Feel free to leave a comment if you have any questions or thoughts on this tutorial. Happy prompting!