Welcome to pdf-text-extractor’s documentation!

PDFTextExtractor

A Python utility for extracting text and images from PDF files. The extracted text includes content from PDF pages and OCR-processed text from images embedded in the PDF. Results are returned as a combined list of dictionaries, preserving the order of appearance.

Features

  • Extract text directly from PDF pages.

  • Extract and OCR-process images embedded in PDFs.

  • Return results in a combined, ordered list of text and image content.

  • Preprocess images to improve OCR accuracy.

Requirements

### Python

  • Python Version: 3.12 or higher

pip install pdf-text-extractor

Tesseract OCR

  • Tesseract Installation: Install Tesseract OCR and ensure it is accessible via the system’s PATH. Follow the Tesseract Installation Guide for details.

Usage

### Import and Initialize:

from pdf_text_extractor import PDFTextExtractor

# Provide the PDF file path and image directory
pdf_path = "example.pdf"
image_dir = "output_images"

# Initialize the extractor
extractor = PDFTextExtractor(pdf_path, image_dir)

### Process PDF and Extract Content

# Extract text and image content
results = extractor.process_and_extract_text()

# Display extracted content
for item in results:
    if "text" in item:
        print("PDF Text:", item["text"])
    elif "image_text" in item:
        print("Image Text:", item["image_text"])

### Text and Image Extraction with LLM:

The latest version adds a feature to refine OCR-processed text using a language model (LLM), such as Ollama. This enhances the accuracy and readability of text extracted from images embedded within the PDF.

# Extract text and image content with LLM refinement for image-based text
results = extractor.process_and_extract_text(use_llm_for_image_text=True)

Output Format

The method process_and_extract_text() returns a list of dictionaries. Each dictionary contains either text or image_text, corresponding to content from the PDF or OCR-processed images.

Example Output

[
  {
    "text": "This is text from the first page of the PDF."
  },
  {
    "image_text": "Text extracted from an image on the first page."
  },
  {
    "text": "Another page of the PDF with textual content."
  },
  {
    "image_text": "Additional image-based text extracted."
  }
]

How It Works

### Text Extraction

  • Text from PDF pages is extracted using PyMuPDF.

### Image Extraction

  • Embedded images are extracted and saved to the specified directory.

  • Images are preprocessed before OCR.

### Image Preprocessing

  • Convert to Grayscale: Converts the image to grayscale.

  • Enhance Contrast: Increases contrast to make text stand out.

  • Binarization: Uses Otsu’s thresholding to create a binary image.

  • Denoising: Applies Gaussian blur to reduce noise.

### OCR

  • Preprocessed images are processed with Tesseract OCR to extract text.

Error Handling

  • If an image fails to process, an empty image_text value is added to the results.

    Example:

    {
      "image_text": ""
    }
    

Methods

### __init__(pdf_path, image_dir)

Parameters: - pdf_path (str): Path to the input PDF file. - image_dir (str): Directory to save extracted images.

### process_and_extract_text()

Description: Processes the PDF to extract text and images.

Returns: - A list of dictionaries containing extracted text or image_text.

Contribution

Contributions are welcome! If you have suggestions or improvements, please open an issue or submit a pull request.