How to OCR a PDF using the prebuilt read API in Python?

Question

I'm using the Prebuilt Read API in Python to perform OCR on PDF documents from a folder. I can successfully upload and OCR the PDFs, but I'm having trouble downloading the resulting PDFs with the extracted text overlayed onto them. How can I modify my code to download the processed PDFs with the OCR text included? Is there any sample code or method that allows me to do this efficiently?

Answer

@Liam Slade Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

. The Prebuilt Read API in Azure AI Document Intelligence is great for extracting text, but it doesn’t directly support overlaying the extracted text onto the original PDFs. However, you can achieve this by combining the OCR results with a PDF manipulation library in Python, such as PyMuPDF or reportlab. . Here’s a step-by-step approach to help you:

Extract Text Using Prebuilt Read API: Continue using the Prebuilt Read API to extract text from your PDFs.
Overlay Text on PDF: Use a PDF manipulation library to overlay the extracted text onto the original PDF. Here’s a sample code snippet which I haven't tested at my end. You might have to debug it and re-code it further:

import fitz  # PyMuPDF
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential

# Azure Form Recognizer credentials
endpoint = "YOUR_FORM_RECOGNIZER_ENDPOINT"
key = "YOUR_FORM_RECOGNIZER_KEY"

# Initialize the client
client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))

# Function to extract text using Prebuilt Read API
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, "rb") as f:
        poller = client.begin_analyze_document("prebuilt-read", document=f)
    result = poller.result()
    return result

# Function to overlay text on PDF
def overlay_text_on_pdf(pdf_path, result):
    doc = fitz.open(pdf_path)
    
    for page_num, page in enumerate(doc):
        for line in result.pages[page_num].lines:
            for word in line.words:
                # Extract x and y coordinates from bounding box
                bounding_box = word.bounding_box
                x_coords = [bounding_box[i] for i in range(0, len(bounding_box), 2)]
                y_coords = [bounding_box[i + 1] for i in range(0, len(bounding_box), 2)]
                
                # Create a rectangular bounding box that encloses the polygon
                rect = fitz.Rect(min(x_coords), min(y_coords), max(x_coords), max(y_coords))
                
                # Insert the text inside the bounding box
                page.insert_textbox(rect, word.content, fontsize=8, color=(0, 0, 0))
    
    output_path = "output_" + pdf_path
    doc.save(output_path)
    return output_path

# Example usage
pdf_path = "path_to_your_pdf.pdf"
result = extract_text_from_pdf(pdf_path)
output_pdf = overlay_text_on_pdf(pdf_path, result)
print(f"Processed PDF saved at: {output_pdf}")

.

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.

Share via

How to OCR a PDF using the prebuilt read API in Python?

1 answer

Your answer