How to OCR a PDF using the prebuilt read API in Python?

Liam Slade 0 Reputation points
2024-10-03T18:03:43.48+00:00

I'm using the Prebuilt Read API in Python to perform OCR on PDF documents from a folder. I can successfully upload and OCR the PDFs, but I'm having trouble downloading the resulting PDFs with the extracted text overlayed onto them. How can I modify my code to download the processed PDFs with the OCR text included? Is there any sample code or method that allows me to do this efficiently?

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,678 questions
{count} votes

1 answer

Sort by: Most helpful
  1. navba-MSFT 24,465 Reputation points Microsoft Employee
    2024-10-04T04:00:45.7733333+00:00

    @Liam Slade Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

    . The Prebuilt Read API in Azure AI Document Intelligence is great for extracting text, but it doesn’t directly support overlaying the extracted text onto the original PDFs. However, you can achieve this by combining the OCR results with a PDF manipulation library in Python, such as PyMuPDF or reportlab. . Here’s a step-by-step approach to help you:

    1. Extract Text Using Prebuilt Read API: Continue using the Prebuilt Read API to extract text from your PDFs.
    2. Overlay Text on PDF: Use a PDF manipulation library to overlay the extracted text onto the original PDF. Here’s a sample code snippet which I haven't tested at my end. You might have to debug it and re-code it further:
    import fitz  # PyMuPDF
    from azure.ai.formrecognizer import DocumentAnalysisClient
    from azure.core.credentials import AzureKeyCredential
    
    # Azure Form Recognizer credentials
    endpoint = "YOUR_FORM_RECOGNIZER_ENDPOINT"
    key = "YOUR_FORM_RECOGNIZER_KEY"
    
    # Initialize the client
    client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
    
    # Function to extract text using Prebuilt Read API
    def extract_text_from_pdf(pdf_path):
        with open(pdf_path, "rb") as f:
            poller = client.begin_analyze_document("prebuilt-read", document=f)
        result = poller.result()
        return result
    
    # Function to overlay text on PDF
    def overlay_text_on_pdf(pdf_path, result):
        doc = fitz.open(pdf_path)
        
        for page_num, page in enumerate(doc):
            for line in result.pages[page_num].lines:
                for word in line.words:
                    # Extract x and y coordinates from bounding box
                    bounding_box = word.bounding_box
                    x_coords = [bounding_box[i] for i in range(0, len(bounding_box), 2)]
                    y_coords = [bounding_box[i + 1] for i in range(0, len(bounding_box), 2)]
                    
                    # Create a rectangular bounding box that encloses the polygon
                    rect = fitz.Rect(min(x_coords), min(y_coords), max(x_coords), max(y_coords))
                    
                    # Insert the text inside the bounding box
                    page.insert_textbox(rect, word.content, fontsize=8, color=(0, 0, 0))
        
        output_path = "output_" + pdf_path
        doc.save(output_path)
        return output_path
    
    # Example usage
    pdf_path = "path_to_your_pdf.pdf"
    result = extract_text_from_pdf(pdf_path)
    output_pdf = overlay_text_on_pdf(pdf_path, result)
    print(f"Processed PDF saved at: {output_pdf}")
    
    

    .

    Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.