Skip to content
B

Bio Bots

Bridging Medical Texts and AI

Created on 7th June 2025

B

Bio Bots

Bridging Medical Texts and AI

The problem Bio Bots solves

PDF/Image to Text Converter with Table Extraction
A comprehensive Python tool that converts PDFs and images to text with advanced table extraction, OCR, and AI enhancement using Groq models.
✨ Features
• 📄 PDF Text Extraction: Extract regular text from PDFs
• 📊 Advanced Table Detection: Multiple extraction methods (PyMuPDF, Tabula, Camelot)
• 🖼️ Image Processing: Extract and process embedded images from PDFs
• 🔍 OCR with Table Detection: Tesseract OCR optimized for tables in images
• 🤖 AI Enhancement: Clean and format text using Groq models
• 📝 Multiple Formats: Support for PDF, PNG, JPG, TIFF, and more
🚀 Quick Start
Installation

Core dependencies

pip install PyMuPDF Pillow pytesseract groq pandas

Optional (recommended for better table extraction)

pip install tabula-py 'camelot-py[cv]'

Install Tesseract OCR

Windows: Download from GitHub releases

macOS: brew install tesseract

Linux: sudo apt-get install tesseract-ocr

Basic Usage
from pdf_converter import document_to_text_converter

Convert PDF with all features

success = document_to_text_converter(
input_path="document.pdf",
output_path="extracted_text.txt",
extract_tables=True,
use_ocr=True,
use_groq=True,
groq_api_key="your_groq_api_key"
)
Command Line

Basic conversion

python pdf_converter.py document.pdf output.txt

With all features enabled

python pdf_converter.py document.pdf output.txt --table-method all --use-groq --api-key your_key

Process image with table detection

python pdf_converter.py screenshot.png text.txt --use-groq
📋 Configuration Options
Parameter Description Default
extract_tables Extract tables from PDFs True
table_method Table extraction method (pymupdf, tabula, camelot, all) all
extract_images Extract images from PDFs True
use_ocr Perform OCR on images True
detect_tables_in_images Detect tables in images via OCR True
ocr_lang OCR language (eng, spa, fra, etc.) eng
use_groq Enhance text with Groq AI False
🛠️ Table Extraction Methods
• PyMuPDF: Fast, built-in table detection
• Tabula: Excellent for bordered tables
• Camelot: High-quality extraction with accuracy scores
• All: Use all methods for best results
📁 Output Structure
=== EXTRACTED TEXT FROM PDF ===
[Regular text content]

=== TABLES (PyMuPDF) ===
[Structured table data]

=== OCR TEXT FROM IMAGES ===
[Text from embedded images with table detection]
🔧 Environment Variables
Set your Groq API key:
export GROQ_API_KEY="your_groq_api_key_here"
📋 Supported Formats
• Input: PDF, PNG, JPG, JPEG, TIFF, BMP, GIF, WebP
• Output: Plain text with structured tables
🏥 Medical Domain Applications
Research & Clinical Documentation
• 📚 Medical Literature Review: Extract text and data tables from research papers, clinical studies, and medical journals for systematic reviews and meta-analyses
• 📊 Clinical Trial Data: Convert PDF trial reports into structured text for data analysis and regulatory submissions
• 🔬 Lab Reports: Extract numerical values and test results from scanned laboratory reports for electronic health records
Patient Care & Records
• 📋 Medical Record Digitization: Convert paper-based patient records, discharge summaries, and consultation notes into searchable digital text
• 💊 Prescription Processing: Extract medication information from handwritten or printed prescriptions for pharmacy management systems
• 🩺 Diagnostic Reports: Process radiology reports, pathology results, and imaging studies for clinical decision support
Administrative & Compliance
• 💰 Insurance Claims: Extract patient information and procedure codes from medical forms for automated claim processing
• 📑 Regulatory Documentation: Convert FDA submissions, clinical protocols, and compliance documents into analyzable formats
• 📈 Quality Metrics: Extract performance data from hospital reports and quality improvement documents
How It Makes Medical Tasks Easier & Safer
⚡ Efficiency Gains
• 10x Faster Data Entry: Eliminate manual typing of patient information and medical data
• Batch Processing: Convert hundreds of medical documents simultaneously
• Automated Workflows: Integrate with hospital information systems for seamless data flow
🛡️ Safety & Accuracy Improvements
• Reduced Human Error: AI-powered error correction minimizes transcription mistakes in critical medical data
• Consistent Formatting: Standardized output reduces misinterpretation of medical information
• Audit Trails: Maintain original documents while creating searchable digital copies
🏥 Clinical Benefits
• More Patient Time: Reduce administrative burden, allowing healthcare providers to focus on patient care
• Faster Diagnosis: Quick access to historical patient data and test results
• Research Acceleration: Rapid literature review and data extraction for evidence-based medicine
🔒 Privacy & Compliance
• HIPAA Compliance: Process documents locally without cloud transmission
• Data Security: No patient data sent to external APIs unless explicitly configured

Challenges we ran into

  1. Table Detection Accuracy Issues
    Problem: Different PDF creation methods store tables differently - some as actual table objects, others as positioned text, and some as images.
    Solution in the code:
    python# Multiple extraction methods for redundancy
    if table_method in ['pymupdf', 'all']:
    pymupdf_tables = extract_tables_with_pymupdf(input_path)
    if table_method in ['tabula', 'all'] and TABULA_AVAILABLE:
    tabula_tables = extract_tables_with_tabula(input_path)
    if table_method in ['camelot', 'all'] and CAMELOT_AVAILABLE:
    camelot_tables = extract_tables_with_camelot(input_path)
    This "shotgun approach" ensures we catch tables regardless of how they're stored in the PDF.
  2. OCR Text Positioning Chaos
    Problem: Tesseract OCR returns words with pixel coordinates, but reconstructing table structure from scattered coordinates is notoriously difficult.
    Solution implemented:
    pythondef process_tsv_for_tables(tsv_data):

    Group text by line and word positions

    lines = {}
    for i in range(len(tsv_data['text'])):
    if tsv_data['conf'][i] > 30: # Only use confident detections
    top = tsv_data['top'][i]
    # Group by approximate line (with tolerance)
    line_key = top // 10 * 10 # Group within 10 pixels
    The key insight was using "fuzzy grouping" - words within 10 pixels vertically are considered the same row.
  3. Memory Issues with Large PDFs
    Problem: Loading entire large medical documents into memory can crash the application.
    Chunking strategy:
    python# Split text into chunks if it's too long
    max_chunk_size = 4000
    chunks = [text[i:i+max_chunk_size] for i in range(0, len(text), max_chunk_size)]
    Process documents in smaller pieces rather than all at once.

Tracks Applied (1)

CoreAgent Track

🤖 CoreAgent Development Track Alignment This PDF/Image to Text converter project demonstrates several core agent develo...Read More

Discussion

Builders also viewed

See more projects on Devfolio