Bio Bots
Bridging Medical Texts and AI
The problem Bio Bots solves
PDF/Image to Text Converter with Table Extraction
A comprehensive Python tool that converts PDFs and images to text with advanced table extraction, OCR, and AI enhancement using Groq models.
✨ Features
• 📄 PDF Text Extraction: Extract regular text from PDFs
• 📊 Advanced Table Detection: Multiple extraction methods (PyMuPDF, Tabula, Camelot)
• 🖼️ Image Processing: Extract and process embedded images from PDFs
• 🔍 OCR with Table Detection: Tesseract OCR optimized for tables in images
• 🤖 AI Enhancement: Clean and format text using Groq models
• 📝 Multiple Formats: Support for PDF, PNG, JPG, TIFF, and more
🚀 Quick Start
Installation
Core dependencies
pip install PyMuPDF Pillow pytesseract groq pandas
Optional (recommended for better table extraction)
pip install tabula-py 'camelot-py[cv]'
Install Tesseract OCR
Windows: Download from GitHub releases
macOS: brew install tesseract
Linux: sudo apt-get install tesseract-ocr
Basic Usage
from pdf_converter import document_to_text_converter
Convert PDF with all features
success = document_to_text_converter(
input_path="document.pdf",
output_path="extracted_text.txt",
extract_tables=True,
use_ocr=True,
use_groq=True,
groq_api_key="your_groq_api_key"
)
Command Line
Basic conversion
python pdf_converter.py document.pdf output.txt
With all features enabled
python pdf_converter.py document.pdf output.txt --table-method all --use-groq --api-key your_key
Process image with table detection
python pdf_converter.py screenshot.png text.txt --use-groq
📋 Configuration Options
Parameter Description Default
extract_tables Extract tables from PDFs True
table_method Table extraction method (pymupdf, tabula, camelot, all) all
extract_images Extract images from PDFs True
use_ocr Perform OCR on images True
detect_tables_in_images Detect tables in images via OCR True
ocr_lang OCR language (eng, spa, fra, etc.) eng
use_groq Enhance text with Groq AI False
🛠️ Table Extraction Methods
• PyMuPDF: Fast, built-in table detection
• Tabula: Excellent for bordered tables
• Camelot: High-quality extraction with accuracy scores
• All: Use all methods for best results
📁 Output Structure
=== EXTRACTED TEXT FROM PDF ===
[Regular text content]
=== TABLES (PyMuPDF) ===
[Structured table data]
=== OCR TEXT FROM IMAGES ===
[Text from embedded images with table detection]
🔧 Environment Variables
Set your Groq API key:
export GROQ_API_KEY="your_groq_api_key_here"
📋 Supported Formats
• Input: PDF, PNG, JPG, JPEG, TIFF, BMP, GIF, WebP
• Output: Plain text with structured tables
🏥 Medical Domain Applications
Research & Clinical Documentation
• 📚 Medical Literature Review: Extract text and data tables from research papers, clinical studies, and medical journals for systematic reviews and meta-analyses
• 📊 Clinical Trial Data: Convert PDF trial reports into structured text for data analysis and regulatory submissions
• 🔬 Lab Reports: Extract numerical values and test results from scanned laboratory reports for electronic health records
Patient Care & Records
• 📋 Medical Record Digitization: Convert paper-based patient records, discharge summaries, and consultation notes into searchable digital text
• 💊 Prescription Processing: Extract medication information from handwritten or printed prescriptions for pharmacy management systems
• 🩺 Diagnostic Reports: Process radiology reports, pathology results, and imaging studies for clinical decision support
Administrative & Compliance
• 💰 Insurance Claims: Extract patient information and procedure codes from medical forms for automated claim processing
• 📑 Regulatory Documentation: Convert FDA submissions, clinical protocols, and compliance documents into analyzable formats
• 📈 Quality Metrics: Extract performance data from hospital reports and quality improvement documents
How It Makes Medical Tasks Easier & Safer
⚡ Efficiency Gains
• 10x Faster Data Entry: Eliminate manual typing of patient information and medical data
• Batch Processing: Convert hundreds of medical documents simultaneously
• Automated Workflows: Integrate with hospital information systems for seamless data flow
🛡️ Safety & Accuracy Improvements
• Reduced Human Error: AI-powered error correction minimizes transcription mistakes in critical medical data
• Consistent Formatting: Standardized output reduces misinterpretation of medical information
• Audit Trails: Maintain original documents while creating searchable digital copies
🏥 Clinical Benefits
• More Patient Time: Reduce administrative burden, allowing healthcare providers to focus on patient care
• Faster Diagnosis: Quick access to historical patient data and test results
• Research Acceleration: Rapid literature review and data extraction for evidence-based medicine
🔒 Privacy & Compliance
• HIPAA Compliance: Process documents locally without cloud transmission
• Data Security: No patient data sent to external APIs unless explicitly configured
Challenges we ran into
- Table Detection Accuracy Issues
Problem: Different PDF creation methods store tables differently - some as actual table objects, others as positioned text, and some as images.
Solution in the code:
python# Multiple extraction methods for redundancy
if table_method in ['pymupdf', 'all']:
pymupdf_tables = extract_tables_with_pymupdf(input_path)
if table_method in ['tabula', 'all'] and TABULA_AVAILABLE:
tabula_tables = extract_tables_with_tabula(input_path)
if table_method in ['camelot', 'all'] and CAMELOT_AVAILABLE:
camelot_tables = extract_tables_with_camelot(input_path)
This "shotgun approach" ensures we catch tables regardless of how they're stored in the PDF. - OCR Text Positioning Chaos
Problem: Tesseract OCR returns words with pixel coordinates, but reconstructing table structure from scattered coordinates is notoriously difficult.
Solution implemented:
pythondef process_tsv_for_tables(tsv_data):Group text by line and word positions
lines = {}
for i in range(len(tsv_data['text'])):
if tsv_data['conf'][i] > 30: # Only use confident detections
top = tsv_data['top'][i]
# Group by approximate line (with tolerance)
line_key = top // 10 * 10 # Group within 10 pixels
The key insight was using "fuzzy grouping" - words within 10 pixels vertically are considered the same row. - Memory Issues with Large PDFs
Problem: Loading entire large medical documents into memory can crash the application.
Chunking strategy:
python# Split text into chunks if it's too long
max_chunk_size = 4000
chunks = [text[i:i+max_chunk_size] for i in range(0, len(text), max_chunk_size)]
Process documents in smaller pieces rather than all at once.
Tracks Applied (1)
CoreAgent Track
Technologies used