Bio Bots

Bridging Medical Texts and AI

Built at Bio x AI Hackathon 2025

Created on 7th June 2025

•

Bio Bots

Bridging Medical Texts and AI

The problem Bio Bots solves

PDF/Image to Text Converter with Table Extraction
A comprehensive Python tool that converts PDFs and images to text with advanced table extraction, OCR, and AI enhancement using Groq models.
✨ Features
• 📄 PDF Text Extraction: Extract regular text from PDFs
• 📊 Advanced Table Detection: Multiple extraction methods (PyMuPDF, Tabula, Camelot)
• 🖼️ Image Processing: Extract and process embedded images from PDFs
• 🔍 OCR with Table Detection: Tesseract OCR optimized for tables in images
• 🤖 AI Enhancement: Clean and format text using Groq models
• 📝 Multiple Formats: Support for PDF, PNG, JPG, TIFF, and more
🚀 Quick Start
Installation

Core dependencies

pip install PyMuPDF Pillow pytesseract groq pandas

Optional (recommended for better table extraction)

pip install tabula-py 'camelot-py[cv]'

Install Tesseract OCR

Windows: Download from GitHub releases

macOS: brew install tesseract

Linux: sudo apt-get install tesseract-ocr

Basic Usage
from pdf_converter import document_to_text_converter

Convert PDF with all features

success = document_to_text_converter(
input_path="document.pdf",
output_path="extracted_text.txt",
extract_tables=True,
use_ocr=True,
use_groq=True,
groq_api_key="your_groq_api_key"
)
Command Line

Basic conversion

python pdf_converter.py document.pdf output.txt

With all features enabled

python pdf_converter.py document.pdf output.txt --table-method all --use-groq --api-key your_key

Process image with table detection

python pdf_converter.py screenshot.png text.txt --use-groq
📋 Configuration Options
Parameter Description Default
extract_tables Extract tables from PDFs True
table_method Table extraction method (pymupdf, tabula, camelot, all) all
extract_images Extract images from PDFs True
use_ocr Perform OCR on images True
detect_tables_in_images Detect tables in images via OCR True
ocr_lang OCR language (eng, spa, fra, etc.) eng
use_groq Enhance text with Groq AI False
🛠️ Table Extraction Methods
• PyMuPDF: Fast, built-in table detection
• Tabula: Excellent for bordered tables
• Camelot: High-quality extraction with accuracy scores
• All: Use all methods for best results
📁 Output Structure
=== EXTRACTED TEXT FROM PDF ===
[Regular text content]

=== TABLES (PyMuPDF) ===
[Structured table data]

=== OCR TEXT FROM IMAGES ===
[Text from embedded images with table detection]
🔧 Environment Variables
Set your Groq API key:
export GROQ_API_KEY="your_groq_api_key_here"
📋 Supported Formats
• Input: PDF, PNG, JPG, JPEG, TIFF, BMP, GIF, WebP
• Output: Plain text with structured tables
🏥 Medical Domain Applications
Research & Clinical Documentation
• 📚 Medical Literature Review: Extract text and data tables from research papers, clinical studies, and medical journals for systematic reviews and meta-analyses
• 📊 Clinical Trial Data: Convert PDF trial reports into structured text for data analysis and regulatory submissions
• 🔬 Lab Reports: Extract numerical values and test results from scanned laboratory reports for electronic health records
Patient Care & Records
• 📋 Medical Record Digitization: Convert paper-based patient records, discharge summaries, and consultation notes into searchable digital text
• 💊 Prescription Processing: Extract medication information from handwritten or printed prescriptions for pharmacy management systems
• 🩺 Diagnostic Reports: Process radiology reports, pathology results, and imaging studies for clinical decision support
Administrative & Compliance
• 💰 Insurance Claims: Extract patient information and procedure codes from medical forms for automated claim processing
• 📑 Regulatory Documentation: Convert FDA submissions, clinical protocols, and compliance documents into analyzable formats
• 📈 Quality Metrics: Extract performance data from hospital reports and quality improvement documents
How It Makes Medical Tasks Easier & Safer
⚡ Efficiency Gains
• 10x Faster Data Entry: Eliminate manual typing of patient information and medical data
• Batch Processing: Convert hundreds of medical documents simultaneously
• Automated Workflows: Integrate with hospital information systems for seamless data flow
🛡️ Safety & Accuracy Improvements
• Reduced Human Error: AI-powered error correction minimizes transcription mistakes in critical medical data
• Consistent Formatting: Standardized output reduces misinterpretation of medical information
• Audit Trails: Maintain original documents while creating searchable digital copies
🏥 Clinical Benefits
• More Patient Time: Reduce administrative burden, allowing healthcare providers to focus on patient care
• Faster Diagnosis: Quick access to historical patient data and test results
• Research Acceleration: Rapid literature review and data extraction for evidence-based medicine
🔒 Privacy & Compliance
• HIPAA Compliance: Process documents locally without cloud transmission
• Data Security: No patient data sent to external APIs unless explicitly configured

Challenges we ran into

Table Detection Accuracy Issues
Problem: Different PDF creation methods store tables differently - some as actual table objects, others as positioned text, and some as images.
Solution in the code:
python# Multiple extraction methods for redundancy
if table_method in ['pymupdf', 'all']:
pymupdf_tables = extract_tables_with_pymupdf(input_path)
if table_method in ['tabula', 'all'] and TABULA_AVAILABLE:
tabula_tables = extract_tables_with_tabula(input_path)
if table_method in ['camelot', 'all'] and CAMELOT_AVAILABLE:
camelot_tables = extract_tables_with_camelot(input_path)
This "shotgun approach" ensures we catch tables regardless of how they're stored in the PDF.
OCR Text Positioning Chaos
Problem: Tesseract OCR returns words with pixel coordinates, but reconstructing table structure from scattered coordinates is notoriously difficult.
Solution implemented:
pythondef process_tsv_for_tables(tsv_data):
Group text by line and word positions
lines = {}
for i in range(len(tsv_data['text'])):
if tsv_data['conf'][i] > 30: # Only use confident detections
top = tsv_data['top'][i]
# Group by approximate line (with tolerance)
line_key = top // 10 * 10 # Group within 10 pixels
The key insight was using "fuzzy grouping" - words within 10 pixels vertically are considered the same row.
Memory Issues with Large PDFs
Problem: Loading entire large medical documents into memory can crash the application.
Chunking strategy:
python# Split text into chunks if it's too long
max_chunk_size = 4000
chunks = [text[i:i+max_chunk_size] for i in range(0, len(text), max_chunk_size)]
Process documents in smaller pieces rather than all at once.

Tracks Applied (1)

CoreAgent Track

🤖 CoreAgent Development Track Alignment This PDF/Image to Text converter project demonstrates several core agent develo...Read More

Technologies used

HTML

React

CSS

JavaScript

Python

Tailwind CSS

Discussion

Builders also viewed

See more projects on Devfolio

Bio Bots

Bridging Medical Texts and AI

Bio Bots

Bridging Medical Texts and AI

The problem Bio Bots solves

Core dependencies

Optional (recommended for better table extraction)

Install Tesseract OCR

Windows: Download from GitHub releases

macOS: brew install tesseract

Linux: sudo apt-get install tesseract-ocr

Convert PDF with all features

Basic conversion

With all features enabled

Process image with table detection

Challenges we ran into

Group text by line and word positions

CoreAgent Track

Discussion

Builders also viewed