Resume RAG Chatbot

Resume RAG Chatbot (LangChain + Streamlit)

Built at HackOdisha 5.0

Created on 7th September 2025

•

Resume RAG Chatbot

Resume RAG Chatbot (LangChain + Streamlit)

The problem Resume RAG Chatbot solves

HR Efficiency Crisis

Manual screening takes 6-8 hours/day → Your tool reduces it to minutes
Finding specific skills is like finding a needle in a haystack → Natural language search makes it instant
Candidate information gets forgotten → Quick contextual retrieval

Business Impact Issues

High recruitment costs ($4,200+ per hire) → Reduces to ~$1,800
Slow hiring cycles (2+ weeks) → Down to 3 days
Poor quality of hire → Better data-driven decisions

Technical Challenges

Unstructured PDF data is unsearchable → AI embeddings make it semantic
Scaling manual processes is impossible → Handles hundreds of resumes effortlessly
Context-blind keyword search → RAG provides intelligent, context-aware answers

Strategic Advantages

Reactive hiring → Proactive talent pipeline management
Gut-feeling decisions → Data-driven candidate evaluation
Inconsistent evaluation → Standardized AI-powered screening

Challenges we ran into

PDF Text Extraction Nightmare
The Problem:
python# This looked simple but was a disaster
pdf_reader = PdfReader(file)
text = pdf_reader.pages[0].extract_text()

Result: Garbled text, missing spaces, weird formatting

"JohnDoeExperience:SoftwareDeveloper2020-2023Skills:Python,React"
The Solution:
python# Switched to PyPDFLoader + proper text processing
def load_pdfs_to_docs(files: List[io.BytesIO]) -> List[Document]:
docs: List[Document] = []
for f in files:
# Key insight: Write to temp file first!
tmp_path = os.path.join("/tmp", f"upload_{time.time_ns()}.pdf")
with open(tmp_path, "wb") as tmp:
tmp.write(f.read())

    loader = PyPDFLoader(tmp_path)  # Much better extraction
    file_docs = loader.load()
    # Clean up temp files later

What I Learned: Different PDF libraries handle formatting differently. PyPDFLoader preserves structure better than PyPDF2/pypdf.

Memory Explosion with Large Resume Batches
The Problem:
python# This killed my local development server
def process_all_at_once(files):
all_docs = []
for file in files: # 50+ PDF files
docs = load_pdf(file)
all_docs.extend(docs) # Memory keeps growing

BOOM! MemoryError when processing 100+ resumes

vectorstore = FAISS.from_documents(all_docs, embeddings)
The Solution:
python# Batch processing + session state management
def chunk_docs(docs: List[Document], chunk_size=800, chunk_overlap=150):
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", ". ", ".", "!", "?", ",", " "]
)
return splitter.split_documents(docs)

Process in smaller batches and extend existing index

if st.session_state.vectorstore is None:
st.session_state.vectorstore = FAISS.from_documents(chunks, embed)
else:
# Add to existing index instead of recreating
st.session_state.vectorstore.add_documents(chunks)
What I Learned: Always think about scalability from day 1. Batch processing and incremental updates are crucial.

Candidate Identity Extraction Hell
The Problem:
python# Naive regex approach failed miserably
def extract_name(text):
This caught everything: "Microsoft Excel", "New York", "Dear Sir"
name_pattern = r"[A-Z][a-z]+ [A-Z][a-z]+"
matches = re.findall(name_pattern, text)
return matches[0] # Usually wrong!
The Solution:
python# Multi-step heuristic approach
EMAIL_RE = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,}")
NAME_HINT_RE = re.compile(r"^(?:name\s*[:-]\s*)?([A-Z][a-zA-Z-']+\s+[A-Z][a-zA-Z-']+(?:\s+[A-Z][a-zA-Z-']+)*)$", re.IGNORECASE)

def extract_contact(text: str) -> Dict[str, str]:
emails = EMAIL_RE.findall(text)
phones = PHONE_RE.findall(text)

# Smart heuristic: check first 20 lines only
name = None
for line in text.splitlines()[:20]:
    line = line.strip()
    if not line or len(line) > 50:  # Skip long lines
        continue
    m = NAME_HINT_RE.match(line)
    if m and not any(word in line.lower() for word in ['company', 'university', 'street']):
        name = m.group(1).strip()
        break

# Fallback to email prefix
if not name and emails:
    name = emails[0].split('@')[0].replace('.', ' ').title()
What I Learned: Resume formats are wildly inconsistent. Need multiple fallback strategies and domain knowledge.

4. FAISS Metadata Filtering Disaster
The Problem:
python# Expected this to work like SQL WHERE clause
retriever = vectorstore.as_retriever(
search_kwargs={
"k": 5,
"filter": {"candidate_key": "[email protected]"} # NOPE!
}
)

Error: FAISS doesn't support metadata filtering in LangChain

The Solution:
python# Workaround: Retrieve more, then filter manually
def get_candidate_docs(vectorstore, query, candidate_key):
if candidate_key:
# Get more results than needed
retr = vectorstore.as_retriever(search_kwargs={"k": 12})
docs = retr.get_relevant_documents(query)

    # Manual filtering
    filtered_docs = [d for d in docs if d.metadata.get("candidate_key") == candidate_key]
    
    # Fallback if vector search misses candidate
    if not filtered_docs:
        all_chunks = st.session_state.chunks
        filtered_docs = [d for d in all_chunks 
                       if d.metadata.get("candidate_key") == candidate_key][:8]
    return filtered_docs
else:
    # Normal search for all candidates
    retr = vectorstore.as_retriever(search_kwargs={"k": 5})
    return retr.get_relevant_documents(query)
What I Learned: Not all vector databases support metadata filtering. Always have a backup plan.
 Streamlit-Specific Nightmares

Technologies used

CSS

Python

Cheer Project

Cheering for a project means supporting a project you like with as little as 0.0025 ETH. Right now, you can Cheer using ETH on Arbitrum, Optimism and Base.

Discussion

Builders also viewed

See more projects on Devfolio

Resume RAG Chatbot

Resume RAG Chatbot (LangChain + Streamlit)

Resume RAG Chatbot

Resume RAG Chatbot (LangChain + Streamlit)

The problem Resume RAG Chatbot solves

Challenges we ran into

Result: Garbled text, missing spaces, weird formatting

BOOM! MemoryError when processing 100+ resumes

Process in smaller batches and extend existing index

This caught everything: "Microsoft Excel", "New York", "Dear Sir"

Error: FAISS doesn't support metadata filtering in LangChain

Cheer Project

Discussion

Builders also viewed