Resume RAG Chatbot
Resume RAG Chatbot (LangChain + Streamlit)
Created on 7th September 2025
•
Resume RAG Chatbot
Resume RAG Chatbot (LangChain + Streamlit)
The problem Resume RAG Chatbot solves
- HR Efficiency Crisis
Manual screening takes 6-8 hours/day → Your tool reduces it to minutes
Finding specific skills is like finding a needle in a haystack → Natural language search makes it instant
Candidate information gets forgotten → Quick contextual retrieval
- Business Impact Issues
High recruitment costs ($4,200+ per hire) → Reduces to ~$1,800
Slow hiring cycles (2+ weeks) → Down to 3 days
Poor quality of hire → Better data-driven decisions
- Technical Challenges
Unstructured PDF data is unsearchable → AI embeddings make it semantic
Scaling manual processes is impossible → Handles hundreds of resumes effortlessly
Context-blind keyword search → RAG provides intelligent, context-aware answers
- Strategic Advantages
Reactive hiring → Proactive talent pipeline management
Gut-feeling decisions → Data-driven candidate evaluation
Inconsistent evaluation → Standardized AI-powered screening
Challenges we ran into
- PDF Text Extraction Nightmare
The Problem:
python# This looked simple but was a disaster
pdf_reader = PdfReader(file)
text = pdf_reader.pages[0].extract_text()
Result: Garbled text, missing spaces, weird formatting
"JohnDoeExperience:SoftwareDeveloper2020-2023Skills:Python,React"
The Solution:
python# Switched to PyPDFLoader + proper text processing
def load_pdfs_to_docs(files: List[io.BytesIO]) -> List[Document]:
docs: List[Document] = []
for f in files:
# Key insight: Write to temp file first!
tmp_path = os.path.join("/tmp", f"upload_{time.time_ns()}.pdf")
with open(tmp_path, "wb") as tmp:
tmp.write(f.read())
loader = PyPDFLoader(tmp_path) # Much better extraction file_docs = loader.load() # Clean up temp files later
What I Learned: Different PDF libraries handle formatting differently. PyPDFLoader preserves structure better than PyPDF2/pypdf.
-
Memory Explosion with Large Resume Batches
The Problem:
python# This killed my local development server
def process_all_at_once(files):
all_docs = []
for file in files: # 50+ PDF files
docs = load_pdf(file)
all_docs.extend(docs) # Memory keeps growingBOOM! MemoryError when processing 100+ resumes
vectorstore = FAISS.from_documents(all_docs, embeddings)
The Solution:
python# Batch processing + session state management
def chunk_docs(docs: List[Document], chunk_size=800, chunk_overlap=150):
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", ". ", ".", "!", "?", ",", " "]
)
return splitter.split_documents(docs)
Process in smaller batches and extend existing index
if st.session_state.vectorstore is None:
st.session_state.vectorstore = FAISS.from_documents(chunks, embed)
else:
# Add to existing index instead of recreating
st.session_state.vectorstore.add_documents(chunks)
What I Learned: Always think about scalability from day 1. Batch processing and incremental updates are crucial.
- Candidate Identity Extraction Hell
The Problem:
python# Naive regex approach failed miserably
def extract_name(text):This caught everything: "Microsoft Excel", "New York", "Dear Sir"
name_pattern = r"[A-Z][a-z]+ [A-Z][a-z]+"
matches = re.findall(name_pattern, text)
return matches[0] # Usually wrong!
The Solution:
python# Multi-step heuristic approach
EMAIL_RE = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,}")
NAME_HINT_RE = re.compile(r"^(?:name\s*[:-]\s*)?([A-Z][a-zA-Z-']+\s+[A-Z][a-zA-Z-']+(?:\s+[A-Z][a-zA-Z-']+)*)$", re.IGNORECASE)
def extract_contact(text: str) -> Dict[str, str]:
emails = EMAIL_RE.findall(text)
phones = PHONE_RE.findall(text)
# Smart heuristic: check first 20 lines only name = None for line in text.splitlines()[:20]: line = line.strip() if not line or len(line) > 50: # Skip long lines continue m = NAME_HINT_RE.match(line) if m and not any(word in line.lower() for word in ['company', 'university', 'street']): name = m.group(1).strip() break # Fallback to email prefix if not name and emails: name = emails[0].split('@')[0].replace('.', ' ').title() What I Learned: Resume formats are wildly inconsistent. Need multiple fallback strategies and domain knowledge.
4. FAISS Metadata Filtering Disaster
The Problem:
python# Expected this to work like SQL WHERE clause
retriever = vectorstore.as_retriever(
search_kwargs={
"k": 5,
"filter": {"candidate_key": "[email protected]"} # NOPE!
}
)
Error: FAISS doesn't support metadata filtering in LangChain
The Solution:
python# Workaround: Retrieve more, then filter manually
def get_candidate_docs(vectorstore, query, candidate_key):
if candidate_key:
# Get more results than needed
retr = vectorstore.as_retriever(search_kwargs={"k": 12})
docs = retr.get_relevant_documents(query)
# Manual filtering filtered_docs = [d for d in docs if d.metadata.get("candidate_key") == candidate_key] # Fallback if vector search misses candidate if not filtered_docs: all_chunks = st.session_state.chunks filtered_docs = [d for d in all_chunks if d.metadata.get("candidate_key") == candidate_key][:8] return filtered_docs else: # Normal search for all candidates retr = vectorstore.as_retriever(search_kwargs={"k": 5}) return retr.get_relevant_documents(query) What I Learned: Not all vector databases support metadata filtering. Always have a backup plan. Streamlit-Specific Nightmares

Cheer Project
Cheering for a project means supporting a project you like with as little as 0.0025 ETH. Right now, you can Cheer using ETH on Arbitrum, Optimism and Base.
