Skip to content
Resume RAG Chatbot

Resume RAG Chatbot

Resume RAG Chatbot (LangChain + Streamlit)

Created on 7th September 2025

Resume RAG Chatbot

Resume RAG Chatbot

Resume RAG Chatbot (LangChain + Streamlit)

The problem Resume RAG Chatbot solves

  1. HR Efficiency Crisis

Manual screening takes 6-8 hours/day → Your tool reduces it to minutes
Finding specific skills is like finding a needle in a haystack → Natural language search makes it instant
Candidate information gets forgotten → Quick contextual retrieval

  1. Business Impact Issues

High recruitment costs ($4,200+ per hire) → Reduces to ~$1,800
Slow hiring cycles (2+ weeks) → Down to 3 days
Poor quality of hire → Better data-driven decisions

  1. Technical Challenges

Unstructured PDF data is unsearchable → AI embeddings make it semantic
Scaling manual processes is impossible → Handles hundreds of resumes effortlessly
Context-blind keyword search → RAG provides intelligent, context-aware answers

  1. Strategic Advantages

Reactive hiring → Proactive talent pipeline management
Gut-feeling decisions → Data-driven candidate evaluation
Inconsistent evaluation → Standardized AI-powered screening

Challenges we ran into

  1. PDF Text Extraction Nightmare
    The Problem:
    python# This looked simple but was a disaster
    pdf_reader = PdfReader(file)
    text = pdf_reader.pages[0].extract_text()

Result: Garbled text, missing spaces, weird formatting

"JohnDoeExperience:SoftwareDeveloper2020-2023Skills:Python,React"
The Solution:
python# Switched to PyPDFLoader + proper text processing
def load_pdfs_to_docs(files: List[io.BytesIO]) -> List[Document]:
docs: List[Document] = []
for f in files:
# Key insight: Write to temp file first!
tmp_path = os.path.join("/tmp", f"upload_{time.time_ns()}.pdf")
with open(tmp_path, "wb") as tmp:
tmp.write(f.read())

loader = PyPDFLoader(tmp_path) # Much better extraction file_docs = loader.load() # Clean up temp files later

What I Learned: Different PDF libraries handle formatting differently. PyPDFLoader preserves structure better than PyPDF2/pypdf.

  1. Memory Explosion with Large Resume Batches
    The Problem:
    python# This killed my local development server
    def process_all_at_once(files):
    all_docs = []
    for file in files: # 50+ PDF files
    docs = load_pdf(file)
    all_docs.extend(docs) # Memory keeps growing

    BOOM! MemoryError when processing 100+ resumes

    vectorstore = FAISS.from_documents(all_docs, embeddings)
    The Solution:
    python# Batch processing + session state management
    def chunk_docs(docs: List[Document], chunk_size=800, chunk_overlap=150):
    splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=["\n\n", "\n", ". ", ".", "!", "?", ",", " "]
    )
    return splitter.split_documents(docs)

Process in smaller batches and extend existing index

if st.session_state.vectorstore is None:
st.session_state.vectorstore = FAISS.from_documents(chunks, embed)
else:
# Add to existing index instead of recreating
st.session_state.vectorstore.add_documents(chunks)
What I Learned: Always think about scalability from day 1. Batch processing and incremental updates are crucial.

  1. Candidate Identity Extraction Hell
    The Problem:
    python# Naive regex approach failed miserably
    def extract_name(text):

    This caught everything: "Microsoft Excel", "New York", "Dear Sir"

    name_pattern = r"[A-Z][a-z]+ [A-Z][a-z]+"
    matches = re.findall(name_pattern, text)
    return matches[0] # Usually wrong!
    The Solution:
    python# Multi-step heuristic approach
    EMAIL_RE = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,}")
    NAME_HINT_RE = re.compile(r"^(?:name\s*[:-]\s*)?([A-Z][a-zA-Z-']+\s+[A-Z][a-zA-Z-']+(?:\s+[A-Z][a-zA-Z-']+)*)$", re.IGNORECASE)

def extract_contact(text: str) -> Dict[str, str]:
emails = EMAIL_RE.findall(text)
phones = PHONE_RE.findall(text)

# Smart heuristic: check first 20 lines only name = None for line in text.splitlines()[:20]: line = line.strip() if not line or len(line) > 50: # Skip long lines continue m = NAME_HINT_RE.match(line) if m and not any(word in line.lower() for word in ['company', 'university', 'street']): name = m.group(1).strip() break # Fallback to email prefix if not name and emails: name = emails[0].split('@')[0].replace('.', ' ').title() What I Learned: Resume formats are wildly inconsistent. Need multiple fallback strategies and domain knowledge.

4. FAISS Metadata Filtering Disaster
The Problem:
python# Expected this to work like SQL WHERE clause
retriever = vectorstore.as_retriever(
search_kwargs={
"k": 5,
"filter": {"candidate_key": "[email protected]"} # NOPE!
}
)

Error: FAISS doesn't support metadata filtering in LangChain

The Solution:
python# Workaround: Retrieve more, then filter manually
def get_candidate_docs(vectorstore, query, candidate_key):
if candidate_key:
# Get more results than needed
retr = vectorstore.as_retriever(search_kwargs={"k": 12})
docs = retr.get_relevant_documents(query)

# Manual filtering filtered_docs = [d for d in docs if d.metadata.get("candidate_key") == candidate_key] # Fallback if vector search misses candidate if not filtered_docs: all_chunks = st.session_state.chunks filtered_docs = [d for d in all_chunks if d.metadata.get("candidate_key") == candidate_key][:8] return filtered_docs else: # Normal search for all candidates retr = vectorstore.as_retriever(search_kwargs={"k": 5}) return retr.get_relevant_documents(query) What I Learned: Not all vector databases support metadata filtering. Always have a backup plan. Streamlit-Specific Nightmares

image

Technologies used

Cheer Project

Cheering for a project means supporting a project you like with as little as 0.0025 ETH. Right now, you can Cheer using ETH on Arbitrum, Optimism and Base.

Discussion

Builders also viewed

See more projects on Devfolio