PromptArmour
Zero-Day Prompt Injection Firewall
The problem PromptArmour solves
PromptArmour is designed to detect and explain prompt injection attacks targeting Large Language Models (LLMs).
Prompt injection is a security vulnerability where a malicious user manipulates the model’s behavior by injecting hidden or adversarial instructions into input prompts. This can lead the model to:
Ignore original instructions
Leak sensitive information
Execute unintended or harmful actions
This is especially dangerous in LLM-powered apps that process untrusted user input (e.g., chatbots, AI agents, retrieval-augmented systems).
Challenges we ran into
Fine-tuning a Robust Classifier
Adapting the DeBERTa model for prompt injection detection required careful preprocessing and dataset balancing. It was particularly challenging to make the model generalize across various forms of prompt injection attacks, including obfuscated, adversarial, and reverse instructions.
Trade-off Between Speed and Accuracy
Lightweight models like BERT-tiny offered fast inference but missed subtle injection cues. On the other hand, larger models provided better accuracy but came with increased latency, making it hard to meet real-time performance goals.
Integrating Reasoning Models for Chain-of-Thought (CoT)
Implementing a local reasoning model using Mistral via Ollama posed challenges in prompt formatting, output consistency, and response streaming. Ensuring that CoT outputs were meaningful and aligned with the classification pipeline took iterative prompt engineering.
Model Packaging and Unified Deployment
Hosting both a sequence classification model and a generative reasoning model in a single Flask application required managing different dependencies, memory usage, and model loading constraints. It was also important to maintain compatibility across different devices (CPU/GPU).
Handling Edge Cases
Some prompt injection attempts closely mimicked normal inputs, making it difficult for the model to flag them without false positives. Similarly, benign prompts were occasionally flagged as attacks due to the limited diversity of safe examples in the training set.
API Reliability and CORS
Setting up reliable REST endpoints that could handle malformed requests, CORS issues, and JSON parsing failures took careful error handling. Additionally, ensuring robustness when the reasoning backend (Ollama) was unavailable added another layer of complexity.
