A vector database is a specialized type of database designed to store, index, and query high-dimensional vector representations of data, often created through large language models. These vectors store semantic information, e.g. vectors with small distances between them are semantically close to each other as well. By storing and managing these embeddings, vector databases are widely used in applications like recommendation systems, semantic search, and anomaly detection.
For some cases, there is an issue of trust, e.g. a recommendation system could recommend you something from the vector database, but how do we make sure that the returned vector is actually the most similar one, not some sponsored & monetized recommendation behind the scenes? The answer is in using verifiable computation!
We do Flat-Indexing with Euclidean Distance metric using SP1 zkVM, where the entire index is split into batches and we make similarity queries recursively until we end up with the most-similar value in the latest iteration. This recursion is done because otherwise we end up with out-of-memory problems, which is expected from an application that uses large vectors with floating point values. For each iteration in these recursions, we create a separate proof with its own public inputs, and each of these are submitted to Aligned Layer for verification.
We first tried using Approximate Nearest-Neighbor methods, but these easily ran into out-of-memory issues much faster and they required a lot more performance. We instead opted for Flat-Index for its simplicity and lower cost, considering that we do batch by batch proofs.
The runtime of the zkVM itself was a problem, it took 5 minutes to generate a single proof with very few vectors of relatively small dimensions. With many proofs the process could take up to almost an hour!
Using zkRust was a bit challenging, there were unexpected errors. We got over it by using SP1 project template instead, and integrating Aligned SDK in it with some helper scripts.
We first tried proof aggregation to submit one final proof to Aligned Layer instead of sending each proof separately, but that turned out to be expensive as well because (1) proof aggregation circuit takes a lot of time, and (2) if you have more than a few proofs you start to get out-of-memory issues within the aggregator itself, thus failing the entire proof at the final step. The solution to this is to simply send each proof separately to Aligned Layer.
Technologies used
Discussion