Even with 1-10% of the words changed, small shifts in grammar, combining sentences, or using synonyms can create thousands of variations. Storing these versions is a solved problem. And brute forcing to find the exact match of a pasted paragraph with one recorded can be optimized further. At the crux of the this project is (1) Use multivariate testing and large language models (LLMs) to create unique variations of every page of the document that is to be shared. Making it possible to compare a paragraph leaked with the exact source of a leak. By tracking specific patterns and document changes, this method can pinpoint which recepient/employee leaked confidential data. This innovative approach strengthens security, offering an unprecedented method for data loss prevention. Getting Gemini API Key will be required, and is out of scope of this project. Cloning and setting up project in a self hosted on premise environment is doable, but a guide to do so - is out of scope of the project/documentation. Although this pilot starts with textual content from documents or emails, in the future it eventually can expand to all forms of sensitive enterprise information, from images, video frames or even LLM weights.
neural network and LLMs do not maintain order or counts of words/characters well. This is the reason that you find typos in dalle2, etc. Solve this using better prompts, and positional encoding.
Tracks Applied (1)
Discussion