EmbeddingTown

EmbeddingTown

Open-source, free, and hosted collection of vector embeddings of public and third-party datasets.

EmbeddingTown

EmbeddingTown

Open-source, free, and hosted collection of vector embeddings of public and third-party datasets.

The problem EmbeddingTown solves

What problem it solves?

  • Embedding creation is time consuming task. Creating embedding of a 12MB file takes around 15-20 mins for all the vectors to be computed.
  • Everyone is creating embedding for popular and common data sources but there is no shared layer to access them at one place.
  • Creating embedding is a costly task. If we use OpenAI's text-embedding-ada-002 model to calculate embeddings for Steve Jobs book, it costs around $0.14 when we index the document in just 1 way. If we have to index in multiple ways, it will cost more.
  • Have pre-computed embeddings makes it easy for developers to play, tinker and build LLM applications with memory easily. Otherwise the process is very timeconsuming and difficult for first time AI devs.
  • Langchain and GPT-index make it easy to create LLM apps. Vector databases like Pinecone, Chroma makes it easy to add memory to LLM apps. But a simple, easy and collaborative layer to get and use embeddings is missing. EmbeddingTown is an effort to solve this.

How it can be used

  • EmbeddingTown can be used by language model developers, data scientists, and researchers to streamline their workflow. It makes the process of obtaining pre-trained vector embeddings easier and faster, eliminating the need to train these embeddings from scratch, which is both time-consuming and computationally expensive.

  • Users can leverage these embeddings for a variety of tasks such as semantic search, text classification, sentiment analysis, and more. By providing embeddings from diverse sources, EmbeddingTown ensures that users have access to a rich, varied set of data, enhancing the performance and generalizability of their models.

  • Moreover, EmbeddingTown promotes safe and ethical use of data by only including open-source embeddings, ensuring transparency and adherence to data privacy standards. It also fosters a collaborative environment where users can request specific embeddings, promoting knowledge sharing within the community.

Challenges I ran into

  • My local elasticsearch was somehow bound to my house IP, and wasn't working at the start. Started digging into the logs and config file and then found out that my ip is hardcoded in host settings. Changed it and it worked.
  • Running Instructor xl model and elasticsearch on the same machine was a challenge. At times during indexing the ES stopped. Right now the instructor xl model is in django project and it was taking a lot of memory and blocking the thread. I had two options - move the Instructor model to a separate repo OR upgrade the server. Keeping time concern in mind, I upgraded my server for the time being.

Discussion