RAG System and TxtAI Part 2: Databases and Vector Indexing

AI Artificial Intelligence lundi 22 janvier 2024

In the first part, we discussed how to interact with a model to obtain a dialogue based on information it was not trained on. In summary, you add the desired information to the context. But what if you want to use an entire knowledge base? That would be far too much information to add to the context. For this, we need to put all the information we wish to provide to users into a database. We will break down our content into paragraphs and apply a vector index. This index converts text into numerical vectors that represent the semantic meaning of words and sentences. What's great is that this allows for searches based on meaning rather than exact words.

Among the databases that allow this kind of search, we find:

Elasticsearch: It is a distributed search and analytics engine, widely used for text search. Elasticsearch can be used to create vector indexes through its plugins and integrations, especially for natural language processing.
Faiss (Facebook AI Similarity Search): Developed by Facebook AI, Faiss is a library for efficient vector indexing and similarity search. It is particularly suitable for handling large sets of vectors and is often used in recommendation systems and semantic search.
Milvus: Milvus is an open-source vector database management system. It is designed to handle large-scale vector indexes and is compatible with various machine learning models, including those used for natural language processing.
Pinecone: Pinecone is a vector database designed for machine learning applications. It offers large-scale vector management and search, which is useful for applications in semantic search and NLP.
Weaviate: Weaviate is a vector knowledge base, allowing the storage of data in vector form. It supports semantic queries and is optimized for use cases involving natural language processing.
Annoy (Approximate Nearest Neighbors Oh Yeah): Annoy is a C++ library with Python bindings designed to search for nearest neighbors in high-dimensional spaces. It is used to create vector indexes and is efficient for fast search queries.
HNSW (Hierarchical Navigable Small World): HNSW is a popular algorithm for nearest neighbor search in high-dimensional spaces. Several database management systems incorporate HNSW for creating vector indexes.
Postgresql (pgvector): Postgresql uses HNSW for creating vector indexes as part of the pgvector extension.You can find a more complete list at this address: https://js.langchain.com/docs/integrations/vectorstores This already offers quite a lot of possibilities.

Regarding storage, how are these vectors created? They use what are referred to as 'embeddings'. Embeddings transform complex data, such as words, into vectors in a multidimensional space. For instance, every word in a text can be represented by a vector of 50, 100, 300 dimensions, or more. These vectors are designed in such a way that words with similar meanings are close in this vector space. For example, 'king' and 'queen' would be close, reflecting their semantic relationship. There are pre-trained embeddings models such as Word2Vec, GloVe, or BERT, which have been trained on vast text corpora and can be used to obtain high-quality word embeddings.

You can find a list of different embedding systems available at this address: https://js.langchain.com/docs/integrations/text_embedding

It is important to understand that when you use a model to create your embeddings, you will need to use the same model for creating your queries later on. It is essential to maintain consistency in order to maximize the chances for the database engine to correctly compare the distance between the vectors of your query and those stored in your database.

In summary, you have a series of vectors in your database, and you're going to make a query with another series of vectors. The database will compare the distances; the smaller they are, the more similar the meanings will be.

To go further, I found the examples in this article quite illustrative: https://medium.com/@aravilliatchutaram/intent-classification-using-vector-database-txtai-sqlite-and-large-language-models-821f939b87ba

If you wish to use a different system than the one provided as standard by txtai, here is an ultimate resource: https://neuml.hashnode.dev/customize-your-own-embeddings-database Here, you will find examples of generating embeddings with NumPy, PyTorch, Faiss, HNSW and even the use of an external API like Huggingface.

Note that txtai by default proposes the creation of vectors with the model https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2, using Faiss as an Ann backend.

To summarize:

Create an embedding

from txtai.embeddings import Embeddings

embeddings = Embeddings()

Create an embedding with a database record: SQLITE / all-MiniLM-L6-v2 / Faiss

from txtai.embeddings import Embeddings

embeddings = Embeddings(content=True)

To create an embedding with Postgresql / gte-large / Faiss

pipenv shell
pipenv install psycopg2-binary

from datasets import load_dataset
import txtai

# Load dataset
ds = load_dataset("ag_news", split="train")

embeddings = txtai.Embeddings(
    content="postgresql+psycopg2://testuser:testpwd@localhost:5432/vectordb",
    objects=True,
    backend="faiss",
    path="thenlper/gte-large"
)

# indexer
embeddings.index(ds["text"])
# sauvegarder l'index
embeddings.save("./index")
# charger l'index sauvegardé
embeddings.load("./index")
## ou dans le cloud,compressé
embeddings.save("/path/to/save/index.tar.gz", cloud={...})

In this example, we are recording the contents in a database and creating our index with FAISS. This will require you to save your indexes to be able to reload them later. You also have the option to save them in the cloud or in Elasticsearch. Note that if you modify your contents in the database, you will need to recreate your index.

import txtai

embeddings = txtai.Embeddings(
    content="postgresql+psycopg2://testuser:testpwd@localhost:5432/vectordb",
    objects=True,
    backend="faiss",
    path="thenlper/gte-large"
)


embeddings.reindex(path="sentence-transformers/all-MiniLM-L6-v2", backend="hnsw")
embeddings.save("./index")

chunking strategy or segmentation of your documents

There are several strategies when it comes to creating your data segments (chunks).1. Fixed-Size Chunking

This is the most common and simplest method. It involves determining a fixed number of tokens for each chunk, with possibly some overlap between them to preserve semantic context.

Content-Aware Chunking

This approach uses the nature of the content for more sophisticated segmentation. It includes:

Sentence Splitting: Using tools like NLTK or spaCy to divide text into sentences, offering better preservation of the context.
Recursive Chunking: Divides the text into smaller chunks in a hierarchical and iterative manner, using different separators or criteria until achieving the desired chunk size or structure.

These techniques vary depending on the content and the intended application, and the choice of the appropriate method depends on several factors, including the nature of the content, the encoding model used, and the application's objective.

Simple PostgreSQL Database Setup:

No mystery here, we keep things simple. We will use the Docker image provided by the developer of the pgvector module.

## docker-compose.yaml
services:
  db:
    hostname: db
    image: ankane/pgvector
    ports:
     - 5432:5432
    restart: always
    environment:
      - POSTGRES_DB=vectordb
      - POSTGRES_USER=testuser
      - POSTGRES_PASSWORD=testpwd
      - POSTGRES_HOST_AUTH_METHOD=trust
    volumes:
     - ./init.sql:/docker-entrypoint-initdb.d/init.sql

## init.sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS embeddings (
  id SERIAL PRIMARY KEY,
  embedding vector,
  text text,
  created_at timestamptz DEFAULT now()
);

docker compose up -d

So that's it for this second part, which is somewhat theoretical. I've only skimmed the subjects, and you can imagine that each of the components involved is full of fine-tuning parameters. The beauty of TxtAi lies in the fact that everything is simple and already pre-configured. These examples are not really useful if we want to set up a simple system. However, in the reality of a project, we rarely use things as they are. That's why I wanted to show for each item the possibilities offered by this framework. Everything is configurable, nothing is left to chance. What's great about this framework, especially if you're new to the field, is that one can discover through the documentation the building blocks needed for such a system. Things can become very complex as soon as we wish to move beyond the established framework.

In the third part of our series on the RAG system, we're going to get our hands a bit dirtier. Having already covered the more tedious aspect of theory, we will be able to focus on the code.

chunking strategy or segmentation of your documents

Simple PostgreSQL Database Setup:

Share this article