RAG System and Txtai Part 3: Practical Cases

AI Artificial Intelligence lundi 22 janvier 2024

In the first part, we discussed how to converse with a model to obtain dialogue based on untrained information. In our second part, we looked at how to create our indexes and store them. In this final part, through a practical case, we will cover how to create a complete RAG system and initiate a dialogue, giving users the chance to converse with our model enriched with our own data.

Just a reminder, all the information is in the documentation available here. Therefore, you do not really need these articles. Nevertheless, it's always easier to understand when additional explanations are given. Some of the information may seem random, which is normal. It is very difficult to simplify such a vast field in a few lines. Moreover, the sector is booming, as are the available tools. We deliberately chose to detail certain aspects while only touching on others. It's up to you, who probably need to implement such a system, to delve deeper into the subject according to your needs.

So let's start with a small objective:

Our site partitech.fr has a technical blog. We publish various contents as long as they can be useful to someone. Usually, a colleague or a client asks us a question, and that leads us to think: 'Well, if this person has this question, why not write a little summary? It could surely be useful to others.' It's an opportunity for us to delve deeper into a subject and make it a personal reminder. In short, we have a blog...

We can access our blog's files via the Sitemap. This is convenient because you might also have access to this resource. So we will go through our Sitemap and index its content so that we can perform queries on it. Great!

To begin with, we need our raw material, so we will quickly develop a small script to search for our content. We will call it 'SonataExtraBlog'.

sonata_extra_blog.py:

import requests
import xml.etree.ElementTree as ET
from txtai.pipeline import Textractor
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import tempfile
import os


class SonataExtraBlog:
    def __init__(self, sitemap_url):
        self.sitemap_url = sitemap_url
        self.textractor = Textractor(sentences=True)

    def is_valid_url(self, url):
        parsed = urlparse(url)
        return bool(parsed.netloc) and bool(parsed.scheme)

    def getData(self):
        # déclaration de notre retour de données
        data_list = []

        # Télécharger le fichier sitemap
        response = requests.get(self.sitemap_url)
        sitemap_content = response.content

        # Parser le contenu XML
        root = ET.fromstring(sitemap_content)

        # Extraire les URLs
        urls = [url.text for url in root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}loc')]

        # Parcourir chaque URL pour télécharger et traiter le contenu
        for url in urls:
            # Vérifier si l'URL est valide
            if not self.is_valid_url(url):
                print(f"URL invalide : {url}")
                continue

            try:
                print(f"Traitement de : {url}")
                # Télécharger le contenu HTML
                response = requests.get(url)
                html_content = response.content.decode('utf-8')

                # Utiliser BeautifulSoup pour extraire le contenu de la div avec la classe 'content'
                soup = BeautifulSoup(html_content, 'html.parser')
                content_div = soup.find('div', class_='site-content')
                if content_div:
                    # Créer un fichier temporaire pour le contenu HTML
                    with tempfile.NamedTemporaryFile(delete=False, suffix='.html', prefix='txtextractor_',
                                                     mode='w') as temp_file:
                        temp_file.write(str(content_div))
                        temp_file_path = temp_file.name

                    # Extraire le texte à l'aide de Textractor
                    text = self.textractor(temp_file_path)

                    # Supprimer le fichier temporaire
                    os.remove(temp_file_path)
                    # print(text)
                    data_list.append({"id": url, "text": text})
                else:
                    text = "Pas de contenu trouvé dans la div 'content'"


            except requests.RequestException as e:
                print(f"Erreur lors du téléchargement de {url}: {e}")
        return data_list

The comments are in the code, so no need for lengthy explanations. Basically, we go through the sitemap links, extract the content area from each page, retrieve the textual content, and then put it into an array that we return. Really, it would be difficult to simplify further (although, actually, we have simplified it further in part 4, but in truth, that's not the topic of the article. Technical considerations and code aesthetics will be for another time).

So, we have our class to retrieve our information. Now, we will create our file that will retrieve this information and create the indexes.

Let's not forget to start our Postgresql server:

services:
  db:
    hostname: db
    image: ankane/pgvector
    ports:
     - 5432:5432
    restart: always
    environment:
      - POSTGRES_DB=vectordb
      - POSTGRES_USER=testuser
      - POSTGRES_PASSWORD=testpwd
      - POSTGRES_HOST_AUTH_METHOD=trust
    volumes:
     - ./init.sql:/docker-entrypoint-initdb.d/init.sql

We launch everything

docker compose up -d
pipenv shell
python3 index.py

Our FAISS format files are created.

And we end up with 3 automatically created and well-filled tables.Now, we can launch our query by asking txtai to kindly add a context retrieved according to our question.

from txtai.embeddings import Embeddings
from llama_cpp import Llama

# on déclare notre sytème d'embeddings
embeddings = Embeddings(
    content="postgresql+psycopg2://testuser:testpwd@localhost:5432/vectordb",
    objects=True,
    backend="faiss"
)

# on charge les embeddings
embeddings.load("./index_blog_partitech")

llm = Llama(
    model_path="/Data/Projets/Llm/Models/openchat_3.5.Q2_K.gguf", n_ctx=90000
)


def execute(question, context):
    prompt = f"""GPT4 User: system You are a friendly assistant. You answer questions from users.
        user Answer the following question using only the context below. Only include information 
        specifically discussed or general AI and LLMs related subject. 
      question: {question}
      context: {context} <|end_of_turn|>
      GPT4 Assistant:
      """
    return llm(prompt,
               temperature=0,
               max_tokens=10000,
               top_p=0.2,
               top_k=10,
               repeat_penalty=1.2)


def rag(question):
    context = "\n".join(x["text"] for x in embeddings.search(question))
    return execute(question, context)


result = rag("What about sonata-extra ?")
print(result)

result = rag("Who wrote sonata-extra ?")
print(result)

And here is the answer to the first question:

{
  "id": "cmpl-63ebabf2-ec6b-4a0e-a0ae-4433c2df6ece",
  "object": "text_completion",
  "created": 1702485091,
  "model": "/Data/Projets/Llm/Models/openchat_3.5.Q2_K.gguf",
  "choices": [
    {
      "text": "\nThe Sonata-Extra Bundle is an extension to Symfony that enhances your experience with additional functionalities such as Activity Log, Approval Workflow, Assets Management, Blog integration, Content Security Policy management, Header Redirect Manager, Language Switcher, Multisite and multilingual support for SonataPageBundle, Sitemap generation, Smart services (AI-powered), WordPress import, Cookie Consent Block, Gutenberg Editor Integration, FAQ manager, Article manager with Gutenberg editor, additional form types, and more.\n\nThe bundle provides features like automatic translation through smart service functionality, integration of the Gutenberg editor for content creation, cookie consent management in compliance with GDPR regulations, and efficient loading of assets only when necessary. It also offers a flexible way to manage CSS and JavaScript assets in Sonata blocks, allowing developers to include external files or inline styles and scripts easily.\n\nTo use these features, you need to inject the required services into your block service using autowireDependencies method. Then, add assets to your block by calling methods like addCss, addJs, addJsInline, and addCssInline. To render the assets in Twig templates, use the provided functions such as sonata_extra_get_blocks_css('default'), sonata_extra_get_blocks_js('default'), etc., with custom indexes for grouping assets when developing custom blocks.",
      "index": 0,
      "logprobs": "None",
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 2560,
    "completion_tokens": 299,
    "total_tokens": 2859
  }
}

The answer to the second question:

{
  "id": "cmpl-cf77a2a2-de5b-45ca-905e-a11094a805aa",
  "object": "text_completion",
  "created": 1702485287,
  "model": "/Data/Projets/Llm/Models/openchat_3.5.Q2_K.gguf",
  "choices": [
    {
      "text": "\\nThe authors of the Sonata-extra bundle are Geraud Bourdin and Thomas Bourdin. They work for partITech, a company that specializes in Symfony, Sonata, and other technologies to enhance digital experiences. The context provided does not mention any specific individual who wrote \"sonata-extra.\"",
      "index": 0,
      "logprobs": "None",
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 1360,
    "completion_tokens": 67,
    "total_tokens": 1427
  }
}

As we can see, the responses are well-aligned with the content we have indexed. Thus, we can easily imagine how such a system could enhance company documentation. Indeed, as easily as this, we have the ability to index content such as images, Word documents, PDFs... As long as the data is well-organized, it can be seamlessly integrated into your IT system.

Txtai is really great when you consider how few lines of code were necessary to produce this. What remains now is to couple it with a chat system. Conveniently, the developer of txtai has also written a tool, txtchat. We will explore how to organize all this in another article, certainly.

In our next part, we will see how to do the same thing directly with LangChain, using a Postgresql database to host all the embeddings. No more FAISS files. We'll do it with Python and LangChain, and then with JavaScript and LangChain. Indeed, this world is also opening up to web developers.

Share this article