For this first part of our series of articles, we're going to base our exploration on this tutorial to implement a RAG process. But what is a RAG? RAG, or "Retrieval-Augmented Generation," is an advanced technique in artificial intelligence, specifically in the field of natural language processing (NLP), which involves enriching the text generation process by incorporating an information retrieval phase. This hybrid method combines the power of deep learning-based language models with the efficiency of information retrieval systems to produce more accurate and contextually relevant responses.

Through this article, we will explore how RAGs transform the traditional approach to language generation by integrating dynamic elements of information search. We will examine the key components of this technology, including how the model identifies and extracts relevant information from a vast database or text corpus before starting the response generation process. This allows the model to rely not only on its internal learning but also on specific external data, which is particularly beneficial for questions requiring updated or specialized knowledge.

Furthermore, we will discuss the challenges and opportunities associated with implementing RAGs, highlighting their potential in various fields of application, from virtual assistance to personalized recommendation systems. Finally, we will illustrate how to set up a RAG system using cutting-edge technologies and tools, by providing a step-by-step guide based on the tutorial from the official documentation of the Txtai tool. Whether you are an experienced developer in PHP and data science or just curious to discover the latest advances in AI, this article aims to provide an in-depth understanding of RAG technology and its impact on the future of natural language processing.

Key Components of a RAG System

RAGs, by their hybrid nature, integrate two essential components: the information retrieval system (Information Retrieval, IR) and the language generation model (Language Generation Model).

Information Retrieval System: The first stage of a RAG is to search for and retrieve relevant information in response to a query. This phase uses advanced search techniques to quickly scan vast data sets, selecting text excerpts that are most likely to be relevant to the question posed.

Language Generation Model: Once the relevant information is retrieved, it is fed into a language generation model. This model, often based on neural network architectures like Transformers, uses this information to construct a coherent and contextually appropriate response.

Challenges and OpportunitiesRAGs pose several technical and practical challenges. For instance, the accuracy of information retrieval is critical; inaccurate retrieval can lead to erroneous responses. Additionally, balancing processing speed and accuracy is an important consideration, especially for real-time applications.

However, the opportunities presented by RAGs are considerable. In customer support, for example, RAGs can provide more precise and personalized answers than traditional language generation systems. In scientific research, they can aid in formulating responses based on the latest publications and discoveries.

Practical Implementation of a RAG

Several steps are necessary to implement a RAG:

Technology Selection: Transformer models like BERT or GPT can be used for the generation model, while systems like Elasticsearch can be employed for information retrieval. Here, we will use MistralAI and the internal system of txtai. We will then see how to integrate the indexes directly into a PostgreSQL database.

Data Preparation: Prepare a large and diverse data corpus for the retrieval phase. This corpus should be representative of the types of queries expected. To do this, we will retrieve the content pages of this same blog.

Training and Fine-tuning:

Train your language generation model using both the data corpus and retrieved information. Fine-tuning may be necessary to adapt the model to specific use cases.

Integration and Testing: Integrate the RAG system into your application or service and perform thorough testing to evaluate its performance and accuracy.

Conclusion

"Retrieval-Augmented Generation" is an exciting advancement in AI, opening new avenues for more sophisticated and accurate natural language processing applications. Although it presents challenges, its potential in various domains makes it invaluable for businesses and developers looking to leverage the latest AI innovations. By following the steps and principles outlined in this article, PHP developers and data science experts can begin to explore the fascinating world of RAGs to enhance the natural language processing capabilities of their applications.

Author's note: Let's not kid ourselves. The beginning of this article was written with the help of artificial intelligence. It must be admitted that it did a good job. It seems clear that an artificial intelligence was trained with its own concepts. This made the writing exercise very playful. My role was simply to "show" the way interactively to obtain the clearest information possible. Now that we've had a little theoretical interlude, we can move on more easily to the practical part.

First of all, the technology chosen for this first article is TxtAi. Why? Well, because it's made for that. And once you understand how it works, you realize that it represents a significant time saving. Considering that there will be data to prepare, an infrastructure to set up, and other technical aspects, if we can save time in development, it will always be time gained for the final client. You will see that everything is not done in just three clicks. You need data, and that's ultimately where it's going to take the longest.

Let's move on to the code. We assume that you have already completed the installation step of Llama.cpp. You have an account on Huggingface with a token and your Python environment is already functional. We start by installing our little dependencies:

pipenv shell
pipenv install txtai[pipeline] autoawq install nltk['punkt']
pipenv install  install git+https://github.com/abetlen/llama-cpp-python.git
pip install hf-transfer

We will need a model, and we have chosen TheBloke/Mistral-7B-OpenOrca-GGUF. You will see that the Llm from MistralAi is very impressive when it comes to writing content. Depending on your hardware, use a lighter version of the model: https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF/tree/main

huggingface-cli download TheBloke/Mistral-7B-OpenOrca-GGUF mistral-7b-openorca.Q8_0.gguf --local-dir /Data/Projets/Llm/Models/MistralAi-OpenOrqa/mistral-7b-openorca.Q8_0.gguf --cache-dir /Data/Projets/Llm/Models/huggingface_cache/

export HF_HUB_ENABLE_HF_TRANSFER=1 &&  huggingface-cli download TheBloke/Mistral-7B-OpenOrca-GGUF mistral-7b-openorca.Q2_K.gguf --local-dir /Data/Projets/Llm/Models/MistralAi-OpenOrqa/mistral-7b-openorca.Q2_K.gguf --cache-dir /Data/Projets/Llm/Models/huggingface_cache/

The advantage of this repository is that the gguf format file is already prepared for us, so we can directly use it with llama.cpp and TxtAi. Therefore, no GPU is needed. Once again, the idea is to demonstrate how it works and to envision using equipment intended for this purpose.

We will need material to perform our data extractions. Thanks to neuml for providing us with sample files. Download documents from cette adresse, or use one of your own by adapting the following code.

from llama_cpp import Llama
from txtai.pipeline import Textractor
import nltk
# Création de notre connection à Llama.cpp avec ces parrametres par defauts.
llm = Llama(
    model_path="/Data/Projets/Llm/Models/MistralAi-OpenOrqa/mistral-7b-openorca.Q8_0.gguf/mistral-7b-openorca.Q8_0.gguf", n_ctx=2048)
# On récupere le texte de notre fichier docx.
textractor = Textractor()
text = textractor("datas/tuto1/document.docx")


# execution de notre appel au modèle.
def execute(question, context):
    # on crée un prompt de démarrage
    prompt = f"""<|im_start|>system You are a friendly assistant. You answer questions from users.<|im_end|> 
    <|im_start|>user Answer the following question using only the context below. Only include information 
    specifically discussed or general AI and LLMs related subject. 
  question: {question}
  context: {context} <|im_end|>
  <|im_start|>assistant
  """

    tokens = nltk.word_tokenize(text)
    print(len(tokens))
    return llm(prompt,
               temperature=0.1,
               max_tokens=10000,
               top_p=0.2,
               top_k=10,
               repeat_penalty=1.2)

# on lance notre methode avec une question mais aussi un contexte tiré de notre fichier docx.
result = execute("Write a long article about TxtAi explaining how it works. You can show code example. Explain how it could solve problem in small and larger company. Make the accurate answer you can do. Make an intro and a conclusion. This could be a general speech in a conference or a learning lesson at Partitech University. Make your answer in french", text)
print(result['choices'][0]['text'])
print(result['choices'][0]['text'])
print(result['usage'])

Here is the result.

 Titre : TxtAi, la base de données d'embeddings pour les recherches sémantiques et l'orchestration des LLMs (Langage de Modèle Linguistique)

Dans cet article, nous allons expliquer comment TxtAi fonctionne en tant que base de données d'embeddings complète pour la recherche sémantique, l'orchestration des Langages de Modèles Linguistiques (LLMs) et les workflows linguistiques. Nous allons également présenter un exemple de code et montrer comment TxtAi peut résoudre des problèmes dans des entreprises de taille différente, que ce soit une petite ou grande entreprise.

TxtAi est une base de données d'embeddings conçue pour faciliter la recherche sémantique et l'orchestration des LLMs. Elle offre plusieurs fonctionnalités importantes, telles que :
1. Recherche vectorielle avec SQL, stockage objet, modélisation du sujet ;
2. Créer des embeddings pour le texte, les documents, la audio et l'image;
3. Pipelines alimentés par des LLMs qui exécutent des prompts de LLM;
4. Workflows permettant de joindre plusieurs pipelines ensemble et d'aggreguer des logiques métier ;
5. Construire avec Python ou YAML, avec des bindings API disponibles pour JavaScript, Java, Rust et Go.

Pour illustrer comment TxtAi fonctionne, voici un exemple de code en Python :


***python
from txtai import TextSearch, DocumentEmbeddings, ImageEmbeddings, Pipeline, Workflow

# Création d'un pipeline avec une LLM
llm_pipeline = Pipeline(model="gpt-3")

# Créer des embeddings pour le texte et l'image
text_embeddings = DocumentEmbeddings("documents.txt", model="all-MiniLM-L6-v2")
image_embeddings = ImageEmbeddings("images/*.jpg", model="BLIP")

# Créer une recherche vectorielle avec SQL et stockage objet
text_search = TextSearch(sql_query="SELECT * FROM documents WHERE similarity > 0.8", storage="object-store")

# Construire un workflow pour joindre les pipelines ensemble
workflow = Workflow([llm_pipeline, text_search, text_embeddings, image_embeddings])

# Exécutez le workflow et affichez les résultats
results = workflow.run()
print(results)
*** 
Ce code montre comment créer un pipeline avec une LLM, générer des embeddings pour du texte et des images, puis construire un workflow qui utilise ces pipelines pour effectuer une recherche vectorielle sur la base de données.

En ce qui concerne l'impact dans les entreprises, TxtAi peut aider à résoudre divers problèmes en améliorant la recherche sémantique et en facilitant le travail des LLMs. Pour une petite entreprise, cela signifie que les employés peuvent accélérer leurs processus de recherche grâce à l'utilisation d'embeddings pour indexer rapidement leur contenu. Dans un contexte plus grand, TxtAi peut aider à améliorer la performance des systèmes de recommandation et à simplifier le développement de workflows linguistiques complexes.

En conclusion, TxtAi est une base d'embeddings complète qui facilite la recherche sémantique et l'orchestration des LLMs pour les entreprises de toutes tailles. Grâce à ses fonctionnalités avancées en matière de vectorisation, de modélisation du sujet et de workflows linguistiques, TxtAi peut améliorer considérablement la productivité et l'efficacité des organisations dans divers domaines.

Note that the output is directly in Markdown format. I changed the code marker, otherwise it wouldn't be very readable.

Let's move on to additional explanations regarding the parameters used in our call:

First, the prompt. How do we define the prompt we send to the model? No magic here. You will need to go directly to the model's documentation. Here, on HuggingFace: https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF

As you can see, the expected prompt is clear: it requires zones <|im_start|>system <|im_start|>user and a zone <|im_start|> that leaves space for the model to respond to you. The trick here is to define a context in the prompt. It's important to understand that the space given to the prompt must be predefined.

n_ctx=2048 = The number of tokens allocated to the context. It is specified in the doc on huggingface. How to define the number of tokens? The calculation of the number of tokens depends on the tokenization method used. Tokenization is the process of breaking a text into tokens. Here are some general steps to calculate the number of tokens:

import nltk
nltk.download('punkt')  # Assurez-vous d'avoir le package 'punkt' téléchargé pour la tokenisation
text = "Votre texte ici."
tokens = nltk.word_tokenize(text)
number_of_tokens = len(tokens)

In our example, the number of tokens of our prompt calculated is 264. The number of tokens returned in the response is 3182. In reality, the model returns the following information to us:

{'prompt_tokens': 552, 'completion_tokens': 938, 'total_tokens': 1490}

It is therefore a bit more complicated to predict the number of tokens in advance.

temperature=0.1

Temperature controls the degree of randomness in the choice of words or phrases. It affects the probabilities associated with each possible choice during text generation.

At a low temperature (near 0), the model tends to choose the most probable words, leading to more predictable and coherent texts. However, this can also make the generated texts less varied and sometimes repetitive.
At a high temperature, the model takes greater risks by choosing less probable words. This can lead to more creative, original, or surprising texts, but also sometimes less coherent or relevant. In a text generation model, temperature acts as a regulator between coherence and creativity. Careful adjustment of the temperature can greatly influence the quality and style of the text generated.

max_tokens=10000

The max_tokens parameter is used in natural language processing (NLP) models, like MistralAI and other neural network-based models. It plays a crucial role in defining the length of the output generated by the model. Here are some key details:

max_tokens specifies the maximum number of tokens (words, punctuation, symbols, etc.) that the model can generate or process at a single time. A "token" is the smallest unit of processing in NLP.
When generating text, max_tokens limits the length of the text produced. If the model reaches this limit during generation, it stops producing more text.
In text comprehension or analysis tasks, this parameter can limit the amount of text analyzed at once.
Limiting the number of tokens is important for managing computing resource usage. The higher the max_tokens value, the more memory and computing power is required to generate or process the text. This is particularly relevant for large NLP models, which can consume a lot of resources.
For Short Generations: A low max_tokens value can be used for short responses, as in chatbots or question-answering systems.
For Longer Texts: A higher value is needed for tasks such as generating articles, narratives, or other forms of longer content. The choice of the appropriate value for max_tokens depends on the specific application and the capabilities of the system on which the model is running.

It is often necessary to find a balance between the quality and length of the generated text, and performance constraints. In summary, max_tokens is an essential parameter that helps to control the length of the text generated by an NLP model while managing the necessary computing resources. Its proper adjustment is crucial for obtaining optimal results according to the specific needs of the application.

top_p=0.2

The top_p parameter, also known as "nucleus sampling," is a crucial setting in text-generation models such as MistralAI. It is used to control the diversity and creativity of the generated texts. Here is a detailed explanation: When the model is generating text, it assigns a probability to each potential word as the next word. top_p operates to filter these words based on their cumulative probabilities. Practically, this means that the model only considers a subset of possible words whose cumulative probabilities reach the top_p threshold. Text Diversity Control:

With a low top_p, the model restricts itself to a smaller number of highly probable choices, leading to more predictable and less diversified texts.
With a high top_p, the model considers a greater number of possible choices, thereby increasing the diversity and creativity of the generated text, but at the expense of coherence.
Unlike temperature, which adjusts the probabilities of all possible choices, top_p simply limits the pool of considered words for generation based on their cumulative probabilities.

top_p offers an alternative to sampling based on the top_k parameter, where only the k most probable words are considered, regardless of their absolute probabilities.

For more standard text: Use a lower top_p if you want texts that are more coherent and aligned with training data.
For more creative text: A higher top_p is suitable for tasks requiring creativity or originality, such as creative writing or generating unique concepts.

The value of top_p is typically a real number between 0 and 1. The appropriate value choice for top_p depends on the type of text you want to generate and the desired balance between creativity and coherence. In summary, top_p is an important parameter for controlling diversity and creativity in text generation. Adjusting it allows for finding the right balance between producing original texts and maintaining certain consistency with the style and content of the training data.

top_k=10

The top_k parameter is another significant setting. It plays a vital role in determining the model's word choices during text generation. In text generation, the model evaluates the probability of many potential words as the next word. top_k restricts this selection to the k most probable words. For example, if top_k is set to 10, the model will only consider the top 10 most probable words to continue the sentence at each generation step.

A low top_k (such as 5 or 10) makes the generated text more predictable and coherent, as it confines itself to the most likely choices.
A high top_k increases the diversity of the generated words, as it allows the model to choose from a broader array of possibilities, which can lead to more creative or surprising texts.
Unlike top_p, which selects words based on a threshold of cumulative probability, top_k focuses solely on the fixed number of most probable choices.- top_k is often used in conjunction with temperature to further refine the generated output.
For more standardized texts: a lower top_k is often used for tasks requiring consistency and precision, like answering questions.
For more innovative texts: a higher top_k can be used to stimulate creativity, useful in creative writing or idea generation.
The value of top_k is usually an integer. The choice of this value depends on the desired balance between creativity and consistency in the generated text.

In summary, top_k is a parameter that directly influences the variety and originality of the text generated by an NLP model. It must be adjusted according to the specific goals of the text generation task, keeping in mind the balance between content diversity and fidelity to the style and structure of the training data.

repeat_penalty=1.2

The repeat_penalty is a parameter used to minimize repetitions in the generated text. It's an important aspect for improving the quality and variety of the produced text.

This parameter aims to penalize the repetition of words or phrases that have already appeared in the generated text. In other words, it decreases the likelihood of reselecting words or phrases that have been used recently.
When the model generates text, it assigns scores (probabilities) to potential words to use next. If a word has already been used, the repeat_penalty reduces that word's score, making it less likely to be chosen again.
The higher the repeat_penalty, the stronger the penalization for word repetition.
With a low or null repeat_penalty, the model may generate texts with frequent repetitions, which can make the text monotonous or less natural.
With a high repeat_penalty, the model tends to avoid repetitions, which can contribute to text variety and richness. However, too high a penalty can also force the model to use less appropriate words, potentially affecting the text's coherence.
For more fluid texts: A moderate repeat_penalty is useful for avoiding repetitions while maintaining consistency and natural flow of the text.
For highly varied texts: A higher repeat_penalty can be used to stimulate diversity in the text, especially in creative contexts.

The exact value of repeat_penalty will depend on the specifics of the model and the type of text you want to generate. It often requires experimentation with different values to find the ideal balance.

For further information, here is an excerpt from the documentation of llama_cpp_python. It can be useful for a better understanding of some parameters.

https://llama-cpp-python.readthedocs.io/en/latest/api-reference/

model_path (str) – Path to the model.
- Explanation: Indicates where the model file is located on your computer.
n_gpu_layers (int, default: 0) – Number of layers to transfer to the GPU (-ngl). If -1, all layers are transferred.- Explanation: Determines how many layers of the model should be run on the GPU. Useful for managing the workload between the CPU and GPU.
main_gpu (int, default: 0) - The GPU used for temporary calculations and small tensors.
- Explanation: Selects which GPU (if multiple are available) is used for the main operations.
tensor_split (Optional[List[float]], default: None) - How to split tensors across GPUs. If None, the model is not split.
- Explanation: Allows for distribution of the model's data across multiple GPUs, if available.
vocab_only (bool, default: False) - Load only the vocabulary, not the model weights.
- Explanation: Useful if you need only the built-in vocabulary of the model, without the trained parts.
use_mmap (bool, default: True) - Use mmap if possible.
- Explanation: mmap allows for more efficient memory management when loading large files.
use_mlock (bool, default: False) - Forces the system to keep the model in RAM.
- Explanation: Ensures that the model stays in live memory for quicker access.
seed (int, default: LLAMA_DEFAULT_SEED) - Seed for the random number generator, -1 for random.
- Explanation: Determines the sequence of random numbers used, important for reproducibility of results.
n_ctx (int, default: 512) - Textual context, 0 = the model's own.
- Explanation: Defines the size of the context (number of tokens) that the model takes into account for its predictions.
n_batch (int, default: 512) - Maximum processing size per batch for prompts.
- Explanation: Determines how many prompts can be processed at the same time.
n_threads (Optional[int], default: None) - Number of threads to use for generation.
- Explanation: Allows configuring multithreading to speed up generation.
n_threads_batch (Optional[int], default: None) - Number of threads to use for batch processing.
- Explanation: Similar to n_threads, but specific to batch processing.
rope_scaling_type (Optional[int], default: LLAMA_ROPE_SCALING_UNSPECIFIED) - Type of RoPE scaling, according to llama_rope_scaling_type enumeration. Ref: llama.cpp pull request #2054
- Explanation: Advanced parameter related to the scaling of certain calculations in the model.
rope_freq_base (float, default: 0.0) - RoPE base frequency, 0 = the model’s own.
- Explanation: Sets a base value for the frequency calculation in RoPE, an advanced model parameter.
rope_freq_scale (float, default: 0.0) - RoPE frequency scaling factor, 0 = the model’s own.
- Explanation: Adjusts the scale of the frequency used in RoPE.
yarn_ext_factor (float, default: -1.0) - YaRN mix extrapolation factor, negative = the model’s own.
- Explanation: Advanced parameter influencing how YaRN, a part of the model, blends data.
yarn_attn_factor (float, default: 1.0) - YaRN attention magnitude scaling factor.
- Explanation: Adjusts the scale of some calculations in YaRN.
yarn_beta_fast (float, default: 32.0) – YaRN low correction dimension.
- Explanation: Technical parameter related to fast correction in YaRN.
yarn_beta_slow (float, default: 1.0) – YaRN high correction dimension.
- Explanation: Technical parameter for slow correction in YaRN.
yarn_orig_ctx (int, default: 0) – Original context size of YaRN.
- Explanation: Determines the original context size for certain calculations in YaRN.
f16_kv (bool, default: True) – Use fp16 for the KV cache, fp32 otherwise.
- Explanation: Determines the data format in the key/value cache, influencing performance and memory usage.
logits_all (bool, default: False) – Returns logits for all tokens, not just the last one. Must be True for completion to return logprobs.
- Explanation: If True, the model will provide logits (scores before activation function) for each generated token.
embedding (bool, default: False) – Embedding only mode.
- Explanation: If True, the model operates in embedding mode, used to obtain vector representations of data.
last_n_tokens_size (int, default: 64) – Maximum number of tokens to keep in the last_n_tokens deque.
- Explanation: Defines the buffer size for the last generated tokens.
lora_base (Optionalstrstrstr, default: None) – Optional path to the base model, useful if you're using a quantized base model and want to apply LoRA to an f16 model.
- Explanation: Allows applying LoRA adjustments to a specific base model.
lora_path (Optionalstrstrstr, default: None) – Path to a LoRA file to apply to the model.
- Explanation: Specifies a LoRA file to modify the model.
numa (bool, default: False) – Enable NUMA support. (NOTE: The initial value of this parameter is used for the rest of the program as this value is set in llama_backend_init)
- Explanation: Enables or disables support for NUMA, a memory management optimization method onsystems with multiple CPUs.
chat_format (str, default: 'llama-2') – String specifying the chat format to use when calling create_chat_completion.
- Explanation: Determines the output format for generated chat sessions.
chat_handler (OptionalLlamaChatCompletionHandlerLlamaChatCompletionHandlerLlamaChatCompletionHandler, default: None) – Optional chat handler to use when calling create_chat_completion.
- Explanation: Allows customization of the processing of chat sessions.
verbose (bool, default: True) – Display detailed output on stderr.
- Explanation: If True, the model will provide detailed information about its operation during execution.

Here is the introduction to RAG. We haven't yet gotten to the heart of the matter, but this first part has allowed us to highlight the concept and to show how things work. See you in part 2 where we will start entering data into the database. We will see how TxtAi manages on its own to make queries to retrieve content and provide us with a coherent response suitable for the user.