#web development#javascript#ai#programming

Learning and Implementing RAG (Retrieval-Augmented Generation) using Hugging Face and LangChain

Kurt Chan

Coffee Driven Developer

February 15, 2024

6 mins read

I've been learning about AI chatbots and their widespread use as company assistants and personal AI companions. Based on what I've seen from TikTok reels and YouTube videos, here's what I understand about building RAG Applications:

LLMs only refer to public pre-trained data, which can sometimes lead to inaccurate answers.
To improve accuracy, they need to be supplemented with custom knowledge bases like documents, PDFs, and company SOPs (Standard Operating Procedures).

In essence, if you want an AI to provide accurate answers, you'll need to implement the RAG approach.

What is RAG (Retrieval-Augmented Generation)?

According to Data Camp, RAG, or Retrieval Augmented Generation, combines a pre-trained large language model with an external data source. It merges the generative capabilities of LLMs like GPT-3 or GPT-4 with precise data search mechanisms to deliver nuanced responses.

Why do we need to use RAG?

Enhanced Accuracy: RAG allows LLMs to access up-to-date and domain-specific information, reducing hallucinations and improving response accuracy.
Customization: Organizations can integrate their proprietary data, enabling AI to provide responses tailored to their specific context and needs.
Transparency: RAG makes it possible to trace the sources of information, making the AI's responses more verifiable and trustworthy.
Cost-Effective: Instead of fine-tuning entire models, RAG allows for updating knowledge by simply modifying the external knowledge base.
Real-Time Updates: The knowledge base can be continuously updated without retraining the entire model, ensuring responses stay current.

How does it work?

According to LangChain, the typical RAG approach has 2 main components:

Indexing: a pipeline for ingesting data from a source and indexing it. This usually happens offline.
Retrieval and generation: the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

Indexing

Load: Begin by loading your data, such as documents, PDFs, URLs, or JSON files.
Split: Break large documents into smaller, digestible chunks. This enhances both data indexing and model processing, making searches more efficient and ensuring the text fits within the model’s context window (e.g., max tokens).
Embed: Use an embedding model to convert these text chunks into vector representations. This process is typically handled by a separate model that transforms text into numerical embeddings.
Store: Finally, store the vectorized chunks in a database, allowing for efficient search and retrieval

Retrieval and Generation

Retrieve: When a user enters a question, the system looks up the most relevant text chunks from storage using a Retriever. Think of it like searching for the best puzzle pieces to answer the question.
Generate: Once the right chunks are found, a ChatModel or LLM takes over. It uses a prompt that combines the user's question with the retrieved info to generate a meaningful response—kind of like piecing together a story from key details.

Essential Components for Building RAG AI Applications:

Embedding Model: Converts text from documents or web pages into vectorized data chunks that LLMs can process. For web content, this requires scraping the URL first.
LLM: The core AI model that generates responses based on the provided embeddings.
Vector Database: A specialized database that stores vectorized data for the AI to reference.

Let’s try to build!

Of course! When learning something new, I always make sure to apply it in practice. Recently, I built an AI Exam Question Generator called Nexar, which creates questions based on the specific Bloom’s Taxonomy level the user selects. However, the AI has limited knowledge of Bloom’s Taxonomy, which is a drawback. To improve this, we’ll be implementing Retrieval-Augmented Generation (RAG).

Tech stack

LangChainJS: A powerful framework for building applications that leverage large language models (LLMs).
Node.js: The runtime environment for developing and running our application.
PineconeDB: Our vector database for storing and indexing vectorized text chunks.
Hugging Face: A platform for accessing and using open-source LLMs.

Setup

Install the dependencies:

javascript

// Install LangChain
npm install @langchain/community @langchain/core

//Install HuggingFace Inference and Pinecone DB
npm install  @huggingface/inference @langchain/pinecone 

// Install other libraries that we need
npm install dotenv ts-node playwright

Step 1: Scrape URL Contents

To gather information on Bloom’s Taxonomy, we’ll find relevant URLs and extract their content through web scraping. Then we will split these texts into smaller chunks. This will provide the necessary data for our project. We’ll add a separate function to scrape all of the contents of our URL.

javascript

// scrape.ts
import * as playwright from 'playwright';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';

export const scrape = async (url: string) => {
        const browser = await playwright.chromium.launch();
    const context = await browser.newContext();
    const page = await context.newPage();

    await page.goto(url, { waitUntil: 'domcontentloaded' });

    // get the text in the body
    const text = await page.innerText('body');
    await browser.close();

    // Replaces all \n with spaces
    const cleanedText = text.replace(/\n/g, ' ');

    // Split scraped data into chunks
    const splitter = new RecursiveCharacterTextSplitter({
        chunkSize: 512,
        chunkOverlap: 100,
    });

    // Create documents out of the chunk texts
    const output = await splitter.createDocuments([cleanedText]);

    console.log("Scraped text into chunks: ", output)
    // return output.map((doc) => ({ text: doc.pageContent, url }));
    return output.map((doc) => ({ text: doc.pageContent, url }));
}

Step 2: Embed the chunks

Each chunk must be embedded and converted into vectorized data before being stored in our database. To achieve this, we'll use the sentence-transformers/all-MiniLM-L6-v2 model from Hugging Face for embedding the text. First, we’ll create a separate function for embedding texts.

javascript

// generateEmbeddings.ts
import { HuggingFaceInferenceEmbeddings } from '@langchain/community/embeddings/hf';
import dotenv from 'dotenv'
dotenv.config();

// Embedding Model using HuggingFace
const client = new HuggingFaceInferenceEmbeddings({
  apiKey: process.env.HF_ACCESSTOKEN, // Your API key in HuggingFace
  model: "sentence-transformers/all-MiniLM-L6-v2",
});

export const generateEmbedding = async (text: string) => {
    const embedding = await client.embedQuery(text)
    return embedding;
}

Secondly, we’ll create an ingest function that will loop over our URLs and embed each chunks of text.

javascript

import { generateEmbedding } from "./generateEmbeddings";
import { scrape } from "./scrape";
import { upsert } from "./upsert";

// urls to scrape
const urls = [
    'https://www.coloradocollege.edu/other/assessment/how-to-assess-learning/learning-outcomes/blooms-revised-taxonomy.html', 
    'https://whatfix.com/blog/blooms-taxonomy/'
 ];

async function ingest() {
    let chunks: {id:number, text: string; $vector: number[]; url: string }[] = [];

    await Promise.all(
        urls.map(async (url) => {
            let data = await scrape(url);

            // Embed the scraped data chunks
            const embeddings = await(Promise.all(data.map(async(doc) => {
                return await generateEmbedding(doc.text)
            })))

            // Return the values with the vectorized chunks data
            chunks = chunks.concat(data.map((doc, index) => {
                return {
                    id: index + 1,
                    text: doc.text,
                    $vector: embeddings[index],
                    url: doc.url
                }
            }))
        })
    );
    // upsert the chunks to Pinecone
    await upsert(chunks)
    console.log("Vectorized Chunks", chunks)
}

ingest().catch((error) => console.error(error));

If you run this by runnning the command: npx ts-node ingest.ts, you will get all the chunked texts and its vectorized data. Now, well need to store these data into our Pinecone vector database.

Step 3: Store the vectorized chunks

On this step, make sure you already created an account for Pinecone. It is optional to create your first index on the UI because we will create the index programmatically. First, we’ll create a function that will check if we have an index created, if none, we’ll create a new index. After creating the index we will create another function to upsert the vectorized data into our pinecone index.

javascript

// upsert.ts
import { PineconeStore } from "@langchain/pinecone";
import { Pinecone as PineconeClient } from "@pinecone-database/pinecone";
import dotenv from 'dotenv'
dotenv.config();

interface Document {
    id: number,
    text: string,
    $vector: number[],
    url: string
}

export const client = new PineconeClient({apiKey: `${process.env.PINECONE_API_KEY}`});
export const indexName ="pinecone-index-1"

// Check the existing index, create if none
const createPineconeIndex = async () => {
    const existingIndexes = await client.listIndexes();
    if(!existingIndexes.indexes?.find((a) => a.name.includes(indexName))){
        console.log(`Creating "${indexName}"...`);

        const createClient = await client.createIndex({
          name: indexName,
          dimension: 384, // vector dimension for model `sentence-transformers/all-MiniLM-L6-v2`
          metric: 'cosine',
          spec: { 
            serverless: { 
              cloud: 'aws', 
              region: 'us-east-1' 
            }
          } 
        })

        console.log(`Created with client:`, createClient);
        // Wait 60 seconds for index initialization
        await new Promise((resolve) => setTimeout(resolve, 60000));
    }
    else{
        console.log(`"${indexName}" already exists.`);
    }
}

export const upsert = async (docs: Document[]) => {
    createPineconeIndex(); // create index
    const index = client.Index(indexName);
    // Batch the chunks to 100, this will allow us not to exceed the Pinecone requirements
    const batchSize = 100;
    let batch = [];

    for (let i = 0; i < docs.length; i++) {
        const vector = {
            id: `${docs[i].id}_${docs[i].url}`,
            values: docs[i].$vector,
            metadata: {
                text: docs[i].text,
                url: docs[i].url
            },
        };
        batch.push(vector)
            // If the batch is done, we will upsert everything and empty the batch
        if (batch.length === batchSize || i === docs.length -1) {
            await index.upsert(batch)
            // Empty the batch
            batch = []
        }
    }
    console.log(`Pinecone index updated with ${docs.length} vectors`);

After storing the vectorized data, you should be able to see the data in the Pinecone UI.

Step 4: Prompting the AI

At this step, we will try to ask the LLM to generate questions based on blooms taxonomy. To make our query be familiar to the AI about blooms taxonomy, we are also going to embed our prompt to get the vectorized data and compare it inside our Pinecone. We’ll create an askAi function that will do that.

javascript

import { generateEmbedding } from "./generateEmbeddings";
import { HuggingFaceInference } from "@langchain/community/llms/hf";
import { RunnableSequence } from "@langchain/core/runnables";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StructuredOutputParser } from "langchain/output_parsers";
import { client, indexName } from "./upsert";
import {z} from "zod";
import dotenv from "dotenv";
dotenv.config();

const question = `Generate exactly 5 Multiple-choice questions that assess the Remembering level of Bloom's Taxonomy. The questions should be based solely on the following content or topic:
    "Data Structures & Algorithms"

Each question must:
- Start with Number 1.
- Focus on the Remembering level (e.g., Remembering behavior).
- Clearly specify the correct answer.`;

// Initialize the HuggingFace model
const model = new HuggingFaceInference({
  model: process.env.HF_MODEL,
  apiKey: process.env.HF_ACCESSTOKEN,
  temperature: 0.5
});

const askAi = async () => {
  try {
    // Retrieve the Pinecone Index
    const index = client.Index(indexName);

    // Create query embedding.
    const queryEmbedding = await generateEmbedding(question);

    // Query the Pinecone index and return top 10 matches
    const queryResponse = await index.query({
      vector: queryEmbedding, // Vectorized query
      topK: 10,              // Top 10 matches
      includeMetadata: true,
      includeValues: true,
    });

    console.log(`Found ${queryResponse.matches.length} matches...`);
    console.log("Asking question...");

    if (queryResponse.matches.length) {
      // Concatenate matched documents into a single string
      const concatenatedPageContent = queryResponse.matches
        .map((match) => match.metadata?.text)
        .join(" ");

      // Create a chat-based prompt template
      const prompt = ChatPromptTemplate.fromMessages([
        [
          "system",
          `You are an expert teacher assistant specializing in creating examination questions. 
                      You will help the user by generating tricky examination questions 
                      based on the content, paragraph, or topic they provide and their 
                      preferred type of questions (e.g., true or false, short answer, multiple-choice). 

                      The user will specify the level of difficulty based on the Revised Bloom's Taxonomy: 
                      {content}

                    **Output Requirements**:
                        - Always generate the questions and answers in the following format:
                        {format_instructions}

                      Instructions:
                      - Directly generate the requested questions and answers. 
                      - Do not provide any commentary, explanations, or context about the topic or your process.
                      - Do not think step-by-step or describe your reasoning.
                      - If you encounter unfamiliar content, still generate questions based on the provided instructions.
                      - Your output must only include the generated questions and their answers in the exact requested format.
            `
        ],
        ["user", "{user_query}"],
      ]);

      // Define the JSON schema for the desired output structure
    const schema = z.object({
      questions: z.array(
        z.object({
          questionNumber: z.number(),
          question: z.string(),
          options: z.object({
            a: z.string(),
            b: z.string(),
            c: z.string(),
            d: z.string(),
          }),
          answer: z.string(),
        })
      ),
    });
      // Create an output parser from the JSON schema
      const parser = StructuredOutputParser.fromZodSchema(schema);

      // Create a RunnableSequence chain with the prompt and the structured model
      const chain = RunnableSequence.from([prompt, model, parser]);

      // Execute the chain with input
      const result = await chain.invoke({
        content: concatenatedPageContent,
        user_query: question,
        format_instructions: parser.getFormatInstructions()
      });

    //   console.log(`Generated Questions:\n${result}`);
      console.log(result);
    } else {
      console.log("No relevant content found. Skipping AI query.");
    }
  } catch (error) {
    console.error("Error occurred:", error);
  }
};

askAi();

As you can see we used a ZodSchema for JSON. Since our AI is a chat base model, it will say a lot of things. Using StructuredOutputParser we will force the output as an JSON schema.

Output:

javascript

Found 10 matches...
Asking question...
{
  questions: [
    {
      questionNumber: 1,
      question: 'What is the basic unit of data in a linked list?',
      options: [Object],
      answer: 'a'
    },
    {
      questionNumber: 2,
      question: 'Which data structure follows the Last In First Out (LIFO) principle?',
      options: [Object],
      answer: 'b'
    },
    {
      questionNumber: 3,
      question: 'In algorithm analysis, what does O(1) represent?',
      options: [Object],
      answer: 'b'
    },
    {
      questionNumber: 4,
      question: 'What is the process of adding an element to a stack called?',
      options: [Object],
      answer: 'b'
    },
    {
      questionNumber: 5,
      question: 'Which of the following is not a primitive data type in most programming languages?',
      question: 'Which of the following is not a primitive data type in most programming languages?',
      options: [Object],
      answer: 'd'
    }
  ]
}

See the repository here!

Conclusion

Retrieval-Augmented Generation (RAG) is a powerful technique that enhances AI chatbots by combining the generative capabilities of LLMs with external knowledge sources. By indexing and retrieving relevant data, RAG improves response accuracy, reduces hallucinations, and allows for real-time updates without the need for fine-tuning. Implementing RAG requires key components like embedding models, vector databases, and retrieval mechanisms, which have been demonstrated in the Nexar AI Exam Question Generator project. By integrating this approach, AI applications can deliver more precise, customized, and transparent answers tailored to specific domains.

Written by

Kurt Chan

Coffee Driven Developer

February 15, 2024