Retrieval-Augmented Generation (RAG)

RAG is a commonly used technique used to improve the quality of LLM-generated responses by grounding the model on external sources of knowledge. In this example, we’ll use BAML to manage the prompts for a RAG pipeline.

Creating BAML functions

The most common way to implement RAG is to use a vector store that contains embeddings of the data. First, let’s define our BAML model for RAG.

BAML Code

rag.baml
1class Response {
2 question string
3 answer string
4}
5
6function RAG(question: string, context: string) -> Response {
7 client "openai/gpt-4o-mini"
8 prompt #"
9 Answer the question in full sentences using the provided context.
10 Do not make up an answer. If the information is not provided in the context, say so clearly.
11
12 QUESTION: {{ question }}
13 RELEVANT CONTEXT: {{ context }}
14
15 {{ ctx.output_format }}
16
17 RESPONSE:
18 "#
19}
20
21test TestOne {
22 functions [RAG]
23 args {
24 question "When was SpaceX founded?"
25 context #"
26 SpaceX is an American spacecraft manufacturer and space transportation company founded by Elon Musk in 2002.
27 "#
28 }
29}
30
31test TestTwo {
32 functions [RAG]
33 args {
34 question "Where is Fiji located?"
35 context #"
36 Fiji is a country in the South Pacific known for its rugged landscapes, palm-lined beaches, and coral reefs with clear lagoons.
37 "#
38 }
39}
40
41test TestThree {
42 functions [RAG]
43 args {
44 question "What is the primary product of BoundaryML?"
45 context #"
46 BoundaryML is the company that makes BAML, the best way to get structured outputs with LLMs.
47 "#
48 }
49}
50
51test TestMissingContext{
52 functions [RAG]
53 args {
54 question "Who founded SpaceX?"
55 context #"
56 BoundaryML is the company that makes BAML, the best way to get structured with LLMs.
57 "#
58 }
59}

Note how in the TestMissingContext test, the model correctly says that it doesn’t know the answer because it’s not provided in the context. The model doesn’t make up an answer, because of the way we’ve written the prompt.

You can generate the BAML client code for this prompt by running baml-cli generate.

Creating a VectorStore

Next, let’s create our own minimal vector store and retriever using scikit-learn.

Python Code

rag.py
1# Install scikit-learn and use its TfidfVectorizer
2from sklearn.feature_extraction.text import TfidfVectorizer
3from sklearn.metrics.pairwise import cosine_similarity
4import numpy as np
5
6class VectorStore:
7 """
8 Adapted from https://github.com/MadcowD/ell/blob/main/examples/rag/rag.py
9 """
10 def __init__(self, vectorizer, tfidf_matrix, documents):
11 self.vectorizer = vectorizer
12 self.tfidf_matrix = tfidf_matrix
13 self.documents = documents
14
15 @classmethod
16 def from_documents(cls, documents: list[str]) -> "VectorStore":
17 vectorizer = TfidfVectorizer()
18 tfidf_matrix = vectorizer.fit_transform(documents)
19 return cls(vectorizer, tfidf_matrix, documents)
20
21 def retrieve_with_scores(self, query: str, k: int = 2) -> list[dict]:
22 query_vector = self.vectorizer.transform([query])
23 similarities = cosine_similarity(query_vector, self.tfidf_matrix).flatten()
24 top_k_indices = np.argsort(similarities)[-k:][::-1]
25 return [
26 {"document": self.documents[i], "relevance": float(similarities[i])}
27 for i in top_k_indices
28 ]
29
30 def retrieve_context(self, query: str, k: int = 2) -> str:
31 documents = self.retrieve_with_scores(query, k)
32 return "\n".join([item["document"] for item in documents])

We can then build our RAG application in Python by calling the BAML client.

rag.py
1from baml_client import b
2
3# class VectorStore:
4# ...
5
6if __name__ == "__main__":
7 documents = [
8 "SpaceX is an American spacecraft manufacturer and space transportation company founded by Elon Musk in 2002.",
9 "Fiji is a country in the South Pacific known for its rugged landscapes, palm-lined beaches, and coral reefs with clear lagoons.",
10 "Dunkirk is a 2017 war film depicting the Dunkirk evacuation of World War II, featuring intense aerial combat scenes with Spitfire aircraft.",
11 "BoundaryML is the company that makes BAML, the best way to get structured outputs with LLMs."
12 ]
13
14 vector_store = VectorStore.from_documents(documents)
15
16 questions = [
17 "What is BAML?",
18 "Which aircraft was featured in Dunkirk?",
19 "When was SpaceX founded?",
20 "Where is Fiji located?",
21 "What is the capital of Fiji?"
22 ]
23
24 for question in questions:
25 context = vector_store.retrieve_context(question)
26 response = b.RAG(question, context)
27 print(response)
28 print("-" * 10)

When you run the Python script, you should see output like the following:

question='What is BAML?' answer='BAML is a product made by BoundaryML, and it is described as the best way to get structured outputs with LLMs.'
----------
question='Which aircraft was featured in Dunkirk?' answer='The aircraft featured in Dunkirk were Spitfire aircraft.'
----------
question='When was SpaceX founded?' answer='SpaceX was founded in 2002.'
----------
question='Where is Fiji located?' answer='Fiji is located in the South Pacific.'
----------
question='What is the capital of Fiji?' answer='The information about the capital of Fiji is not provided in the context.'
----------

Once again, in the last question, the model correctly says that it doesn’t know the answer because it’s not provided in the context.

That’s it! You can now attempt such a RAG workflow with a vector database on a larger dataset. All you have to do is point BAML to the retriever class you’ve implemented.

Creating Citations with LLM

In this advanced section, we’ll explore how to enhance our RAG implementation to include citations for the generated responses. This is particularly useful when you need to track the source of information in the generated responses.

First, let’s extend our BAML model to support citations. We’ll create a new response type and function that explicitly handles citations:

rag.baml
1class ResponseWithCitations {
2 question string
3 answer string
4 citations string[]
5}
6
7function RAGWithCitations(question: string, context: string) -> ResponseWithCitations {
8 client "openai/gpt-4o-mini"
9 prompt #"
10 Answer the question in full sentences using the provided context.
11 If the statement contains information from the context, put the exact cited quotes in complete sentences in the citations array.
12 Do not make up an answer. If the information is not provided in the context, say so clearly.
13
14 QUESTION: {{ question }}
15 RELEVANT CONTEXT: {{ context }}
16 {{ ctx.output_format }}
17 RESPONSE:
18 "#
19}

Let’s add a test to verify our citation functionality:

rag.baml
1test TestCitations {
2 functions [RAGWithCitations]
3 args {
4 question "What can you tell me about SpaceX and its founder?"
5 context #"
6 SpaceX is an American spacecraft manufacturer and space transportation company founded by Elon Musk in 2002.
7 The company has developed several launch vehicles and spacecraft.
8 Einstein was born on March 14, 1879.
9 "#
10 }
11}

This test will demonstrate how the model:

  1. Provides relevant information about SpaceX and its founder
  2. Includes the exact source quotes in the citations array
  3. Only uses information that’s actually present in the context

To use this enhanced RAG implementation in our Python code, we simply need to update our loop to use the new RAGWithCitations function:

rag.py
1for question in questions:
2 context = vector_store.retrieve_context(question)
3 response = b.RAGWithCitations(question, context)
4 print(response)
5 print("-" * 10)

When you run this modified code, you’ll see responses that include both answers and their supporting citations. For example:

question='What is BAML?' answer='BAML is a product made by BoundaryML that provides the best way to get structured outputs with LLMs.' citations=['BoundaryML is the company that makes BAML, the best way to get structured outputs with LLMs.']
----------
question='Which aircraft was featured in Dunkirk?' answer='The aircraft featured in Dunkirk were Spitfire aircraft.' citations=['Dunkirk is a 2017 war film depicting the Dunkirk evacuation of World War II, featuring intense aerial combat scenes with Spitfire aircraft.']
----------
question='When was SpaceX founded?' answer='SpaceX was founded in 2002.' citations=['SpaceX is an American spacecraft manufacturer and space transportation company founded by Elon Musk in 2002.']
----------
question='Where is Fiji located?' answer='Fiji is located in the South Pacific.' citations=['Fiji is a country in the South Pacific.']
----------
question='What is the capital of Fiji?' answer='The capital of Fiji is not provided in the context.' citations=[]
----------

Notice how each piece of information in the answer is backed by a specific citation from the source context. This makes the responses more transparent and verifiable, which is especially important in applications where the source of information matters.

Using Pinecone as Vector Database

Instead of using our custom vector store, we can use Pinecone, a production-ready vector database. Here’s how to implement the same RAG pipeline using Pinecone:

First, install the required packages:

$pip install pinecone

Now, let’s modify our Python code to use Pinecone:

rag_pinecone.py
1import pinecone as pc
2from sentence_transformers import SentenceTransformer
3from pinecone import ServerlessSpec
4from baml_client import b
5
6# Initialize Pinecone
7pc = Pinecone(api_key="YOUR_API_KEY")
8
9class PineconeStore:
10 def __init__(self, index_name: str):
11 self.index_name = index_name
12 self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
13
14 # Create index if it doesn't exist
15 if index_name not in pc.list_indexes().names():
16 pc.create_index(
17 name=index_name,
18 dimension=self.encoder.get_sentence_embedding_dimension(),
19 metric='cosine',
20 spec=ServerlessSpec(
21 cloud='aws',
22 region='us-east-1'
23 )
24 )
25 self.index = pc.Index(index_name)
26
27 def add_documents(self, documents: list[str], ids: list[str] = None):
28 if ids is None:
29 ids = [str(i) for i in range(len(documents))]
30
31 # Create embeddings
32 embeddings = self.encoder.encode(documents)
33
34 # Create vector records
35 vectors = [(id, emb.tolist(), {"text": doc})
36 for id, emb, doc in zip(ids, embeddings, documents)]
37
38 # Upsert to Pinecone
39 self.index.upsert(vectors=vectors)
40
41 def retrieve_context(self, query: str, k: int = 2) -> str:
42 # Create query embedding
43 query_embedding = self.encoder.encode(query).tolist()
44
45 # Query Pinecone
46 results = self.index.query(
47 vector=query_embedding,
48 top_k=k,
49 include_metadata=True
50 )
51
52 # Extract and join the document texts
53 contexts = [match.metadata["text"] for match in results.matches]
54 return "\n".join(contexts)
55
56if __name__ == "__main__":
57 # Initialize Pinecone store
58 vector_store = PineconeStore("baml-rag-demo")
59
60 # Sample documents (same as before)
61 documents = [
62 "SpaceX is an American spacecraft manufacturer and space transportation company founded by Elon Musk in 2002.",
63 "Fiji is a country in the South Pacific known for its rugged landscapes, palm-lined beaches, and coral reefs with clear lagoons.",
64 "Dunkirk is a 2017 war film depicting the Dunkirk evacuation of World War II, featuring intense aerial combat scenes with Spitfire aircraft.",
65 "BoundaryML is the company that makes BAML, the best way to get structured outputs with LLMs."
66 ]
67
68 # Add documents to Pinecone
69 vector_store.add_documents(documents)
70
71 # Test questions (same as before)
72 questions = [
73 "What is BAML?",
74 "Which aircraft was featured in Dunkirk?",
75 "When was SpaceX founded?",
76 "Where is Fiji located?",
77 "What is the capital of Fiji?"
78 ]
79
80 # Query using the same BAML functions
81 for question in questions:
82 context = vector_store.retrieve_context(question)
83 response = b.RAGWithCitations(question, context)
84 print(response)
85 print("-" * 10)

The key differences when using Pinecone are:

  1. Documents are stored in Pinecone’s serverless infrastructure on AWS instead of in memory
  2. We can persist our vector database across sessions

Here is a snapshot of the entriies in our Pinecone database console:

Note that you’ll need to:

  1. Create a Pinecone account
  2. Get your API key from the Pinecone console
  3. Replace YOUR_API_KEY with your actual Pinecone credentials
  4. Make sure you have access to the serverless offering in your Pinecone account

The BAML functions (RAG and RAGWithCitations) remain exactly the same, demonstrating how BAML cleanly separates the prompt engineering from the implementation details of your vector database.

When you run this code, you’ll get the same type of responses as before, but now you’re using a production-ready serverless vector database that can scale automatically based on your usage.