3. Embedding New Content#

This guide explains how to embed your own documents into a Maeser-compatible vector store, enabling Retrieval‑Augmented Generation (RAG) over custom knowledge bases.

3.1. Prerequisites#

A Maeser development or user environment set up (see Development Setup).
Python 3.10+ virtual environment activated.
Maeser and its dependencies installed (pip install -e . or make setup).
Your documents in plain text format (e.g., .txt, Markdown .md, or PDF converted to text).

3.2. Prepare Your Text Data#

Collect all source files you want to embed into a single folder (e.g., docs/homework, docs/labs, etc.).
If your files are not plain text (e.g., PDF), convert them first. A simple example using pdftotext:
```
pdftotext input.pdf output.txt
```
Ensure each file’s encoding is UTF‑8 to avoid errors when reading in Python.

Your script for storing data will consist of two parts–chunking, and storing.

3.3. Chunk Your Documents#

Large documents must be split into smaller, semantically meaningful chunks before embedding:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path

# Configure text splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)

# Read all text files and generate chunks
texts = [] # this is where the text chunks will be stored
metadatas = [] # this will keep track of which text chunks belong to which text file
folder = Path("path/to/txts")
for f in folder.glob("*.txt"):
    doc = f.read_text(encoding="utf8")
    chunks = splitter.split_text(doc)
    texts.extend(chunks) # store text chunks in texts
    metadatas.extend([{"source": f.name}] * len(chunks)) # store file name in metadatas

Tip: Adjust chunk_size and chunk_overlap based on your document structure. Smaller chunks improve retrieval precision but increase vector store size.

3.4. Create the Vector Store#

Use FAISS (via LangChain) to embed and store your chunks:

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings()

# Build or load FAISS vector store
vectorstore = FAISS.from_texts(
    texts,
    embeddings,
    metadatas
)

# Persist to disk
vectorstore.save_local("my_vectorstore")

my_vectorstore/ will contain index files you can reuse in Maeser.
Metadatas (source field) helps trace which document each vector originated from.

Tip: The code above assumes the OPENAI_API_KEY environment variable is defined. If you want to pass your api key into the script without using the environment variable, assign it when initializing your embeddings:
embeddings = OpenAIEmbeddings(api_key=<your_api_key_here>)

3.5. Integrate with Maeser#

The following code snippet assumes that you have an initialized ChatSessionManager object called sessions_manager and that you have imported config variables from config_example.py. Add the following code to your Maeser application (e.g., in flask_example.py or your custom script):

from maeser.graphs.simple_rag import get_simple_rag
from langgraph.graph.graph import CompiledGraph

# Create a system prompt for your chatbot with appended context. Example prompt:
my_prompt: str = """
    You are a helpful teacher helping a student with course material.
    You will answer a question based on the context provided.
    Don't answer questions about other things.

    {context}
"""

# Create a RAG graph pointing to your vector store
my_simple_rag: CompiledGraph = get_simple_rag(
    vectorstore_path=f"{VEC_STORE_PATH}/my_vectorstore",
    vectorstore_index="index", # the name of the .faiss and .pkl files in your vectorstore
    memory_filepath=f"{LOG_SOURCE_PATH}/my_branch.db",
    system_prompt_text=my_prompt,
    model=LLM_MODEL_NAME
)
# Register it as a new branch
sessions_manager.register_branch(
    branch_name="my_branch",
    branch_label="My Custom Knowledge",
    graph=my_simple_rag,    
)

Now, when you run the Maeser app, your new “My Custom Knowledge” branch will retrieve answers from your embedded content.

3.6. Automated VectorStore Building tool#

In the maeser folder, there is a folder titled populate_data. Inside is a tool to very quickly and efficiently build vector stores to use for your model.

Make sure to run source .venv/bin/activate to open in a virtual environment.
cd into the populate_data folder
run make to install all the required dependencies
To run the applet, execute the command python WebClient.py.
- You may upload any number of pdfs and name the dataset in the applet. The resulting vectorstore will show up in populate_data/data_stores
- Each time will overwrite the previous data, so move out the data to the desired location before generating a new set.

3.7. Additional Tips#

Rebuilding embeddings: Whenever you update your source documents, delete my_vectorstore/ and rerun the vector store creation step.
Batch embeddings: For large corpora, consider parallelizing embeddings or using batch parameter in OpenAIEmbeddings.
Alternative backends: You can swap FAISS for other vector stores supported by LangChain (e.g., Chroma, Pinecone) by changing the import and API calls.

3.8. Next Steps#

Explore custom graph workflows for advanced RAG pipelines in Graphs: Simple RAG vs. Pipeline RAG.
Review the example flask scripts in Maeser Example (with Flask & User Management).

Embedding New Content

Contents