6. Embedding Content#

Maeser currently works with locally saved FAISS vectorstores.

For complete information about creating vectorstores using LangChain, read the docs.

The example directory contains two scripts that show how an individual Wikipedia page could be vectorized: create_byu_vectorstore.py and create_maeser_vectorstore.

The principles in these scripts could be applied to any content you would like to vectorize:

Preprocess the content and turn it into plaintext.

# Extract the text from the Karl G. Maeser Wikipedia page

import wikipediaapi

wiki_wiki = wikipediaapi.Wikipedia(
    user_agent='Maeser AI Example',
        language='en',
        extract_format=wikipediaapi.ExtractFormat.WIKI
)

p_wiki = wiki_wiki.page("Karl G. Maeser")
text = p_wiki.text

Chunk the data strategically.

# Split the text into chunks

from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
documents = text_splitter.create_documents([text])

Vectorize each chunk and save as FAISS database.

# Save the vectorized text to a local FAISS vectorstore

from langchain_community.vectorstores import FAISS

db = FAISS.from_documents(documents, OpenAIEmbeddings())
db.save_local("example/vectorstores/maeser")