基于PieCloudVector的 RAG实践

什么是RAG ？

检索增强生成（RAG）主要作用是对大型语言模型（LLM）的输出进行优化，使其能够在生成响应结果之前引用训练数据来源之外的权威知识库。大型语言模型用海量数据进行训练，使用数十亿个参数为回答问题、翻译语言和完成句子等任务生成原始输出。在 LLM 强大的功能基础上，RAG 将其扩展为能访问特定领域或组织的内部知识库，所有这些都无需重新训练模型。这是一种经济高效地改进 LLM 输出的方法，可以针对不同的应用场景，对大语言模型进行一些“定制化”，将通用大模型转化为契合自身独特业务需求和用例场景的专属模型。

如果没有 RAG，通常会直接将用户输入作为LLM的输入，并根据它所接受训练的信息或它已经知道的信息返回输出。RAG 引入了一个信息检索组件，此时对于一个用户输入，会首先基于该用户输入从信息检索组件提取相关的信息，提取出的相关信息会作为模型输入的上下文信息。然后将用户查询和相关上下文都提供给 LLM。LLM 使用提供的上下文及其训练数据来创建输出。具体如下图所示

创建外部数据源。 LLM 原始训练数据集之外的新数据称为外部数据。它可以来自多个数据来源，例如 API、数据库或文档存储库。数据可能以各种格式存在，例如文件、数据库记录或长篇文本，向量embedding等。本文中，外部数据会存储在OpenPie 开发的向量数据库 PieCloudVector中，外部数据保存其原始文本形式和原始文本对应的embedding信息。
处理用户输入。对于用户输入的一个Query, 在查询外部数据源前，进行预处理。例如可以对用户输入进行提取embedding处理，以通过向量相似搜索的方式到外部数据源检索相关上下文数据。
执行相关性搜索。用户输入提取embedding向量后，到外部数据源进行相关性搜索。PieCloudVector支持HNSW, IVFFLAT, IVFQD等高效的向量索引来加速该过程。
构建模型输入上下文。基于从外部数据源中检索到的与用户查询相似的数据构建模型上下文，例如可以使用最相似的top k 条数据的原始文本构建模型输入上下文。
模型输入。将用户输入和相关的上下文信息同时输入到模型。
模型输出。

相比于模型重新训练和微调， RAG具有如下特点：

经济高效： 针对内部或特定领域的数据，对LLM进行重新训练的硬件成本很高。RAG 是一种将新数据引入 LLM 的更加经济高效的方法。它使生成式人工智能技术更广泛地获得和使用。
实时性：使用 RAG 可以将 LLM 与实时更新的社交媒体、新闻网站等外部数据源进行连接，此时LLM 可以基于最新的数据向用户提供最新推理结果。
结果更加可信： 基于RAG，可以使得 LLM的输出可以包括对来源的引文或引用。用户如果需要进一步对模型输出结果进行确认，用户也可以自己查找源文档。这可以增加用户对生成式人工智能的信任和信心
模型输入控制：根据不同类型的任务， RAG可以控制使用不同的模型输入信息，例如，根据不同的权限级别控制模型输入，将不同敏感程度的数据作为模型的输入，以保护隐私。

什么是PieCloudVector？

PieCloudVector是拓数派开发的一款云原生向量计算引擎。PieCloudVector支持向量数据的高效存储和相似性检索、向量聚类和分类，高性能并行计算、强大的可扩展性和容错能力等特点。在架构设计方面，如下图所示，PieCloudVector的每个Executor对应一个PieCloudVector实例，从而实现向量存储和相似性搜索服务的高性能、可扩展性和可靠性。PieCloudVector支持丰富的客户端，借助PieCloudVector，用户不仅可以存储和管理原始数据对应的向量，还可以调用PieCloudVector相关工具进行模糊搜索。与全局搜索相比，牺牲了一定的精度来实现毫秒级的搜索，进一步提高了查询效率。

接下来，本文将以PieCloudVector来存放外部数据，语言模型使用Llama2，基于langchain实现一个完整的RAG工作流程。

准备外部数据源和模型

对于外部数据，我们使用根据拓数派发布的英文文档和博客整理的内部数据集，每条数据仅包含一段英文文本，格式如下所示。

Openpie is dedicated to "Data Computing for New Discoveries" and has successfully completed three rounds of strategic financing....
OpenPie's flagship product, PieCloudDB realizes cutting-edge data warehouse virtualization technology  ....
With continuous innovation of artificial intelligence (AI) technology, we can observe its increasingly widespread applications ...

我们使用langchain提供的 VectorStore接口对 PieCloudVector 进行了封装，将其封装为VectorStore的一个实现类。使用langchain提供的api 对外部数据进行切分，提取embedding 等处理后，将原始文本数据和embedding数据存储到PieCloudVector中。同时，为了加速相似向量的检索过程，还创建了HNSW索引。核心代码如下：

raw_doc_path = "./RAG-data/context-text"
loader = DirectoryLoader(raw_doc_path)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
doc_splits = text_splitter.split_documents(docs)
model_name = "BAAI/bge-base-en"
encode_kwargs = {'normalize_embeddings': True}  # set True to compute cosine similarity
embedding_function = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cuda'},
    encode_kwargs=encode_kwargs
)
CONNECTION_STRING = "postgresql+psycopg2://openpie@xx.xx.xx.xx:5432/openpie"
vectordb = PieCloudVector.from_documents(
    documents=doc_splits,  # text data that you want to embed and store
    embedding=embedding_function,  # used to convert the documents into embeddings
    connection_string=CONNECTION_STRING,
    collection_name="docs_v1"
)
vectordb.create_hnsw_index(dims=768, index_key="HNSW32", ef_construction=40, ef_search=16)

外部数据写入到PieCloudVector后，每条数据包含两个重要字段 embedding 和document, 格式如下所示：

{                                                                                                                   
  "embedding": [-0.0087991655,-0.027009273,0.0033726105,0.018299054,0549,0.045432627,-0.038479857,...],
	"document": "Openpie is dedicated to 'Data Computing for New Discoveries' and ... ",
}

使用huggingface的transformers库加载 Llama2模型，并构造任务流水线

MODEL_NAME = "NousResearch/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    use_fast=True,
    add_eos_token=True,
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    use_safetensors=True,
    trust_remote_code=True,
    device_map='auto',
    load_in_8bit=True,
)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15,
)

llm = HuggingFacePipeline(pipeline=pipe)

推理

LangChain定义了一个Retriever接口，它包装了一个可以返回与查询字符串相关的文档的方法。PieCloudVector类也实现了这个接口。

推理过程中，首先将PieCloudVector转换为一个Retriever, 针对每一个查询，该retriever会到PieCloudVector中进行查询，返回最相似的3条数据。然后整合大模型，外部数据源构造问答任务链。最后输入问题执行推理任务。

retriever = vectordb.as_retriever(search_kwargs={"k": 3})
retrieval_qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

query = "What is PieCloudVector? and any advantages of PieCloudVector?  please describe in short words"
response = retrieval_qa_chain(query)

使用了RAG后，对于问题 “What is PieCloudVector? and any advantages of PieCloudVector? please describe in short words” ，模型的输入不仅包含了问题信息，必要的提示，还包含了从外部数据源检索到的问题的上下文信息，具体形式如下所示：

{
"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know,
 don't try to make up an answer.",
 'PieCloudVector vector database has the capability to perform fast queries on trillion-scale vector databases. 
  It supports single-node multi-threaded index creation, effectively utilizing all available hardware computational resources.
  This results in a five-fold improvement in index creation performance, 
  a six-fold improvement in retrieval performance, and a three-fold improvement in interactive response speed.
  PieCloudVector, in conjunction with Soochow Securities Xiucai GPT, 
  forms the overall RAG architecture. PieCloudVector primarily stores the embedded vector data 
  while also supporting storage of scalar data for applications. Additionally,  ....', 
  'Question: What is PieCloudVector? and any advantages of PieCloudVector?  please describe in short words',
}

推理结果分析

使用了RAG后，问题 “What is PieCloudVector? and any advantages of PieCloudVector? please describe in short words” 的输出如下所示。可见Llama2模型根据输入的上下文信息，基本可以输出一个正确结果。

'Helpful Answer:
PieCloudVector is a distributed vector database developed by OpenPie. 
It offers high scalability, low latency, and efficient query processing, 
making it suitable for large-scale vector data analysis tasks such as 
recommendation systems, image recognition, and natural language processing.
Some key features include support for multiple indexing methods (e.g., B+ tree, hash table), 
parallelized query execution, and fault tolerance through replication and redundancy techniques. 
Overall, PieCloudVector can help organizations process massive amounts of 
unstructured data quickly and efficiently, leading to 
improved decision-making and better customer experiences.'

而如果不使用RAG, 直接将问题输入 Llama2，得到的输出如下。由于Llama2的训练数据中缺少对PieCloudVector的知识，因此不能很到的回答这个问题也在情理之中，这也从侧面反映了RAG的强大。

Question: What is PieCloudVector? and any advantages of PieCloudVector?  please describe in short words.
Answer: Comment: @user1095108 I've added a link to the documentation, which should answer your questions.