{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZjlwUPWugM37"
   },
   "source": [
    "# Use ChromaDocumentStore with Haystack\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "135w48jbgRRU"
   },
   "source": [
    "## Install dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "znSRD-hO2doM"
   },
   "outputs": [],
   "source": [
    "# Install the Chroma integration, Haystack will come as a dependency\n",
    "!pip install -U chroma-haystack \"huggingface_hub>=0.22.0\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gt_XhGXBgU-I"
   },
   "source": [
    "## Indexing Pipeline: preprocess, split and index documents\n",
    "In this section, we will index documents into a Chroma DB collection by building a Haystack indexing pipeline. Here, we are indexing documents from the [VIM User Manuel](https://vimhelp.org/) into the Haystack [`ChromaDocumentStore`](https://haystack.deepset.ai/integrations/chroma-documentstore).\n",
    "\n",
    " We have the `.txt` files for these pages in the examples folder for the `ChromaDocumentStore`, so we are using the [`TextFileToDocument`](https://docs.haystack.deepset.ai/docs/textfiletodocument) and [`DocumentWriter`](https://docs.haystack.deepset.ai/docs/documentwriter) components to build this indexing pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "id": "fGxsA9C74BWr"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mkdir: main: File exists\n",
      "mv: rename main/integrations/chroma/example/data to ./data: Directory not empty\n"
     ]
    }
   ],
   "source": [
    "# Fetch data files from the Github repo\n",
    "!curl -sL https://github.com/deepset-ai/haystack-core-integrations/tarball/main -o main.tar\n",
    "!mkdir main\n",
    "!tar xf main.tar -C main --strip-components 1\n",
    "!mv main/integrations/chroma/example/data ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "id": "ayyBKQIC3jGo"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'writer': {'documents_written': 36}}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "from haystack import Pipeline\n",
    "from haystack.components.converters import TextFileToDocument\n",
    "from haystack.components.writers import DocumentWriter\n",
    "\n",
    "from haystack_integrations.document_stores.chroma import ChromaDocumentStore\n",
    "\n",
    "file_paths = [\"data\" / Path(name) for name in os.listdir(\"data\")]\n",
    "\n",
    "# Chroma is used in-memory so we use the same instances in the two pipelines below\n",
    "document_store = ChromaDocumentStore()\n",
    "\n",
    "indexing = Pipeline()\n",
    "indexing.add_component(\"converter\", TextFileToDocument())\n",
    "indexing.add_component(\"writer\", DocumentWriter(document_store))\n",
    "indexing.connect(\"converter\", \"writer\")\n",
    "indexing.run({\"converter\": {\"sources\": file_paths}})\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "44cRT55agw2e"
   },
   "source": [
    "## Query Pipeline: build retrieval-augmented generation (RAG) pipelines\n",
    "\n",
    "Once we have documents in the `ChromaDocumentStore`, we can use the accompanying Chroma retrievers to build a query pipeline. The query pipeline below is a simple retrieval-augmented generation (RAG) pipeline that uses Chroma's [query API](https://docs.trychroma.com/usage-guide#querying-a-collection).\n",
    "\n",
    "You can change the indexing pipeline and query pipelines here for embedding search by using one of the [`Haystack Embedders`](https://docs.haystack.deepset.ai/docs/embedders) accompanied by the  `ChromaEmbeddingRetriever`.\n",
    "\n",
    "\n",
    "In this example we are using:\n",
    "- The [`OpenAIChatGenerator`](https://docs.haystack.deepset.ai/docs/openaichatgenerator) with `gpt-4o-mini`. (You will need a OpenAI API key to use this model). You can replace this with any of the other [`Generators`](https://docs.haystack.deepset.ai/docs/generators)\n",
    "- The [`ChatPromptBuilder`](https://docs.haystack.deepset.ai/docs/chatpromptbuilder) which holds the prompt template. You can adjust this to a prompt of your choice\n",
    "- The [`ChromaQueryTextRetriver`](https://docs.haystack.deepset.ai/docs/chromaqueryretriever) which expects a list of queries and retieves the `top_k` most relevant documents from your Chroma collection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "id": "WGGApIR3pllW"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Enter OpenAI API key: ········\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter OpenAI API key:\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "id": "YQJTPYNreNV-"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "ChatPromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<haystack.core.pipeline.pipeline.Pipeline object at 0x308f29880>\n",
       "🚅 Components\n",
       "  - retriever: ChromaQueryTextRetriever\n",
       "  - prompt_builder: ChatPromptBuilder\n",
       "  - llm: OpenAIChatGenerator\n",
       "🛤️ Connections\n",
       "  - retriever.documents -> prompt_builder.documents (List[Document])\n",
       "  - prompt_builder.prompt -> llm.messages (List[ChatMessage])"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever\n",
    "from haystack.components.generators.chat import OpenAIChatGenerator\n",
    "from haystack.components.builders import ChatPromptBuilder\n",
    "from haystack.dataclasses.chat_message import ChatMessage\n",
    "\n",
    "prompt = \"\"\"\n",
    "Answer the query based on the provided context.\n",
    "If the context does not contain the answer, say 'Answer not found'.\n",
    "Context:\n",
    "{% for doc in documents %}\n",
    "  {{ doc.content }}\n",
    "{% endfor %}\n",
    "query: {{query}}\n",
    "Answer:\n",
    "\"\"\"\n",
    "\n",
    "template = [ChatMessage.from_user(prompt)]\n",
    "prompt_builder = ChatPromptBuilder(template=template)\n",
    "\n",
    "llm = OpenAIChatGenerator()\n",
    "retriever = ChromaQueryTextRetriever(document_store)\n",
    "\n",
    "querying = Pipeline()\n",
    "querying.add_component(\"retriever\", retriever)\n",
    "querying.add_component(\"prompt_builder\", prompt_builder)\n",
    "querying.add_component(\"llm\", llm)\n",
    "\n",
    "querying.connect(\"retriever.documents\", \"prompt_builder.documents\")\n",
    "querying.connect(\"prompt_builder\", \"llm\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "O8jcmcdqrGu1"
   },
   "outputs": [],
   "source": [
    "query = \"Should I write documentation for my plugin?\"\n",
    "results = querying.run({\"retriever\": {\"query\": query, \"top_k\": 3}, \"prompt_builder\": {\"query\": query}})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "id": "Pa7f7EzjtBXw"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Yes, it is a good idea to write documentation for your plugin. This helps users understand how to use it, especially when its behavior can be changed by the user.\n"
     ]
    }
   ],
   "source": [
    "print(results[\"llm\"][\"replies\"][0].text)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
