{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_coq_qCuItbN"
      },
      "source": [
        "# Using the Jina-embeddings-v2-base-en model in a Haystack RAG pipeline for legal document analysis\n",
        "\n",
        "One foggy day in October 2023, I was narrowly excused from jury duty. I had mixed feelings about it, since it actually seemed like a pretty interesting case (Google v. Sonos). A few months later, I idly wondered how the proceedings turned out. I could just read the news, but what's the fun in that? Let's see how AI can solve this problem.\n",
        "\n",
        "[Jina.ai](https://jina.ai/) recently released `jina-embeddings-v2-base-en`. It's an open-source text embedding model capable of accommodating up to 8192 tokens. Splitting text into larger chunks is helpful for understanding longer documents. One of the use cases this model is especially suited for is legal document analysis.\n",
        "\n",
        "In this demo, we'll build a [RAG pipeline](https://www.deepset.ai/blog/llms-retrieval-augmentation) to discover the outcome of the Google v. Sonos case, using the following technologies:\n",
        "- the [`jina-embeddings-v2-base-en`](https://arxiv.org/abs/2310.19923) model\n",
        "- [Haystack](https://haystack.deepset.ai/), the open source LLM orchestration framework\n",
        "- [Chroma](https://docs.trychroma.com/getting-started) to store our vector embeddings, via the [Chroma Document Store Haystack integration](https://haystack.deepset.ai/integrations/chroma-documentstore)\n",
        "- the open source [Mistral 7B Instruct LLM](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)\n",
        "\n",
        "\n",
        "## Prerequisites:\n",
        "- You need a Jina AI key - [get a free one here](https://jina.ai/embeddings/).\n",
        "- You also need an [Hugging Face access token](https://huggingface.co/docs/hub/security-tokens)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bKgrm01CD3af"
      },
      "source": [
        "First, install all our required dependencies."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zFbYg2Yb_UgT",
        "outputId": "0d6d6753-19bc-4e73-fb15-aa8fae5b50b6"
      },
      "outputs": [],
      "source": [
        "!pip3 install pypdf"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "prmChq6T_T4l",
        "outputId": "2768c613-7cb4-4b23-de77-1c6a8cf24850"
      },
      "outputs": [],
      "source": [
        "!pip install haystack-ai jina-haystack chroma-haystack \"huggingface_hub>=0.22.0\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3fGOt677EYuE"
      },
      "source": [
        "Then input our credentials."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "JRogtDXaAMIF",
        "outputId": "79ea9fda-6595-4d74-9462-fc35ba18476f"
      },
      "outputs": [],
      "source": [
        "from getpass import getpass\n",
        "import os\n",
        "\n",
        "os.environ[\"JINA_API_KEY\"]  = getpass(\"JINA api key:\")\n",
        "os.environ[\"HF_API_TOKEN\"] = getpass(\"Enter your HuggingFace api token: \")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KlMkLSVjJVoW"
      },
      "source": [
        "## Build an Indexing Pipeline\n",
        "\n",
        "At a high level, the `LinkContentFetcher` pulls this document from its URL. Then we convert it from a PDF into a Document object Haystack can understand.\n",
        "\n",
        "We preprocess it by removing whitespace and redundant substrings. Then split it into chunks, generate embeddings, and write these embeddings into the `ChromaDocumentStore`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 339
        },
        "id": "3lx84bCtM4aS",
        "outputId": "bb99c198-fb05-4c64-d5b7-192e60957d1f"
      },
      "outputs": [],
      "source": [
        "from haystack_integrations.document_stores.chroma import ChromaDocumentStore\n",
        "document_store = ChromaDocumentStore()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "YpY3UQiN07J-",
        "outputId": "77b55084-c547-4749-9745-24edf5883333"
      },
      "outputs": [],
      "source": [
        "from haystack import Pipeline\n",
        "\n",
        "from haystack.components.fetchers import LinkContentFetcher\n",
        "from haystack.components.converters import PyPDFToDocument\n",
        "from haystack.components.writers import DocumentWriter\n",
        "from haystack.components.preprocessors import DocumentCleaner\n",
        "from haystack.components.preprocessors import DocumentSplitter\n",
        "from haystack_integrations.components.retrievers.chroma import ChromaEmbeddingRetriever\n",
        "from haystack.document_stores.types import DuplicatePolicy\n",
        "\n",
        "from haystack_integrations.components.embedders.jina import JinaDocumentEmbedder\n",
        "from haystack_integrations.components.embedders.jina import JinaTextEmbedder\n",
        "\n",
        "fetcher = LinkContentFetcher()\n",
        "converter = PyPDFToDocument()\n",
        "# remove repeated substrings to get rid of headers/footers\n",
        "cleaner = DocumentCleaner(remove_repeated_substrings=True)\n",
        "\n",
        "# Since jina-v2 can handle 8192 tokens, 500 words seems like a safe chunk size\n",
        "splitter = DocumentSplitter(split_by=\"word\", split_length=500)\n",
        "\n",
        "# DuplicatePolicy.SKIP is optional but helps avoid errors if you want to re-run the pipeline\n",
        "writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)\n",
        "\n",
        "retriever = ChromaEmbeddingRetriever(document_store=document_store)\n",
        "\n",
        "# There are both small and large embedding models available, depending on your computing resources and requirements.\n",
        "# Here we're using the larger model.\n",
        "document_embedder = JinaDocumentEmbedder(model=\"jina-embeddings-v2-base-en\")\n",
        "\n",
        "indexing_pipeline = Pipeline()\n",
        "indexing_pipeline.add_component(instance=fetcher, name=\"fetcher\")\n",
        "indexing_pipeline.add_component(instance=converter, name=\"converter\")\n",
        "indexing_pipeline.add_component(instance=cleaner, name=\"cleaner\")\n",
        "indexing_pipeline.add_component(instance=splitter, name=\"splitter\")\n",
        "indexing_pipeline.add_component(instance=document_embedder, name=\"embedder\")\n",
        "indexing_pipeline.add_component(instance=writer, name=\"writer\")\n",
        "\n",
        "indexing_pipeline.connect(\"fetcher.streams\", \"converter.sources\")\n",
        "indexing_pipeline.connect(\"converter.documents\", \"cleaner.documents\")\n",
        "indexing_pipeline.connect(\"cleaner.documents\", \"splitter.documents\")\n",
        "indexing_pipeline.connect(\"splitter.documents\", \"embedder.documents\")\n",
        "indexing_pipeline.connect(\"embedder.documents\", \"writer.documents\")\n",
        "\n",
        "# This case references Google V Sonos, October 2023\n",
        "urls = [\"https://cases.justia.com/federal/district-courts/california/candce/3:2020cv06754/366520/813/0.pdf\"]\n",
        "\n",
        "indexing_pipeline.run(data={\"fetcher\": {\"urls\": urls}})\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OvBHvwqoa3YJ"
      },
      "source": [
        "# Query pipeline\n",
        "\n",
        "Now the real fun begins. Let's create a query pipeline so we can actually start asking questions. We write a prompt allowing us to pass our documents to the Mistral-7B LLM. Then we initiatialize the LLM via the `HuggingFaceAPIGenerator`.\n",
        "\n",
        "To use this model, you need to accept the conditions here: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1\n",
        "\n",
        "In Haystack 2.0 `retriever`s are tightly coupled to `DocumentStores`. If we pass in the `retriever` we initialized earlier, this pipeline can access those embeddings we generated, and pass them to the LLM."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bZefXI2cBRME"
      },
      "outputs": [],
      "source": [
        "\n",
        "from haystack.components.generators import HuggingFaceAPIGenerator\n",
        "from haystack.components.builders.prompt_builder import PromptBuilder\n",
        "\n",
        "from haystack_integrations.components.embedders.jina import JinaTextEmbedder\n",
        "prompt = \"\"\" Answer the question, based on the\n",
        "content in the documents. If you can't answer based on the documents, say so.\n",
        "\n",
        "Documents:\n",
        "{% for doc in documents %}\n",
        "  {{doc.content}}\n",
        "{% endfor %}\n",
        "\n",
        "question: {{question}}\n",
        "\"\"\"\n",
        "\n",
        "text_embedder = JinaTextEmbedder(model=\"jina-embeddings-v2-base-en\")\n",
        "generator = HuggingFaceAPIGenerator(\n",
        "    api_type=\"serverless_inference_api\",\n",
        "    api_params={\"model\": \"mistralai/Mixtral-8x7B-Instruct-v0.1\"})  \n",
        "\n",
        "\n",
        "prompt_builder = PromptBuilder(template=prompt)\n",
        "query_pipeline = Pipeline()\n",
        "query_pipeline.add_component(\"text_embedder\",text_embedder)\n",
        "query_pipeline.add_component(instance=prompt_builder, name=\"prompt_builder\")\n",
        "query_pipeline.add_component(\"retriever\", retriever)\n",
        "query_pipeline.add_component(\"generator\", generator)\n",
        "\n",
        "query_pipeline.connect(\"text_embedder.embedding\", \"retriever.query_embedding\")\n",
        "query_pipeline.connect(\"retriever.documents\", \"prompt_builder.documents\")\n",
        "query_pipeline.connect(\"prompt_builder.prompt\", \"generator.prompt\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WuHpvfGZGB7b"
      },
      "source": [
        "Time to ask a question!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "O3VxIscXJMOS",
        "outputId": "f2ed03f3-efe9-44f0-e7e3-a5bdc8500c56"
      },
      "outputs": [],
      "source": [
        "question = \"Summarize what happened in Google v. Sonos\"\n",
        "\n",
        "result = query_pipeline.run(data={\"text_embedder\":{\"text\": question},\n",
        "                                  \"retriever\": {\"top_k\": 3},\n",
        "                                  \"prompt_builder\":{\"question\": question},\n",
        "                                  \"generator\": {\"generation_kwargs\": {\"max_new_tokens\": 350}}})\n",
        "\n",
        "print(result['generator']['replies'][0])\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cJE9E6LmzEwB"
      },
      "source": [
        "### Other questions you could try:\n",
        "- What role did If This Then That play in Google v. Sonos?\n",
        "- What judge presided over Google v. Sonos?\n",
        "- What should Sonos have done differently?\n",
        "\n",
        "\n",
        "### Alternate cases to explore\n",
        "The indexing pipeline is written so that you can swap in other documents and analyze them. can You can try plugging the following URLs (or any PDF written in English) into the indexing pipeline and re-running all the code blocks below it.\n",
        "- Google v. Oracle: https://supreme.justia.com/cases/federal/us/593/18-956/case.pdf\n",
        "- JACK DANIEL’S PROPERTIES, INC. v. VIP PRODUCTS\n",
        "LLC: https://www.supremecourt.gov/opinions/22pdf/22-148_3e04.pdf\n",
        "\n",
        "Note: if you want to change the prompt template, you'll also need to re-run the code blocks starting where the `DocumentStore` is defined.\n",
        "\n",
        "### Wrapping it up\n",
        "Thanks for reading! If you're interested in learning more about the technologies used here, check out these blog posts:\n",
        "- [Embeddings in Depth](https://jina.ai/news/embeddings-in-depth/)\n",
        "- [What is text vectorization in NLP?](https://haystack.deepset.ai/blog/what-is-text-vectorization-in-nlp)\n",
        "- [The definitive guide to BERT models](https://haystack.deepset.ai/blog/the-definitive-guide-to-bertmodels)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
