{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cb7RLf9gpEoN"
      },
      "source": [
        "# Hybrid Retrieval: BM42 + Dense Retrieval\n",
        "\n",
        "<img src=\"https://qdrant.tech/articles_data/bm42/preview/title.webp\" width=\"800\" style=\"display:inline;\"/>\n",
        "\n",
        "In this notebook, we will see how to create Hybrid Retrieval pipelines, combining BM42 (a new Sparse embedding Retrieval approach) and Dense embedding Retrieval.\n",
        "\n",
        "We will use the Qdrant Document Store and Fastembed Embedders.\n",
        "\n",
        "⚠️ Recent evaluations have raised questions about the validity of BM42. Future developments may address these concerns. Please keep this in mind while reviewing the content."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "I2ATFcgVpTWc"
      },
      "source": [
        "## Why BM42?\n",
        "\n",
        "[Qdrant introduced BM42](https://qdrant.tech/articles/bm42/), an algorithm designed to replace BM25 in hybrid RAG pipelines (dense + sparse retrieval).\n",
        "\n",
        "They found that BM25, while relevant for a long time, has some limitations in common RAG scenarios.\n",
        "\n",
        "Let's first take a look at BM25 and SPLADE to understand the motivation and the inspiration for BM42.\n",
        "\n",
        "**BM25**\n",
        "\\begin{equation}\n",
        "\\text{score}(D,Q) = \\sum_{i=1}^{N} \\text{IDF}(q_i) \\times \\frac{f(q_i, D) \\cdot (k_1 + 1)}{f(q_i, D) + k_1 \\cdot \\left(1 - b + b \\cdot \\frac{|D|}{\\text{avgdl}}\\right)}\\\n",
        "\\end{equation}\n",
        "\n",
        "\n",
        "BM25 is an evolution of TF-IDF and has two components:\n",
        "- Inverse Document Frequency = term importance within a collection\n",
        "- a component incorporating Term Frequency = term importance within a document\n",
        "\n",
        "Qdrant folks observed that the TF component relies on document statistics, which only makes sense for longer texts.\n",
        "This is not the case with common RAG pipelines, where documents are short.\n",
        "\n",
        "**SPLADE**\n",
        "\n",
        "Another interesting approach is SPLADE, which uses a BERT-based model to create a bag-of-words representation of the text.\n",
        "While it generally performs better than BM25, it has some drawbacks:\n",
        "- tokenization issues with out-of-vocabulary words\n",
        "- adaptation to new domains requires fine-tuning\n",
        "- computationally heavy\n",
        "\n",
        "*For using SPLADE with Haystack, see [this notebook](https://github.com/deepset-ai/haystack-cookbook/blob/main/notebooks/sparse_embedding_retrieval.ipynb).*\n",
        "\n",
        "**BM42**\n",
        "\n",
        "\\begin{equation}\n",
        "\\text{score}(D,Q) = \\sum_{i=1}^{N} \\text{IDF}(q_i) \\times \\text{Attention}(\\text{CLS}, q_i)\n",
        "\\end{equation}\n",
        "\n",
        "Taking inspiration from SPLADE, the Qdrant team developed BM42 to improve BM25.\n",
        "\n",
        "IDF works well, so they kept it.\n",
        "\n",
        "But how to quantify term importance within a document?\n",
        "\n",
        "The attention matrix of Transformer models comes to our aid:\n",
        "we can the use attention row for the [CLS] token!\n",
        "\n",
        "To fix tokenization issues, BM42 merges subwords and sums their attention weights.\n",
        "\n",
        "In their implementation, Qdrant team used all-MiniLM-L6-v2 model, but this technique can work with any Transformer, no fine-tuning needed.\n",
        "\n",
        "\n",
        "⚠️ Recent evaluations have raised questions about the validity of BM42. Future developments may address these concerns. Please keep this in mind while reviewing the content."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LQ-L4Gf2Hfci"
      },
      "source": [
        "## Install dependencies"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "id": "tnSq1XK_ovZV",
        "outputId": "c3ef26d6-457a-4a6b-f739-42065d2fe203"
      },
      "outputs": [],
      "source": [
        "!pip install -U fastembed-haystack qdrant-haystack wikipedia transformers"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9RHRrrQh3wqL"
      },
      "source": [
        "## Hybrid Retrieval"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XxKy73D1wPhH"
      },
      "source": [
        "### Indexing"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8pw_uDcZwdDb"
      },
      "source": [
        "#### Create a Qdrant Document Store"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "id": "eAKP9icf1Inj"
      },
      "outputs": [],
      "source": [
        "from haystack_integrations.document_stores.qdrant import QdrantDocumentStore\n",
        "\n",
        "document_store = QdrantDocumentStore(\n",
        "    \":memory:\",\n",
        "    recreate_index=True,\n",
        "    embedding_dim=384,\n",
        "    return_embedding=True,\n",
        "    use_sparse_embeddings=True,  # set this parameter to True, otherwise the collection schema won't allow to store sparse vectors\n",
        "    sparse_idf=True  # required for BM42, allows streaming updates of the sparse embeddings while keeping the IDF calculation up-to-date\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "x8Bpy1ri_Ipx"
      },
      "source": [
        "#### Download Wikipedia pages and create raw documents\n",
        "\n",
        "We download a few Wikipedia pages about animals and create Haystack documents from them."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "7FpCSSnUzuuP"
      },
      "outputs": [],
      "source": [
        "nice_animals= [\"Capybara\", \"Dolphin\", \"Orca\", \"Walrus\"]\n",
        "\n",
        "import wikipedia\n",
        "from haystack.dataclasses import Document\n",
        "\n",
        "raw_docs=[]\n",
        "for title in nice_animals:\n",
        "    page = wikipedia.page(title=title, auto_suggest=False)\n",
        "    doc = Document(content=page.content, meta={\"title\": page.title, \"url\":page.url})\n",
        "    raw_docs.append(doc)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DLiNhYKV_g8u"
      },
      "source": [
        "#### Indexing pipeline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wO4GvIkEGoJE"
      },
      "source": [
        "Our indexing pipeline includes both a Sparse Document Embedder (based on BM42) and a Dense Document Embedder."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "id": "a1taDmfx1HCM"
      },
      "outputs": [],
      "source": [
        "from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter\n",
        "from haystack.components.writers import DocumentWriter\n",
        "from haystack.document_stores.types import DuplicatePolicy\n",
        "from haystack import Pipeline\n",
        "from haystack_integrations.components.embedders.fastembed import FastembedSparseDocumentEmbedder, FastembedDocumentEmbedder"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Bs-oOLF1Y7PB",
        "outputId": "f5e0e3bb-52ff-4e27-f700-b31a911664aa"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "<haystack.core.pipeline.pipeline.Pipeline object at 0x7fb6bc33a2f0>\n",
              "🚅 Components\n",
              "  - cleaner: DocumentCleaner\n",
              "  - splitter: DocumentSplitter\n",
              "  - sparse_doc_embedder: FastembedSparseDocumentEmbedder\n",
              "  - dense_doc_embedder: FastembedDocumentEmbedder\n",
              "  - writer: DocumentWriter\n",
              "🛤️ Connections\n",
              "  - cleaner.documents -> splitter.documents (List[Document])\n",
              "  - splitter.documents -> sparse_doc_embedder.documents (List[Document])\n",
              "  - sparse_doc_embedder.documents -> dense_doc_embedder.documents (List[Document])\n",
              "  - dense_doc_embedder.documents -> writer.documents (List[Document])"
            ]
          },
          "execution_count": 14,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "hybrid_indexing = Pipeline()\n",
        "hybrid_indexing.add_component(\"cleaner\", DocumentCleaner())\n",
        "hybrid_indexing.add_component(\"splitter\", DocumentSplitter(split_by='sentence', split_length=4))\n",
        "hybrid_indexing.add_component(\"sparse_doc_embedder\", FastembedSparseDocumentEmbedder(model=\"Qdrant/bm42-all-minilm-l6-v2-attentions\", meta_fields_to_embed=[\"title\"]))\n",
        "hybrid_indexing.add_component(\"dense_doc_embedder\", FastembedDocumentEmbedder(model=\"BAAI/bge-small-en-v1.5\", meta_fields_to_embed=[\"title\"]))\n",
        "hybrid_indexing.add_component(\"writer\", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))\n",
        "\n",
        "hybrid_indexing.connect(\"cleaner\", \"splitter\")\n",
        "hybrid_indexing.connect(\"splitter\", \"sparse_doc_embedder\")\n",
        "hybrid_indexing.connect(\"sparse_doc_embedder\", \"dense_doc_embedder\")\n",
        "hybrid_indexing.connect(\"dense_doc_embedder\", \"writer\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hmLUyhjZyfWv"
      },
      "source": [
        "#### Let's index our documents!\n",
        "⚠️ If you are running this notebook on Google Colab, please note that Google Colab only provides 2 CPU cores, so the embedding generation with Fastembed could be not as fast as it can be on a standard machine."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "dyBwOzM-9Tqm",
        "outputId": "b49c2cba-28a9-432f-ccfe-7cb26724f275"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Calculating sparse embeddings: 100%|██████████| 340/340 [00:27<00:00, 12.52it/s]\n",
            "Calculating embeddings: 100%|██████████| 340/340 [01:23<00:00,  4.07it/s]\n",
            "400it [00:00, 1179.66it/s]                         \n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "{'writer': {'documents_written': 340}}"
            ]
          },
          "execution_count": 15,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "hybrid_indexing.run({\"documents\":raw_docs})"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "HMB-AxGwyx1c",
        "outputId": "a3269ef5-9a91-4587-f639-899dbe90b6d8"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "340"
            ]
          },
          "execution_count": 16,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "document_store.count_documents()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AZFoiFczyvBx"
      },
      "source": [
        "### Retrieval"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "L-mUxbqn3l63"
      },
      "source": [
        "#### Retrieval pipeline\n",
        "\n",
        "As already mentioned, BM42 is designed to perform best in Hybrid Retrieval (and Hybrid RAG) pipelines.\n",
        "\n",
        "- `FastembedSparseTextEmbedder`: transforms the query into a sparse embedding\n",
        "- `FastembedTextEmbedder`: transforms the query into a dense embedding\n",
        "- `QdrantHybridRetriever`: looks for relevant documents, based on the similarity of both the embeddings\n",
        "\n",
        "Qdrant Hybrid Retriever compares dense and sparse query and document embeddings and retrieves the most relevant documents, merging the scores with Reciprocal Rank Fusion.\n",
        "\n",
        "If you want to customize the fusion behavior more, see Hybrid Retrieval Pipelines ([tutorial](https://haystack.deepset.ai/tutorials/33_hybrid_retrieval))."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "HTlEGit3_XFk",
        "outputId": "c598858a-a611-4c1d-cda3-20509a9877f8"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "<haystack.core.pipeline.pipeline.Pipeline object at 0x7fb6bc33ae30>\n",
              "🚅 Components\n",
              "  - sparse_text_embedder: FastembedSparseTextEmbedder\n",
              "  - dense_text_embedder: FastembedTextEmbedder\n",
              "  - retriever: QdrantHybridRetriever\n",
              "🛤️ Connections\n",
              "  - sparse_text_embedder.sparse_embedding -> retriever.query_sparse_embedding (SparseEmbedding)\n",
              "  - dense_text_embedder.embedding -> retriever.query_embedding (List[float])"
            ]
          },
          "execution_count": 28,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from haystack_integrations.components.retrievers.qdrant import QdrantHybridRetriever\n",
        "from haystack_integrations.components.embedders.fastembed import FastembedTextEmbedder, FastembedSparseTextEmbedder\n",
        "\n",
        "\n",
        "hybrid_query = Pipeline()\n",
        "hybrid_query.add_component(\"sparse_text_embedder\", FastembedSparseTextEmbedder(model=\"Qdrant/bm42-all-minilm-l6-v2-attentions\"))\n",
        "hybrid_query.add_component(\"dense_text_embedder\", FastembedTextEmbedder(model=\"BAAI/bge-small-en-v1.5\", prefix=\"Represent this sentence for searching relevant passages: \"))\n",
        "hybrid_query.add_component(\"retriever\", QdrantHybridRetriever(document_store=document_store, top_k=5))\n",
        "\n",
        "hybrid_query.connect(\"sparse_text_embedder.sparse_embedding\", \"retriever.query_sparse_embedding\")\n",
        "hybrid_query.connect(\"dense_text_embedder.embedding\", \"retriever.query_embedding\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sQnk_qCW890T"
      },
      "source": [
        "#### Try the retrieval pipeline"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 29,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "NpUwlxIj6O0R",
        "outputId": "c4ff0b37-bf61-4190-b868-fd4b94d4617e"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00, 82.10it/s]\n",
            "Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.75it/s]\n"
          ]
        }
      ],
      "source": [
        "question = \"Who eats fish?\"\n",
        "\n",
        "results = hybrid_query.run(\n",
        "    {\"dense_text_embedder\": {\"text\": question},\n",
        "     \"sparse_text_embedder\": {\"text\": question}}\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 30,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 721
        },
        "id": "rcFsJVoqQ4zQ",
        "outputId": "ee110953-12b3-4a61-dbf8-de23819198e7"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
              "id: 370071638e221257cf77702716695626d9b1b4dfe4212b4a10e255434bfeb08b\n",
              "Orca\n",
              " Some populations in the Norwegian and Greenland sea specialize in herring and follow that fish's autumnal \n",
              "migration to the Norwegian coast. Salmon account for <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">96</span>% of northeast Pacific residents' diet, including <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">65</span>% of \n",
              "large, fatty Chinook. Chum salmon are also eaten, but smaller sockeye and pink salmon are not a significant food \n",
              "item. Depletion of specific prey species in an area is, therefore, cause for concern for local populations, despite\n",
              "the high diversity of prey.\n",
              "score: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.5</span>\n",
              "---\n",
              "</pre>\n"
            ],
            "text/plain": [
              "\n",
              "id: 370071638e221257cf77702716695626d9b1b4dfe4212b4a10e255434bfeb08b\n",
              "Orca\n",
              " Some populations in the Norwegian and Greenland sea specialize in herring and follow that fish's autumnal \n",
              "migration to the Norwegian coast. Salmon account for \u001b[1;36m96\u001b[0m% of northeast Pacific residents' diet, including \u001b[1;36m65\u001b[0m% of \n",
              "large, fatty Chinook. Chum salmon are also eaten, but smaller sockeye and pink salmon are not a significant food \n",
              "item. Depletion of specific prey species in an area is, therefore, cause for concern for local populations, despite\n",
              "the high diversity of prey.\n",
              "score: \u001b[1;36m0.5\u001b[0m\n",
              "---\n"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
              "id: 1ed8f49561630f10202b55c8c7619a32cd9f6a11675cbb56c64a578826e488ef\n",
              "Orca\n",
              "; Ellis, Graeme M. <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2006</span><span style=\"font-weight: bold\">)</span>. <span style=\"color: #008000; text-decoration-color: #008000\">\"Selective foraging by fish-eating killer whales Orcinus orca in British Columbia\"</span>. \n",
              "Marine Ecology Progress Series.\n",
              "score: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.5</span>\n",
              "---\n",
              "</pre>\n"
            ],
            "text/plain": [
              "\n",
              "id: 1ed8f49561630f10202b55c8c7619a32cd9f6a11675cbb56c64a578826e488ef\n",
              "Orca\n",
              "; Ellis, Graeme M. \u001b[1m(\u001b[0m\u001b[1;36m2006\u001b[0m\u001b[1m)\u001b[0m. \u001b[32m\"Selective foraging by fish-eating killer whales Orcinus orca in British Columbia\"\u001b[0m. \n",
              "Marine Ecology Progress Series.\n",
              "score: \u001b[1;36m0.5\u001b[0m\n",
              "---\n"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
              "id: a9bb77dac4747c4fba48a7464038c9da206d7e3663d837f2c95f6d882de8111e\n",
              "Orca\n",
              " On average, an orca eats <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">227</span> kilograms <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">500</span> lb<span style=\"font-weight: bold\">)</span> each day. While salmon are usually hunted by an individual whale \n",
              "or a small group, herring are often caught using carousel feeding: the orcas force the herring into a tight ball by\n",
              "releasing bursts of bubbles or flashing their white undersides. They then slap the ball with their tail flukes, \n",
              "stunning or killing up to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">15</span> fish at a time, then eating them one by one. Carousel feeding has been documented only\n",
              "in the Norwegian orca population, as well as some oceanic dolphin species.\n",
              "score: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.41666666666666663</span>\n",
              "---\n",
              "</pre>\n"
            ],
            "text/plain": [
              "\n",
              "id: a9bb77dac4747c4fba48a7464038c9da206d7e3663d837f2c95f6d882de8111e\n",
              "Orca\n",
              " On average, an orca eats \u001b[1;36m227\u001b[0m kilograms \u001b[1m(\u001b[0m\u001b[1;36m500\u001b[0m lb\u001b[1m)\u001b[0m each day. While salmon are usually hunted by an individual whale \n",
              "or a small group, herring are often caught using carousel feeding: the orcas force the herring into a tight ball by\n",
              "releasing bursts of bubbles or flashing their white undersides. They then slap the ball with their tail flukes, \n",
              "stunning or killing up to \u001b[1;36m15\u001b[0m fish at a time, then eating them one by one. Carousel feeding has been documented only\n",
              "in the Norwegian orca population, as well as some oceanic dolphin species.\n",
              "score: \u001b[1;36m0.41666666666666663\u001b[0m\n",
              "---\n"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
              "id: 33fdef8b4f33f4c5ce00cbbc9e3cb3605b778131854436d4bb7e54f5adaf79ae\n",
              "Dolphin\n",
              " === Consumption === ==== Cuisine ==== In some parts of the world, such as Taiji, Japan and the Faroe Islands, \n",
              "dolphins are traditionally considered as food, and are killed in harpoon or drive hunts.\n",
              "Dolphin meat is consumed in a small number of countries worldwide, which include Japan and Peru <span style=\"font-weight: bold\">(</span>where it is \n",
              "referred to as chancho marino, or <span style=\"color: #008000; text-decoration-color: #008000\">\"sea pork\"</span><span style=\"font-weight: bold\">)</span>. While Japan may be the best-known and most controversial example, \n",
              "only a very small minority of the population has ever sampled it.\n",
              "Dolphin meat is dense and such a dark shade of red as to appear black.\n",
              "score: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.3333333333333333</span>\n",
              "---\n",
              "</pre>\n"
            ],
            "text/plain": [
              "\n",
              "id: 33fdef8b4f33f4c5ce00cbbc9e3cb3605b778131854436d4bb7e54f5adaf79ae\n",
              "Dolphin\n",
              " === Consumption === ==== Cuisine ==== In some parts of the world, such as Taiji, Japan and the Faroe Islands, \n",
              "dolphins are traditionally considered as food, and are killed in harpoon or drive hunts.\n",
              "Dolphin meat is consumed in a small number of countries worldwide, which include Japan and Peru \u001b[1m(\u001b[0mwhere it is \n",
              "referred to as chancho marino, or \u001b[32m\"sea pork\"\u001b[0m\u001b[1m)\u001b[0m. While Japan may be the best-known and most controversial example, \n",
              "only a very small minority of the population has ever sampled it.\n",
              "Dolphin meat is dense and such a dark shade of red as to appear black.\n",
              "score: \u001b[1;36m0.3333333333333333\u001b[0m\n",
              "---\n"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
              "id: 6b643c8aa3d47fc198063f8bbc98828bd1d2368d22c95b6b97c36beb60b7fbd0\n",
              "Orca\n",
              "<span style=\"color: #008000; text-decoration-color: #008000\">\" Although large variation in the ecological distinctiveness of different orca groups complicate simple </span>\n",
              "<span style=\"color: #008000; text-decoration-color: #008000\">differentiation into types, research off the west coast of North America has identified fish-eating \"</span>residents\", \n",
              "mammal-eating <span style=\"color: #008000; text-decoration-color: #008000\">\"transients\"</span> and <span style=\"color: #008000; text-decoration-color: #008000\">\"offshores\"</span>. Other populations have not been as well studied, although specialized \n",
              "fish and mammal eating orcas have been distinguished elsewhere. Mammal-eating orcas in different regions were long \n",
              "thought likely to be closely related, but genetic testing has refuted this hypothesis. A <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2024</span> study supported the \n",
              "elevation of Eastern North American resident and transient orcas as distinct species, O.\n",
              "score: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.3333333333333333</span>\n",
              "---\n",
              "</pre>\n"
            ],
            "text/plain": [
              "\n",
              "id: 6b643c8aa3d47fc198063f8bbc98828bd1d2368d22c95b6b97c36beb60b7fbd0\n",
              "Orca\n",
              "\u001b[32m\" Although large variation in the ecological distinctiveness of different orca groups complicate simple \u001b[0m\n",
              "\u001b[32mdifferentiation into types, research off the west coast of North America has identified fish-eating \"\u001b[0mresidents\", \n",
              "mammal-eating \u001b[32m\"transients\"\u001b[0m and \u001b[32m\"offshores\"\u001b[0m. Other populations have not been as well studied, although specialized \n",
              "fish and mammal eating orcas have been distinguished elsewhere. Mammal-eating orcas in different regions were long \n",
              "thought likely to be closely related, but genetic testing has refuted this hypothesis. A \u001b[1;36m2024\u001b[0m study supported the \n",
              "elevation of Eastern North American resident and transient orcas as distinct species, O.\n",
              "score: \u001b[1;36m0.3333333333333333\u001b[0m\n",
              "---\n"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "import rich\n",
        "\n",
        "for d in results['retriever']['documents']:\n",
        "  rich.print(f\"\\nid: {d.id}\\n{d.meta['title']}\\n{d.content}\\nscore: {d.score}\\n---\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 31,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "wC-CwKnkKzxn",
        "outputId": "7a3f3be9-5eac-4705-bda9-8be836794108"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00, 71.98it/s]\n",
            "Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00,  8.90it/s]\n"
          ]
        }
      ],
      "source": [
        "question = \"capybara social behavior\"\n",
        "\n",
        "results = hybrid_query.run(\n",
        "    {\"dense_text_embedder\": {\"text\": question},\n",
        "     \"sparse_text_embedder\": {\"text\": question}}\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 32,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 609
        },
        "id": "IGayh5GTK_pb",
        "outputId": "ce00cc9d-3e50-418d-ceca-c90aa477e6b3"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
              "id: d35c090ebfdad52eb882915b0ee2a9578c751a243ecef3a2c941ef0713a7c9aa\n",
              "Capybara\n",
              " The capybara inhabits savannas and dense forests, and lives near bodies of water. It is a highly social species \n",
              "and can be found in groups as large as <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">100</span> individuals, but usually live in groups of <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">10</span>–<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">20</span> individuals. The \n",
              "capybara is hunted for its meat and hide and also for grease from its thick fatty skin. == Etymology ==\n",
              "Its common name is derived from Tupi ka'apiûara, a complex agglutination of kaá <span style=\"font-weight: bold\">(</span>leaf<span style=\"font-weight: bold\">)</span> + píi <span style=\"font-weight: bold\">(</span>slender<span style=\"font-weight: bold\">)</span> + ú <span style=\"font-weight: bold\">(</span>eat<span style=\"font-weight: bold\">)</span> + \n",
              "ara <span style=\"font-weight: bold\">(</span>a suffix for agent nouns<span style=\"font-weight: bold\">)</span>, meaning <span style=\"color: #008000; text-decoration-color: #008000\">\"one who eats slender leaves\"</span>, or <span style=\"color: #008000; text-decoration-color: #008000\">\"grass-eater\"</span>.\n",
              "score: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.7</span>\n",
              "---\n",
              "</pre>\n"
            ],
            "text/plain": [
              "\n",
              "id: d35c090ebfdad52eb882915b0ee2a9578c751a243ecef3a2c941ef0713a7c9aa\n",
              "Capybara\n",
              " The capybara inhabits savannas and dense forests, and lives near bodies of water. It is a highly social species \n",
              "and can be found in groups as large as \u001b[1;36m100\u001b[0m individuals, but usually live in groups of \u001b[1;36m10\u001b[0m–\u001b[1;36m20\u001b[0m individuals. The \n",
              "capybara is hunted for its meat and hide and also for grease from its thick fatty skin. == Etymology ==\n",
              "Its common name is derived from Tupi ka'apiûara, a complex agglutination of kaá \u001b[1m(\u001b[0mleaf\u001b[1m)\u001b[0m + píi \u001b[1m(\u001b[0mslender\u001b[1m)\u001b[0m + ú \u001b[1m(\u001b[0meat\u001b[1m)\u001b[0m + \n",
              "ara \u001b[1m(\u001b[0ma suffix for agent nouns\u001b[1m)\u001b[0m, meaning \u001b[32m\"one who eats slender leaves\"\u001b[0m, or \u001b[32m\"grass-eater\"\u001b[0m.\n",
              "score: \u001b[1;36m0.7\u001b[0m\n",
              "---\n"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
              "id: e1b0dcc9a1d01481052af5964616f438073f201ffaa0605282a7ddaf90fcafaf\n",
              "Capybara\n",
              " Males establish social bonds, dominance, or general group consensus. They can make dog-like barks when threatened \n",
              "or when females are herding young.\n",
              "Capybaras have two types of scent glands: a morrillo, located on the snout, and anal glands. Both sexes have these \n",
              "glands, but males have much larger morrillos and use their anal glands more frequently.\n",
              "score: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.6666666666666666</span>\n",
              "---\n",
              "</pre>\n"
            ],
            "text/plain": [
              "\n",
              "id: e1b0dcc9a1d01481052af5964616f438073f201ffaa0605282a7ddaf90fcafaf\n",
              "Capybara\n",
              " Males establish social bonds, dominance, or general group consensus. They can make dog-like barks when threatened \n",
              "or when females are herding young.\n",
              "Capybaras have two types of scent glands: a morrillo, located on the snout, and anal glands. Both sexes have these \n",
              "glands, but males have much larger morrillos and use their anal glands more frequently.\n",
              "score: \u001b[1;36m0.6666666666666666\u001b[0m\n",
              "---\n"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
              "id: fd11addea30e8ae2f1d60274beae4d42646b075eb0579bef1c2899cde1e1bb2b\n",
              "Capybara\n",
              "<span style=\"color: #00ff00; text-decoration-color: #00ff00; font-weight: bold\">1.31.0.1</span>.\n",
              "score: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.5</span>\n",
              "---\n",
              "</pre>\n"
            ],
            "text/plain": [
              "\n",
              "id: fd11addea30e8ae2f1d60274beae4d42646b075eb0579bef1c2899cde1e1bb2b\n",
              "Capybara\n",
              "\u001b[1;92m1.31.0.1\u001b[0m.\n",
              "score: \u001b[1;36m0.5\u001b[0m\n",
              "---\n"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
              "id: 1600c15a21aa722965ef2cc4fab4e622474fc8d1ff9e0c555c955e78b038ee2d\n",
              "Capybara\n",
              " In addition, a female alerts males she is in estrus by whistling through her nose. During mating, the female has \n",
              "the advantage and mating choice. Capybaras mate only in water, and if a female does not want to mate with a certain\n",
              "male, she either submerges or leaves the water. Dominant males are highly protective of the females, but they \n",
              "usually cannot prevent some of the subordinates from copulating.\n",
              "score: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.25</span>\n",
              "---\n",
              "</pre>\n"
            ],
            "text/plain": [
              "\n",
              "id: 1600c15a21aa722965ef2cc4fab4e622474fc8d1ff9e0c555c955e78b038ee2d\n",
              "Capybara\n",
              " In addition, a female alerts males she is in estrus by whistling through her nose. During mating, the female has \n",
              "the advantage and mating choice. Capybaras mate only in water, and if a female does not want to mate with a certain\n",
              "male, she either submerges or leaves the water. Dominant males are highly protective of the females, but they \n",
              "usually cannot prevent some of the subordinates from copulating.\n",
              "score: \u001b[1;36m0.25\u001b[0m\n",
              "---\n"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
              "id: 994f31c23e46c16744558b3a499cff0c446da33661a74bb2ddaede9e26e64e11\n",
              "Capybara\n",
              "<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">40</span> ft<span style=\"font-weight: bold\">)</span> in length, stand <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">50</span> to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">62</span> cm <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">20</span> to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">24</span> in<span style=\"font-weight: bold\">)</span> tall at the withers, and typically weigh <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">35</span> to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">66</span> kg <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">77</span> to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">146</span> \n",
              "lb<span style=\"font-weight: bold\">)</span>, with an average in the Venezuelan llanos of <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">48.9</span> kg <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">108</span> lb<span style=\"font-weight: bold\">)</span>. Females are slightly heavier than males. The top\n",
              "recorded weights are <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">91</span> kg <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">201</span> lb<span style=\"font-weight: bold\">)</span> for a wild female from Brazil and <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">73</span>.\n",
              "score: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.25</span>\n",
              "---\n",
              "</pre>\n"
            ],
            "text/plain": [
              "\n",
              "id: 994f31c23e46c16744558b3a499cff0c446da33661a74bb2ddaede9e26e64e11\n",
              "Capybara\n",
              "\u001b[1;36m40\u001b[0m ft\u001b[1m)\u001b[0m in length, stand \u001b[1;36m50\u001b[0m to \u001b[1;36m62\u001b[0m cm \u001b[1m(\u001b[0m\u001b[1;36m20\u001b[0m to \u001b[1;36m24\u001b[0m in\u001b[1m)\u001b[0m tall at the withers, and typically weigh \u001b[1;36m35\u001b[0m to \u001b[1;36m66\u001b[0m kg \u001b[1m(\u001b[0m\u001b[1;36m77\u001b[0m to \u001b[1;36m146\u001b[0m \n",
              "lb\u001b[1m)\u001b[0m, with an average in the Venezuelan llanos of \u001b[1;36m48.9\u001b[0m kg \u001b[1m(\u001b[0m\u001b[1;36m108\u001b[0m lb\u001b[1m)\u001b[0m. Females are slightly heavier than males. The top\n",
              "recorded weights are \u001b[1;36m91\u001b[0m kg \u001b[1m(\u001b[0m\u001b[1;36m201\u001b[0m lb\u001b[1m)\u001b[0m for a wild female from Brazil and \u001b[1;36m73\u001b[0m.\n",
              "score: \u001b[1;36m0.25\u001b[0m\n",
              "---\n"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "import rich\n",
        "\n",
        "for d in results['retriever']['documents']:\n",
        "  rich.print(f\"\\nid: {d.id}\\n{d.meta['title']}\\n{d.content}\\nscore: {d.score}\\n---\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5mHu_jbeFcyQ"
      },
      "source": [
        "## 📚 Resources\n",
        "- [BM42: New Baseline for Hybrid Search - article by Qdrant](https://qdrant.tech/articles/bm42/)\n",
        "- [Sparse Embedding Retrieval with SPLADE - notebook](https://github.com/deepset-ai/haystack-cookbook/blob/main/notebooks/sparse_embedding_retrieval.ipynb)\n",
        "- Haystack docs:\n",
        "  - [Retrievers](https://docs.haystack.deepset.ai/docs/retrievers)\n",
        "  - [Qdrant Sparse Embedding Retriever](https://docs.haystack.deepset.ai/docs/qdrantsparseembeddingretriever)\n",
        "  - [Qdrant Hybrid Retriever](https://docs.haystack.deepset.ai/docs/qdranthybridretriever)\n",
        "  - [FastEmbed Sparse Text Embedder](https://docs.haystack.deepset.ai/docs/fastembedsparsetextembedder)\n",
        "  - [Fastembed Sparse Document Embedder](https://docs.haystack.deepset.ai/docs/fastembedsparsedocumentembedder)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "w4i2BQL9En-H"
      },
      "source": [
        "(*Notebook by [Stefano Fiorucci](https://github.com/anakin87)*)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
