{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3hGCrW6Ue-7v"
   },
   "source": [
    "# Improving Retrieval with Auto-Merging and Hierarchical Document Retrieval\n",
    "\n",
    "This notebook shows how to use Haystack components: `AutoMergingRetriever` and `HierarchicalDocumentSplitter`.\n",
    "\n",
    "- 📚[Read the full article here](https://haystack.deepset.ai/blog/improve-retrieval-with-auto-merging)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TeOAvHZsBXhf"
   },
   "source": [
    "## Setting up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "LaJsFx4P1o_l",
    "outputId": "a5b29fa2-6d74-4ccf-e732-77c8a4f68491"
   },
   "outputs": [],
   "source": [
    "!pip install haystack-ai"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2bgDKD-GgX4l"
   },
   "source": [
    "## Let's get a dataset to index and explore\n",
    "\n",
    "- We will use a dataset containing 2225 new articles part of the paper by \"Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering\", Proc. ICML 2006. by D. Greene and P. Cunningham.\n",
    "\n",
    "- The original dataset is available at http://mlg.ucd.ie/datasets/bbc.html, but we will instead use a CSV processed version available here: https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "cpMYVx1VY7Z7",
    "outputId": "521dbe20-c6dc-4897-c4d7-764b6b82cea1"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2024-09-06 09:41:04--  https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 5080260 (4.8M) [text/plain]\n",
      "Saving to: ‘bbc-news-data.csv’\n",
      "\n",
      "bbc-news-data.csv   100%[===================>]   4.84M  --.-KB/s    in 0.09s   \n",
      "\n",
      "2024-09-06 09:41:05 (56.4 MB/s) - ‘bbc-news-data.csv’ saved [5080260/5080260]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4gm-olXqhyoE"
   },
   "source": [
    "\n",
    "## Let's convert the raw data into Haystack Documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "BNfHo3tN_8mQ"
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "from typing import List\n",
    "from haystack import Document\n",
    "\n",
    "def read_documents() -> List[Document]:\n",
    "    with open(\"bbc-news-data.csv\", \"r\") as file:\n",
    "        reader = csv.reader(file, delimiter=\"\\t\")\n",
    "        next(reader, None)  # skip the headers\n",
    "        documents = []\n",
    "        for row in reader:\n",
    "            category = row[0].strip()\n",
    "            title = row[2].strip()\n",
    "            text = row[3].strip()\n",
    "            documents.append(Document(content=text, meta={\"category\": category, \"title\": title}))\n",
    "\n",
    "    return documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "1ThlSJguh7ok"
   },
   "outputs": [],
   "source": [
    "docs = read_documents()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "_OSZntiEh7zs",
    "outputId": "3af289d5-b829-4717-c1b3-ac7a1fe2ad27"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(id=8b0eec9b4039d3c21eed119c9cbf1022a172f6b96661a391c76ee9a00b388334, content: 'Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to...', meta: {'category': 'business', 'title': 'Ad sales boost Time Warner profit'}),\n",
       " Document(id=0b20edb280b3c492d81751d97aa67f008759b242f2596d56c6816bacb5ea0c08, content: 'The dollar has hit its highest level against the euro in almost three months after the Federal Reser...', meta: {'category': 'business', 'title': 'Dollar gains on Greenspan speech'}),\n",
       " Document(id=9465b0a3c9e81843db56beb8cb3183b14810e8fc7b3195bd37718296f3a13e31, content: 'The owners of embattled Russian oil giant Yukos are to ask the buyer of its former production unit t...', meta: {'category': 'business', 'title': 'Yukos unit buyer faces loan claim'}),\n",
       " Document(id=151d64ed92b61b1b9e58c52a90e7ab4be964c0e47aaf1a233dfb93110986d9cd, content: 'British Airways has blamed high fuel prices for a 40% drop in profits.  Reporting its results for th...', meta: {'category': 'business', 'title': \"High fuel prices hit BA's profits\"}),\n",
       " Document(id=4355d611f770b814f9e7d33959ad9d16b69048650ed0eaf24f1bce3e8ab5bf4c, content: 'Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the targe...', meta: {'category': 'business', 'title': 'Pernod takeover talk lifts Domecq'})]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[0:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jJKxxkysiacd"
   },
   "source": [
    "We can see that we have successfully created Documents.\n",
    "\n",
    "## Document Splitting and Indexing\n",
    "\n",
    "Now we split each document into smaller ones creating an hierarchical document structure connecting each smaller child documents with the corresponding parent document.\n",
    "\n",
    "We also create two document stores, one for the leaf documents and the other for the parent documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "7KV0CNqJAVM5"
   },
   "outputs": [],
   "source": [
    "from typing import Tuple\n",
    "\n",
    "from haystack.document_stores.in_memory import InMemoryDocumentStore\n",
    "from haystack.document_stores.types import DuplicatePolicy\n",
    "\n",
    "from haystack.components.preprocessors import HierarchicalDocumentSplitter\n",
    "\n",
    "def indexing(documents: List[Document]) -> Tuple[InMemoryDocumentStore, InMemoryDocumentStore]:\n",
    "    splitter = HierarchicalDocumentSplitter(block_sizes={10, 3}, split_overlap=0, split_by=\"word\")\n",
    "    docs = splitter.run(documents)\n",
    "\n",
    "    # Store the leaf documents in one document store\n",
    "    leaf_documents = [doc for doc in docs[\"documents\"] if doc.meta[\"__level\"] == 1]\n",
    "    leaf_doc_store = InMemoryDocumentStore()\n",
    "    leaf_doc_store.write_documents(leaf_documents, policy=DuplicatePolicy.OVERWRITE)\n",
    "\n",
    "    # Store the parent documents in another document store\n",
    "    parent_documents = [doc for doc in docs[\"documents\"] if doc.meta[\"__level\"] == 0]\n",
    "    parent_doc_store = InMemoryDocumentStore()\n",
    "    parent_doc_store.write_documents(parent_documents, policy=DuplicatePolicy.OVERWRITE)\n",
    "\n",
    "    return leaf_doc_store, parent_doc_store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "VJDoInqQnQaT"
   },
   "outputs": [],
   "source": [
    "leaf_doc_store, parent_doc_store = indexing(docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VDMahtVkpMAT"
   },
   "source": [
    "## Retrieving Documents with Auto-Merging\n",
    "\n",
    "We are now ready to query the document store using the `AutoMergingRetriever`. Let's build a pipeline that uses the `BM25Retriever` to handle the user queries, and we connect it to the `AutoMergingRetriever`, which, based on the documents retrieved and the hierarchical structure, decides whether the leaf documents or the parent document is returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ysiW52XcAWRB"
   },
   "outputs": [],
   "source": [
    "from haystack import Pipeline\n",
    "from haystack.components.retrievers.in_memory import InMemoryBM25Retriever\n",
    "from haystack.components.retrievers import AutoMergingRetriever\n",
    "\n",
    "def querying_pipeline(leaf_doc_store: InMemoryDocumentStore, parent_doc_store: InMemoryDocumentStore, threshold: float = 0.6):\n",
    "    pipeline = Pipeline()\n",
    "    bm25_retriever = InMemoryBM25Retriever(document_store=leaf_doc_store)\n",
    "    auto_merge_retriever = AutoMergingRetriever(parent_doc_store, threshold=threshold)\n",
    "    pipeline.add_component(instance=bm25_retriever, name=\"BM25Retriever\")\n",
    "    pipeline.add_component(instance=auto_merge_retriever, name=\"AutoMergingRetriever\")\n",
    "    pipeline.connect(\"BM25Retriever.documents\", \"AutoMergingRetriever.documents\")\n",
    "    return pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CCf73zK6sdZ1"
   },
   "source": [
    "Let's create this pipeline by setting the threshold for the `AutoMergingRetriever` at 0.6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "PTNb-HZCp3fm"
   },
   "outputs": [],
   "source": [
    "pipeline = querying_pipeline(leaf_doc_store, parent_doc_store, threshold=0.6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BzlnybXQstXB"
   },
   "source": [
    "Let's now query the pipeline for document store for articles related to cybersecurity. Let's also make use of the pipeline parameter `include_outputs_from` to also get the outputs from the `BM25Retriever` component.  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Gem7W5JmsY1u"
   },
   "outputs": [],
   "source": [
    "result = pipeline.run(data={'query': 'phishing attacks spoof websites spam e-mails spyware'},  include_outputs_from={'BM25Retriever'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "S62-2ScPs0l0",
    "outputId": "3e4e6b52-70e6-4da0-aecb-078b46996764"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(result['AutoMergingRetriever']['documents'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "ZB0XGbCgs9jO",
    "outputId": "48fdbcbb-9add-46db-fa44-67cc4b576fc9"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(result['BM25Retriever']['documents'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "m60y5n2Qs-8n"
   },
   "outputs": [],
   "source": [
    "retrieved_doc_titles_bm25 = sorted([d.meta['title'] for d in result['BM25Retriever']['documents']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "88VzuL25tAPr",
    "outputId": "5e339156-1026-49eb-e1f3-6849b1a2284f"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Bad e-mail habits sustains spam',\n",
       " 'Cyber criminals step up the pace',\n",
       " 'Cyber criminals step up the pace',\n",
       " 'More women turn to net security',\n",
       " 'Rich pickings for hi-tech thieves',\n",
       " 'Screensaver tackles spam websites',\n",
       " 'Security scares spark browser fix',\n",
       " 'Solutions to net security fears',\n",
       " 'Solutions to net security fears',\n",
       " 'Spam e-mails tempt net shoppers']"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retrieved_doc_titles_bm25"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "PfmhKk-_tBX-"
   },
   "outputs": [],
   "source": [
    "retrieved_doc_titles_automerging = sorted([d.meta['title'] for d in result['AutoMergingRetriever']['documents']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "hMKXEjOmtDCf",
    "outputId": "6f17ddff-b08b-4671-dedf-8358506ff725"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Bad e-mail habits sustains spam',\n",
       " 'Cyber criminals step up the pace',\n",
       " 'Cyber criminals step up the pace',\n",
       " 'More women turn to net security',\n",
       " 'Rich pickings for hi-tech thieves',\n",
       " 'Screensaver tackles spam websites',\n",
       " 'Security scares spark browser fix',\n",
       " 'Solutions to net security fears',\n",
       " 'Solutions to net security fears',\n",
       " 'Spam e-mails tempt net shoppers']"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retrieved_doc_titles_automerging"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
