{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0ac8d556-0ff2-4919-a159-2242a953ce44",
   "metadata": {},
   "source": [
    "# LLMMetaDataExtractor: seamless metadata extraction from documents with just a prompt\n",
    "\n",
    "Notebook by [David S. Batista](https://www.davidsbatista.net/)\n",
    "\n",
    "This notebook shows how to use [`LLMMetadataExtractor`](https://docs.haystack.deepset.ai/docs/llmmetadataextractor), we will use a arge Language Model to perform metadata extraction from a Document."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84e1dbc3-807a-43c9-84af-a3fe6d3ffee5",
   "metadata": {},
   "source": [
    "## Setting Up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c15ea1c-2e4f-456d-9f4a-689196a00df6",
   "metadata": {},
   "outputs": [],
   "source": [
    "!uv pip install haystack-ai\n",
    "!uv pip install \"sentence-transformers>=3.0.0\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd3037f2-5b03-4446-87c2-1e5275d5053c",
   "metadata": {},
   "source": [
    "## Initialize LLMMetadataExtractor\n",
    "\n",
    "Let's define what kind of metadata we want to extract from our documents, we wil do it through a LLM prompt, which will then be used by the `LLMMetadataExtractor` component. In this case we want to extract named-entities from our documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "60d608ef-fa00-468f-888f-f6e22f4d113c",
   "metadata": {},
   "outputs": [],
   "source": [
    "NER_PROMPT = \"\"\"\n",
    "    -Goal-\n",
    "    Given text and a list of entity types, identify all entities of those types from the text.\n",
    "\n",
    "    -Steps-\n",
    "    1. Identify all entities. For each identified entity, extract the following information:\n",
    "    - entity: Name of the entity\n",
    "    - entity_type: One of the following types: [organization, product, service, industry]\n",
    "    Format each entity as a JSON like: {\"entity\": <entity_name>, \"entity_type\": <entity_type>}\n",
    "\n",
    "    2. Return output in a single list with all the entities identified in steps 1.\n",
    "\n",
    "    -Examples-\n",
    "    ######################\n",
    "    Example 1:\n",
    "    entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend]\n",
    "    text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top\n",
    "    10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of\n",
    "    our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer\n",
    "    base and high cross-border usage.\n",
    "    We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership\n",
    "    with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global\n",
    "    Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the\n",
    "    United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent\n",
    "    agreement with Emirates Skywards.\n",
    "    And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital\n",
    "    issuers are equally\n",
    "    ------------------------\n",
    "    output:\n",
    "    {\"entities\": [{\"entity\": \"Visa\", \"entity_type\": \"company\"}, {\"entity\": \"Alaska Airlines\", \"entity_type\": \"company\"}, {\"entity\": \"Qatar Airways\", \"entity_type\": \"company\"}, {\"entity\": \"British Airways\", \"entity_type\": \"company\"}, {\"entity\": \"National Bank of Kuwait\", \"entity_type\": \"company\"}, {\"entity\": \"Marriott\", \"entity_type\": \"company\"}, {\"entity\": \"Qatar Islamic Bank\", \"entity_type\": \"company\"}, {\"entity\": \"Emirates Skywards\", \"entity_type\": \"company\"}, {\"entity\": \"Royal Air Maroc\", \"entity_type\": \"company\"}]}\n",
    "    #############################\n",
    "    -Real Data-\n",
    "    ######################\n",
    "    entity_types: [company, organization, person, country, product, service]\n",
    "    text: {{ document.content }}\n",
    "    ######################\n",
    "    output:\n",
    "    \"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce98f9f8-a9b3-4289-827d-9a98e1e8063b",
   "metadata": {},
   "source": [
    "Let's initialise an instance of the `LLMMetadataExtractor` using OpenAI as the LLM provider and the prompt defined above to perform metadata extraction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "548be4fb-36af-4dde-a100-635530299820",
   "metadata": {},
   "outputs": [],
   "source": [
    "from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db457f51-45e3-48de-a7e5-25396d2a08e2",
   "metadata": {},
   "source": [
    "We will also need to set the OPENAI_API_KEY"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "1599015d-6101-4c2b-9b13-4db357674aee",
   "metadata": {},
   "outputs": [
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "Enter OpenAI API key: ········\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "if \"OPENAI_API_KEY\" not in os.environ:\n",
    "  os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter OpenAI API key:\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7508ce5b-e987-4ab4-8fe3-aa292e7ab6d3",
   "metadata": {},
   "source": [
    "We will instatiate a `LLMMetadataExtractor` instance using the OpenAI as LLM provider. Notice that the parameter `prompt` is set to the prompt we defined above, and that we also need to set which keys should be present in the JSON ouput, in this case \"entities\".\n",
    "\n",
    "Another important aspect is the `raise_on_failure=False`, if for some document the LLM fails (e.g.: network error, or doesn't return a valid JSON object) we continue the processing of all the documents in the input."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "1257ebd8-4e5e-4db6-a80a-680735c3fe2d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor\n",
    "from haystack.components.generators.chat import OpenAIChatGenerator\n",
    "\n",
    "chat_generator = OpenAIChatGenerator(\n",
    "    generation_kwargs={\n",
    "        \"max_tokens\": 500,\n",
    "        \"temperature\": 0.0,\n",
    "        \"seed\": 0,\n",
    "        \"response_format\": {\"type\": \"json_object\"},\n",
    "    },\n",
    "    max_retries=1,\n",
    "    timeout=60.0,\n",
    ")\n",
    "\n",
    "metadata_extractor = LLMMetadataExtractor(\n",
    "    prompt=NER_PROMPT,\n",
    "    chat_generator=chat_generator,\n",
    "    expected_keys=[\"entities\"],\n",
    "    raise_on_failure=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21fb3a38-dd30-4239-a004-141bbdf99273",
   "metadata": {},
   "source": [
    "### Let's define documents from which the component will extract metadata, i.e.: named-entities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "953b8e85-c673-4d61-a93c-2ae2995546a1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from haystack import Document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "87671953-ad8c-4469-b164-af6ada2923a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = [\n",
    "    Document(content=\"deepset was founded in 2018 in Berlin, and is known for its Haystack framework\"),    \n",
    "    Document(content=\"Hugging Face is a company that was founded in New York, USA and is known for its Transformers library\"),\n",
    "    Document(content=\"Google was founded in 1998 by Larry Page and Sergey Brin\"),\n",
    "    Document(content=\"Peugeot is a French automotive manufacturer that was founded in 1810 by Jean-Pierre Peugeot\"),\n",
    "    Document(content=\"Siemens is a German multinational conglomerate company headquartered in Munich and Berlin, founded in 1847 by Werner von Siemens\")\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4620031-6e35-4b4a-8663-cc36d493d83e",
   "metadata": {},
   "source": [
    "and let's extract :)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "52c1df77-dced-4edb-90f5-a032c3951eb1",
   "metadata": {},
   "outputs": [],
   "source": [
    "result = metadata_extractor.run(documents=docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "0078fa4d-7481-435a-9458-ecc2b99eb075",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'documents': [Document(id=05fe6674dd4faf3dcaa991f9e6d520c9185d5644c4ac2b8b52276e6b70a831f2, content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework', meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Haystack', 'entity_type': 'product'}]}),\n",
       "  Document(id=37364c858185cf02abc43b43db613d236baa4dd501453d6942681842863c313a, content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers librar...', meta: {'entities': [{'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'}, {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers library', 'entity_type': 'product'}]}),\n",
       "  Document(id=eb4e2410115dfb7edc47b84853d0cdc845699120509346383896ed7d47354e2d, content: 'Google was founded in 1998 by Larry Page and Sergey Brin', meta: {'entities': [{'entity': 'Google', 'entity_type': 'company'}, {'entity': 'Larry Page', 'entity_type': 'person'}, {'entity': 'Sergey Brin', 'entity_type': 'person'}]}),\n",
       "  Document(id=ee28eff307d3a1d435f0515195e0a86e592b72b5570dcaddc4d3108769632596, content: 'Peugeot is a French automotive manufacturer that was founded in 1810 by Jean-Pierre Peugeot', meta: {'entities': [{'entity': 'Peugeot', 'entity_type': 'company'}, {'entity': 'France', 'entity_type': 'country'}, {'entity': 'Jean-Pierre Peugeot', 'entity_type': 'person'}]}),\n",
       "  Document(id=0a56bf794d37839113a73634cc0f3ecab33744eeea7b682b49fd2dc51737aed8, content: 'Siemens is a German multinational conglomerate company headquartered in Munich and Berlin, founded i...', meta: {'entities': [{'entity': 'Siemens', 'entity_type': 'company'}, {'entity': 'Germany', 'entity_type': 'country'}, {'entity': 'Munich', 'entity_type': 'city'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Werner von Siemens', 'entity_type': 'person'}]})],\n",
       " 'failed_documents': []}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbf38465-95df-459e-8ae3-f5620c5d7d5b",
   "metadata": {},
   "source": [
    "## Indexing Pipeline with Extraction\n",
    "\n",
    "Let's now build an indexing pipeline, where we simply give the Documents as input and get a Document Store with the documents indexed with metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "730c66fa-411b-4704-a6be-0b852b436d36",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<haystack.core.pipeline.pipeline.Pipeline object at 0x320d71010>\n",
       "🚅 Components\n",
       "  - metadata_extractor: LLMMetadataExtractor\n",
       "  - embedder: SentenceTransformersDocumentEmbedder\n",
       "  - writer: DocumentWriter\n",
       "🛤️ Connections\n",
       "  - metadata_extractor.documents -> embedder.documents (List[Document])\n",
       "  - embedder.documents -> writer.documents (List[Document])"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from haystack import Pipeline\n",
    "from haystack.document_stores.in_memory import InMemoryDocumentStore\n",
    "from haystack.components.embedders import SentenceTransformersDocumentEmbedder\n",
    "from haystack.components.writers import DocumentWriter\n",
    "\n",
    "doc_store = InMemoryDocumentStore()\n",
    "\n",
    "p = Pipeline()\n",
    "p.add_component(instance=metadata_extractor, name=\"metadata_extractor\")\n",
    "p.add_component(instance=SentenceTransformersDocumentEmbedder(model=\"sentence-transformers/all-MiniLM-L6-v2\"), name=\"embedder\")\n",
    "p.add_component(instance=DocumentWriter(document_store=doc_store), name=\"writer\")\n",
    "p.connect(\"metadata_extractor.documents\", \"embedder.documents\")\n",
    "p.connect(\"embedder.documents\", \"writer.documents\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2285213c",
   "metadata": {},
   "source": [
    "## Try it Out!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "9367eeb5-26be-4bfd-8038-c8f3ce1ffb54",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6fd611dbaf48426b926e3d1e105e45bc",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Batches:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "{'metadata_extractor': {'failed_documents': []},\n",
       " 'writer': {'documents_written': 5}}"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p.run(data={\"metadata_extractor\": {\"documents\": docs}})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "239d362b-5e16-4abf-965b-43dab482433f",
   "metadata": {},
   "source": [
    "Let's inspect the documents metadata in the document store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "658af74a-90bc-4271-95bb-6abe2684edaa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "deepset was founded in 2018 in Berlin, and is known for its Haystack framework\n",
      "{'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Haystack', 'entity_type': 'product'}]}\n",
      "\n",
      "---------\n",
      "Hugging Face is a company that was founded in New York, USA and is known for its Transformers library\n",
      "{'entities': [{'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'}, {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers library', 'entity_type': 'product'}]}\n",
      "\n",
      "---------\n",
      "Google was founded in 1998 by Larry Page and Sergey Brin\n",
      "{'entities': [{'entity': 'Google', 'entity_type': 'company'}, {'entity': 'Larry Page', 'entity_type': 'person'}, {'entity': 'Sergey Brin', 'entity_type': 'person'}]}\n",
      "\n",
      "---------\n",
      "Peugeot is a French automotive manufacturer that was founded in 1810 by Jean-Pierre Peugeot\n",
      "{'entities': [{'entity': 'Peugeot', 'entity_type': 'company'}, {'entity': 'France', 'entity_type': 'country'}, {'entity': 'Jean-Pierre Peugeot', 'entity_type': 'person'}]}\n",
      "\n",
      "---------\n",
      "Siemens is a German multinational conglomerate company headquartered in Munich and Berlin, founded in 1847 by Werner von Siemens\n",
      "{'entities': [{'entity': 'Siemens', 'entity_type': 'company'}, {'entity': 'Germany', 'entity_type': 'country'}, {'entity': 'Munich', 'entity_type': 'city'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Werner von Siemens', 'entity_type': 'person'}]}\n",
      "\n",
      "---------\n"
     ]
    }
   ],
   "source": [
    "for doc in doc_store.storage.values():\n",
    "    print(doc.content)\n",
    "    print(doc.meta)\n",
    "    print(\"\\n---------\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
