{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "t1BeKtSo7KzI"
      },
      "source": [
        "# Crawl Website Content for Question Answering with Apify\n",
        "\n",
        "Author: Jiri Spilka ([Apify](https://apify.com/jiri.spilka))\n",
        "\n",
        "In this tutorial, we'll use the [apify-haystack](https://github.com/apify/apify-haystack/tree/main) integration to call [Website Content Crawler](https://apify.com/apify/website-content-crawler) and crawl and scrape text content from the [Haystack website](https://haystack.deepset.ai). Then, we'll use the [OpenAIDocumentEmbedder](https://docs.haystack.deepset.ai/docs/openaidocumentembedder) to compute text embeddings and the [InMemoryDocumentStore](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore) to store documents in a temporary in-memory database. The last step will be a retrieval augmented generation pipeline to answer users' questions from the scraped data.\n",
        "\n",
        "\n",
        "## Install dependencies"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "r5AJeMOE1Cou",
        "outputId": "4b47b345-bc7f-4e07-f69d-a9bbc324b5b6"
      },
      "outputs": [],
      "source": [
        "!pip install -q apify-haystack"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "h6MmIG9K1HkK"
      },
      "source": [
        "## Set up the API keys\n",
        "\n",
        "You need to have an Apify account and obtain [APIFY_API_TOKEN](https://docs.apify.com/platform/integrations/api).\n",
        "\n",
        "You also need an OpenAI account and [OPENAI_API_KEY](https://platform.openai.com/docs/quickstart)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "yiUTwYzP36Yr",
        "outputId": "28c743be-d94f-447f-8839-7cdd5cf258be"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Enter YOUR APIFY_API_TOKEN··········\n",
            "Enter YOUR OPENAI_API_KEY··········\n"
          ]
        }
      ],
      "source": [
        "import os\n",
        "from getpass import getpass\n",
        "\n",
        "os.environ[\"APIFY_API_TOKEN\"] = getpass(\"Enter YOUR APIFY_API_TOKEN\")\n",
        "os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter YOUR OPENAI_API_KEY\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HQzAujMc505k"
      },
      "source": [
        "## Use the Website Content Crawler to scrape data from the haystack documentation\n",
        "\n",
        "Now, let us call the Website Content Crawler using the Haystack component `ApifyDatasetFromActorCall`. First, we need to define parameters for the Website Content Crawler and then what data we need to save into the vector database.\n",
        "\n",
        "The `actor_id` and detailed description of input parameters (variable `run_input`) can be found on the [Website Content Crawler input page](https://apify.com/apify/website-content-crawler/input-schema).\n",
        "\n",
        "For this example, we will define `startUrls` and limit the number of crawled pages to five."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "_AYgcfBx681h"
      },
      "outputs": [],
      "source": [
        "actor_id = \"apify/website-content-crawler\"\n",
        "run_input = {\n",
        "    \"maxCrawlPages\": 5,  # limit the number of pages to crawl\n",
        "    \"startUrls\": [{\"url\": \"https://haystack.deepset.ai/\"}],\n",
        "}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yIODy29t-_JY"
      },
      "source": [
        "Next, we need to define a dataset mapping function. We need to know the output of the Website Content Crawler. Typically, it is a JSON object that looks like this (truncated for brevity):\n",
        "\n",
        "```\n",
        "[\n",
        "  {\n",
        "    \"url\": \"https://haystack.deepset.ai/overview/quick-start\",\n",
        "    \"text\": \"Haystack is an open-source AI framework to build custom production-grade LLM ...\"\n",
        "  },\n",
        "  {\n",
        "    \"url\": \"https://haystack.deepset.ai/cookbook\",\n",
        "    \"text\": \"You can use these examples as guidelines on how to make use of different mod... \"\n",
        "  },\n",
        "]\n",
        "```\n",
        "\n",
        "We will convert this JSON to a Haystack `Document` using the `dataset_mapping_function` as follows:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "OZ0PAVHI_mhn"
      },
      "outputs": [],
      "source": [
        "from haystack import Document\n",
        "\n",
        "def dataset_mapping_function(dataset_item: dict) -> Document:\n",
        "    return Document(content=dataset_item.get(\"text\"), meta={\"url\": dataset_item.get(\"url\")})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xtFquWflA5kf"
      },
      "source": [
        "And the definition of the `ApifyDatasetFromActorCall`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "id": "gdN7baGrA_lR"
      },
      "outputs": [],
      "source": [
        "from apify_haystack import ApifyDatasetFromActorCall\n",
        "\n",
        "apify_dataset_loader = ApifyDatasetFromActorCall(\n",
        "    actor_id=actor_id,\n",
        "    run_input=run_input,\n",
        "    dataset_mapping_function=dataset_mapping_function,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3hG6SvMm_mAB"
      },
      "source": [
        "Before actually running the Website Content Crawler, we need to define embedding function and document store:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "id": "zKr0KTfhAQz6"
      },
      "outputs": [],
      "source": [
        "from haystack.components.embedders import OpenAIDocumentEmbedder\n",
        "from haystack.document_stores.in_memory import InMemoryDocumentStore\n",
        "\n",
        "document_store = InMemoryDocumentStore()\n",
        "docs_embedder = OpenAIDocumentEmbedder()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GxDNZ7LqAsWV"
      },
      "source": [
        "After that, we can call the Website Content Crawler and print the scraped data:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qfaWI6BaAko9"
      },
      "outputs": [],
      "source": [
        "# Crawler website and store documents in the document_store\n",
        "# Crawling will take some time (1-2 minutes), you can monitor progress in the https://console.apify.com/actors/runs\n",
        "\n",
        "docs = apify_dataset_loader.run()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "LXcfam6pFJG-",
        "outputId": "70b23d91-6ac4-4daa-a422-c636546225f6"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'documents': [Document(id=3650d4d2050c97d0b20d6bb9202eb72494e2dc6ad0222a7e4a7bad038780ab31, content: 'Haystack | Haystack\n",
            "Multimodal\n",
            "AI\n",
            "Architect a next generation AI app around all modalities, not just...', meta: {'url': 'https://haystack.deepset.ai/'}, embedding: vector of size 1536), Document(id=a441728f7b8c8f7541304f23be229372f526306c6d39f634fecf245923d2f239, content: 'What is Haystack? | Haystack\n",
            "Haystack is an open-source AI orchestration framework built by deepset ...', meta: {'url': 'https://haystack.deepset.ai/overview/intro'}, embedding: vector of size 1536), Document(id=82282e7eb3115bf0e8efbaaa4de70fd68bcd1bebf25218a68973c3441ff9638f, content: 'Demos | Haystack\n",
            "Check out demos built with Haystack!\n",
            "AutoQuizzer\n",
            "Try out our AutoQuizzer demo built...', meta: {'url': 'https://haystack.deepset.ai/overview/demo'}, embedding: vector of size 1536), Document(id=55f775825a43a52c8f51f4ba08713389a652e05eb992ed15d7c18bbe68bbe38a, content: 'Get Started | Haystack\n",
            "Haystack is an open-source AI framework to build custom production-grade LLM ...', meta: {'url': 'https://haystack.deepset.ai/overview/quick-start'}, embedding: vector of size 1536), Document(id=1b7ed59f60d536b9e1903b9c66f86e942bfed9bab5ae9f32dcecc6645b95daab, content: '🧑‍🍳 Cookbook | Haystack\n",
            "You can use these examples as guidelines on how to make use of different mod...', meta: {'url': 'https://haystack.deepset.ai/cookbook'}, embedding: vector of size 1536)]}\n"
          ]
        }
      ],
      "source": [
        "print(docs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OxIGcmHcClQo"
      },
      "source": [
        "Compute the embeddings and store them in the database:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "YrKAkHLuCp6N",
        "outputId": "2df583d4-7f65-4ec5-a94f-7a756e06d3e4"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Calculating embeddings: 1it [00:01,  1.07s/it]\n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "5"
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "embeddings = docs_embedder.run(docs.get(\"documents\"))\n",
        "document_store.write_documents(embeddings[\"documents\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "18tOCjLEDNGu"
      },
      "source": [
        "## Retrieval and LLM generative pipeline\n",
        "\n",
        "Once we have the crawled data in the database, we can set up the classical retrieval augmented pipeline. Refer to the [RAG Haystack tutorial](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline) for details.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "31W_jlNWFkz3",
        "outputId": "f23e06af-1fd9-4e6f-c276-26e8ff9056a8"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Initializing pipeline...\n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "<haystack.core.pipeline.pipeline.Pipeline object at 0x79d0f361ea90>\n",
              "🚅 Components\n",
              "  - embedder: OpenAITextEmbedder\n",
              "  - retriever: InMemoryEmbeddingRetriever\n",
              "  - prompt_builder: ChatPromptBuilder\n",
              "  - llm: OpenAIChatGenerator\n",
              "🛤️ Connections\n",
              "  - embedder.embedding -> retriever.query_embedding (List[float])\n",
              "  - retriever.documents -> prompt_builder.documents (List[Document])\n",
              "  - prompt_builder.prompt -> llm.messages (List[ChatMessage])"
            ]
          },
          "execution_count": 9,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from haystack import Pipeline\n",
        "from haystack.components.builders import ChatPromptBuilder\n",
        "from haystack.components.embedders import OpenAITextEmbedder\n",
        "from haystack.components.generators.chat import OpenAIChatGenerator\n",
        "from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever\n",
        "from haystack.dataclasses import ChatMessage\n",
        "\n",
        "text_embedder = OpenAITextEmbedder()\n",
        "retriever = InMemoryEmbeddingRetriever(document_store)\n",
        "generator = OpenAIChatGenerator(model=\"gpt-4o-mini\")\n",
        "\n",
        "template = \"\"\"\n",
        "Given the following information, answer the question.\n",
        "\n",
        "Context:\n",
        "{% for document in documents %}\n",
        "    {{ document.content }}\n",
        "{% endfor %}\n",
        "\n",
        "Question: {{question}}\n",
        "Answer:\n",
        "\"\"\"\n",
        "\n",
        "prompt_builder = ChatPromptBuilder(template=[ChatMessage.from_user(template)], required_variables=\"*\")\n",
        "\n",
        "# Add components to your pipeline\n",
        "print(\"Initializing pipeline...\")\n",
        "pipe = Pipeline()\n",
        "pipe.add_component(\"embedder\", text_embedder)\n",
        "pipe.add_component(\"retriever\", retriever)\n",
        "pipe.add_component(\"prompt_builder\", prompt_builder)\n",
        "pipe.add_component(\"llm\", generator)\n",
        "\n",
        "# Now, connect the components to each other\n",
        "pipe.connect(\"embedder.embedding\", \"retriever.query_embedding\")\n",
        "pipe.connect(\"retriever\", \"prompt_builder.documents\")\n",
        "pipe.connect(\"prompt_builder\", \"llm\")\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CXP-_TqcGU2Z"
      },
      "source": [
        "Now, you can ask questions about Haystack and get correct answers:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "uPtoRZEdF1BN",
        "outputId": "f0e63536-f4e1-4d04-c6b7-75c872b71749"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "question: What is haystack?\n",
            "answer: Haystack is an open-source AI orchestration framework developed by deepset that enables Python developers to create real-world applications using large language models (LLMs). It provides tools for building various types of applications, including autonomous agents, multi-modal apps, and scalable retrieval-augmented generation (RAG) systems. Haystack's modular architecture allows users to customize components, experiment with state-of-the-art methods, and manage their technology stack effectively. It caters to developers at all levels, from prototyping to full-scale deployment, and is supported by a community that values open-source collaboration. Haystack can be utilized directly in Python or through a visual interface called deepset Studio.\n"
          ]
        }
      ],
      "source": [
        "question = \"What is haystack?\"\n",
        "\n",
        "response = pipe.run({\"embedder\": {\"text\": question}, \"prompt_builder\": {\"question\": question}})\n",
        "\n",
        "print(f\"question: {question}\")\n",
        "print(f\"answer: {response['llm']['replies'][0].text}\")"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
