{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gnKi6wEermZf"
      },
      "source": [
        "# LinkedIn, Company Intelligence & Lead Enrichment with Haystack, MongoDB Atlas, and Bright Data\n",
        "\n",
        "## 🚀 Build Your Own AI Sales Research Assistant\n",
        "\n",
        "This cookbook demonstrates how to build an AI-powered sales research assistant that:\n",
        "\n",
        "- **Extracts live data** from LinkedIn, Crunchbase, news sources, and job postings\n",
        "- **Stores and indexes** data in MongoDB Atlas for semantic search\n",
        "- **Answers complex questions** like \"What pain points is this company facing?\" and \"Generate a personalized outreach angle\"\n",
        "\n",
        "**The Tech Stack:**\n",
        "\n",
        "- **🌐 Bright Data**: Web scraping for 45+ data sources (LinkedIn, Crunchbase, news, job boards)\n",
        "- **🍃 MongoDB Atlas**: Vector database for semantic search + structured metadata filtering\n",
        "- **🔧 Haystack**: Open-source LLM framework for building RAG pipelines\n",
        "- **🤖 Google Gemini 2.5**: Generate actionable sales intelligence from raw data\n",
        "\n",
        "**What You'll Build:**\n",
        "\n",
        "1. **Find companies** matching your Ideal Customer Profile (ICP) criteria\n",
        "2. **Identify decision makers** and research their backgrounds\n",
        "3. **Extract pain points** from job postings, news articles, and company data\n",
        "4. **Generate personalized outreach** angles based on comprehensive company intelligence\n",
        "\n",
        "Let's get started! 🎯"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gjzGi7bormZg"
      },
      "source": [
        "## 🏗️ Architecture Overview\n",
        "\n",
        "### How the Sales Research Assistant Works\n",
        "\n",
        "Our AI assistant combines three powerful technologies to deliver comprehensive lead intelligence:\n",
        "\n",
        "```\n",
        "┌─────────────────┐\n",
        "│   User Query    │  \"Find AI startups in NYC with Series A funding\"\n",
        "└────────┬────────┘\n",
        "         │\n",
        "         ▼\n",
        "┌─────────────────────────────────────────────────────────────┐\n",
        "│                   HAYSTACK PIPELINE                         │\n",
        "│  ┌──────────────┐    ┌──────────────┐    ┌───────────────┐  │\n",
        "│  │   Embedder   │───▶│  Retriever   │───▶│ Prompt Builder│  │\n",
        "│  └──────────────┘    └──────┬───────┘    └──────┬────────┘  │\n",
        "│                              │                    │         │\n",
        "│                              ▼                    ▼         │\n",
        "│                    ┌──────────────────┐  ┌──────────────┐   │\n",
        "│                    │  MongoDB Atlas   │  │ Gemini 2.5   │   │\n",
        "│                    │ Vector Search +  │  │  Generator   │   │\n",
        "│                    │ Metadata Filter  │  └──────────────┘   │\n",
        "│                    └────────▲─────────┘                     │\n",
        "└─────────────────────────────┼───────────────────────────────┘\n",
        "                              │\n",
        "                    ┌─────────┴─────────┐\n",
        "                    │  INDEXING LAYER   │\n",
        "                    └─────────▲─────────┘\n",
        "                              │\n",
        "                    ┌─────────┴─────────┐\n",
        "                    │   BRIGHT DATA     │\n",
        "                    │  Web Scraping API │\n",
        "                    └─────────┬─────────┘\n",
        "                              │\n",
        "              ┌───────────────┼───────────────┐\n",
        "              │               │               │\n",
        "         ┌────▼────┐    ┌─────▼─────┐  ┌─────▼─────┐\n",
        "         │LinkedIn │    │ Crunchbase│  │Google SERP│\n",
        "         │Profiles │    │ Companies │  │   News    │\n",
        "         └─────────┘    └───────────┘  └───────────┘\n",
        "```\n",
        "\n",
        "### Component Breakdown\n",
        "\n",
        "#### 1. **Bright Data Layer** (Data Collection)\n",
        "- **Web Scraper API**: Extracts structured data from 45+ sources\n",
        "  - `linkedin_company_profile`: Company size, industry, description, location\n",
        "  - `linkedin_person_profile`: Decision maker titles, backgrounds, experience\n",
        "  - `crunchbase_company`: Funding rounds, investors, employee count\n",
        "- **SERP API**: Real-time search results from Google/Bing\n",
        "  - Company news and press releases\n",
        "  - Job postings (signal for pain points)\n",
        "  - Industry trends and mentions\n",
        "- **Compliance Built-in**: Respects robots.txt, handles CAPTCHAs, rotates IPs automatically\n",
        "\n",
        "#### 2. **MongoDB Atlas** (Storage & Retrieval)\n",
        "- **Vector Search**: Semantic similarity matching on embedded company/person descriptions\n",
        "- **Metadata Filtering**: Hybrid search combining vectors with structured filters\n",
        "  - Filter by: industry, funding stage, location, company size, job titles\n",
        "- **Document Storage**: Stores raw scraped data + embeddings + metadata\n",
        "- **Scalable**: Handles millions of leads with sub-second query times\n",
        "\n",
        "#### 3. **Haystack Pipeline** (Orchestration)\n",
        "- **Embedder**: Converts queries and documents to vector representations using Google's text-embedding-004\n",
        "- **Retriever**: Finds most relevant leads from MongoDB based on semantic + metadata match\n",
        "- **Prompt Builder**: Constructs context-rich prompts with retrieved lead data\n",
        "- **LLM Generator**: Gemini 2.5 Flash synthesizes insights and generates actionable intelligence\n",
        "\n",
        "### Agent Capabilities\n",
        "\n",
        "This architecture enables four key workflows:\n",
        "\n",
        "**1. Company Discovery**\n",
        "- Input: ICP criteria (industry, funding stage, location, size)\n",
        "- Process: Scrape Crunchbase/LinkedIn → Index in MongoDB → Semantic search\n",
        "- Output: Ranked list of companies matching criteria\n",
        "\n",
        "**2. Decision Maker Identification**\n",
        "- Input: Company name or URL\n",
        "- Process: Scrape LinkedIn company page → Extract employee profiles → Identify key roles\n",
        "- Output: List of decision makers with titles, backgrounds, and contact hints\n",
        "\n",
        "**3. Pain Point Analysis**\n",
        "- Input: Company name\n",
        "- Process: SERP search for job postings + news → Analyze requirements and challenges\n",
        "- Output: Inferred pain points, hiring priorities, growth signals\n",
        "\n",
        "**4. Personalized Outreach Generation**\n",
        "- Input: Prospect name/company + context from above\n",
        "- Process: RAG retrieval of all data → Gemini synthesis with sales prompts\n",
        "- Output: Personalized email/message angle with specific talking points\n",
        "\n",
        "### Data Flow Example\n",
        "\n",
        "**Query**: *\"Find AI startups in NYC that raised Series A in the last 6 months\"*\n",
        "\n",
        "1. **Scraping**: Bright Data queries Crunchbase for AI companies in NYC with recent Series A funding\n",
        "2. **Indexing**: Companies are converted to Documents with embeddings and metadata (industry=AI, location=NYC, funding_stage=Series A)\n",
        "3. **Retrieval**: Query embedding matches semantically similar companies + metadata filters enforce ICP criteria\n",
        "4. **Generation**: Gemini 2.0 receives top 10 matching companies and synthesizes a detailed report with key insights\n",
        "\n",
        "Now let's build it! 🛠️"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lU8xln2wrmZg"
      },
      "source": [
        "## Setup\n",
        "\n",
        "First, we need to install the required dependencies for our sales research assistant."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "pSiJNcburmZg"
      },
      "outputs": [],
      "source": [
        "! pip install haystack-ai haystack-brightdata mongodb-atlas-haystack google-genai-haystack dotenv"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RMt6PDx9rmZh"
      },
      "source": [
        "### API Configuration\n",
        "\n",
        "Next, we'll configure the API keys needed for our sales research assistant. You'll need:\n",
        "\n",
        "1. **Bright Data API Key**: Get yours from the [Bright Data Dashboard](https://brightdata.com/cp/setting/users)\n",
        "2. **MongoDB Connection String**: From your [MongoDB Atlas cluster](https://www.mongodb.com/docs/atlas/getting-started/)\n",
        "3. **Google API Key**: For Gemini access from [Google AI Studio](https://aistudio.google.com/)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qpiWeX7PrmZh",
        "outputId": "d0ef3ac3-5ec2-4316-8e51-3f42c8736a55"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "✅ All environment variables loaded successfully\n"
          ]
        }
      ],
      "source": [
        "import os\n",
        "from dotenv import load_dotenv\n",
        "\n",
        "# Load environment variables from .env file\n",
        "load_dotenv(override=True)\n",
        "\n",
        "if not os.environ.get(\"GOOGLE_API_KEY\") and os.environ.get(\"GOOGLE_AI_API_KEY\"):\n",
        "    os.environ[\"GOOGLE_API_KEY\"] = os.environ[\"GOOGLE_AI_API_KEY\"]\n",
        "\n",
        "# Verify all required keys are loaded\n",
        "required_keys = [\"BRIGHT_DATA_API_KEY\", \"MONGO_CONNECTION_STRING\", \"GOOGLE_API_KEY\"]\n",
        "missing_keys = [key for key in required_keys if not os.environ.get(key)]\n",
        "\n",
        "if missing_keys:\n",
        "    print(f\"❌ Missing keys: {', '.join(missing_keys)}\")\n",
        "    raise ValueError(f\"Please add {', '.join(missing_keys)} to your .env file\")\n",
        "else:\n",
        "    print(\"✅ All environment variables loaded successfully\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KT1vZQLIrmZh"
      },
      "source": [
        "### Bright Data Datasets Reference\n",
        "\n",
        "- [Crunchbase Company](https://brightdata.com/products/datasets/crunchbase)\n",
        "- [LinkedIn Company Profile](https://brightdata.com/products/datasets/linkedin/company)\n",
        "- [LinkedIn Person Profile](https://brightdata.com/products/datasets/linkedin/profiles)\n",
        "- [Google SERP API](https://brightdata.com/products/serp-api/google-search)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "y6SFAHoNrmZh",
        "outputId": "ee4d5848-bca3-4b3a-e7c7-6aa3e6acf08b"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Total available datasets: 43\n",
            "\n",
            "Sales research relevant datasets:\n",
            "--------------------------------------------------\n",
            "📊 linkedin_person_profile\n",
            "   Extract structured LinkedIn person profile data. Requires a valid LinkedIn profile URL.\n",
            "\n",
            "📊 linkedin_company_profile\n",
            "   Extract structured LinkedIn company profile data. Requires a valid LinkedIn company URL.\n",
            "\n",
            "📊 linkedin_job_listings\n",
            "   Extract structured LinkedIn job listings data. Requires a valid LinkedIn job URL.\n",
            "\n",
            "📊 linkedin_posts\n",
            "   Extract structured LinkedIn posts data. Requires a valid LinkedIn post URL.\n",
            "\n",
            "📊 linkedin_people_search\n",
            "   Extract structured LinkedIn people search data. Requires URL, first_name, and last_name.\n",
            "\n",
            "📊 crunchbase_company\n",
            "   Extract structured Crunchbase company data. Requires a valid Crunchbase company URL.\n",
            "\n",
            "📊 zoominfo_company_profile\n",
            "   Extract structured ZoomInfo company profile data. Requires a valid ZoomInfo company URL.\n",
            "\n",
            "📊 instagram_profiles\n",
            "   Extract structured Instagram profile data. Requires a valid Instagram profile URL.\n",
            "\n",
            "📊 facebook_company_reviews\n",
            "   Extract structured Facebook company reviews. Requires a valid Facebook company URL and num_of_reviews.\n",
            "\n",
            "📊 tiktok_profiles\n",
            "   Extract structured TikTok profile data. Requires a valid TikTok profile URL.\n",
            "\n",
            "📊 youtube_profiles\n",
            "   Extract structured YouTube channel profile data. Requires a valid YouTube channel URL.\n",
            "\n"
          ]
        }
      ],
      "source": [
        "from haystack_brightdata import BrightDataWebScraper\n",
        "\n",
        "# List all supported datasets\n",
        "datasets = BrightDataWebScraper.get_supported_datasets()\n",
        "\n",
        "print(f\"Total available datasets: {len(datasets)}\\n\")\n",
        "print(\"Sales research relevant datasets:\")\n",
        "print(\"-\" * 50)\n",
        "\n",
        "# Filter for relevant datasets\n",
        "relevant_keywords = [\"linkedin\", \"crunchbase\", \"company\", \"profile\"]\n",
        "for dataset in datasets:\n",
        "    if any(keyword in dataset['id'].lower() for keyword in relevant_keywords):\n",
        "        print(f\"📊 {dataset['id']}\")\n",
        "        print(f\"   {dataset['description']}\")\n",
        "        print()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6Jp4C9AvrmZi"
      },
      "source": [
        "### MongoDB Atlas Setup\n",
        "\n",
        "MongoDB Atlas will serve as our vector database for storing embedded lead data and enabling semantic search.\n",
        "\n",
        "**1. Create a MongoDB Atlas Cluster**\n",
        "\n",
        "Follow the [Get Started with Atlas](https://www.mongodb.com/docs/atlas/getting-started/) guide to:\n",
        "- Create a free cluster (M0 tier is sufficient for testing)\n",
        "- Set up database access credentials\n",
        "- Configure network access (allow your IP or use 0.0.0.0/0 for testing)\n",
        "- Get your connection string\n",
        "\n",
        "**2. Create Vector Search Index**\n",
        "\n",
        "1. Go to your cluster in the Atlas UI\n",
        "2. Click the **\"Search\"** tab → \"Create Search Index\"\n",
        "3. Select \"Atlas Vector Search\" → \"JSON Editor\"\n",
        "4. Configure:\n",
        "   - Index name: `lead_vector_index`\n",
        "   - Database: `sales_intelligence`\n",
        "   - Collection: `leads`\n",
        "\n",
        "5. Paste this configuration:\n",
        "\n",
        "```json\n",
        "{\n",
        "  \"fields\": [\n",
        "    {\n",
        "      \"type\": \"vector\",\n",
        "      \"path\": \"embedding\",\n",
        "      \"numDimensions\": 768,\n",
        "      \"similarity\": \"cosine\"\n",
        "    }\n",
        "  ]\n",
        "}\n",
        "```\n",
        "\n",
        "6. Wait for the index status to change from \"Building\" to \"Active\"\n",
        "\n",
        "**This is how it should look after setup:**\n",
        "\n",
        "![image.png](https://github.com/deepset-ai/haystack-cookbook/blob/main/data/ai_sales_research_assistant_assets/mongo_setup.png?raw=1)\n",
        "\n",
        "Let's initialize the document store:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QIYGglSjrmZi",
        "outputId": "15e17394-b169-4fe7-82b7-9f97d7fea02c"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "✅ MongoDB Atlas DocumentStore initialized\n",
            "   Database: sales_intelligence\n",
            "   Collection: leads\n",
            "   Vector Search Index: lead_vector_index\n",
            "   Full-Text Search Index: lead_fulltext_index\n"
          ]
        }
      ],
      "source": [
        "from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore\n",
        "\n",
        "# Initialize MongoDB Atlas Document Store\n",
        "# Note: It automatically reads from MONGO_CONNECTION_STRING environment variable\n",
        "document_store = MongoDBAtlasDocumentStore(\n",
        "    database_name=\"sales_intelligence\",\n",
        "    collection_name=\"leads\",\n",
        "    vector_search_index=\"lead_vector_index\",\n",
        "    full_text_search_index=\"lead_fulltext_index\"\n",
        ")\n",
        "\n",
        "print(\"✅ MongoDB Atlas DocumentStore initialized\")\n",
        "print(f\"   Database: sales_intelligence\")\n",
        "print(f\"   Collection: leads\")\n",
        "print(f\"   Vector Search Index: lead_vector_index\")\n",
        "print(f\"   Full-Text Search Index: lead_fulltext_index\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Pnp3MixdrmZi"
      },
      "source": [
        "## Data Model Design\n",
        "\n",
        "Our lead intelligence database uses a flexible schema that accommodates data from multiple sources while enabling powerful **hybrid search capabilities**.\n",
        "\n",
        "This structure enables three search modes:\n",
        "\n",
        "1. **Semantic Search**: Find similar companies/people based on meaning\n",
        "   - Query: \"AI startups focused on enterprise automation\"\n",
        "   - Matches: Companies with similar descriptions, even if wording differs\n",
        "\n",
        "2. **Metadata Filtering**: Exact match on structured fields\n",
        "   - Filter: `funding_stage = \"Series A\" AND location = \"New York, NY\"`\n",
        "   - Returns: Only companies meeting exact criteria\n",
        "\n",
        "3. **Hybrid Search**: Combine both approaches\n",
        "   - Semantic query: \"Companies building developer tools\"\n",
        "   - + Filters: `funding_stage = \"Series A\"` AND `location = \"San Francisco, CA\"`\n",
        "   - Result: Semantically relevant companies that also match exact criteria\n",
        "\n",
        "### Example Documents\n",
        "\n",
        "Each document has three components: `content` (human-readable text for LLM context), `embedding` (768-dim vector from text-embedding-004 for semantic search), and `meta` (structured fields for filtering).\n",
        "\n",
        "**Company Document (Crunchbase):**\n",
        "```python\n",
        "{\n",
        "  \"content\": \"Company: Acme AI\\nIndustry: Artificial Intelligence\\nFunding: $15M Series A...\",\n",
        "  \"embedding\": [0.123, -0.456, ...],  # 768 dimensions\n",
        "  \"meta\": {\n",
        "    \"source_url\": \"https://www.crunchbase.com/organization/acme-ai\",\n",
        "    \"dataset_type\": \"crunchbase_company\",\n",
        "    \"company_name\": \"Acme AI\",\n",
        "    \"industry\": \"AI/ML\",\n",
        "    \"funding_stage\": \"Series A\",\n",
        "    \"location\": \"San Francisco, CA\",\n",
        "    \"scraped_date\": \"2026-01-19\"\n",
        "  }\n",
        "}\n",
        "```\n",
        "\n",
        "**Person Document (LinkedIn):**\n",
        "```python\n",
        "{\n",
        "  \"content\": \"Name: Jane Smith\\nTitle: VP of Engineering\\nCompany: Acme AI\\nExperience: 10+ years...\",\n",
        "  \"embedding\": [0.234, -0.567, ...],  # 768 dimensions\n",
        "  \"meta\": {\n",
        "    \"source_url\": \"https://www.linkedin.com/in/janesmith\",\n",
        "    \"dataset_type\": \"linkedin_person\",\n",
        "    \"person_name\": \"Jane Smith\",\n",
        "    \"person_title\": \"VP of Engineering\",\n",
        "    \"company\": \"Acme AI\",\n",
        "    \"location\": \"San Francisco, CA\",\n",
        "    \"scraped_date\": \"2026-01-19\"\n",
        "  }\n",
        "}\n",
        "```\n",
        "\n",
        "**News Signal Document (SERP):**\n",
        "```python\n",
        "{\n",
        "  \"content\": \"News: Acme AI raises $15M Series A\\nSource: TechCrunch\\nSnippet: AI startup...\",\n",
        "  \"embedding\": [0.345, -0.678, ...],  # 768 dimensions\n",
        "  \"meta\": {\n",
        "    \"source_url\": \"https://techcrunch.com/...\",\n",
        "    \"dataset_type\": \"news\",\n",
        "    \"company_name\": \"Acme AI\",\n",
        "    \"scraped_date\": \"2026-01-19\"\n",
        "  }\n",
        "}\n",
        "```\n",
        "\n",
        "This flexible schema allows us to enrich lead profiles with multiple data sources while maintaining fast, accurate search capabilities."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rKZJBl_irmZi",
        "outputId": "72d932fe-b512-460f-e0b7-a17ac097919f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "✅ MongoDB Atlas Retriever initialized\n",
            "   Connected to: leads\n",
            "   Using vector index: lead_vector_index\n"
          ]
        }
      ],
      "source": [
        "from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever\n",
        "\n",
        "# Initialize the retriever for vector search\n",
        "retriever = MongoDBAtlasEmbeddingRetriever(document_store=document_store)\n",
        "\n",
        "print(\"✅ MongoDB Atlas Retriever initialized\")\n",
        "print(f\"   Connected to: {document_store.collection_name}\")\n",
        "print(f\"   Using vector index: {document_store.vector_search_index}\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "HQxu2FvlrmZi",
        "outputId": "4278835e-ddd1-44dd-bd0a-7c64fe622789"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "✅ Bright Data Web Scraper initialized\n",
            "   API Key configured: 2dceb1aa0cda2fc6f7f7...\n",
            "   Ready to scrape from 45+ supported datasets\n"
          ]
        }
      ],
      "source": [
        "from haystack_brightdata import BrightDataWebScraper\n",
        "\n",
        "# Initialize the Web Scraper\n",
        "# Note: Automatically uses BRIGHT_DATA_API_KEY from environment\n",
        "scraper = BrightDataWebScraper()\n",
        "\n",
        "print(\"✅ Bright Data Web Scraper initialized\")\n",
        "print(f\"   API Key configured: {os.environ.get('BRIGHT_DATA_API_KEY')[:20]}...\")\n",
        "print(f\"   Ready to scrape from 45+ supported datasets\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "m1S0njTgrmZi"
      },
      "source": [
        "### Example 1: Scraping Crunchbase Company Data\n",
        "\n",
        "Let's start by extracting company intelligence from Crunchbase. This gives us funding information, investors, employee count, and more."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2gn560f3rmZi",
        "outputId": "0486a41c-4eca-4a88-a6af-0b9b72668357"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Scraping Crunchbase data for: https://www.crunchbase.com/organization/openai\n",
            "\n",
            "✅ Successfully scraped company data!\n",
            "\n",
            "📊 Key Information:\n",
            "   Company: OpenAI\n",
            "   Overview: OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.\n",
            "   Industries: Agentic AI, Artificial Intelligence (AI), Foundational AI, Generative AI, Machine Learning, Natural Language Processing, SaaS\n",
            "   Operating Status: active\n",
            "   Website: https://www.openai.com\n",
            "   Employees: 1001-5000\n",
            "   Phone: +1 800-242-8478\n",
            "   Active Tech Count: 79\n",
            "   Tech (sample): DNSSEC, SSL by Default, HSTS, U.S. Server Location, Mobile Non Scaleable Content\n",
            "   Latest News Date: 2026-01-25\n",
            "\n",
            "📄 Full data structure (first 500 chars):\n",
            "{\n",
            "  \"name\": \"OpenAI\",\n",
            "  \"url\": \"https://www.crunchbase.com/organization/openai\",\n",
            "  \"id\": \"openai\",\n",
            "  \"cb_rank\": 3,\n",
            "  \"region\": \"California\",\n",
            "  \"about\": \"OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.\",\n",
            "  \"industries\": [\n",
            "    {\n",
            "      \"id\": \"agentic-ai-17fa\",\n",
            "      \"value\": \"Agentic AI\"\n",
            "    },\n",
            "    {\n",
            "      \"id\": \"artificial-intelligence\",\n",
            "      \"value\": \"Artificial Intelligence (AI)\"\n",
            "    },\n",
            "    {\n",
            "      \"id\": \"foundational-ai\",\n",
            "      \"value\": \"Fou...\n"
          ]
        }
      ],
      "source": [
        "import json\n",
        "\n",
        "# Example: Scrape company data from Crunchbase\n",
        "# Replace with an actual Crunchbase company URL you want to research\n",
        "company_url = \"https://www.crunchbase.com/organization/openai\"\n",
        "\n",
        "print(\"Scraping Crunchbase data for: {}\".format(company_url))\n",
        "print()\n",
        "\n",
        "def coalesce(data, *keys, default=\"N/A\"):\n",
        "    for key in keys:\n",
        "        value = data.get(key)\n",
        "        if value not in (None, \"\", [], {}):\n",
        "            return value\n",
        "    return default\n",
        "\n",
        "def format_industries(industries):\n",
        "    if not industries:\n",
        "        return \"N/A\"\n",
        "    if isinstance(industries, list):\n",
        "        values = []\n",
        "        for item in industries:\n",
        "            if isinstance(item, dict):\n",
        "                value = item.get(\"value\") or item.get(\"name\") or item.get(\"id\")\n",
        "                if value:\n",
        "                    values.append(value)\n",
        "            else:\n",
        "                values.append(str(item))\n",
        "        return \", \".join(values) if values else \"N/A\"\n",
        "    return industries\n",
        "\n",
        "def parse_company(result):\n",
        "    raw = result.get(\"data\", result)\n",
        "    if isinstance(raw, str):\n",
        "        raw = json.loads(raw)\n",
        "    if isinstance(raw, list):\n",
        "        return raw[0] if raw else {}\n",
        "    if isinstance(raw, dict):\n",
        "        return raw\n",
        "    return {}\n",
        "\n",
        "try:\n",
        "    result = scraper.run(\n",
        "        dataset=\"crunchbase_company\",\n",
        "        url=company_url\n",
        "    )\n",
        "\n",
        "    company_data = parse_company(result)\n",
        "\n",
        "    industries = format_industries(company_data.get(\"industries\"))\n",
        "    tech_list = company_data.get(\"builtwith_tech\") or company_data.get(\"built_with_tech\") or []\n",
        "    tech_names = [\n",
        "        item.get(\"name\")\n",
        "        for item in tech_list\n",
        "        if isinstance(item, dict) and item.get(\"name\")\n",
        "    ]\n",
        "    tech_preview = \", \".join(tech_names[:5]) if tech_names else \"N/A\"\n",
        "\n",
        "    news_items = company_data.get(\"news\") or []\n",
        "    news_dates = [\n",
        "        item.get(\"date\")\n",
        "        for item in news_items\n",
        "        if isinstance(item, dict) and item.get(\"date\")\n",
        "    ]\n",
        "    latest_news_date = max(news_dates) if news_dates else \"N/A\"\n",
        "\n",
        "    print(\"✅ Successfully scraped company data!\")\n",
        "    print()\n",
        "    print(\"📊 Key Information:\")\n",
        "    print(\"   Company: {}\".format(coalesce(company_data, \"name\", \"legal_name\")))\n",
        "    print(\"   Overview: {}\".format(coalesce(company_data, \"about\", \"company_overview\")))\n",
        "    print(\"   Industries: {}\".format(industries))\n",
        "    print(\"   Operating Status: {}\".format(coalesce(company_data, \"operating_status\")))\n",
        "    print(\"   Website: {}\".format(coalesce(company_data, \"website\", \"url\")))\n",
        "    print(\"   Employees: {}\".format(coalesce(company_data, \"num_employees\", \"number_of_employee_profiles\")))\n",
        "    print(\"   Phone: {}\".format(coalesce(company_data, \"contact_phone\", \"phone_number\")))\n",
        "    print(\n",
        "        \"   Active Tech Count: {}\".format(\n",
        "            coalesce(\n",
        "                company_data,\n",
        "                \"active_tech_count\",\n",
        "                \"builtwith_num_technologies_used\",\n",
        "                \"built_with_num_technologies_used\"\n",
        "            )\n",
        "        )\n",
        "    )\n",
        "    print(\"   Tech (sample): {}\".format(tech_preview))\n",
        "    print(\"   Latest News Date: {}\".format(latest_news_date))\n",
        "\n",
        "    print()\n",
        "    print(\"📄 Full data structure (first 500 chars):\")\n",
        "    print(json.dumps(company_data, indent=2)[:500] + \"...\")\n",
        "except Exception as e:\n",
        "    print(\"❌ Error scraping data: {}\".format(e))\n",
        "    print(\"   This might be due to invalid URL or rate limiting\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2JgOAzVJrmZj"
      },
      "source": [
        "### Example 2: Scraping Linkedin Company Data\n",
        "\n",
        "Now we will extract company data from Linkedin. This gives us broader information about the requested company"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "guQnSmdvrmZj",
        "outputId": "694aebc9-5465-4b22-bffd-118b23f5391f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Scraping LinkedIn company data for: https://www.linkedin.com/company/openai/\n",
            "\n",
            "✅ Successfully scraped LinkedIn company data!\n",
            "\n",
            "📊 Key Information:\n",
            "   Company: OpenAI\n",
            "   Description: OpenAI | 9,797,179 followers on LinkedIn. OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. AI is an extremel...\n",
            "   Industry: N/A\n",
            "   Company Size: 201-500 employees\n",
            "   Headquarters: San Francisco, CA\n",
            "   Website: https://openai.com/\n",
            "   Followers: N/A\n",
            "   Specialties: a, r, t, i, f\n",
            "\n",
            "📄 Full data structure (first 500 chars):\n",
            "{\n",
            "  \"id\": \"openai\",\n",
            "  \"name\": \"OpenAI\",\n",
            "  \"country_code\": \"US\",\n",
            "  \"locations\": [\n",
            "    \"San Francisco, CA 94110, US\"\n",
            "  ],\n",
            "  \"followers\": 9797179,\n",
            "  \"employees_in_linkedin\": 7020,\n",
            "  \"about\": \"OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. AI is an extremely powerful tool that must be created with safety and human needs at its core. OpenAI is dedicated to putting that alignment of interests first \\u2014 ahe...\n"
          ]
        }
      ],
      "source": [
        "import json\n",
        "\n",
        "# Example: Scrape LinkedIn company profile\n",
        "# Replace with an actual LinkedIn company URL you want to research\n",
        "linkedin_url = \"https://www.linkedin.com/company/openai/\"\n",
        "\n",
        "print(f\"Scraping LinkedIn company data for: {linkedin_url}\")\n",
        "print()\n",
        "\n",
        "try:\n",
        "    result = scraper.run(\n",
        "        dataset=\"linkedin_company_profile\",\n",
        "        url=linkedin_url\n",
        "    )\n",
        "\n",
        "    # Parse the JSON response\n",
        "    if isinstance(result[\"data\"], str):\n",
        "        company_data = json.loads(result[\"data\"])\n",
        "    else:\n",
        "        company_data = result[\"data\"]\n",
        "\n",
        "    # Handle list response\n",
        "    if isinstance(company_data, list):\n",
        "        company_data = company_data[0] if company_data else {}\n",
        "\n",
        "    print(\"✅ Successfully scraped LinkedIn company data!\")\n",
        "    print(\"\\n📊 Key Information:\")\n",
        "    print(f\"   Company: {company_data.get('name', 'N/A')}\")\n",
        "    print(f\"   Description: {company_data.get('description', 'N/A')[:200]}...\")\n",
        "    print(f\"   Industry: {company_data.get('industry', 'N/A')}\")\n",
        "    print(f\"   Company Size: {company_data.get('company_size', 'N/A')}\")\n",
        "    print(f\"   Headquarters: {company_data.get('headquarters', 'N/A')}\")\n",
        "    print(f\"   Website: {company_data.get('website', 'N/A')}\")\n",
        "    print(f\"   Followers: {company_data.get('follower_count', 'N/A')}\")\n",
        "    print(f\"   Specialties: {', '.join(company_data.get('specialties', [])[:5]) if company_data.get('specialties') else 'N/A'}\")\n",
        "\n",
        "    print(\"\\n📄 Full data structure (first 500 chars):\")\n",
        "    print(json.dumps(company_data, indent=2)[:500] + \"...\")\n",
        "\n",
        "except Exception as e:\n",
        "    print(f\"❌ Error scraping data: {e}\")\n",
        "    print(\"   This might be due to invalid URL, rate limiting, or authentication requirements\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uiIGSiH6rmZj"
      },
      "source": [
        "### Example 3: Scraping LinkedIn Person Profile\n",
        "\n",
        "Now let's extract decision maker profiles from LinkedIn. This helps identify key contacts, their backgrounds, and experience."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Ixcx45fkrmZj",
        "outputId": "026934c2-b2de-464f-9345-d7f1abd9206d"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Scraping LinkedIn person profile for: https://www.linkedin.com/in/satyanadella/\n",
            "\n",
            "✅ Successfully scraped LinkedIn person profile!\n",
            "\n",
            "📊 Key Information:\n",
            "   Name: Satya Nadella\n",
            "   Position: Chairman and CEO at Microsoft\n",
            "   Location: Redmond, Washington, United States, US\n",
            "   Current Company: Microsoft\n",
            "   Followers: 11816477\n",
            "   Connections: 500\n",
            "\n",
            "   About: As chairman and CEO of Microsoft, I define my mission and that of my company as empowering every person and every organization on the planet to achieve more....\n",
            "\n",
            "   Experience (5 roles):\n",
            "      1. Chairman and CEO at Microsoft (N/A)\n",
            "      2. Member Board Of Trustees at University of Chicago (N/A)\n",
            "      3. Board Member at Starbucks (N/A)\n",
            "\n",
            "   Education (3 entries):\n",
            "      1. The University of Chicago Booth School of Business (1994-1996)\n",
            "      2. Manipal Institute of Technology, Manipal (-)\n",
            "\n",
            "📄 Full data structure (first 500 chars):\n",
            "{\n",
            "  \"id\": \"satyanadella\",\n",
            "  \"name\": \"Satya Nadella\",\n",
            "  \"city\": \"Redmond, Washington, United States\",\n",
            "  \"country_code\": \"US\",\n",
            "  \"position\": \"Chairman and CEO at Microsoft\",\n",
            "  \"about\": \"As chairman and CEO of Microsoft, I define my mission and that of my company as empowering every person and every organization on the planet to achieve more.\",\n",
            "  \"posts\": [\n",
            "    {\n",
            "      \"title\": \"A Positive-Sum Future\",\n",
            "      \"attribution\": \"I\\u2019ve been thinking a lot about what the net benefit of the AI platform...\n"
          ]
        }
      ],
      "source": [
        "import json\n",
        "\n",
        "# Example: Scrape LinkedIn person profile\n",
        "person_url = \"https://www.linkedin.com/in/satyanadella/\"\n",
        "\n",
        "print(f\"Scraping LinkedIn person profile for: {person_url}\")\n",
        "print()\n",
        "\n",
        "try:\n",
        "    result = scraper.run(\n",
        "        dataset=\"linkedin_person_profile\",\n",
        "        url=person_url\n",
        "    )\n",
        "\n",
        "    # Parse the JSON response\n",
        "    if isinstance(result[\"data\"], str):\n",
        "        person_data = json.loads(result[\"data\"])\n",
        "    else:\n",
        "        person_data = result[\"data\"]\n",
        "\n",
        "    # Handle list response - LinkedIn returns a list with one person object\n",
        "    if isinstance(person_data, list):\n",
        "        person_data = person_data[0] if person_data else {}\n",
        "\n",
        "    print(\"✅ Successfully scraped LinkedIn person profile!\")\n",
        "    print(\"\\n📊 Key Information:\")\n",
        "    print(f\"   Name: {person_data.get('name', 'N/A')}\")\n",
        "    print(f\"   Position: {person_data.get('position', 'N/A')}\")\n",
        "    print(f\"   Location: {person_data.get('city', 'N/A')}, {person_data.get('country_code', 'N/A')}\")\n",
        "\n",
        "    # Current company\n",
        "    current_company = person_data.get('current_company', {})\n",
        "    if current_company:\n",
        "        print(f\"   Current Company: {current_company.get('name', 'N/A')}\")\n",
        "    else:\n",
        "        print(f\"   Current Company: N/A\")\n",
        "\n",
        "    print(f\"   Followers: {person_data.get('followers', 'N/A')}\")\n",
        "    print(f\"   Connections: {person_data.get('connections', 'N/A')}\")\n",
        "\n",
        "    # About section\n",
        "    about = person_data.get('about')\n",
        "    if about:\n",
        "        print(f\"\\n   About: {about[:200]}...\")\n",
        "\n",
        "    # Experience\n",
        "    experience = person_data.get('experience', [])\n",
        "    if experience:\n",
        "        print(f\"\\n   Experience ({len(experience)} roles):\")\n",
        "        for i, exp in enumerate(experience[:3]):  # Show first 3 roles\n",
        "            company = exp.get('company', 'N/A')\n",
        "            title = exp.get('title', 'N/A')\n",
        "            duration = exp.get('duration', 'N/A')\n",
        "            print(f\"      {i+1}. {title} at {company} ({duration})\")\n",
        "\n",
        "    # Education\n",
        "    education = person_data.get('education', [])\n",
        "    if education:\n",
        "        print(f\"\\n   Education ({len(education)} entries):\")\n",
        "        for i, edu in enumerate(education[:2]):  # Show first 2 education entries\n",
        "            title = edu.get('title', 'N/A')\n",
        "            years = f\"{edu.get('start_year', '')}-{edu.get('end_year', '')}\"\n",
        "            print(f\"      {i+1}. {title} ({years})\")\n",
        "\n",
        "    print(\"\\n📄 Full data structure (first 500 chars):\")\n",
        "    print(json.dumps(person_data, indent=2)[:500] + \"...\")\n",
        "\n",
        "except Exception as e:\n",
        "    print(f\"❌ Error scraping data: {e}\")\n",
        "    print(\"   This might be due to invalid URL, rate limiting, or authentication requirements\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Mr--7WsarmZj"
      },
      "source": [
        "## SERP API for Market Signals\n",
        "\n",
        "Bright Data's SERP API lets us gather market signals through search results - hiring signals, news, and pain points.\n",
        "\n",
        "### Example SERP Queries for Sales Research\n",
        "\n",
        "```python\n",
        "# Hiring signals\n",
        "query = 'site:linkedin.com/jobs \"Company Name\" engineering'\n",
        "\n",
        "# Funding news\n",
        "query = '\"Company Name\" funding Series A announcement'\n",
        "\n",
        "# Recent news\n",
        "query = '\"Company Name\" news (2024 OR 2025)'\n",
        "```\n",
        "\n",
        "### Data Structure\n",
        "\n",
        "SERP API returns search results:\n",
        "\n",
        "```python\n",
        "{\n",
        "  \"results\": [\n",
        "    {\n",
        "      \"title\": \"Company raises $50M Series B...\",\n",
        "      \"url\": \"https://techcrunch.com/...\",\n",
        "      \"snippet\": \"AI startup Company announced today...\",\n",
        "      \"date\": \"2025-01-15\"\n",
        "    }\n",
        "  ]\n",
        "}\n",
        "```\n",
        "\n",
        "Let's see it in action!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "DHMCNYKgrmZj",
        "outputId": "03e93ad1-e6fa-45c4-d55e-db4d2ca9fe26"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "✅ Bright Data SERP API initialized\n",
            "   API Key configured: 2dceb1aa0cda2fc6f7f7...\n",
            "   Ready to search Google/Bing for market signals\n"
          ]
        }
      ],
      "source": [
        "from haystack_brightdata import BrightDataSERP\n",
        "\n",
        "# Initialize the SERP API component\n",
        "# Note: Automatically uses BRIGHT_DATA_API_KEY from environment\n",
        "serp = BrightDataSERP()\n",
        "\n",
        "print(\"✅ Bright Data SERP API initialized\")\n",
        "print(f\"   API Key configured: {os.environ.get('BRIGHT_DATA_API_KEY')[:20]}...\")\n",
        "print(f\"   Ready to search Google/Bing for market signals\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Sd3PZwuTrmZj"
      },
      "source": [
        "### Example: Using SERP API to Find Company News\n",
        "\n",
        "Let's use SERP to discover recent news and signals about a company. This is perfect for identifying buying signals like funding announcements, product launches, or hiring initiatives."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "y59babkIrmZj",
        "outputId": "b6e435b9-46f4-4cb4-da6f-6965f749ab1c"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Searching for recent news about: OpenAI\n",
            "Query: \"OpenAI\" news funding OR announcement OR launch 2025 OR 2026\n",
            "\n",
            "✅ Found 9 results\n",
            "\n",
            "📰 Recent News & Signals:\n",
            "\n",
            "1. OpenAI seek investments from Middle East for multibillion- ...\n",
            "   URL: https://www.cnbc.com/2026/01/21/openai-seek-investments-from-middle-east-for-multibillion-dollar-round.html\n",
            "   Snippet: OpenAI is in talks with sovereign wealth funds in the Middle East to try to secure investments for a new multibillion dollar funding round, CNBC ...Re...\n",
            "\n",
            "2. Horizon 1000: Advancing AI for primary healthcare\n",
            "   URL: https://openai.com/index/horizon-1000/\n",
            "   Snippet: Together, the Gates Foundation and OpenAI are committing $50 million in funding, technology, and technical support to support their work ...Read more...\n",
            "\n",
            "3. OpenAI is coming for those sweet enterprise dollars in 2026\n",
            "   URL: https://techcrunch.com/2026/01/22/openai-is-coming-for-those-sweet-enterprise-dollars-in-2026/\n",
            "   Snippet: OpenAI on the other hand has seen its usage market share drop from 50% in 2023 to 27% at the end of 2025 — a trend that appears to concern the ...Read...\n",
            "\n",
            "4. OpenAI's Altman Meets Mideast Investors for $50 Billion ...\n",
            "   URL: https://www.bloomberg.com/news/articles/2026-01-21/openai-s-altman-meets-mideast-investors-for-50-billion-round\n",
            "   Snippet: OpenAI Chief Executive Officer Sam Altman has been meeting with top investors in the Middle East to line up funding for a new investment round ...Read...\n",
            "\n",
            "5. Inside OpenAI's Plan To Make Money\n",
            "   URL: https://www.forbes.com/sites/the-prompt/2026/01/20/inside-openais-plan-to-make-money/\n",
            "   Snippet: OpenAI ended 2025 with back-to-back massive infrastructure deals with the likes of Oracle, AMD and Broadcom that tallied up to $1.4 trillion of ...Rea...\n",
            "\n",
            "\n",
            "💡 Sales Intelligence Use Cases:\n",
            "   • Store these results in MongoDB with embeddings\n",
            "   • Use Gemini to summarize key developments\n",
            "   • Set up alerts for specific keywords (funding, hiring, launch)\n",
            "   • Identify warm leads (companies announcing growth)\n",
            "\n",
            "📄 Full data structure (first 500 chars):\n",
            "{\n",
            "  \"general\": {\n",
            "    \"search_engine\": \"google\",\n",
            "    \"query\": \"\\\"OpenAI\\\" news funding OR announcement OR launch 2025 OR 2026\",\n",
            "    \"language\": \"en\",\n",
            "    \"location\": \"San Antonio, Texas\",\n",
            "    \"mobile\": false,\n",
            "    \"basic_view\": false,\n",
            "    \"search_type\": \"text\",\n",
            "    \"page_title\": \"\\\"OpenAI\\\" news funding OR announcement OR launch 2025 OR 2026 - Google Search\",\n",
            "    \"timestamp\": \"2026-01-25T12:05:32.212Z\"\n",
            "  },\n",
            "  \"input\": {\n",
            "    \"original_url\": \"https://www.google.com/search?q=%22OpenAI%22+news+funding...\n"
          ]
        }
      ],
      "source": [
        "import json\n",
        "\n",
        "# Example: Search for recent company news and announcements\n",
        "company_name = \"OpenAI\"\n",
        "search_query = f'\"{company_name}\" news funding OR announcement OR launch 2025 OR 2026'\n",
        "\n",
        "print(f\"Searching for recent news about: {company_name}\")\n",
        "print(f\"Query: {search_query}\")\n",
        "print()\n",
        "\n",
        "try:\n",
        "    result = serp.run(\n",
        "        query=search_query,\n",
        "        num_results=10\n",
        "    )\n",
        "\n",
        "    # Parse the results\n",
        "    if isinstance(result[\"results\"], str):\n",
        "        serp_data = json.loads(result[\"results\"])\n",
        "    else:\n",
        "        serp_data = result[\"results\"]\n",
        "\n",
        "    # Extract organic results (may be at root level or nested)\n",
        "    organic_results = serp_data.get(\"organic\", [])\n",
        "    if not organic_results and \"results\" in serp_data:\n",
        "        organic_results = serp_data.get(\"results\", [])\n",
        "\n",
        "    if not organic_results:\n",
        "        print(\"⚠️ No results found\")\n",
        "    else:\n",
        "        print(f\"✅ Found {len(organic_results)} results\")\n",
        "        print(\"\\n📰 Recent News & Signals:\\n\")\n",
        "\n",
        "        for i, item in enumerate(organic_results[:5], 1):  # Show top 5 results\n",
        "            title = item.get(\"title\", \"N/A\")\n",
        "            link = item.get(\"link\", item.get(\"url\", \"N/A\"))\n",
        "            snippet = item.get(\"snippet\", item.get(\"description\", \"N/A\"))\n",
        "\n",
        "            print(f\"{i}. {title}\")\n",
        "            print(f\"   URL: {link}\")\n",
        "            print(f\"   Snippet: {snippet[:150]}...\")\n",
        "            print()\n",
        "\n",
        "        print(\"\\n💡 Sales Intelligence Use Cases:\")\n",
        "        print(\"   • Store these results in MongoDB with embeddings\")\n",
        "        print(\"   • Use Gemini to summarize key developments\")\n",
        "        print(\"   • Set up alerts for specific keywords (funding, hiring, launch)\")\n",
        "        print(\"   • Identify warm leads (companies announcing growth)\")\n",
        "\n",
        "    print(\"\\n📄 Full data structure (first 500 chars):\")\n",
        "    print(json.dumps(serp_data, indent=2)[:500] + \"...\")\n",
        "\n",
        "except Exception as e:\n",
        "    print(f\"❌ Error searching: {e}\")\n",
        "    print(\"   This might be due to rate limiting or API issues\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "B2DxO_80rmZj"
      },
      "source": [
        "## Data Processing & Indexing Pipeline\n",
        "\n",
        "Now we need to process and index our scraped data into MongoDB Atlas for semantic search.\n",
        "\n",
        "### The Indexing Pipeline Flow\n",
        "\n",
        "```\n",
        "Raw Scraped Data → Document Creation → Embedding Generation → MongoDB Storage\n",
        "     (JSON)            (Haystack)         (Gemini 768d)         (Vector DB)\n",
        "```\n",
        "\n",
        "### Document Structure\n",
        "\n",
        "Each document in MongoDB has three components:\n",
        "\n",
        "```python\n",
        "{\n",
        "  \"content\": \"Human-readable text about company/person\",\n",
        "  \"embedding\": [0.123, -0.456, ...],  # 768-dimensional vector\n",
        "  \"meta\": {\n",
        "    \"source_url\": \"...\",\n",
        "    \"dataset_type\": \"crunchbase_company\",\n",
        "    \"company_name\": \"...\",\n",
        "    \"industry\": \"...\",\n",
        "    \"funding_stage\": \"...\",\n",
        "    \"location\": \"...\",\n",
        "    \"scraped_date\": \"2026-01-19\"\n",
        "  }\n",
        "}\n",
        "```\n",
        "\n",
        "Let's build it!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0o0vtLbfrmZj"
      },
      "source": [
        "### Helper Functions: Transform Scraped Data into Haystack Documents\n",
        "\n",
        "Before we can index data, we need to transform raw scraper responses into Haystack `Document` objects. Let's create helper functions for each data source.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9ifgt0BNrmZj",
        "outputId": "9b6f714f-d83e-4e93-b916-2b497086239b"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "✅ Helper function created: create_company_documents()\n",
            "   Supports: crunchbase_company, linkedin_company_profile\n"
          ]
        }
      ],
      "source": [
        "import json\n",
        "from datetime import datetime\n",
        "from haystack import Document\n",
        "\n",
        "def create_company_documents(scraper_result, source_url, dataset_type):\n",
        "    \"\"\"\n",
        "    Transform company data from Crunchbase or LinkedIn into Haystack Documents.\n",
        "\n",
        "    Args:\n",
        "        scraper_result: Raw result from BrightDataWebScraper.run()\n",
        "        source_url: Original URL that was scraped\n",
        "        dataset_type: \"crunchbase_company\" or \"linkedin_company_profile\"\n",
        "\n",
        "    Returns:\n",
        "        List of Document objects ready for indexing\n",
        "    \"\"\"\n",
        "    # Parse the JSON response\n",
        "    if isinstance(scraper_result[\"data\"], str):\n",
        "        data = json.loads(scraper_result[\"data\"])\n",
        "    else:\n",
        "        data = scraper_result[\"data\"]\n",
        "\n",
        "    # Handle both list and single object responses\n",
        "    if not isinstance(data, list):\n",
        "        data = [data]\n",
        "\n",
        "    documents = []\n",
        "    scraped_date = datetime.now().strftime(\"%Y-%m-%d\")\n",
        "\n",
        "    for item in data:\n",
        "        # Create content string based on dataset type\n",
        "        if dataset_type == \"crunchbase_company\":\n",
        "            content = f\"\"\"Company: {item.get('name', 'N/A')}\n",
        "Overview: {item.get('about', 'N/A')}\n",
        "Industries: {item.get('industries', 'N/A')}\n",
        "Operating Status: {item.get('operating_status', 'N/A')}\n",
        "Location: {item.get('headquarters', 'N/A')}\n",
        "Founded: {item.get('founded_year') or item.get('founded_date', 'N/A')}\n",
        "Employees: {item.get('num_employees', 'N/A')}\n",
        "Website: {item.get('website', 'N/A')}\"\"\"\n",
        "\n",
        "        elif dataset_type == \"linkedin_company_profile\":\n",
        "            content = f\"\"\"Company: {item.get('name', 'N/A')}\n",
        "About: {item.get('about') or item.get('description', 'N/A')}\n",
        "Industries: {item.get('industries', 'N/A')}\n",
        "Company Size: {item.get('company_size', 'N/A')}\n",
        "Headquarters: {item.get('headquarters', 'N/A')}\n",
        "Founded: {item.get('founded', 'N/A')}\n",
        "Website: {item.get('website', 'N/A')}\n",
        "Followers: {item.get('followers', 'N/A')}\n",
        "Employees on LinkedIn: {item.get('employees_in_linkedin', 'N/A')}\"\"\"\n",
        "\n",
        "        else:\n",
        "            content = f\"Company: {item.get('name', 'N/A')}\"\n",
        "\n",
        "        # Extract industry - handle both string and list formats\n",
        "        industries = item.get('industries', item.get('industry', ''))\n",
        "        if isinstance(industries, list):\n",
        "            industries = ', '.join([\n",
        "                ind.get('value', ind) if isinstance(ind, dict) else str(ind)\n",
        "                for ind in industries\n",
        "            ])\n",
        "\n",
        "        # Create Document with metadata\n",
        "        documents.append(Document(\n",
        "            content=content,\n",
        "            meta={\n",
        "                \"source_url\": source_url,\n",
        "                \"dataset_type\": dataset_type,\n",
        "                \"company_name\": item.get('name', ''),\n",
        "                \"industry\": industries,\n",
        "                \"location\": item.get('headquarters') or item.get('location', ''),\n",
        "                \"scraped_date\": scraped_date\n",
        "            }\n",
        "        ))\n",
        "\n",
        "    return documents\n",
        "\n",
        "print(\"✅ Helper function created: create_company_documents()\")\n",
        "print(\"   Supports: crunchbase_company, linkedin_company_profile\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Gpg4UC8VrmZj",
        "outputId": "639a7200-83ff-4ea0-8cab-82cc7f81a13f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "✅ Helper function created: create_person_documents()\n",
            "   Supports: linkedin_person_profile\n"
          ]
        }
      ],
      "source": [
        "def create_person_documents(scraper_result, source_url):\n",
        "    \"\"\"\n",
        "    Transform LinkedIn person profile data into Haystack Documents.\n",
        "\n",
        "    Args:\n",
        "        scraper_result: Raw result from BrightDataWebScraper.run()\n",
        "        source_url: Original LinkedIn profile URL\n",
        "\n",
        "    Returns:\n",
        "        List of Document objects ready for indexing\n",
        "    \"\"\"\n",
        "    # Parse the JSON response\n",
        "    if isinstance(scraper_result[\"data\"], str):\n",
        "        data = json.loads(scraper_result[\"data\"])\n",
        "    else:\n",
        "        data = scraper_result[\"data\"]\n",
        "\n",
        "    # Handle both list and single object responses\n",
        "    if not isinstance(data, list):\n",
        "        data = [data]\n",
        "\n",
        "    documents = []\n",
        "    scraped_date = datetime.now().strftime(\"%Y-%m-%d\")\n",
        "\n",
        "    for person in data:\n",
        "        # Extract experience summary (first 3 roles)\n",
        "        experience = person.get('experience', [])\n",
        "        experience_summary = []\n",
        "        for i, exp in enumerate(experience[:3]):\n",
        "            company = exp.get('company', 'N/A')\n",
        "            title = exp.get('title', 'N/A')\n",
        "            duration = exp.get('duration', 'N/A')\n",
        "            experience_summary.append(f\"{title} at {company} ({duration})\")\n",
        "        experience_text = '\\n'.join(experience_summary) if experience_summary else 'N/A'\n",
        "\n",
        "        # Extract education summary\n",
        "        education = person.get('education', [])\n",
        "        education_summary = []\n",
        "        for edu in education[:2]:\n",
        "            title = edu.get('title', 'N/A')\n",
        "            years = f\"{edu.get('start_year', '')}-{edu.get('end_year', '')}\"\n",
        "            education_summary.append(f\"{title} ({years})\")\n",
        "        education_text = '\\n'.join(education_summary) if education_summary else 'N/A'\n",
        "\n",
        "        # Get current company info\n",
        "        current_company = person.get('current_company', {})\n",
        "        current_company_name = current_company.get('name', 'N/A') if current_company else 'N/A'\n",
        "\n",
        "        # Create content string\n",
        "        content = f\"\"\"Name: {person.get('name', 'N/A')}\n",
        "Position: {person.get('position', 'N/A')}\n",
        "Current Company: {current_company_name}\n",
        "Location: {person.get('city', 'N/A')}, {person.get('country_code', 'N/A')}\n",
        "About: {person.get('about', 'N/A')}\n",
        "Followers: {person.get('followers', 'N/A')}\n",
        "Connections: {person.get('connections', 'N/A')}\n",
        "\n",
        "Recent Experience:\n",
        "{experience_text}\n",
        "\n",
        "Education:\n",
        "{education_text}\"\"\"\n",
        "\n",
        "        # Create Document with metadata\n",
        "        documents.append(Document(\n",
        "            content=content,\n",
        "            meta={\n",
        "                \"source_url\": source_url,\n",
        "                \"dataset_type\": \"linkedin_person_profile\",\n",
        "                \"person_name\": person.get('name', ''),\n",
        "                \"person_title\": person.get('position', ''),\n",
        "                \"company\": current_company_name,\n",
        "                \"location\": f\"{person.get('city', '')}, {person.get('country_code', '')}\",\n",
        "                \"scraped_date\": scraped_date\n",
        "            }\n",
        "        ))\n",
        "\n",
        "    return documents\n",
        "\n",
        "print(\"✅ Helper function created: create_person_documents()\")\n",
        "print(\"   Supports: linkedin_person_profile\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oBC1aYXqrmZj"
      },
      "source": [
        "### Build the Indexing Pipeline\n",
        "\n",
        "Now let's create a Haystack pipeline that automatically:\n",
        "1. Takes Document objects\n",
        "2. Generates embeddings using Gemini\n",
        "3. Writes to MongoDB Atlas\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bA6mVkgWrmZj",
        "outputId": "439cdc38-fb11-482c-ef38-ea0f36a50f01"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "✅ Indexing pipeline created\n",
            "\n",
            "Pipeline structure:\n",
            "   Documents → Embedder (Gemini text-embedding-004) → Writer (MongoDB)\n",
            "\n",
            "Components:\n",
            "   • Embedder: GoogleGenAIDocumentEmbedder (768 dimensions)\n",
            "   • Writer: MongoDB Atlas (leads)\n"
          ]
        }
      ],
      "source": [
        "from haystack import Pipeline\n",
        "from haystack.components.writers import DocumentWriter\n",
        "from haystack_integrations.components.embedders.google_genai import GoogleGenAIDocumentEmbedder\n",
        "\n",
        "# Create the indexing pipeline\n",
        "indexing_pipeline = Pipeline()\n",
        "\n",
        "# Add components - create a fresh embedder instance for this pipeline\n",
        "indexing_pipeline.add_component(\"embedder\", GoogleGenAIDocumentEmbedder(model=\"text-embedding-004\"))\n",
        "indexing_pipeline.add_component(\"writer\", DocumentWriter(document_store=document_store))\n",
        "\n",
        "# Connect components\n",
        "indexing_pipeline.connect(\"embedder.documents\", \"writer.documents\")\n",
        "\n",
        "print(\"✅ Indexing pipeline created\")\n",
        "print(\"\\nPipeline structure:\")\n",
        "print(\"   Documents → Embedder (Gemini text-embedding-004) → Writer (MongoDB)\")\n",
        "print(\"\\nComponents:\")\n",
        "print(f\"   • Embedder: GoogleGenAIDocumentEmbedder (768 dimensions)\")\n",
        "print(f\"   • Writer: MongoDB Atlas ({document_store.collection_name})\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SppZYQ0TrmZj"
      },
      "source": [
        "### Index Sample Companies\n",
        "\n",
        "Let's test the complete indexing flow by scraping a company and indexing it into MongoDB Atlas.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "IW75J5sNrmZk",
        "outputId": "fb9c7abb-5cbc-46d0-d9cf-b147c044c1ca"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "✅ Collection 'leads' already exists\n",
            "   Current document count: 2\n"
          ]
        }
      ],
      "source": [
        "# Initialize the collection in MongoDB if it doesn't exist\n",
        "# This creates the collection and ensures it's ready for indexing\n",
        "\n",
        "try:\n",
        "    # Get the MongoDB client and database\n",
        "    from pymongo import MongoClient\n",
        "\n",
        "    client = MongoClient(os.environ.get(\"MONGO_CONNECTION_STRING\"))\n",
        "    db = client[document_store.database_name]\n",
        "\n",
        "    # Create the collection if it doesn't exist\n",
        "    if document_store.collection_name not in db.list_collection_names():\n",
        "        db.create_collection(document_store.collection_name)\n",
        "        print(f\"✅ Created collection '{document_store.collection_name}' in database '{document_store.database_name}'\")\n",
        "    else:\n",
        "        print(f\"✅ Collection '{document_store.collection_name}' already exists\")\n",
        "\n",
        "    # Count existing documents\n",
        "    collection = db[document_store.collection_name]\n",
        "    doc_count = collection.count_documents({})\n",
        "    print(f\"   Current document count: {doc_count}\")\n",
        "\n",
        "except Exception as e:\n",
        "    print(f\"⚠️ Error initializing collection: {e}\")\n",
        "    print(\"   You may need to create the collection manually in MongoDB Atlas\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Au9oC2bsrmZk"
      },
      "source": [
        "**Note:** Before indexing, we need to ensure the MongoDB collection exists. Let's initialize it:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "TE7Tw544rmZk",
        "outputId": "3694185f-c3d4-4248-d3cb-f7b5c4ae9a5b"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Step 1: Scraping company data from https://www.crunchbase.com/organization/openai\n",
            "------------------------------------------------------------\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "✅ Scraping complete\n",
            "\n",
            "Step 2: Transforming into Haystack Documents\n",
            "------------------------------------------------------------\n",
            "✅ Created 1 document(s)\n",
            "\n",
            "Document preview:\n",
            "   Content (first 200 chars): Company: OpenAI\n",
            "Overview: OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.\n",
            "Industries: [{'id': 'agentic-ai-17fa', 'value': 'Agentic AI'}, {'id': 'ar...\n",
            "   Metadata: {'source_url': 'https://www.crunchbase.com/organization/openai', 'dataset_type': 'crunchbase_company', 'company_name': 'OpenAI', 'industry': 'Agentic AI, Artificial Intelligence (AI), Foundational AI, Generative AI, Machine Learning, Natural Language Processing, SaaS', 'location': [{'name': 'San Francisco', 'permalink': 'san-francisco-california'}, {'name': 'California', 'permalink': 'california-united-states'}, {'name': 'United States', 'permalink': 'united-states'}, {'name': 'North America', 'permalink': 'north-america'}], 'scraped_date': '2026-01-25'}\n",
            "\n",
            "Step 3: Generating embeddings and indexing into MongoDB\n",
            "------------------------------------------------------------\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Calculating embeddings: 1it [00:00,  1.18it/s]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "✅ Indexed 1 document(s) into MongoDB\n",
            "\n",
            "🎉 Complete! The company is now searchable in your vector database\n",
            "   • Semantic search: Find similar companies\n",
            "   • Metadata filters: Filter by industry, location, etc.\n",
            "   • RAG pipeline: Answer questions about this company\n"
          ]
        }
      ],
      "source": [
        "# Example: Scrape and index a company from Crunchbase\n",
        "company_url = \"https://www.crunchbase.com/organization/openai\"\n",
        "\n",
        "print(f\"Step 1: Scraping company data from {company_url}\")\n",
        "print(\"-\" * 60)\n",
        "\n",
        "# Scrape the company\n",
        "scraper_result = scraper.run(\n",
        "    dataset=\"crunchbase_company\",\n",
        "    url=company_url\n",
        ")\n",
        "\n",
        "print(\"✅ Scraping complete\")\n",
        "\n",
        "# Transform into Haystack Documents\n",
        "print(\"\\nStep 2: Transforming into Haystack Documents\")\n",
        "print(\"-\" * 60)\n",
        "\n",
        "documents = create_company_documents(\n",
        "    scraper_result=scraper_result,\n",
        "    source_url=company_url,\n",
        "    dataset_type=\"crunchbase_company\"\n",
        ")\n",
        "\n",
        "print(f\"✅ Created {len(documents)} document(s)\")\n",
        "print(f\"\\nDocument preview:\")\n",
        "print(f\"   Content (first 200 chars): {documents[0].content[:200]}...\")\n",
        "print(f\"   Metadata: {documents[0].meta}\")\n",
        "\n",
        "# Index into MongoDB\n",
        "print(\"\\nStep 3: Generating embeddings and indexing into MongoDB\")\n",
        "print(\"-\" * 60)\n",
        "\n",
        "result = indexing_pipeline.run({\"embedder\": {\"documents\": documents}})\n",
        "\n",
        "print(f\"✅ Indexed {result['writer']['documents_written']} document(s) into MongoDB\")\n",
        "print(f\"\\n🎉 Complete! The company is now searchable in your vector database\")\n",
        "print(f\"   • Semantic search: Find similar companies\")\n",
        "print(f\"   • Metadata filters: Filter by industry, location, etc.\")\n",
        "print(f\"   • RAG pipeline: Answer questions about this company\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rNGz7-EqrmZk"
      },
      "source": [
        "## RAG Pipeline for Sales Intelligence\n",
        "\n",
        "RAG combines **retrieval** (finding relevant documents) with **generation** (LLM synthesis) to answer questions based on your indexed data.\n",
        "\n",
        "```\n",
        "User Question → Text Embedder → Retriever → Prompt Builder → Generator → Answer\n",
        "```\n",
        "\n",
        "### Components\n",
        "\n",
        "- [GoogleGenAITextEmbedder](https://docs.haystack.deepset.ai/docs/googlegenaitextembedder)\n",
        "- [MongoDBAtlasEmbeddingRetriever](https://docs.haystack.deepset.ai/docs/mongodbatlasembeddingretriever)\n",
        "- [ChatPromptBuilder](https://docs.haystack.deepset.ai/docs/chatpromptbuilder)\n",
        "- [GoogleGenAIChatGenerator](https://docs.haystack.deepset.ai/docs/googlegenaichatgenerator)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xySYIHaurmZk"
      },
      "source": [
        "### Build the RAG Pipeline\n",
        "\n",
        "Now let's assemble all components into a complete RAG pipeline."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5nwZSovnrmZk",
        "outputId": "2a1c6d74-21a2-4851-8e1f-3a59e2a3ca38"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.\n",
            "ChatPromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.\n",
            "Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "✅ RAG pipeline created\n",
            "\n",
            "Pipeline structure:\n",
            "   Question → Text Embedder → Retriever → Prompt Builder → Generator → Answer\n",
            "\n",
            "Components:\n",
            "   • Text Embedder: text-embedding-004 (768d)\n",
            "   • Retriever: MongoDB Atlas (top_k=5)\n",
            "   • Prompt Builder: Sales intelligence template\n",
            "   • Generator: gemini-2.5-flash\n"
          ]
        }
      ],
      "source": [
        "from haystack import Pipeline\n",
        "from haystack.components.builders import ChatPromptBuilder\n",
        "from haystack.dataclasses import ChatMessage\n",
        "from haystack_integrations.components.embedders.google_genai import GoogleGenAITextEmbedder\n",
        "from haystack_integrations.components.generators.google_genai import GoogleGenAIChatGenerator\n",
        "\n",
        "# Define the prompt template for sales intelligence\n",
        "system_message = ChatMessage.from_system(\"\"\"\n",
        "You are a sales intelligence assistant. Your role is to analyze company and people data to provide actionable sales intelligence.\n",
        "\n",
        "When answering queries:\n",
        "- Cite specific company names and details from the data\n",
        "- Provide insights relevant for sales outreach\n",
        "- Highlight key information like funding, company size, location, recent news\n",
        "- Suggest talking points for personalized outreach\n",
        "\"\"\")\n",
        "\n",
        "user_template = \"\"\"\n",
        "Based on the following company/person data, answer the user's question.\n",
        "\n",
        "Context:\n",
        "{% for document in documents %}\n",
        "{{ document.content }}\n",
        "---\n",
        "{% endfor %}\n",
        "\n",
        "Question: {{ question }}\n",
        "\n",
        "Provide a detailed, actionable answer based on the retrieved data.\n",
        "\"\"\"\n",
        "\n",
        "user_message = ChatMessage.from_user(user_template)\n",
        "\n",
        "# Create the RAG pipeline\n",
        "rag_pipeline = Pipeline()\n",
        "\n",
        "# Add components\n",
        "rag_pipeline.add_component(\"text_embedder\", GoogleGenAITextEmbedder(model=\"text-embedding-004\"))\n",
        "rag_pipeline.add_component(\"retriever\", MongoDBAtlasEmbeddingRetriever(document_store=document_store, top_k=5))\n",
        "rag_pipeline.add_component(\"prompt_builder\", ChatPromptBuilder(template=[system_message, user_message]))\n",
        "rag_pipeline.add_component(\"generator\", GoogleGenAIChatGenerator(model=\"gemini-2.5-flash\"))\n",
        "\n",
        "# Connect components\n",
        "rag_pipeline.connect(\"text_embedder.embedding\", \"retriever.query_embedding\")\n",
        "rag_pipeline.connect(\"retriever.documents\", \"prompt_builder.documents\")\n",
        "rag_pipeline.connect(\"prompt_builder.prompt\", \"generator.messages\")\n",
        "\n",
        "print(\"✅ RAG pipeline created\")\n",
        "print(\"\\nPipeline structure:\")\n",
        "print(\"   Question → Text Embedder → Retriever → Prompt Builder → Generator → Answer\")\n",
        "print(\"\\nComponents:\")\n",
        "print(\"   • Text Embedder: text-embedding-004 (768d)\")\n",
        "print(\"   • Retriever: MongoDB Atlas (top_k=5)\")\n",
        "print(\"   • Prompt Builder: Sales intelligence template\")\n",
        "print(\"   • Generator: gemini-2.5-flash\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6Kbl4odTrmZk"
      },
      "source": [
        "## Demo: Query the Sales Research Assistant\n",
        "\n",
        "Let's test our RAG pipeline with a sales intelligence question."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iJnIHcVDrmZk",
        "outputId": "47870e81-3b7b-4aae-e3dc-44a47094afcc"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Question: What can you tell me about OpenAI? Include details about their industry, products, and any relevant information for sales outreach.\n",
            "\n",
            "================================================================================\n",
            "Processing...\n",
            "================================================================================\n",
            "\n",
            "Answer:\n",
            "--------------------------------------------------------------------------------\n",
            "Based on the provided data, here's what you can tell about OpenAI and relevant information for sales outreach:\n",
            "\n",
            "**Company Overview:**\n",
            "\n",
            "*   **Name:** OpenAI\n",
            "*   **Core Business:** OpenAI is a leading AI research and deployment company. They are known for developing advanced AI models, most notably **ChatGPT**.\n",
            "*   **Key Industries:** They operate at the forefront of several cutting-edge AI fields, including:\n",
            "    *   Agentic AI\n",
            "    *   Artificial Intelligence (AI)\n",
            "    *   Foundational AI\n",
            "    *   Generative AI\n",
            "    *   Machine Learning\n",
            "    *   Natural Language Processing (NLP)\n",
            "    *   SaaS (indicating they deploy their models as services)\n",
            "*   **Operating Status:** Active\n",
            "*   **Size:** The data indicates their employee count is substantial, either **1,001-5,000** or **5,001-10,000**. This suggests a rapidly growing, large enterprise.\n",
            "*   **Website:** https://www.openai.com\n",
            "\n",
            "**Sales Intelligence and Talking Points:**\n",
            "\n",
            "1.  **Pioneers in AI:** OpenAI is a major player and innovator in the AI space, particularly in generative AI and foundational models. This indicates they are constantly looking for cutting-edge solutions, talent, and infrastructure to maintain their leadership.\n",
            "    *   **Sales Angle:** Any product or service that enhances AI research, model development, deployment efficiency, or security for advanced AI systems would be highly relevant.\n",
            "    *   **Talking Point:** \"Given OpenAI's groundbreaking work in [Generative AI/Foundational AI] with models like ChatGPT, I imagine you're constantly seeking ways to optimize your [data processing/compute infrastructure/model deployment/AI safety protocols].\"\n",
            "\n",
            "2.  **SaaS Provider:** Their inclusion in the \"SaaS\" industry means they are not just developing AI, but also productizing and deploying it as services. This implies needs related to scaling, customer support, API management, cloud infrastructure, and enterprise-grade reliability.\n",
            "    *   **Sales Angle:** Solutions for large-scale SaaS operations, particularly those with high computational demands, would be valuable.\n",
            "    *   **Talking Point:** \"As a key SaaS provider in the AI space, managing the scalability and reliability of services like ChatGPT must be a critical focus. How are you currently addressing [specific SaaS challenge, e.g., low-latency inference at scale/secure API access]?\"\n",
            "\n",
            "3.  **Large and Growing Organization:** With thousands of employees, OpenAI likely faces challenges typical of rapidly scaling enterprises, such as internal communication, talent management, complex project coordination, and managing diverse research and engineering teams.\n",
            "    *   **Sales Angle:** Solutions for enterprise collaboration, project management, developer tools, or specialized HR/recruitment for AI talent could be relevant.\n",
            "    *   **Talking Point:** \"With OpenAI's rapid growth and the complexity of your AI projects, I'm curious how you manage [specific internal challenge, e.g., cross-functional collaboration between research and engineering/onboarding specialized AI talent].\"\n",
            "\n",
            "4.  **Focus on Advanced AI:** Their specific industry tags like \"Agentic AI\" and \"Foundational AI\" highlight their focus on the most complex and impactful areas of AI. This implies a need for robust, high-performance, and secure infrastructure.\n",
            "    *   **Sales Angle:** If your product or service provides a competitive advantage in areas like high-performance computing, specialized hardware (e.g., GPUs), data privacy, or ethical AI frameworks, it would resonate.\n",
            "    *   **Talking Point:** \"Your work in Foundational AI is truly pushing the boundaries. We've seen companies tackling similar challenges find significant value in our [specific solution, e.g., secure data pipelines for large models/compute orchestration for distributed AI training].\"\n",
            "\n",
            "**Overall Sales Strategy:**\n",
            "\n",
            "When reaching out to OpenAI, emphasize how your solution directly supports their mission to advance AI, enhances their existing AI models or infrastructure, improves their SaaS offerings, or addresses the operational complexities of a leading, fast-growing AI enterprise. Tailor your message to their specific industry focus (e.g., Generative AI, Foundational AI) to demonstrate you understand their unique challenges and priorities.\n",
            "--------------------------------------------------------------------------------\n",
            "\n",
            "📄 Retrieved 3 relevant documents from MongoDB\n",
            "\n",
            "================================================================================\n",
            "RETRIEVED DOCUMENTS:\n",
            "================================================================================\n",
            "\n",
            "Document 1:\n",
            "   Company: OpenAI\n",
            "   Source: crunchbase_company\n",
            "   Location: [{'name': 'San Francisco', 'permalink': 'san-francisco-california'}, {'name': 'California', 'permalink': 'california-united-states'}, {'name': 'United States', 'permalink': 'united-states'}, {'name': 'North America', 'permalink': 'north-america'}]\n",
            "   Industry: Agentic AI, Artificial Intelligence (AI), Foundational AI, Generative AI, Machine Learning, Natural Language Processing, SaaS\n",
            "\n",
            "   Content:\n",
            "   Company: OpenAI\n",
            "Overview: OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.\n",
            "Industries: [{'id': 'agentic-ai-17fa', 'value': 'Agentic AI'}, {'id': 'artificial-intelligence', 'value': 'Artificial Intelligence (AI)'}, {'id': 'foundational-ai', 'value':...\n",
            "--------------------------------------------------------------------------------\n",
            "\n",
            "Document 2:\n",
            "   Company: OpenAI\n",
            "   Source: crunchbase_company\n",
            "   Location: [{'name': 'San Francisco', 'permalink': 'san-francisco-california'}, {'name': 'California', 'permalink': 'california-united-states'}, {'name': 'United States', 'permalink': 'united-states'}, {'name': 'North America', 'permalink': 'north-america'}]\n",
            "   Industry: Agentic AI, Artificial Intelligence (AI), Foundational AI, Generative AI, Machine Learning, Natural Language Processing, SaaS\n",
            "\n",
            "   Content:\n",
            "   Company: OpenAI\n",
            "Overview: OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.\n",
            "Industries: [{'id': 'agentic-ai-17fa', 'value': 'Agentic AI'}, {'id': 'artificial-intelligence', 'value': 'Artificial Intelligence (AI)'}, {'id': 'foundational-ai', 'value':...\n",
            "--------------------------------------------------------------------------------\n",
            "\n",
            "Document 3:\n",
            "   Company: OpenAI\n",
            "   Source: crunchbase_company\n",
            "   Location: [{'name': 'San Francisco', 'permalink': 'san-francisco-california'}, {'name': 'California', 'permalink': 'california-united-states'}, {'name': 'United States', 'permalink': 'united-states'}, {'name': 'North America', 'permalink': 'north-america'}]\n",
            "   Industry: Agentic AI, Artificial Intelligence (AI), Foundational AI, Generative AI, Machine Learning, Natural Language Processing, SaaS\n",
            "\n",
            "   Content:\n",
            "   Company: OpenAI\n",
            "Overview: OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.\n",
            "Industries: [{'id': 'agentic-ai-17fa', 'value': 'Agentic AI'}, {'id': 'artificial-intelligence', 'value': 'Artificial Intelligence (AI)'}, {'id': 'foundational-ai', 'value':...\n",
            "--------------------------------------------------------------------------------\n"
          ]
        }
      ],
      "source": [
        "# Example query: Ask about companies in the database\n",
        "question = \"What can you tell me about OpenAI? Include details about their industry, products, and any relevant information for sales outreach.\"\n",
        "\n",
        "print(f\"Question: {question}\")\n",
        "print(\"\\n\" + \"=\"*80)\n",
        "print(\"Processing...\")\n",
        "print(\"=\"*80 + \"\\n\")\n",
        "\n",
        "try:\n",
        "    # Run the RAG pipeline with include_outputs_from to get retriever results\n",
        "    result = rag_pipeline.run(\n",
        "        data={\n",
        "            \"text_embedder\": {\"text\": question},\n",
        "            \"prompt_builder\": {\"question\": question}\n",
        "        },\n",
        "        include_outputs_from={\"retriever\"}\n",
        "    )\n",
        "\n",
        "    # Extract the answer using .text\n",
        "    answer = result[\"generator\"][\"replies\"][0].text\n",
        "\n",
        "    print(\"Answer:\")\n",
        "    print(\"-\" * 80)\n",
        "    print(answer)\n",
        "    print(\"-\" * 80)\n",
        "\n",
        "    # Show retrieved documents\n",
        "    if \"retriever\" in result:\n",
        "        retrieved_docs = result[\"retriever\"][\"documents\"]\n",
        "        print(f\"\\n📄 Retrieved {len(retrieved_docs)} relevant documents from MongoDB\")\n",
        "        print(\"\\n\" + \"=\"*80)\n",
        "        print(\"RETRIEVED DOCUMENTS:\")\n",
        "        print(\"=\"*80)\n",
        "\n",
        "        for i, doc in enumerate(retrieved_docs, 1):\n",
        "            print(f\"\\nDocument {i}:\")\n",
        "            print(f\"   Company: {doc.meta.get('company_name', 'N/A')}\")\n",
        "            print(f\"   Source: {doc.meta.get('dataset_type', 'N/A')}\")\n",
        "            print(f\"   Location: {doc.meta.get('location', 'N/A')}\")\n",
        "            print(f\"   Industry: {doc.meta.get('industry', 'N/A')}\")\n",
        "            print(f\"\\n   Content:\")\n",
        "            print(f\"   {doc.content[:300]}...\")\n",
        "            print(\"-\" * 80)\n",
        "    else:\n",
        "        print(\"\\n⚠️ Retriever output not available\")\n",
        "\n",
        "except Exception as e:\n",
        "    print(f\"❌ Error: {e}\")\n",
        "    import traceback\n",
        "    traceback.print_exc()\n",
        "    print(\"\\nMake sure you have:\")\n",
        "    print(\"   1. Indexed at least one company (run the indexing demo cell)\")\n",
        "    print(\"   2. MongoDB collection exists and has data\")\n",
        "    print(\"   3. Vector search index is properly configured\")"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": ".venv",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.13"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
