Skip to content

AI Scraper : scrap and extract data from website in any format (CSV, JSON, HTML...) using Selenium or Crawl4ai, and using Ollama or Sambanova API, and using Streamlit for UI as chatbot

Notifications You must be signed in to change notification settings

drisskhattabi6/AI-Scraper

Repository files navigation

AI Scraper

An interactive tool that lets users choose between local (Ollama) or remote (Sambanova) LLMs to scrape, process and structure web data into tabular form for download, Using Selenium or Crawl4AI for scraping, and Streamlit for the UI.

Video Demo

AI Scraper Demo

🚀 App Description

AI Scraper is a Python‑based utility that automates the end‑to‑end flow of:

  1. Scraping
    – Choose between Selenium (browser‑driven) or Crawl4AI (headless crawler)
    – Input any URL and custom query (e.g. “Extract all product names and prices”)
  2. Content Extraction & Conversion
    – Fetch raw HTML
    – Convert to Markdown for cleaner context
  3. LLM‑Based Structuring
    – Send the Markdown + query to your selected provider (Ollama local model or Sambanova API)
    – Instruct the model to output CSV‑formatted rows
  4. DataFrame & Export
    – Parse the returned CSV into a pandas DataFrame
    – Allow users to download as CSV, JSON, Excel, or other formats

This gives analysts and developers a frictionless way to spin up custom scrapers + LLM data‑structuring pipelines—no manual parsing, no glue code.

✨ App Features

  • Provider‑agnostic LLM selection:
    Ollama (runs locally)
    Sambanova (cloud API)
  • Dual‐mode scraping:
    Selenium: full browser automation (JS‑heavy sites)
    Crawl4AI: lightweight, headless crawling
  • Markdown intermediate: cleans up HTML for more reliable LLM prompts
  • CSV output from LLM: specify structure via natural‑language prompt
  • DataFrame inspection: preview results in‑app
  • Multi‑format download: CSV, JSON, Markdown, HTML, TXT, etc.
  • Environment‑based config: keys & options live in a .env

🚀 App Architecture

AI Scraper architecture

📁 Project Structure

The project is organized as follows:

├── .env                         # Environment variables (contains Sambanova API key as like API_KEY)
├── .gitignore                   # Files/folders to be ignored by git
├── README.md                    # Project documentation (you're reading it!)
├── app.py                       # main Streamlit UI
├── assets.py                    # Helper or static data (HEADLESS OPTIONS and USER AGENTS)
├── generate_response.py         # Handles LLM interaction
├── scraper.py                   # Contains Selenium and Crawl4AI scraping logic
├── urls_&_queries.txt           # Sample URLs and queries for scraping
├── chrome-linux64/              # Chrome WebDriver for Linux (required for Selenium)
│   └── chrome                   # Actual Chrome driver binary
├── imgs/                        # Screenshots and static image assets
    └── (app_screenshots.png)    # Placeholder for UI screenshots

⚙️ Installation

(It's Recommended to use Linux to run the App)

  1. Clone the repo

    git clone https://github.com/drisskhattabi6/AI-Scraper.git
    cd AI-Scraper
  2. Create & activate your Python env (Optionel)

    • For Linux or macOS
    python3 -m venv venv
    source venv/bin/activate 
    • For Windows
    python3 -m venv venv
    venv\Scripts\activate
  3. Install Python dependencies

    pip install -r requirements.txt
  4. Configure Sambanova (if using API)
    Create an account in Sambanova to get 5 dollars credits for free, then get your API key from the 'https://cloud.sambanova.ai/apis'.

    • In .env set:

      API_KEY="your_Sambanova_key_here"
  5. Setup Ollama (if using local LLM)

    • Install Ollama: for Linux

      curl -fsSL https://ollama.com/install.sh | sh
    • for Windows or macOS : https://ollama.com/download

    • Pull a model :

      ollama pull qwen2.5  # or Your Prefered LLM
  6. Chrome WebDriver

    • Linux:
      1. Download the matching ChromeDriver for your Chrome version.
      2. Unzip into chrome-linux/ so that chrome-linux/chrome is executable.
    • Windows:
      1. Download chromedriver.exe.
      2. Place it in your project root or somewhere on your PATH.
  7. Crawl4AI

    pip install crawl4ai
    crawl4ai-setup
    crawl4ai-doctor
    playwright install  # (if showed an error)

    – Follow Crawl4AI installation instructions

▶️ Usage

  1. Start the app

    streamlit run app.py
  2. In your browser:

    • Select LLM provider (Ollama or Sambanova)
    • Paste target URL and type your scraping query
    • Choose scraping method (Selenium or Crawl4AI) (Recommended Selenium for Windows)
    • Click strat scraping
  3. Inspect results:

    • View DataFrame preview
    • Download as CSV / JSON / HTML / TXT / Markdown

Screenshots

the Full App

the Full App

Sidebar Settings :

Sidebar Settings

Download Options

Download Options

Example 1 : Scraping PIP Libraries from PIP Libraries

Example 1

Example 2 : Scraping chat-gpt prompts for youtubers chat-gpt-prompts-for-youtubers

Example 2

Example 3 : Scraping Avatars Products from

Example 3

Example 4 : Scraping news from ycombinator from ycombinator

Example 4

Example 5 : Scraping Avatars Products from

Example 5

Example 6 : Scraping Books from Quotes

Example 6

Troubleshooting

  1. It's recommended to use Linux to run the App (but There is an error for scraping using Selenium)
  2. If you're using Windows, Use Selenium WebDriver For Scraping (crawl4ai is not supported on Windows)
  3. its better to use crawl4ai in linux and selinuim for windows
  4. its better to use chrome/brave browser for windows
  5. some website may not be scraped due to some reason (anti scraping tools), for both selenium and crawl4ai.
  6. if you want to use other webdriver, you must edit the init function in scraper.py for selenium
  7. in case you dont have chrome browser, you can install it from here : https://www.google.com/chrome/
  8. in case you dont have chromium Driver, you can install it from here : https://sites.google.com/chromium.org/driver/

Contributing

We welcome contributions! If you'd like to improve the project, feel free to fork the repository and submit a pull request. Please follow the existing code structure and ensure that your contribution includes proper tests.

About

AI Scraper : scrap and extract data from website in any format (CSV, JSON, HTML...) using Selenium or Crawl4ai, and using Ollama or Sambanova API, and using Streamlit for UI as chatbot

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages