An interactive tool that lets users choose between local (Ollama) or remote (Sambanova) LLMs to scrape, process and structure web data into tabular form for download, Using Selenium or Crawl4AI for scraping, and Streamlit for the UI.
AI Scraper is a Python‑based utility that automates the end‑to‑end flow of:
- Scraping
– Choose between Selenium (browser‑driven) or Crawl4AI (headless crawler)
– Input any URL and custom query (e.g. “Extract all product names and prices”) - Content Extraction & Conversion
– Fetch raw HTML
– Convert to Markdown for cleaner context - LLM‑Based Structuring
– Send the Markdown + query to your selected provider (Ollama local model or Sambanova API)
– Instruct the model to output CSV‑formatted rows - DataFrame & Export
– Parse the returned CSV into a pandas DataFrame
– Allow users to download as CSV, JSON, Excel, or other formats
This gives analysts and developers a frictionless way to spin up custom scrapers + LLM data‑structuring pipelines—no manual parsing, no glue code.
- Provider‑agnostic LLM selection:
– Ollama (runs locally)
– Sambanova (cloud API) - Dual‐mode scraping:
– Selenium: full browser automation (JS‑heavy sites)
– Crawl4AI: lightweight, headless crawling - Markdown intermediate: cleans up HTML for more reliable LLM prompts
- CSV output from LLM: specify structure via natural‑language prompt
- DataFrame inspection: preview results in‑app
- Multi‑format download: CSV, JSON, Markdown, HTML, TXT, etc.
- Environment‑based config: keys & options live in a
.env
The project is organized as follows:
├── .env # Environment variables (contains Sambanova API key as like API_KEY)
├── .gitignore # Files/folders to be ignored by git
├── README.md # Project documentation (you're reading it!)
├── app.py # main Streamlit UI
├── assets.py # Helper or static data (HEADLESS OPTIONS and USER AGENTS)
├── generate_response.py # Handles LLM interaction
├── scraper.py # Contains Selenium and Crawl4AI scraping logic
├── urls_&_queries.txt # Sample URLs and queries for scraping
├── chrome-linux64/ # Chrome WebDriver for Linux (required for Selenium)
│ └── chrome # Actual Chrome driver binary
├── imgs/ # Screenshots and static image assets
└── (app_screenshots.png) # Placeholder for UI screenshots
(It's Recommended to use Linux to run the App)
-
Clone the repo
git clone https://github.com/drisskhattabi6/AI-Scraper.git cd AI-Scraper
-
Create & activate your Python env (Optionel)
- For Linux or macOS
python3 -m venv venv source venv/bin/activate
- For Windows
python3 -m venv venv venv\Scripts\activate
-
Install Python dependencies
pip install -r requirements.txt
-
Configure Sambanova (if using API)
Create an account in Sambanova to get 5 dollars credits for free, then get your API key from the 'https://cloud.sambanova.ai/apis'.-
In
.env
set:API_KEY="your_Sambanova_key_here"
-
-
Setup Ollama (if using local LLM)
-
Install Ollama: for Linux
curl -fsSL https://ollama.com/install.sh | sh
-
for Windows or macOS : https://ollama.com/download
-
Pull a model :
ollama pull qwen2.5 # or Your Prefered LLM
-
-
Chrome WebDriver
- Linux:
- Download the matching ChromeDriver for your Chrome version.
- Unzip into
chrome-linux/
so thatchrome-linux/chrome
is executable.
- Windows:
- Download
chromedriver.exe
. - Place it in your project root or somewhere on your
PATH
.
- Download
- Linux:
-
Crawl4AI
pip install crawl4ai crawl4ai-setup crawl4ai-doctor playwright install # (if showed an error)
– Follow Crawl4AI installation instructions
-
Start the app
streamlit run app.py
-
In your browser:
- Select LLM provider (Ollama or Sambanova)
- Paste target URL and type your scraping query
- Choose scraping method (Selenium or Crawl4AI) (Recommended Selenium for Windows)
- Click strat scraping
-
Inspect results:
- View DataFrame preview
- Download as CSV / JSON / HTML / TXT / Markdown
Example 1 : Scraping PIP Libraries from PIP Libraries
Example 2 : Scraping chat-gpt prompts for youtubers chat-gpt-prompts-for-youtubers
Example 4 : Scraping news from ycombinator from ycombinator
Example 6 : Scraping Books from Quotes
- It's recommended to use Linux to run the App (but There is an error for scraping using Selenium)
- If you're using Windows, Use Selenium WebDriver For Scraping (crawl4ai is not supported on Windows)
- its better to use crawl4ai in linux and selinuim for windows
- its better to use chrome/brave browser for windows
- some website may not be scraped due to some reason (anti scraping tools), for both selenium and crawl4ai.
- if you want to use other webdriver, you must edit the init function in scraper.py for selenium
- in case you dont have chrome browser, you can install it from here : https://www.google.com/chrome/
- in case you dont have chromium Driver, you can install it from here : https://sites.google.com/chromium.org/driver/
We welcome contributions! If you'd like to improve the project, feel free to fork the repository and submit a pull request. Please follow the existing code structure and ensure that your contribution includes proper tests.