AI Scraper

An interactive tool that lets users choose between local (Ollama) or remote (Sambanova) LLMs to scrape, process and structure web data into tabular form for download, Using Selenium or Crawl4AI for scraping, and Streamlit for the UI.

Video Demo

🚀 App Description

AI Scraper is a Python‑based utility that automates the end‑to‑end flow of:

Scraping
– Choose between Selenium (browser‑driven) or Crawl4AI (headless crawler)
– Input any URL and custom query (e.g. “Extract all product names and prices”)
Content Extraction & Conversion
– Fetch raw HTML
– Convert to Markdown for cleaner context
LLM‑Based Structuring
– Send the Markdown + query to your selected provider (Ollama local model or Sambanova API)
– Instruct the model to output CSV‑formatted rows
DataFrame & Export
– Parse the returned CSV into a pandas DataFrame
– Allow users to download as CSV, JSON, Excel, or other formats

This gives analysts and developers a frictionless way to spin up custom scrapers + LLM data‑structuring pipelines—no manual parsing, no glue code.

✨ App Features

Provider‑agnostic LLM selection:
– Ollama (runs locally)
– Sambanova (cloud API)
Dual‐mode scraping:
– Selenium: full browser automation (JS‑heavy sites)
– Crawl4AI: lightweight, headless crawling
Markdown intermediate: cleans up HTML for more reliable LLM prompts
CSV output from LLM: specify structure via natural‑language prompt
DataFrame inspection: preview results in‑app
Multi‑format download: CSV, JSON, Markdown, HTML, TXT, etc.
Environment‑based config: keys & options live in a .env

🚀 App Architecture

📁 Project Structure

The project is organized as follows:

├── .env                         # Environment variables (contains Sambanova API key as like API_KEY)
├── .gitignore                   # Files/folders to be ignored by git
├── README.md                    # Project documentation (you're reading it!)
├── app.py                       # main Streamlit UI
├── assets.py                    # Helper or static data (HEADLESS OPTIONS and USER AGENTS)
├── generate_response.py         # Handles LLM interaction
├── scraper.py                   # Contains Selenium and Crawl4AI scraping logic
├── urls_&_queries.txt           # Sample URLs and queries for scraping
├── chrome-linux64/              # Chrome WebDriver for Linux (required for Selenium)
│   └── chrome                   # Actual Chrome driver binary
├── imgs/                        # Screenshots and static image assets
    └── (app_screenshots.png)    # Placeholder for UI screenshots

⚙️ Installation

(It's Recommended to use Linux to run the App)

Clone the repo

git clone https://github.com/drisskhattabi6/AI-Scraper.git
cd AI-Scraper

Create & activate your Python env (Optionel)

For Linux or macOS

python3 -m venv venv
source venv/bin/activate

For Windows

python3 -m venv venv
venv\Scripts\activate

Install Python dependencies
```
pip install -r requirements.txt
```
Configure Sambanova (if using API)
Create an account in Sambanova to get 5 dollars credits for free, then get your API key from the 'https://cloud.sambanova.ai/apis'.
- In .env set:
```
API_KEY="your_Sambanova_key_here"
```
Setup Ollama (if using local LLM)
- Install Ollama: for Linux
```
curl -fsSL https://ollama.com/install.sh | sh
```
- for Windows or macOS : https://ollama.com/download
- Pull a model :
```
ollama pull qwen2.5  # or Your Prefered LLM
```
Chrome WebDriver
- Linux:
  1. Download the matching ChromeDriver for your Chrome version.
  2. Unzip into chrome-linux/ so that chrome-linux/chrome is executable.
- Windows:
  1. Download chromedriver.exe.
  2. Place it in your project root or somewhere on your PATH.

Crawl4AI

pip install crawl4ai
crawl4ai-setup
crawl4ai-doctor
playwright install  # (if showed an error)

– Follow Crawl4AI installation instructions

▶️ Usage

Start the app
```
streamlit run app.py
```
In your browser:
- Select LLM provider (Ollama or Sambanova)
- Paste target URL and type your scraping query
- Choose scraping method (Selenium or Crawl4AI) (Recommended Selenium for Windows)
- Click strat scraping
Inspect results:
- View DataFrame preview
- Download as CSV / JSON / HTML / TXT / Markdown

Screenshots

the Full App

Sidebar Settings :

Download Options

Example 1 : Scraping PIP Libraries from PIP Libraries

Example 2 : Scraping chat-gpt prompts for youtubers chat-gpt-prompts-for-youtubers

Example 3 : Scraping Avatars Products from

Example 4 : Scraping news from ycombinator from ycombinator

Example 5 : Scraping Avatars Products from

Example 6 : Scraping Books from Quotes

Troubleshooting

It's recommended to use Linux to run the App (but There is an error for scraping using Selenium)
If you're using Windows, Use Selenium WebDriver For Scraping (crawl4ai is not supported on Windows)
its better to use crawl4ai in linux and selinuim for windows
its better to use chrome/brave browser for windows
some website may not be scraped due to some reason (anti scraping tools), for both selenium and crawl4ai.
if you want to use other webdriver, you must edit the init function in scraper.py for selenium
in case you dont have chrome browser, you can install it from here : https://www.google.com/chrome/
in case you dont have chromium Driver, you can install it from here : https://sites.google.com/chromium.org/driver/

Contributing

We welcome contributions! If you'd like to improve the project, feel free to fork the repository and submit a pull request. Please follow the existing code structure and ensure that your contribution includes proper tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Scraper

Video Demo

🚀 App Description

✨ App Features

🚀 App Architecture

📁 Project Structure

⚙️ Installation

▶️ Usage

Screenshots

the Full App

Sidebar Settings :

Download Options

Example 1 : Scraping PIP Libraries from PIP Libraries

Example 2 : Scraping chat-gpt prompts for youtubers chat-gpt-prompts-for-youtubers

Example 3 : Scraping Avatars Products from

Example 4 : Scraping news from ycombinator from ycombinator

Example 5 : Scraping Avatars Products from

Example 6 : Scraping Books from Quotes

Troubleshooting

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
imgs		imgs
.env		.env
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
app.py		app.py
assets.py		assets.py
generate_response.py		generate_response.py
requirements.txt		requirements.txt
scraper.py		scraper.py
urls_&_queries.txt		urls_&_queries.txt

drisskhattabi6/AI-Scraper

Folders and files

Latest commit

History

Repository files navigation

AI Scraper

Video Demo

🚀 App Description

✨ App Features

🚀 App Architecture

📁 Project Structure

⚙️ Installation

▶️ Usage

Screenshots

the Full App

Sidebar Settings :

Download Options

Example 1 : Scraping PIP Libraries from PIP Libraries

Example 2 : Scraping chat-gpt prompts for youtubers chat-gpt-prompts-for-youtubers

Example 3 : Scraping Avatars Products from

Example 4 : Scraping news from ycombinator from ycombinator

Example 5 : Scraping Avatars Products from

Example 6 : Scraping Books from Quotes

Troubleshooting

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages