[Feature Request]: Support building Knowledge Base via Web Crawling/Scraping

Self Checks

I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (Language Policy).
Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
Please do not modify this template :) and fill in all the required fields.

Is your feature request related to a problem?

Yes. Currently, RAGFlow excels at processing uploaded files (PDF, DOCX, etc.), but it lacks the ability to ingest data directly from web URLs.
Many knowledge bases rely on dynamic online documentation (e.g., GitBook, Wiki, Official Docs, Notion pages). To use RAGFlow with these sources now, I have to manually save webpages as PDFs or Markdown files and then upload them. This process is:
1.  **Inefficient:** Time-consuming for large documentation sites.
2.  **Hard to maintain:** If the website updates, the RAGFlow knowledge base becomes stale immediately, requiring a manual re-upload.

Describe the feature you'd like

I would like to see a "Web Crawler" or "URL Import" option when creating or updating a Knowledge Base.

Core Requirements:

Input Source: Allow inputting a specific URL or a Sitemap.xml.
Scraping Strategy:
- Single Page: Fetch and parse the content of a single link.
- Recursive/Site-wide: Ability to crawl sub-pages (with a depth limit setting) or follow a sitemap to ingest an entire documentation site.
Content Parsing:
- Automatically extract the main content (Article) and remove noise (Navbars, Footers, Ads).
- Convert the HTML to Markdown or Text to feed into the existing DeepDoc/Chunking pipeline.

Describe implementation you've considered

Describe alternatives you've considered

Manual Workaround: Using browser extensions (e.g., "Save to Notion" or "Print to PDF") to save pages one by one and uploading them to RAGFlow.
External Scripts: Writing a Python script using BeautifulSoup or Selenium to scrape data locally, save it as .txt/.md, and then upload it. This disconnects the data from the source and makes synchronization difficult.

Documentation, adoption, use case

Use Cases:
This feature is critical for the following scenarios:
Technical Documentation Q&A:
Most developer documentation (e.g., Python docs, Stripe API, internal Wikis) exists as static websites. Users want to point RAGFlow to https://docs.example.com and immediately start querying.
Policy & Compliance Bots:
Government or Corporate regulations are often published on public web portals. Updates happen frequently, and manual re-uploading is prone to error.
Competitor Analysis:
Marketing teams need to ingest competitor pricing pages or product descriptions directly from the web to perform comparative analysis using the LLM.



Adoption Impact:
Implementing this feature will significantly drive adoption for RAGFlow:
Competitive Parity: Many popular RAG frameworks (like Dify, FastGPT, or EmbedAI) already support URL scraping. Adding this removes a major reason for users to choose a competitor.
Lower Time-to-Value: New users can test RAGFlow's capabilities instantly by pasting a URL, rather than gathering and cleaning local datasets.
Automation Friendly: It enables "Set and Forget" workflows where the knowledge base stays synchronized with the live website (if scheduled updates are added later).

Additional information

Integration with existing open-source scraping tools could speed up implementation. For example:

Firecrawl (Excellent for turning websites into LLM-ready Markdown).
Jina Reader API (Lightweight URL to text).
Scrapy / Playwright (For headless browsing).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request]: Support building Knowledge Base via Web Crawling/Scraping #12328

Self Checks

Is your feature request related to a problem?

Describe the feature you'd like

Describe implementation you've considered

Documentation, adoption, use case

Additional information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request]: Support building Knowledge Base via Web Crawling/Scraping #12328

Description

Self Checks

Is your feature request related to a problem?

Describe the feature you'd like

Describe implementation you've considered

Documentation, adoption, use case

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions