-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Open
Open
Copy link
Labels
💞 featureFeature request, pull request that fullfill a new feature.Feature request, pull request that fullfill a new feature.
Description
Self Checks
- I have searched for existing issues search for existing issues, including closed ones.
- I confirm that I am using English to submit this report (Language Policy).
- Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- Please do not modify this template :) and fill in all the required fields.
Is your feature request related to a problem?
Yes. Currently, RAGFlow excels at processing uploaded files (PDF, DOCX, etc.), but it lacks the ability to ingest data directly from web URLs.
Many knowledge bases rely on dynamic online documentation (e.g., GitBook, Wiki, Official Docs, Notion pages). To use RAGFlow with these sources now, I have to manually save webpages as PDFs or Markdown files and then upload them. This process is:
1. **Inefficient:** Time-consuming for large documentation sites.
2. **Hard to maintain:** If the website updates, the RAGFlow knowledge base becomes stale immediately, requiring a manual re-upload.Describe the feature you'd like
I would like to see a "Web Crawler" or "URL Import" option when creating or updating a Knowledge Base.
Core Requirements:
- Input Source: Allow inputting a specific URL or a Sitemap.xml.
- Scraping Strategy:
- Single Page: Fetch and parse the content of a single link.
- Recursive/Site-wide: Ability to crawl sub-pages (with a depth limit setting) or follow a sitemap to ingest an entire documentation site.
- Content Parsing:
- Automatically extract the main content (Article) and remove noise (Navbars, Footers, Ads).
- Convert the HTML to Markdown or Text to feed into the existing DeepDoc/Chunking pipeline.
Describe implementation you've considered
Describe alternatives you've considered
- Manual Workaround: Using browser extensions (e.g., "Save to Notion" or "Print to PDF") to save pages one by one and uploading them to RAGFlow.
- External Scripts: Writing a Python script using
BeautifulSouporSeleniumto scrape data locally, save it as.txt/.md, and then upload it. This disconnects the data from the source and makes synchronization difficult.
Documentation, adoption, use case
Use Cases:
This feature is critical for the following scenarios:
Technical Documentation Q&A:
Most developer documentation (e.g., Python docs, Stripe API, internal Wikis) exists as static websites. Users want to point RAGFlow to https://docs.example.com and immediately start querying.
Policy & Compliance Bots:
Government or Corporate regulations are often published on public web portals. Updates happen frequently, and manual re-uploading is prone to error.
Competitor Analysis:
Marketing teams need to ingest competitor pricing pages or product descriptions directly from the web to perform comparative analysis using the LLM.
Adoption Impact:
Implementing this feature will significantly drive adoption for RAGFlow:
Competitive Parity: Many popular RAG frameworks (like Dify, FastGPT, or EmbedAI) already support URL scraping. Adding this removes a major reason for users to choose a competitor.
Lower Time-to-Value: New users can test RAGFlow's capabilities instantly by pasting a URL, rather than gathering and cleaning local datasets.
Automation Friendly: It enables "Set and Forget" workflows where the knowledge base stays synchronized with the live website (if scheduled updates are added later).Additional information
Integration with existing open-source scraping tools could speed up implementation. For example:
- Firecrawl (Excellent for turning websites into LLM-ready Markdown).
- Jina Reader API (Lightweight URL to text).
- Scrapy / Playwright (For headless browsing).
dosubot
Metadata
Metadata
Assignees
Labels
💞 featureFeature request, pull request that fullfill a new feature.Feature request, pull request that fullfill a new feature.