Diffbot’s cover photo
Diffbot

Diffbot

Technology, Information and Internet

Menlo Park, California 5,359 followers

We structure the world's knowledge.

About us

We Structure the World's Knowledge. Diffbot is a world-class group of AI engineers building a universal database of structured information, to provide knowledge as a service to all intelligent applications. Whether you are building an app that uses web content, an enterprise business application, or a smart robotic assistant, we've got you covered. Thousands of leading companies rely on Diffbot data for their enterprise and consumer applications.

Website
https://www.diffbot.com/
Industry
Technology, Information and Internet
Company size
11-50 employees
Headquarters
Menlo Park, California
Type
Privately Held
Founded
2011
Specialties
machine learning, relation extraction, truth discovery, knowledge fusion, computer vision, web scraping, data extraction, information retrieval, artificial intelligence, and ecommerce

Locations

Employees at Diffbot

Updates

  • Diffbot reposted this

    2025 was the year LLMs went from being good at simple chat interactions (ChatGPT) to being good at applications that involve calling tools and reading documents (coding with Claude Code).   Yet thus far, there hasn't been really any good local alternative to the frontier models for terminal coding or web research applications. Why is that?   Well, the shape of the data in these two applications couldn't be more different. A chat looks like short text snippets that ping-pong back and forth, and agentic tool use reads in large text blocks with potentially multiple turns interleaving human messages. If you've played at all with LLMs on consumer hardware, you'll notice that chit chatting with local LLMs works fine. But as soon as you enable tool use or try to use your local LLM as a backend to Claude Code, your session grinds to a halt right after the first tool call is returned, no matter how much VRAM you have.    The reason? Standard transformers dot-product attention is a fundamentally quadratic computation. What's interesting is that this situation has all changed with the last wave of open source models released from the major players. NVIDIA (Nemotron 3), Qwen (qwen3-coder-next/qwen3.5), and GLM (4.7-flash/5) all introduced in their latest models some form of hybrid, sub-quadratic attention mechanism. Even Deepseek, the first breakout open-source model, updated their architecture with their own DeepSeek Sparse Attention (DSA) in v3.2.   What these next-gen models have in common is they replace the standard quadratic dot-product attention with "linear" variants of attention. (In practice they aren't fully linear but stack full and linear layers with some ratio, hence "hybrid"). For example, probably the leading small open-source model right now, Qwen 3.5 uses a linear attention variant called GatedDeltaNet. Instead of a quadratic dot-product of matrices, this is a simple for-loop through the token sequence (see code in first image) that carries forward a vector last_recurrent_state, which is decayed by g, and updated with the new data by beta. Think about full attention like taking an open book exam with all of the pages of the book laid out on your desk so that you can randomly access, and linear recurrent state like a notecard you keep with you that you write and erase from as you read the pages. Full attention works, but you need a really large desk to in order to answer questions the first way! Nevertheless, that is essentially how DeepSeek V3, Kimi2.5, and Llama 4 work, and there's a clear limit to how far that approach goes.   Check out our latest model, which fine-tunes qwen3-coder-next 80B A3B for GraphRAG. It can fit on a single consumer GPU or on your macbook with Q4 GGUFs.  I've been using it as the backend model for to our LLM demo (try it at https://diffy.chat) and as a fully local alternative to Claude Code, when it is increasingly down.   Link to our model in the comments.

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
  • Diffbot reposted this

    Meet Diffbot - the Menlo Park team that's been transforming the entire public web into the world's largest queryable knowledge graph since 2011. What makes them different: • Knowledge Graph at unprecedented scale - 10+ billion entities and 1 trillion facts extracted from 60+ billion web pages, rebuilt every 4-5 days • AI that reads like humans - Computer vision and ML that visually parses any page and extracts structured facts, no rules required • Facts, not flat data - Entities are linked across the web; "Diffbot" in MIT Tech Review connects to the same "Diffbot" on LinkedIn or their site • Complete web coverage - 246M+ organizations, 1.6B+ articles, 3M+ products, events, and discussions structured through their API Companies across finance, intelligence, news, and ML (DuckDuckGo, Snapchat, FactSet, Dow Jones, Sequoia Capital) use Diffbot when they need the web as a structured database, not a collection of HTML. Powering their operations, Massive provides the proxy infrastructure that enables Diffbot's continuous web-wide crawling at scale. We're proud to support platforms turning unstructured chaos into connected knowledge.

  • How does ANN search work at scale? At scale, semantic search can’t compare every vector to every other one. ANN uses a vector index to retrieve likely nearest candidates first, then applies exact similarity on those candidates. This gives major latency gains with typically minor recall/accuracy degradation.

  • Diffbot reposted this

    Started 2026 with a quick trip to Austin during the Texas winter freeze. This was the last ever Data Day Texas, but it was the kind of event where every room had something to say from LLM debates to the weird edges of AI most people aren’t thinking about yet. Met some new folks. Reconnected with old ones. Altogether, one for the books. #AI #DataDayTexas #Diffbot Jerome Choo 🤖 Lynn Bender Mac McCarty Alexandra Pasi, PhD Jennifer Colehower Nnaemeka Akpunonu Vanessa McMahan Denise M.J. Harders Joel Anderson Mark H. O. Jaya Zenchenko Sarah McKenna Si Ran W.

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
      +4
  • Diffbot reposted this

    Investors in AI Web Infrastructure - Q1 2026 The "Web" is being rebuilt for Agents, not Humans. We are seeing a massive infrastructure shift in Q1 2026. The new battleground isn't just the AI models themselves, it's the pipes that feed them. Here is where the smart capital is concentrating in the AI Web Infrastructure stack right now: Browser Infrastructure & Web Automation CRV, Kleiner Perkins, Notable Capital (Browserbase) Sequoia Capital, Spark Capital (Airtop) Accel, SV Angel (Hyperbrowser) Felicis (Browser Use) Crawling, Extraction & Web Data APIs J&T Ventures (Apify) Nexus Venture Partners (Firecrawl) Tencent, Felicis (Diffbot) AltaIR Capital, Goodwater Capital, (Browse AI) Web Data Orchestration & Pipelines Sequoia Capital, Accel, Redpoint, Meritech Capital (n8n) OpenView, Matrix, Thrive Capital (Parabola) NVIDIA's NVentures (Unstructured) AI Web Access & Indexing Lightspeed (Exa, Composio) Y Combinator (ParseHub, Inspector (YC F25), Prox (YC F25)) NVIDIA's NVentures (Contextual AI)

    • No alternative text description for this image
  • Diffbot reposted this

    What a crazy way to start the year 🚀 Just published a new Google for Developers Codelab on building GraphRAG-powered multi-agent systems with Google ADK + Neo4j. If you’re building RAG apps and want to go beyond vector search into relationship-aware, multi-hop reasoning, this is for you. This codelab is based on the amazing work done by my colleague Michael Hunger with MCP + Diffbot KG and Kurtis Van Gent from Google Cloud team last year. 👉 https://lnkd.in/gpYDSc9h Big thanks to Romin Irani for collaborating on this 🙌 Google Developer Experts Google Cloud #GraphRAG #AIAgents #KnowledgeGraphs #Neo4j #GoogleADK #GenAI

    • No alternative text description for this image
  • Diffbot reposted this

    State of E-commerce Data Providers - Q4 2025 E-commerce runs on constant measurement: prices, promos, availability, seller changes, and "what the shelf actually looks like" across retailers and marketplaces. The challenge is stable collection at scale, retries when sites break, anti-bot evasion, clean geo signals, and then turning messy HTML into usable structured data. In preparation for the holiday season, we mapped the landscape of e-commerce data providers: Competitive intel + digital shelf: DataWeave, Price2Spy, Intelligence Node, Profitero+, Wiser Solutions, Inc. Marketplace intelligence + data: Jungle Scout, Helium 10, DataHawk, SellerSprite Trade, Supply Chain, Imports / Exports: Trademo, ImportYeti, Descartes Datamyne Scraper APIs & Extraction Platforms: Zyte, Diffbot, Stratalis, AutoScraping, SerpApi Managed Data Extraction & Services: GroupBWT, DataOx, epctex, MrScraper Retail Media & Ad Platforms: Pacvue, Perpetua, Teikametrics Network & runtime infra for e-com scraping: Playwright, Puppeteer, browserless

    • No alternative text description for this image

Affiliated pages

Similar pages

Browse jobs

Funding

Diffbot 3 total rounds

Last Round

Series A

US$ 10.0M

See more info on crunchbase