Diffbot

Technology, Information and Internet

Menlo Park, California 5,359 followers

We structure the world's knowledge.

Discover all 30 employees

About us

We Structure the World's Knowledge. Diffbot is a world-class group of AI engineers building a universal database of structured information, to provide knowledge as a service to all intelligent applications. Whether you are building an app that uses web content, an enterprise business application, or a smart robotic assistant, we've got you covered. Thousands of leading companies rely on Diffbot data for their enterprise and consumer applications.

Website: https://www.diffbot.com/
External link for Diffbot
Industry: Technology, Information and Internet
Company size: 11-50 employees
Headquarters: Menlo Park, California
Type: Privately Held
Founded: 2011
Specialties: machine learning, relation extraction, truth discovery, knowledge fusion, computer vision, web scraping, data extraction, information retrieval, artificial intelligence, and ecommerce

Locations

Primary

333 Ravenswood Ave

Menlo Park, California 94025, US

Get directions

Employees at Diffbot

See all employees

Updates

Diffbot reposted this
Mike Tung 🤖
2mo Edited
Report this post
2025 was the year LLMs went from being good at simple chat interactions (ChatGPT) to being good at applications that involve calling tools and reading documents (coding with Claude Code). Yet thus far, there hasn't been really any good local alternative to the frontier models for terminal coding or web research applications. Why is that? Well, the shape of the data in these two applications couldn't be more different. A chat looks like short text snippets that ping-pong back and forth, and agentic tool use reads in large text blocks with potentially multiple turns interleaving human messages. If you've played at all with LLMs on consumer hardware, you'll notice that chit chatting with local LLMs works fine. But as soon as you enable tool use or try to use your local LLM as a backend to Claude Code, your session grinds to a halt right after the first tool call is returned, no matter how much VRAM you have. The reason? Standard transformers dot-product attention is a fundamentally quadratic computation. What's interesting is that this situation has all changed with the last wave of open source models released from the major players. NVIDIA (Nemotron 3), Qwen (qwen3-coder-next/qwen3.5), and GLM (4.7-flash/5) all introduced in their latest models some form of hybrid, sub-quadratic attention mechanism. Even Deepseek, the first breakout open-source model, updated their architecture with their own DeepSeek Sparse Attention (DSA) in v3.2. What these next-gen models have in common is they replace the standard quadratic dot-product attention with "linear" variants of attention. (In practice they aren't fully linear but stack full and linear layers with some ratio, hence "hybrid"). For example, probably the leading small open-source model right now, Qwen 3.5 uses a linear attention variant called GatedDeltaNet. Instead of a quadratic dot-product of matrices, this is a simple for-loop through the token sequence (see code in first image) that carries forward a vector last_recurrent_state, which is decayed by g, and updated with the new data by beta. Think about full attention like taking an open book exam with all of the pages of the book laid out on your desk so that you can randomly access, and linear recurrent state like a notecard you keep with you that you write and erase from as you read the pages. Full attention works, but you need a really large desk to in order to answer questions the first way! Nevertheless, that is essentially how DeepSeek V3, Kimi2.5, and Llama 4 work, and there's a clear limit to how far that approach goes. Check out our latest model, which fine-tunes qwen3-coder-next 80B A3B for GraphRAG. It can fit on a single consumer GPU or on your macbook with Q4 GGUFs. I've been using it as the backend model for to our LLM demo (try it at https://diffy.chat) and as a fully local alternative to Claude Code, when it is increasingly down. Link to our model in the comments.
5 Comments

Like Comment Share
Diffbot

5,359 followers
2mo
Report this post
Recall vs. precision: which comes first in web search?

Like Comment Share
Diffbot reposted this
Massive

3,069 followers
2mo
Report this post
Meet Diffbot - the Menlo Park team that's been transforming the entire public web into the world's largest queryable knowledge graph since 2011. What makes them different: • Knowledge Graph at unprecedented scale - 10+ billion entities and 1 trillion facts extracted from 60+ billion web pages, rebuilt every 4-5 days • AI that reads like humans - Computer vision and ML that visually parses any page and extracts structured facts, no rules required • Facts, not flat data - Entities are linked across the web; "Diffbot" in MIT Tech Review connects to the same "Diffbot" on LinkedIn or their site • Complete web coverage - 246M+ organizations, 1.6B+ articles, 3M+ products, events, and discussions structured through their API Companies across finance, intelligence, news, and ML (DuckDuckGo, Snapchat, FactSet, Dow Jones, Sequoia Capital) use Diffbot when they need the web as a structured database, not a collection of HTML. Powering their operations, Massive provides the proxy infrastructure that enables Diffbot's continuous web-wide crawling at scale. We're proud to support platforms turning unstructured chaos into connected knowledge.

1 Comment

Like Comment Share
Diffbot

5,359 followers
3mo
Report this post
How does ANN search work at scale? At scale, semantic search can’t compare every vector to every other one. ANN uses a vector index to retrieve likely nearest candidates first, then applies exact similarity on those candidates. This gives major latency gains with typically minor recall/accuracy degradation.

Like Comment Share
Diffbot reposted this
Wale Edu
4mo
Report this post
Started 2026 with a quick trip to Austin during the Texas winter freeze. This was the last ever Data Day Texas, but it was the kind of event where every room had something to say from LLM debates to the weird edges of AI most people aren’t thinking about yet. Met some new folks. Reconnected with old ones. Altogether, one for the books. #AI #DataDayTexas #Diffbot Jerome Choo 🤖 Lynn Bender Mac McCarty Alexandra Pasi, PhD Jennifer Colehower Nnaemeka Akpunonu Vanessa McMahan Denise M.J. Harders Joel Anderson Mark H. O. Jaya Zenchenko Sarah McKenna Si Ran W.
- +4
2 Comments

Like Comment Share
Diffbot reposted this
Jason Grad
4mo
Report this post
Investors in AI Web Infrastructure - Q1 2026 The "Web" is being rebuilt for Agents, not Humans. We are seeing a massive infrastructure shift in Q1 2026. The new battleground isn't just the AI models themselves, it's the pipes that feed them. Here is where the smart capital is concentrating in the AI Web Infrastructure stack right now: Browser Infrastructure & Web Automation CRV, Kleiner Perkins, Notable Capital (Browserbase) Sequoia Capital, Spark Capital (Airtop) Accel, SV Angel (Hyperbrowser) Felicis (Browser Use) Crawling, Extraction & Web Data APIs J&T Ventures (Apify) Nexus Venture Partners (Firecrawl) Tencent, Felicis (Diffbot) AltaIR Capital, Goodwater Capital, (Browse AI) Web Data Orchestration & Pipelines Sequoia Capital, Accel, Redpoint, Meritech Capital (n8n) OpenView, Matrix, Thrive Capital (Parabola) NVIDIA's NVentures (Unstructured) AI Web Access & Indexing Lightspeed (Exa, Composio) Y Combinator (ParseHub, Inspector (YC F25), Prox (YC F25)) NVIDIA's NVentures (Contextual AI)
5 Comments

Like Comment Share
Diffbot reposted this
Siddhant Agarwal
4mo
Report this post
What a crazy way to start the year 🚀 Just published a new Google for Developers Codelab on building GraphRAG-powered multi-agent systems with Google ADK + Neo4j. If you’re building RAG apps and want to go beyond vector search into relationship-aware, multi-hop reasoning, this is for you. This codelab is based on the amazing work done by my colleague Michael Hunger with MCP + Diffbot KG and Kurtis Van Gent from Google Cloud team last year. 👉 https://lnkd.in/gpYDSc9h Big thanks to Romin Irani for collaborating on this 🙌 Google Developer Experts Google Cloud #GraphRAG #AIAgents #KnowledgeGraphs #Neo4j #GoogleADK #GenAI
6 Comments

Like Comment Share
Diffbot reposted this
Jason Grad
5mo
Report this post
State of E-commerce Data Providers - Q4 2025 E-commerce runs on constant measurement: prices, promos, availability, seller changes, and "what the shelf actually looks like" across retailers and marketplaces. The challenge is stable collection at scale, retries when sites break, anti-bot evasion, clean geo signals, and then turning messy HTML into usable structured data. In preparation for the holiday season, we mapped the landscape of e-commerce data providers: Competitive intel + digital shelf: DataWeave, Price2Spy, Intelligence Node, Profitero+, Wiser Solutions, Inc. Marketplace intelligence + data: Jungle Scout, Helium 10, DataHawk, SellerSprite Trade, Supply Chain, Imports / Exports: Trademo, ImportYeti, Descartes Datamyne Scraper APIs & Extraction Platforms: Zyte, Diffbot, Stratalis, AutoScraping, SerpApi Managed Data Extraction & Services: GroupBWT, DataOx, epctex, MrScraper Retail Media & Ad Platforms: Pacvue, Perpetua, Teikametrics Network & runtime infra for e-com scraping: Playwright, Puppeteer, browserless
24 Comments

Like Comment Share
Diffbot

5,359 followers
7mo
Report this post
That's a 3fer for 2025!
Like Comment Share

Affiliated pages

LeadGraph

Software Development

Menlo Park, California

Browse jobs

Funding

Diffbot 3 total rounds

Last Round

Series A Mar 11, 2016

US$ 10.0M

Investors

Felicis Tencent + 4 Other investors

See more info on crunchbase

Diffbot

Technology, Information and Internet

Menlo Park, California 5,359 followers

We structure the world's knowledge.

About us

Crawl

Data Extraction Software

Extract

Data Extraction Software

Knowledge Graph

Data Extraction Software

Natural Language

Natural Language Processing (NLP) Software

Locations

Employees at Diffbot

Sky Dayton

Kris Negulescu

Andy Chou

Aaron Lee

Updates

Join now to see what you are missing

Affiliated pages

LeadGraph

Similar pages

Import.io

Apify

Zyte

Bright Data

Oxylabs.io

Octoparse - Octopus Data Inc.

ScraperAPI

Neo4j

ScrapeHero

Kumu

Browse jobs

Engineer jobs

Mechanical Engineer jobs

Account Executive jobs

Solutions Architect jobs

Senior Business Director jobs

Innovation Manager jobs

Director jobs

Machine Learning Engineer jobs

Software Engineer jobs

Junior Software Engineer jobs

Full Stack Engineer jobs

Developer jobs

Project Management Intern jobs

Finance Intern jobs

Scientist jobs

Intern jobs

Vice President of Engineering jobs

Virtual Assistant jobs

Head of Marketing jobs

Marketing Director jobs

Funding