Skip to content

fix: optimize knowledge graph clustering for large corpus #1967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

aubford
Copy link

@aubford aubford commented Mar 17, 2025

Current implementation of find_indirect_clusters runs at exponential time because the depth-first search always explores every path in the graph. A kg w/ ~3000 relationships takes over 3 hours on a 2024 M4 Macbook Pro.
This brings it down to quad/cubic time relative to testset size (instead of kg relationships). Generating a 100 sample testset of abstract multihop queries for a KG of 1 million relationships takes about 20 seconds.

  • Added new find_n_indirect_clusters method and applied it to MultiHopAbstractQuerySynthesizer.
  • No longer searches the entire graph, just enough to get a well-diversified randomized sample of clusters for the desired testset size.
  • Added lots of tests. I can scale them back. They're mostly to demo the spec so you can easily compare the new behavior with the original find_indirect_clusters which I left in place.
  • New behavior adds randomization and diversity to cluster sampling.
    • Will no longer return subsets along with their supersets, only the superset.
  • Also fixed the other performance bottleneck in MultiHopAbstractQuerySynthesizer (collecting child nodes).
  • Eliminates potentially expensive edge-case where LLM calls could be made for every possible cluster in the KG.
  • In depth details can be found in the find_n_indirect_clusters docstring.
@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Mar 17, 2025
@shahules786 shahules786 requested review from jjmachan and shahules786 and removed request for jjmachan March 17, 2025 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XXL This PR changes 1000+ lines, ignoring generated files.
1 participant