fix: optimize knowledge graph clustering for large corpus #1967

aubford · 2025-03-17T18:11:03Z

Current implementation of find_indirect_clusters runs at exponential time because the depth-first search always explores every path in the graph. A kg w/ ~3000 relationships takes over 3 hours on a 2024 M4 Macbook Pro.
This brings it down to quad/cubic time relative to testset size (instead of kg relationships). Generating a 100 sample testset of abstract multihop queries for a KG of 1 million relationships takes about 20 seconds.

Added new find_n_indirect_clusters method and applied it to MultiHopAbstractQuerySynthesizer.
No longer searches the entire graph, just enough to get a well-diversified randomized sample of clusters for the desired testset size.
Added lots of tests. I can scale them back. They're mostly to demo the spec so you can easily compare the new behavior with the original find_indirect_clusters which I left in place.
New behavior adds randomization and diversity to cluster sampling.
- Will no longer return subsets along with their supersets, only the superset.
Also fixed the other performance bottleneck in MultiHopAbstractQuerySynthesizer (collecting child nodes).
Eliminates potentially expensive edge-case where LLM calls could be made for every possible cluster in the KG.
In depth details can be found in the find_n_indirect_clusters docstring.

… algos before making changes to production code

aubford added 8 commits March 17, 2025 09:39

add unit test test_knowledge_graph_clusters for testing kg clustering…

b0c3a69

… algos before making changes to production code

add new dfs algo find_n_indirect_clusters

6e98e14

handle edge case where all nodes are in groupings of n nodes

5d7e09b

performance optimizations

3de71ce

cleanup

c4318b3

add synth test

88951c6

optimize child node search

cd19ca5

formatting

25dc0d7

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Mar 17, 2025

shahules786 requested review from jjmachan and shahules786 and removed request for jjmachan March 17, 2025 18:22

aubrey-ford-nutrien mentioned this pull request Apr 2, 2025

Testset Generation: Is going into continuous loop #662

Open

apply n=1 for default_query_distribution case

d333365

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: optimize knowledge graph clustering for large corpus #1967

fix: optimize knowledge graph clustering for large corpus #1967

aubford commented Mar 17, 2025 •

edited

Loading

fix: optimize knowledge graph clustering for large corpus #1967

Are you sure you want to change the base?

fix: optimize knowledge graph clustering for large corpus #1967

Conversation

aubford commented Mar 17, 2025 • edited Loading

aubford commented Mar 17, 2025 •

edited

Loading