Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented May 22, 2025

cont #13194

The KV cells editing logic is now implemented via the new struct llama_kv_cells_unified in the new src/llama-kv-cells.h source. The goal is to simplify the implementation in llama-kv-cache.cpp and make it easier to understand and update in the future.

One of the primary simplifications is that llama_kv_cache_unified no longer tracks the number of used cells manually. This is now automatically tracked by the llama_kv_cells_unified based on the edits that we apply, such as adding and removing sequences from the cells. Same for the has_shift flag.

  • The cell information (i.e. pos, delta, seq) is now a structure of arrays for better cache locality
  • The sequences that belong to a cell are now tracked with a std::bitset instead of std::set

Here is an example of the position shift logic before and after the change:

// before
    for (uint32_t i = 0; i < size; ++i) {
        if (cells[i].has_seq_id(seq_id) && cells[i].pos >= p0 && cells[i].pos < p1) {
            has_shift = true;

            cells[i].pos   += delta;
            cells[i].delta += delta;

            if (cells[i].pos < 0) {
                if (!cells[i].is_empty()) {
                    used--;
                }
                cells[i].pos = -1;
                cells[i].seq_id.clear();
                if (new_head == size) {
                    new_head = i;
                }
            }
        }
    }

// after
    for (uint32_t i = 0; i < cells.size(); ++i) {
        if (!cells.pos_in(i, p0, p1)) {
            continue;
        }

        if (cells.seq_has(i, seq_id)) {
            if (cells.pos_add(i, delta)) {
                if (new_head == cells.size()) {
                    new_head = i;
                }
            }
        }
    }

Next

  • Add a heap for efficiently tracking min and max pos per sequence
  • Efficiently track the max occupied KV cell (i.e. n = cell_max()) instead of searching for it on every batch
@ggerganov
Copy link
Member Author

ggerganov commented May 22, 2025

In the next PR I will try to rework these 3 methods with something like batch_process_info_t init(const llama_batch & batch);:

// =============================================================================================================
// TODO: refactor and simplify this [TAG: KV_API]
virtual llama_sbatch sbatch_init(const llama_batch & batch, bool logits_all) = 0;
// different KV caches require different batch splitting strategies
virtual llama_ubatch ubatch_next(llama_sbatch & sbatch, uint32_t n_ubatch, bool embd_pooled) const = 0;
// find an empty slot of size "n_tokens" in the cache
virtual bool find_slot(const llama_ubatch & batch) = 0;

The main goal is to be able to run SWA caches with just n_swa + n_ubatch cells as explained in #13194 (comment). This refactor will also absorb the find_slot logic so that the llama_context won't need to be aware of this implementation detail. The batch -> sbatch -> ubatch logic will also be better contained within the llama_kv_cache and won't leak into llama_decode() as it does now, which will be useful later for implementing new KV caches.

When this rework is ready, I will use the new init(batch) mechanism to remove the set_full() call from the interface:

// simulate full cache, used for allocating worst-case compute buffers
// TODO: remove
virtual void set_full() = 0;

Simulating a full cache will be now achieved by initializing the appropriate batches and just not processing them.

Any suggestions about the plan are welcome.

@ggerganov ggerganov requested a review from slaren May 22, 2025 13:26
@ggerganov ggerganov force-pushed the gg/kv-cache-simplify-part2 branch from 0a8cdc3 to eda2e13 Compare May 23, 2025 09:03
@ggerganov ggerganov force-pushed the gg/kv-cache-simplify-part2 branch from 1ec785c to 0dc4804 Compare May 24, 2025 10:37
@ggerganov
Copy link
Member Author

While this change does not have a measurable impact on the performance under normal conditions, when building in Debug I observe a significant performance gain, which is nice:

./scripts/compare-commits.sh master gg/kv-cache-simplify-part2 -m ./models/llama-3.2-1b-instruct/ggml-model-q8_0.gguf -fa 1 -d 8192 -n 128 -p 0,1024 -r 5
Model Test t/s master t/s gg/kv-cache-simplify-part2 Speedup
llama 1B Q8_0 pp1024@d8192 1562.31 2040.47 1.31
llama 1B Q8_0 tg128@d8192 157.03 166.46 1.06
diff --git a/scripts/compare-commits.sh b/scripts/compare-commits.sh
index e40d1cc6d..7d9ca79cf 100755
--- a/scripts/compare-commits.sh
+++ b/scripts/compare-commits.sh
@@ -24,7 +24,7 @@ dir="build-bench"
 
 function run {
     rm -fr ${dir} > /dev/null
-    cmake -B ${dir} -S . $cmake_opts > /dev/null
+    cmake -DCMAKE_BUILD_TYPE=Debug -B ${dir} -S . $cmake_opts > /dev/null
     cmake --build ${dir} -t llama-bench > /dev/null
     ${dir}/bin/llama-bench -o sql -oe md $bench_args | sqlite3 llama-bench.sqlite
 }
@ggerganov ggerganov merged commit de2ef53 into master May 25, 2025
53 checks passed
@ggerganov ggerganov deleted the gg/kv-cache-simplify-part2 branch May 25, 2025 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants