Skip to content

Conversation

@ikawrakow
Copy link
Contributor

This PR improves k_quants perplexity scores by tweaking the quantization approach and quantization mixes. It is fully backward compatible (but obviously one needs to re-quantize the models to take advantage of these improvements).

The most significant gains are for LLAMA_FTYPE_MOSTLY_Q2_K, where perplexity is reduced by a significant margin while slightly reducing the model size (e.g., from 2.67 GiB to 2.63 GiB for 7B). See graphs below.

Significant improvements are also observed for LLAMA_FTYPE_MOSTLY_Q3_K_M and LLAMA_FTYPE_MOSTLY_Q4_K_S for LLaMA-v2-7B. This comes at the expense of a slightly increased model size (e.g., at 7B, 3.59 GiB vs 3.56 GiB for Q4_K_S and 3.07 GiB vs 3.06 GiB for Q3_K_M).

Other quantization types / models are slightly better for LLaMA-v2 (but the change is much smaller compared to those mentioned above), or basically the same for LLaMA-v1.

Note on LLAMA_FTYPE_MOSTLY_Q2_K: strictly speaking, this is now mostly a Q3_K quantization. All tensors are quantized using Q3_K, except for attention K and Q, which are Q2_K, and output.weight, which is Q6_K as usual. I considered naming it LLAMA_FTYPE_MOSTLY_Q3_K_XS or similar, but given that this model is smaller and better than the previous LLAMA_FTYPE_MOSTLY_Q2_K, so the existing Q2_K model would have been useless in comparison, I decided that it is simpler to just re-use the LLAMA_FTYPE_MOSTLY_Q2_K designation for this new quantization mix.

The following graph shows perplexity vs model size for the LLaMA-v2-7B model and a context length of 512. Black dots/lines are for current master (i.e., after the merge of the GGUF related changes). Red dots/lines depict the results of this PR. Results for Q4_0, Q4_1, Q5_0 and Q5_1 on current master are shown in blue for comparison. The perplexity of the fp16 model is 5.7963. The new Q6_K quantization arrives at 5.8067 (so, 0.18% higher) compared to 5.8118 (0.27% higher) on Master.

ppl_vs_size_l2_new

The following graph is the same as the above, but with a smaller plot range to better appreciate the perplexity differences in the 4-6 bit quantization range.

ppl_vs_size_l2_new1

Similar to the above graphs, but for the LLaMA-v1-7B model.

ppl_vs_size_l1_new

Iwan Kawrakow added 12 commits August 22, 2023 08:47
* Q3_K_S: use Q5_K for 1st 2 layers of attention.wv and feed_forward.w2
* Q4_K_S: use Q6_K for 1st 2 layers of attention.wv and feed_forward.w2
* Q2_K and Q3_K_M: use Q5_K instead of Q4_K for 1st 2 layers of
  attention.wv and feed_forward.w2

This leads to a slight model sized increase as follows:
Q2_K  : 2.684G vs 2.670G
Q3_K_S: 2.775G vs 2.745G
Q3_K_M: 3.071G vs 3.057G
Q4_K_S: 3.592G vs 3.563G

LLaMA-2 PPL for context 512 changes as follows:
Q2_K  : 6.6691 vs 6.8201
Q3_K_S: 6.2129 vs 6.2584
Q3_K_M: 6.0387 vs 6.1371
Q4_K_S: 5.9138 vs 6.0041

There are improvements for LLaMA-1 as well, but they are
way smaller than the above.
For the same model size as previus commit, we get
PPL = 5.9069 vs 5.9138.
With it, we get PPL = 5.8828 for L2-7B Q4_K_S.
Smaller model, lower perplexity.
 7B: file size = 2.632G, PPL = 6.3772 vs original 2.670G PPL = 6.8201
12B: file size = 5.056G, PPL = 5.4577 vs original 5.130G PPL = 5.7178

It is mostly Q3_K except for tok_embeddings, attention.wq, attention.wk,
which are Q2_K
@Green-Sky
Copy link
Collaborator

would it be possible to add some kind of meta information the the gguf to be able to determine if it was generated using the improvements of this pr. maybe generic like date/builnumber or more specific like k-quants-v1.1 or something. (whatever makes sense, but gguf now has easy extensibility)

The following graph is the same as the above, but with a smaller plot range to better appreciate the perplexity differences in the 4-6 bit quantization range.

I loled

@klosax
Copy link
Contributor

klosax commented Aug 22, 2023

would it be possible to add some kind of meta information the the gguf to be able to determine if it was generated using the improvements of this pr

Currently main will print the number of tensors of each quantization format.

@ikawrakow
Copy link
Contributor Author

Currently main will print the number of tensors of each quantization format.

Yes, but one might decide to change the quantization strategy, so even though all tensors are quantized with the same type, result is still different. For instance, in this PR I have changed Q4_K and Q5_K to use the newly added function make_qkx2_quants() instead of the previous approach in make_qkx1_quants(). Hence, a date (or any other version identifier such as the commit hash) as requested by @Green-Sky will be a very useful thing to have in the meta data (and be printed, so people know what kind of quantized model they are using).

As it stands, when running with this PR, LLAMA_FTYPE_MOSTLY_Q2_K to LLAMA_FTYPE_MOSTLY_Q3_K_L get all reported as Mostly Q3_K - Medium, LLAMA_FTYPE_MOSTLY_Q4_K_S and LLAMA_FTYPE_MOSTLY_Q4_K_M are both reported as Mostly Q4_K - Medium, etc. This will be really confusing if it remains this way.

@ggerganov
Copy link
Member

#2710 adds the ftype field back to the meta data.

Feel free to extend further the meta info with version/date/commit/etc. As long as the added KV info is optional, we can extend it anyway we like

@KerfuffleV2
Copy link
Contributor

@TheBloke - in case you didn't see this. Might be a reason to hold off on conversion for a bit if you haven't started yet.

@IgnacioFDM
Copy link
Contributor

Any benchmarks on models larger than 7B?

@TheBloke
Copy link
Contributor

TheBloke commented Aug 22, 2023

Looks fantastic! 6.0x for Q3 is amazing

@ikawrakow
Copy link
Contributor Author

Any benchmarks on models larger than 7B?

Yes, I did comparison for both 13B LLaMA's. Bud development was done on a branch that did not have the GGUF changes. When I was ready to submit the PR, I rebased on master, which brought in the GGUF changes, which changes the perplexity results. The change is actually quite dramatic for LLaMA-v2-13B: fp16 perplexity on current master is 5.1195, while what I had before was PPL = 5.1000. This is due to the change in rms_eps. Before GGUF, the default rms_eps was 5e-6. After GGUF, it is taken from the model meta data and there is no way to modify it, so it ends up being 1e-5. It was discussed e.g. in #2384 that 5e-6 gives lower perplexities than 1e-5, despite 1e-5 having been used during training. In any case, I'm re-running all calculations for the 13B models and will post when they become ready.

@Green-Sky Green-Sky changed the title Quantization imrovements for k_quants Aug 22, 2023
@ikawrakow
Copy link
Contributor Author

OK, here is the LLaMA-v2-13B result.

First an overview:
ppl_vs_size_l2_13_new

Then with focus on the 4-6 bit quantization range:
ppl_vs_size_l2_13_new1

@IgnacioFDM
Copy link
Contributor

Massive improvement with Q2_K it looks like

@ikawrakow ikawrakow merged commit bac6699 into master Aug 22, 2023
@ikawrakow ikawrakow deleted the ik/better_q234_k branch August 22, 2023 16:14
@cebtenzzre
Copy link
Collaborator

The part of the graphs that surprises me the most is that q4_1 and q5_1 have higher perplexity than their q4_0/q5_0 counterparts on LLaMA-v2.

@TheBloke It makes me wonder whether you should even bother providing q4_1/q5_1 quantizations for LLaMA-v2 models, since they are bigger, slower, and lower quality. Maybe you could at least make a note on the READMEs that they are probably not useful.

@Green-Sky
Copy link
Collaborator

the values in the help of the quantization tool where not updated. @ikawrakow

@mirek190
Copy link

q4.x or q5.x should be banned already as qk models are just better in everything ....

@cebtenzzre
Copy link
Collaborator

I ran some performance tests. The most noticeable change is Q2_K, which is now 40% slower.

GPU Model Test t/s before t/s PR Speedup
P40 7b q4_0 tg128 55.94 55.91 no change
P40 7b q2_K tg128 55.71 33.38 0.599
P40 7b q3_K_S tg128 30.84 30.81 no change
P40 7b q3_K_M tg128 37.35 37.20 0.996
P40 7b q3_K_L tg128 34.79 34.77 no change
P40 7b q4_K_S tg128 53.97 53.32 0.988
P40 7b q4_K_M tg128 49.12 49.13 no change
P40 7b q5_K_S tg128 42.90 42.90 no change
P40 7b q5_K_M tg128 41.06 41.07 no change
P40 7b q6_K tg128 33.50 33.50 no change
@KerfuffleV2
Copy link
Contributor

The most noticeable change is Q2_K, which is now 40% slower.

That seems surprising since this is a backward compatible change. You should be able to quantize with this version and then test with a version before the pull was committed - if you do that, you still see a large performance difference?

@ikawrakow
Copy link
Contributor Author

I cannot confirm a change in performance for Q2_K on RTX-4080. I get 147.7 t/s for Q2_K quantized the old and the new way. On an older GTX-1660, I get 41.1 t/s using the old Q2_K quantization, and 33.2 t/s using the new, so 0.808X. This is due to the fact that now there are a lot of Q3_K quantized tensors, and Q3_K performance is not as good as the others on older GPU's. But a 40% drop in inference performance seems too much. @cebtenzzre Can you share details of your tests (GPU, CUDA settings)? Thanks.

@TheBloke
Copy link
Contributor

Hey guys, a couple of quick questions:

When I run ./quantize -h I see this table:

   2  or  Q4_0   :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
   3  or  Q4_1   :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
   8  or  Q5_0   :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
   9  or  Q5_1   :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
  10  or  Q2_K   :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
  12  or  Q3_K   : alias for Q3_K_M
  11  or  Q3_K_S :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
  12  or  Q3_K_M :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
  13  or  Q3_K_L :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
  15  or  Q4_K   : alias for Q4_K_M
  14  or  Q4_K_S :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
  15  or  Q4_K_M :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
  17  or  Q5_K   : alias for Q5_K_M
  16  or  Q5_K_S :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
  17  or  Q5_K_M :  4.45G, +0.0122 ppl @ LLaMA-v1-7B
  18  or  Q6_K   :  5.15G, -0.0008 ppl @ LLaMA-v1-7B
   7  or  Q8_0   :  6.70G, +0.0004 ppl @ LLaMA-v1-7B
   1  or  F16    : 13.00G              @ 7B
   0  or  F32    : 26.00G              @ 7B

Is it correct that Q6_K has better perplexity than Q8_0? In which case there'd no reason to include Q8_0 any more?

Also I assume it must be a measurement error that Q6_K has better perplexity than FP16? :) Like one figure is from before GGUF and one after or something? Would that also affect the Q6_K vs Q8_0 figures?

There used to be some text information displayed when ./quantize -h was run, explaining the different formats and giving recommended/not recommended. Any reason that was removed? I was going to use it for writing my new GGUF README description of the quant methods.

@klosax
Copy link
Contributor

klosax commented Aug 23, 2023

I suggest all +0.0004 ppl should be the difference from F32 of the PTH model which gives the lowest possible ppl. The original PTH models are using BF16 which will be cut when converting to F16.

@ikawrakow
Copy link
Contributor Author

ikawrakow commented Aug 23, 2023

A PPL difference of +/- 0.001 is within the statistical noise for the amount of tokens in Wikitext. In the case of LLaMA-v1-7B it happens that Q6_K by chance arrives at a better PPL than fp16 (and Q8_0). But this will not be the case in general. For instance, for LLaMA-v2-7B, Q6_K has PPL = 5.8067, which is 0.18% higher than the PPL = 5.7963 for fp16 (and this is likely outside of the statistical uncertainty of the result, but one needs a proper uncertainty estimate added to the perplexity tool to confirm that). When editing the numbers I just took what was currently on master and adapted the number. Not sure why the previous explanations were removed. I was considering to update with the LLaMA-v2-7B numbers, but then decided to go for continuity and keep LLaMA-v1-7B results in the help.

@klosax
Copy link
Contributor

klosax commented Aug 23, 2023

The +/- ppl statistic may be is confusing for normal users to understand. Printing the real ppl may be better?

@KerfuffleV2
Copy link
Contributor

Printing the real ppl may be better?

As a user, what could you do with the raw ppl number except for subtracting it from some other value (like unquantized) to get a relative value?

@klosax
Copy link
Contributor

klosax commented Aug 23, 2023

At least print the real PPL value of the unquantized F32.

@TheBloke
Copy link
Contributor

TheBloke commented Aug 23, 2023

OK thanks for the explanations!

What is the feeling regarding Q6_K vs Q8_0? Is there enough of a statistically significant difference between Q8_0 vs Q6_K to make it worthwhile including Q8_0 still?

For example do you have a Q8_0 figure for the Llama V2 7B case you mentioned?

@KerfuffleV2
Copy link
Contributor

At least print the real PPL value of the unquantized F32.

Seems pretty reasonable, though I think it's still kind of hard for the user to do anything with.

I was actually the one that added the additional information to the quantize tool and my first pass included a lot more stuff. Some of the stuff from this post: #406 (comment) (note, the values are outdated)

One metric I think is actually pretty useful is % PPL increase relative to going from a 13B to 7B model. I think users that have messed with LLMs a bit will have some conception of the difference between a 13B and 7B model, so saying "this increases perplexity 50% as much as going from 13B to 7B" means more than +0.68 ppl.

What is the feeling regarding Q6_K vs Q8_0? Is there enough of a statistically significant difference between Q8_0 vs Q6_K to make it worthwhile including Q8_0 still?

The main use case I can think of is people who want to keep a high quality version of the model to requantize but don't want to keep the full 16bit model around. I.E. When using the quantize tool with --allow-requantize. The difference seems too small to care about for just running inference.

@IgnacioFDM
Copy link
Contributor

OK thanks for the explanations!

What is the feeling regarding Q6_K vs Q8_0? Is there enough of a statistically significant difference between Q8_0 vs Q6_K to make it worthwhile including Q8_0 still?

For example do you have a Q8_0 figure for the Llama V2 7B case you mentioned?

Q8_0 could potentially be significantly faster than Q6_K if properly optimized I'd think (especially if we did INT8 activations instead of converting to FP16). But I might be mistaken.

@ikawrakow
Copy link
Contributor Author

For example do you have a Q8_0 figure for the Llama V2 7B case you mentioned?

Yes. Q8_0 PPL for LLaMA-v2-7B is 5.7986, so 0.14% better than Q6_K.

While experimenting with the k_quants refinement (PR #2707), at some point I tried using Q8_0 instead of Q6_K for the output.weight tensor. This improved the PPL for all quantization types by ~0.003 for LLaMA-v2-7B, but made it worse by ~0.003 for LLaMA-v1-7B. So, basically, it very much depends on the model and the distribution of weight values in the tensors. Q6_K has only 6 bits available, but it does some extra work to minimize the difference to the float weights, so in some cases this can end up being better than the 8-bit round-to-nearest used in Q8_0. But it is unlikely this to be true in general, 2 extra bits are going to be beneficial more often than not. Overall I think Q6_K is within 0.2% of fp16 or better, so most likely indistinguishable from fp16 for most practical purposes. But I wouldn't quire remove Q8_0 yet.

@KerfuffleV2
Copy link
Contributor

Q6_K has only 6 bits available, but it does some extra work to minimize the difference to the float weights, so in some cases this can end up being better than the 8-bit round-to-nearest used in Q8_0.

Theoretically a new Q8_something could be added that does this extra work and is always better than Q6_K and Q8_0. Correct?

@ikawrakow
Copy link
Contributor Author

Q8_0 could potentially be significantly faster than Q6_K if properly optimized I'd think (especially if we did INT8 activations instead of converting to FP16). But I might be mistaken.

Q8_0 will never be faster than Q6_K for token prediction, which, at least on current hardware, is totally memory bound. The ~30% difference in size cannot be recovered by the fewer computations needed in Q8_0 matrix multiplications. For prompt processing, yes, Q8_0 is faster than Q6_K. To give specific numbers: TG-128 on RTX-4080 is 91.3 tokens/second for Q6_K vs 78.3 t/s for Q8_0. Perplexity takes 134.6 seconds for Q8_0 and 146.9 seconds for Q6_K.

@TheBloke
Copy link
Contributor

OK thanks very much! I will keep making Q8_0s then.

I'm definitely dropping Q4_0, Q4_1, Q5_0 and Q5_1.

@ikawrakow
Copy link
Contributor Author

Theoretically a new Q8_something could be added that does this extra work and is always better than Q6_K and Q8_0. Correct?

Theoretically, yes. In practice, not so easy to make sure that it always beats (or is at least the same) as Q6_K, given how small the difference is between Q6_K and fp16.

@IgnacioFDM
Copy link
Contributor

Q8_0 will never be faster than Q6_K for token prediction, which, at least on current hardware, is totally memory bound. The ~30% difference in size cannot be recovered by the fewer computations needed in Q8_0 matrix multiplications.

Isn't that only the case for consumer hardware? I'd expect tensor core INT8 inference to be significantly faster on A100 than the current setup with quantized mulmat.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Aug 23, 2023

I'm definitely dropping Q4_0, Q4_1, Q5_0 and Q5_1.

On my P40, Q5_0 is about 9% faster at token generation than Q5_K_S for a negligible difference in perplexity and file size on LLaMA-v2-7b. Could you keep that one at least?

@Dampfinchen
Copy link

Q8_0 will never be faster than Q6_K for token prediction, which, at least on current hardware, is totally memory bound. The ~30% difference in size cannot be recovered by the fewer computations needed in Q8_0 matrix multiplications.

Isn't that only the case for consumer hardware? I'd expect tensor core INT8 inference to be significantly faster on A100 than the current setup with quantized mulmat.

Consumer GPUs support INT4 and INT8 inference on tensor cores as well.

@ikawrakow
Copy link
Contributor Author

ikawrakow commented Aug 24, 2023

OK, I have most of the LLaMA-v2-70B results now. Did not (yet) do Q5_1. Cannot run Q8_0 and fp16 on the computers I have available (not enough RAM).

As table:

Quantization Model size (GiB) Perplexity Delta to fp16
Q4_0 36.20 3.5550 3.61%
Q4_1 40.20 3.5125 2.37%
Q5_0 44.20 3.4744 1.26%
Q2_K 27.11 3.8164 11.2%
Q3_K_S 27.70 3.7800 10.2%
Q3_K_M 30.83 3.5932 4.72%
Q3_K_L 33.67 3.5617 3.80%
Q4_K_S 36.31 3.4923 1.78%
Q4_K_M 38.54 3.4725 1.20%
Q5_K_S 44.20 3.4483 0.50%
Q5_K_M 45.41 3.4451 0.40%
Q6_K 52.70 3.4367 0.16%
fp16 128.5 3.4313 -

As graph:
ppl_vs_size_l2_70

@KerfuffleV2
Copy link
Contributor

I have most of the LLaMA-v2-70B results now.

Based on the 13B results, I guess we can expect the difference between the previous version and this pull to be very small so not really worth comparing?

@ggerganov
Copy link
Member

The PPL for LLaMA v2 70B F16 is 3.4313

Here is full Metal run. Not sure why the estimated time is so off (~4 hours). It took just 1.2 hours

main: build = 1044 (44d5462)
main: seed  = 1692857864
llama_model_loader: loaded meta data with 15 key-value pairs and 723 tensors from models/llama-70b-v2/ggml-model-f16.gguf (version GGUF V1 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  8192, 32000,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight f16      [  8192, 32000,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor    6:         blk.0.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor    8:            blk.0.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor    9:              blk.0.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   10:           blk.0.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   11:            blk.0.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   13:              blk.1.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   14:              blk.1.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   15:         blk.1.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   16:            blk.1.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   17:            blk.1.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor   18:              blk.1.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   19:           blk.1.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   20:            blk.1.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   22:              blk.2.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   23:              blk.2.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   24:         blk.2.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   25:            blk.2.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   26:            blk.2.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor   27:              blk.2.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   28:           blk.2.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   29:            blk.2.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   31:              blk.3.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   32:              blk.3.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   33:         blk.3.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   34:            blk.3.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   35:            blk.3.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor   36:              blk.3.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   37:           blk.3.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   38:            blk.3.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   40:              blk.4.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   41:              blk.4.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   42:         blk.4.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   43:            blk.4.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   44:            blk.4.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor   45:              blk.4.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   46:           blk.4.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   47:            blk.4.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   49:              blk.5.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   50:              blk.5.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   51:         blk.5.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   52:            blk.5.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   53:            blk.5.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor   54:              blk.5.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   55:           blk.5.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   56:            blk.5.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   58:              blk.6.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   59:              blk.6.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   60:         blk.6.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   61:            blk.6.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   62:            blk.6.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor   63:              blk.6.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   64:           blk.6.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   65:            blk.6.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   67:              blk.7.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   68:              blk.7.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   69:         blk.7.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   70:            blk.7.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   71:            blk.7.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor   72:              blk.7.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   73:           blk.7.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   74:            blk.7.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   76:              blk.8.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   77:              blk.8.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   78:         blk.8.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   79:            blk.8.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   80:            blk.8.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor   81:              blk.8.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   82:           blk.8.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   83:            blk.8.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   85:              blk.9.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   86:              blk.9.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   87:         blk.9.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   88:            blk.9.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   89:            blk.9.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor   90:              blk.9.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   91:           blk.9.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   92:            blk.9.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   94:             blk.10.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   95:             blk.10.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   96:        blk.10.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   97:           blk.10.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   98:           blk.10.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor   99:             blk.10.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  100:          blk.10.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  101:           blk.10.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  103:             blk.11.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  104:             blk.11.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  105:        blk.11.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  106:           blk.11.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  107:           blk.11.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  108:             blk.11.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  109:          blk.11.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  110:           blk.11.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  112:             blk.12.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  113:             blk.12.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  114:        blk.12.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  115:           blk.12.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  116:           blk.12.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  117:             blk.12.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  118:          blk.12.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  119:           blk.12.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  121:             blk.13.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  122:             blk.13.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  123:        blk.13.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  124:           blk.13.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  125:           blk.13.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  126:             blk.13.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  127:          blk.13.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  128:           blk.13.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  130:             blk.14.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  131:             blk.14.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  132:        blk.14.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  133:           blk.14.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  134:           blk.14.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  135:             blk.14.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  136:          blk.14.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  137:           blk.14.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  139:             blk.15.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  140:             blk.15.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  141:        blk.15.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  143:           blk.15.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  144:             blk.15.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  145:          blk.15.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  146:           blk.15.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  148:             blk.16.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  149:             blk.16.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  150:        blk.16.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  151:           blk.16.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  152:           blk.16.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  153:             blk.16.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  154:          blk.16.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  155:           blk.16.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  157:             blk.17.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  158:             blk.17.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  159:        blk.17.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  160:           blk.17.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  161:           blk.17.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  162:             blk.17.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  163:          blk.17.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  164:           blk.17.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  166:             blk.18.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  167:             blk.18.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  168:        blk.18.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  169:           blk.18.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  170:           blk.18.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  171:             blk.18.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  172:          blk.18.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  173:           blk.18.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  175:             blk.19.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  176:             blk.19.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  177:        blk.19.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  178:           blk.19.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  179:           blk.19.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  180:             blk.19.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  181:          blk.19.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  182:           blk.19.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  184:             blk.20.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  185:             blk.20.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  186:        blk.20.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  187:           blk.20.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  188:           blk.20.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  189:             blk.20.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  190:          blk.20.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  191:           blk.20.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  193:             blk.21.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  194:             blk.21.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  195:        blk.21.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  196:           blk.21.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  197:           blk.21.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  198:             blk.21.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  199:          blk.21.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  200:           blk.21.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  202:             blk.22.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  203:             blk.22.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  204:        blk.22.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  205:           blk.22.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  206:           blk.22.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  207:             blk.22.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  208:          blk.22.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  209:           blk.22.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  211:             blk.23.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  212:             blk.23.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  213:        blk.23.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  214:           blk.23.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  215:           blk.23.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  216:             blk.23.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  217:          blk.23.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  218:           blk.23.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  220:             blk.24.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  221:             blk.24.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  222:        blk.24.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  223:           blk.24.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  224:           blk.24.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  225:             blk.24.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  226:          blk.24.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  227:           blk.24.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  229:             blk.25.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  230:             blk.25.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  231:        blk.25.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  232:           blk.25.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  233:           blk.25.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  234:             blk.25.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  235:          blk.25.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  236:           blk.25.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  238:             blk.26.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  239:             blk.26.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  240:        blk.26.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  241:           blk.26.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  242:           blk.26.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  243:             blk.26.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  244:          blk.26.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  245:           blk.26.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  247:             blk.27.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  248:             blk.27.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  249:        blk.27.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  250:           blk.27.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  251:           blk.27.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  252:             blk.27.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  253:          blk.27.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  254:           blk.27.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  256:             blk.28.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  257:             blk.28.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  258:        blk.28.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  259:           blk.28.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  260:           blk.28.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  261:             blk.28.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  262:          blk.28.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  263:           blk.28.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  265:             blk.29.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  266:             blk.29.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  267:        blk.29.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  268:           blk.29.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  269:           blk.29.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  270:             blk.29.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  271:          blk.29.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  272:           blk.29.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  274:             blk.30.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  275:             blk.30.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  276:        blk.30.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  277:           blk.30.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  278:           blk.30.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  279:             blk.30.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  280:          blk.30.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  281:           blk.30.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  283:             blk.31.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  284:             blk.31.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  285:        blk.31.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  286:           blk.31.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  287:           blk.31.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  288:             blk.31.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  289:          blk.31.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  290:           blk.31.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  291:             blk.32.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  292:             blk.32.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  293:             blk.32.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  294:        blk.32.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  295:           blk.32.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  296:           blk.32.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  297:             blk.32.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  298:          blk.32.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  299:           blk.32.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  300:             blk.33.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  301:             blk.33.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  302:             blk.33.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  303:        blk.33.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  304:           blk.33.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  305:           blk.33.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  306:             blk.33.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  307:          blk.33.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  308:           blk.33.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  309:             blk.34.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  310:             blk.34.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  311:             blk.34.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  312:        blk.34.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  313:           blk.34.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  314:           blk.34.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  315:             blk.34.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  316:          blk.34.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  317:           blk.34.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  318:             blk.35.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  319:             blk.35.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  320:             blk.35.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  321:        blk.35.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  322:           blk.35.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  323:           blk.35.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  324:             blk.35.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  325:          blk.35.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  326:           blk.35.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  327:             blk.36.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  328:             blk.36.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  329:             blk.36.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  330:        blk.36.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  331:           blk.36.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  332:           blk.36.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  333:             blk.36.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  334:          blk.36.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  335:           blk.36.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  336:             blk.37.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  337:             blk.37.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  338:             blk.37.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  339:        blk.37.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  340:           blk.37.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  341:           blk.37.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  342:             blk.37.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  343:          blk.37.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  344:           blk.37.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  345:             blk.38.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  346:             blk.38.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  347:             blk.38.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  348:        blk.38.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  349:           blk.38.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  350:           blk.38.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  351:             blk.38.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  352:          blk.38.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  353:           blk.38.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  354:             blk.39.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  355:             blk.39.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  356:             blk.39.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  357:        blk.39.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  358:           blk.39.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  359:           blk.39.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  360:             blk.39.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  361:          blk.39.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  362:           blk.39.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  363:             blk.40.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  364:             blk.40.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  365:             blk.40.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  366:        blk.40.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  367:           blk.40.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  368:           blk.40.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  369:             blk.40.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  370:          blk.40.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  371:           blk.40.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  372:             blk.41.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  373:             blk.41.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  374:             blk.41.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  375:        blk.41.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  376:           blk.41.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  377:           blk.41.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  378:             blk.41.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  379:          blk.41.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  380:           blk.41.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  381:             blk.42.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  382:             blk.42.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  383:             blk.42.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  384:        blk.42.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  385:           blk.42.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  386:           blk.42.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  387:             blk.42.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  388:          blk.42.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  389:           blk.42.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  390:             blk.43.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  391:             blk.43.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  392:             blk.43.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  393:        blk.43.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  394:           blk.43.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  395:           blk.43.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  396:             blk.43.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  397:          blk.43.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  398:           blk.43.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  399:             blk.44.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  400:             blk.44.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  401:             blk.44.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  402:        blk.44.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  403:           blk.44.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  404:           blk.44.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  405:             blk.44.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  406:          blk.44.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  407:           blk.44.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  408:             blk.45.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  409:             blk.45.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  410:             blk.45.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  411:        blk.45.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  412:           blk.45.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  413:           blk.45.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  414:             blk.45.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  415:          blk.45.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  416:           blk.45.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  417:             blk.46.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  418:             blk.46.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  419:             blk.46.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  420:        blk.46.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  421:           blk.46.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  422:           blk.46.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  423:             blk.46.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  424:          blk.46.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  425:           blk.46.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  426:             blk.47.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  427:             blk.47.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  428:             blk.47.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  429:        blk.47.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  430:           blk.47.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  431:           blk.47.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  432:             blk.47.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  433:          blk.47.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  434:           blk.47.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  435:             blk.48.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  436:             blk.48.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  437:             blk.48.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  438:        blk.48.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  439:           blk.48.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  440:           blk.48.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  441:             blk.48.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  442:          blk.48.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  443:           blk.48.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  444:             blk.49.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  445:             blk.49.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  446:             blk.49.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  447:        blk.49.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  448:           blk.49.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  449:           blk.49.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  450:             blk.49.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  451:          blk.49.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  452:           blk.49.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  453:             blk.50.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  454:             blk.50.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  455:             blk.50.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  456:        blk.50.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  457:           blk.50.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  458:           blk.50.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  459:             blk.50.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  460:          blk.50.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  461:           blk.50.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  462:             blk.51.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  463:             blk.51.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  464:             blk.51.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  465:        blk.51.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  466:           blk.51.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  467:           blk.51.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  468:             blk.51.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  469:          blk.51.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  470:           blk.51.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  471:             blk.52.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  472:             blk.52.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  473:             blk.52.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  474:        blk.52.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  475:           blk.52.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  476:           blk.52.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  477:             blk.52.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  478:          blk.52.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  479:           blk.52.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  480:             blk.53.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  481:             blk.53.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  482:             blk.53.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  483:        blk.53.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  484:           blk.53.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  485:           blk.53.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  486:             blk.53.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  487:          blk.53.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  488:           blk.53.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  489:             blk.54.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  490:             blk.54.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  491:             blk.54.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  492:        blk.54.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  493:           blk.54.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  494:           blk.54.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  495:             blk.54.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  496:          blk.54.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  497:           blk.54.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  498:             blk.55.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  499:             blk.55.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  500:             blk.55.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  501:        blk.55.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  502:           blk.55.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  503:           blk.55.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  504:             blk.55.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  505:          blk.55.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  506:           blk.55.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  507:             blk.56.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  508:             blk.56.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  509:             blk.56.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  510:        blk.56.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  511:           blk.56.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  512:           blk.56.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  513:             blk.56.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  514:          blk.56.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  515:           blk.56.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  516:             blk.57.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  517:             blk.57.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  518:             blk.57.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  519:        blk.57.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  520:           blk.57.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  521:           blk.57.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  522:             blk.57.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  523:          blk.57.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  524:           blk.57.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  525:             blk.58.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  526:             blk.58.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  527:             blk.58.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  528:        blk.58.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  529:           blk.58.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  530:           blk.58.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  531:             blk.58.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  532:          blk.58.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  533:           blk.58.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  534:             blk.59.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  535:             blk.59.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  536:             blk.59.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  537:        blk.59.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  538:           blk.59.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  539:           blk.59.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  540:             blk.59.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  541:          blk.59.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  542:           blk.59.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  543:             blk.60.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  544:             blk.60.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  545:             blk.60.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  546:        blk.60.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  547:           blk.60.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  548:           blk.60.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  549:             blk.60.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  550:          blk.60.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  551:           blk.60.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  552:             blk.61.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  553:             blk.61.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  554:             blk.61.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  555:        blk.61.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  556:           blk.61.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  557:           blk.61.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  558:             blk.61.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  559:          blk.61.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  560:           blk.61.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  561:             blk.62.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  562:             blk.62.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  563:             blk.62.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  564:        blk.62.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  565:           blk.62.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  566:           blk.62.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  567:             blk.62.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  568:          blk.62.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  569:           blk.62.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  570:             blk.63.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  571:             blk.63.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  572:             blk.63.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  573:        blk.63.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  574:           blk.63.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  575:           blk.63.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  576:             blk.63.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  577:          blk.63.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  578:           blk.63.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  579:             blk.64.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  580:             blk.64.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  581:             blk.64.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  582:        blk.64.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  583:           blk.64.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  584:           blk.64.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  585:             blk.64.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  586:          blk.64.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  587:           blk.64.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  588:             blk.65.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  589:             blk.65.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  590:             blk.65.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  591:        blk.65.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  592:           blk.65.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  593:           blk.65.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  594:             blk.65.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  595:          blk.65.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  596:           blk.65.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  597:             blk.66.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  598:             blk.66.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  599:             blk.66.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  600:        blk.66.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  601:           blk.66.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  602:           blk.66.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  603:             blk.66.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  604:          blk.66.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  605:           blk.66.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  606:             blk.67.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  607:             blk.67.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  608:             blk.67.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  609:        blk.67.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  610:           blk.67.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  611:           blk.67.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  612:             blk.67.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  613:          blk.67.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  614:           blk.67.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  615:             blk.68.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  616:             blk.68.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  617:             blk.68.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  618:        blk.68.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  619:           blk.68.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  620:           blk.68.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  621:             blk.68.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  622:          blk.68.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  623:           blk.68.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  624:             blk.69.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  625:             blk.69.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  626:             blk.69.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  627:        blk.69.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  628:           blk.69.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  629:           blk.69.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  630:             blk.69.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  631:          blk.69.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  632:           blk.69.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  633:             blk.70.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  634:             blk.70.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  635:             blk.70.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  636:        blk.70.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  637:           blk.70.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  638:           blk.70.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  639:             blk.70.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  640:          blk.70.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  641:           blk.70.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  642:             blk.71.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  643:             blk.71.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  644:             blk.71.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  645:        blk.71.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  646:           blk.71.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  647:           blk.71.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  648:             blk.71.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  649:          blk.71.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  650:           blk.71.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  651:             blk.72.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  652:             blk.72.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  653:             blk.72.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  654:        blk.72.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  655:           blk.72.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  656:           blk.72.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  657:             blk.72.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  658:          blk.72.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  659:           blk.72.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  660:             blk.73.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  661:             blk.73.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  662:             blk.73.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  663:        blk.73.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  664:           blk.73.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  665:           blk.73.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  666:             blk.73.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  667:          blk.73.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  668:           blk.73.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  669:             blk.74.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  670:             blk.74.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  671:             blk.74.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  672:        blk.74.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  673:           blk.74.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  674:           blk.74.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  675:             blk.74.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  676:          blk.74.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  677:           blk.74.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  678:             blk.75.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  679:             blk.75.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  680:             blk.75.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  681:        blk.75.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  682:           blk.75.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  683:           blk.75.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  684:             blk.75.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  685:          blk.75.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  686:           blk.75.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  687:             blk.76.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  688:             blk.76.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  689:             blk.76.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  690:        blk.76.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  691:           blk.76.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  692:           blk.76.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  693:             blk.76.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  694:          blk.76.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  695:           blk.76.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  696:             blk.77.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  697:             blk.77.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  698:             blk.77.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  699:        blk.77.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  700:           blk.77.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  701:           blk.77.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  702:             blk.77.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  703:          blk.77.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  704:           blk.77.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  705:             blk.78.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  706:             blk.78.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  707:             blk.78.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  708:        blk.78.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  709:           blk.78.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  710:           blk.78.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  711:             blk.78.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  712:          blk.78.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  713:           blk.78.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  714:             blk.79.attn_q.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  715:             blk.79.attn_k.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  716:             blk.79.attn_v.weight f16      [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  717:        blk.79.attn_output.weight f16      [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  718:           blk.79.ffn_gate.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  719:           blk.79.ffn_down.weight f16      [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  720:             blk.79.ffn_up.weight f16      [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  721:          blk.79.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  722:           blk.79.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type  f16:  562 tensors
llm_load_print_meta: format         = GGUF V1 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 512
llm_load_print_meta: n_embd         = 8192
llm_load_print_meta: n_head         = 64
llm_load_print_meta: n_head_kv      = 8
llm_load_print_meta: n_layer        = 80
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 8
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 28672
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 70B
llm_load_print_meta: model ftype    = mostly F16
llm_load_print_meta: model size     = 68.98 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 131565.25 MB
llm_load_tensors: mem required  = 131565.25 MB (+  160.00 MB per state)
....................................................................................................
llama_new_context_with_model: kv self size  =  160.00 MB
ggml_metal_init: allocating
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x113106ee0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_add_row                        0x113107620 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul                            0x113107b60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_row                        0x1131081b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_scale                          0x1131086f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_silu                           0x113108c30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_relu                           0x113109170 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_gelu                           0x1131096b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max                       0x113109d80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_diag_mask_inf                  0x11310a400 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_f16                   0x113205590 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x113205ef0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x1132065c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x113206c90 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x113207360 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x113207a30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x113208100 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x113306f40 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rms_norm                       0x113307770 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_norm                           0x1133080d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x113704530 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x113704dd0 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x113705550 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x113705e50 | th_max =  640 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x1137065d0 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x11310aa60 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x11310b300 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x11310bc80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_f16_f32                 0x1132088a0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32                0x113209180 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32                0x113706c70 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32                0x113308770 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32                0x113308e10 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32                0x1133095d0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32                0x113309d90 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32                0x11310c320 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_rope                           0x11310c980 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_alibi_f32                      0x1137071b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x113707b80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x113708430 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x113708ce0 | th_max = 1024 | th_width =   32
ggml_metal_init: recommendedMaxWorkingSetSize  = 147456.00 MB
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size =  145.41 MB
llama_new_context_with_model: max tensor size =   500.00 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 110592.00 MB, offs =            0
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 21473.28 MB, offs = 115439812608, (132065.72 / 147456.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =     1.42 MB, (132067.14 / 147456.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   162.00 MB, (132229.14 / 147456.00)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =   144.02 MB, (132373.16 / 147456.00)

system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity: tokenizing the input ..
perplexity: calculating perplexity over 655 chunks, batch_size=512
perplexity: 21.47 seconds per pass - ETA 3 hours 54.40 minutes
[1]3.0177,[2]3.3428,[3]3.8718,[4]3.3837,[5]3.1850,[6]3.0186,[7]3.0333,[8]3.0344,[9]2.9744,[10]2.9192,[11]2.8739,[12]2.9128,[13]3.0101,[14]3.1590,[15]3.3160,[16]3.2632,[17]3.3276,[18]3.4259,[19]3.3624,[20]3.4237,[21]3.4344,[22]3.3608,[23]3.3745,[24]3.3487,[25]3.3381,[26]3.2516,[27]3.1746,[28]3.1237,[29]3.0685,[30]2.9913,[31]2.9330,[32]2.9428,[33]2.9190,[34]2.9255,[35]2.9385,[36]2.9692,[37]2.9755,[38]2.9795,[39]3.0035,[40]3.0395,[41]3.0563,[42]3.0874,[43]3.0924,[44]3.1373,[45]3.1631,[46]3.1706,[47]3.2040,[48]3.2140,[49]3.2255,[50]3.2191,[51]3.2372,[52]3.2472,[53]3.2885,[54]3.2967,[55]3.2900,[56]3.2434,[57]3.2204,[58]3.2464,[59]3.2732,[60]3.3094,[61]3.3189,[62]3.3675,[63]3.3935,[64]3.4052,[65]3.4332,[66]3.4374,[67]3.4546,[68]3.4731,[69]3.5022,[70]3.5340,[71]3.5619,[72]3.5952,[73]3.6369,[74]3.6527,[75]3.6676,[76]3.6844,[77]3.7020,[78]3.7016,[79]3.7280,[80]3.7360,[81]3.7569,[82]3.7490,[83]3.7277,[84]3.7284,[85]3.7362,[86]3.7348,[87]3.6951,[88]3.6524,[89]3.6128,[90]3.5817,[91]3.5583,[92]3.5413,[93]3.5320,[94]3.5054,[95]3.5123,[96]3.4905,[97]3.4703,[98]3.4532,[99]3.4390,[100]3.4341,[101]3.4333,[102]3.4269,[103]3.4249,[104]3.4189,[105]3.4138,[106]3.4103,[107]3.4050,[108]3.4076,[109]3.3945,[110]3.3808,[111]3.3814,[112]3.3858,[113]3.3781,[114]3.3634,[115]3.3566,[116]3.3469,[117]3.3343,[118]3.3498,[119]3.3640,[120]3.3877,[121]3.3983,[122]3.4198,[123]3.4489,[124]3.4667,[125]3.4725,[126]3.5028,[127]3.5299,[128]3.5527,[129]3.5316,[130]3.5422,[131]3.5481,[132]3.5520,[133]3.5500,[134]3.5624,[135]3.5695,[136]3.5720,[137]3.5776,[138]3.5749,[139]3.5751,[140]3.5788,[141]3.5666,[142]3.5685,[143]3.5555,[144]3.5500,[145]3.5472,[146]3.5493,[147]3.5592,[148]3.5662,[149]3.5694,[150]3.5756,[151]3.5854,[152]3.5889,[153]3.5875,[154]3.5901,[155]3.5994,[156]3.6032,[157]3.6160,[158]3.6200,[159]3.6256,[160]3.6348,[161]3.6465,[162]3.6372,[163]3.6361,[164]3.6274,[165]3.6174,[166]3.6078,[167]3.5933,[168]3.5751,[169]3.5644,[170]3.5612,[171]3.5509,[172]3.5486,[173]3.5446,[174]3.5332,[175]3.5237,[176]3.5201,[177]3.5141,[178]3.5061,[179]3.5017,[180]3.5009,[181]3.4939,[182]3.4867,[183]3.4847,[184]3.4878,[185]3.4811,[186]3.4831,[187]3.4899,[188]3.4845,[189]3.4960,[190]3.4979,[191]3.5121,[192]3.5233,[193]3.5342,[194]3.5410,[195]3.5583,[196]3.5687,[197]3.5830,[198]3.5937,[199]3.5990,[200]3.5866,[201]3.5713,[202]3.5632,[203]3.5575,[204]3.5494,[205]3.5488,[206]3.5483,[207]3.5440,[208]3.5426,[209]3.5441,[210]3.5511,[211]3.5609,[212]3.5684,[213]3.5777,[214]3.5840,[215]3.5887,[216]3.6001,[217]3.6130,[218]3.6250,[219]3.6287,[220]3.6308,[221]3.6298,[222]3.6332,[223]3.6331,[224]3.6303,[225]3.6294,[226]3.6438,[227]3.6478,[228]3.6567,[229]3.6647,[230]3.6657,[231]3.6782,[232]3.6755,[233]3.6708,[234]3.6656,[235]3.6534,[236]3.6529,[237]3.6508,[238]3.6572,[239]3.6547,[240]3.6530,[241]3.6562,[242]3.6604,[243]3.6618,[244]3.6590,[245]3.6593,[246]3.6547,[247]3.6516,[248]3.6513,[249]3.6506,[250]3.6561,[251]3.6534,[252]3.6535,[253]3.6470,[254]3.6439,[255]3.6408,[256]3.6328,[257]3.6293,[258]3.6271,[259]3.6275,[260]3.6238,[261]3.6225,[262]3.6224,[263]3.6213,[264]3.6074,[265]3.6105,[266]3.6119,[267]3.6111,[268]3.6160,[269]3.6169,[270]3.6214,[271]3.6292,[272]3.6259,[273]3.6268,[274]3.6286,[275]3.6335,[276]3.6377,[277]3.6476,[278]3.6562,[279]3.6634,[280]3.6675,[281]3.6749,[282]3.6812,[283]3.6927,[284]3.7010,[285]3.7093,[286]3.7185,[287]3.7157,[288]3.7206,[289]3.7187,[290]3.7072,[291]3.6944,[292]3.6817,[293]3.6710,[294]3.6658,[295]3.6677,[296]3.6693,[297]3.6683,[298]3.6676,[299]3.6635,[300]3.6532,[301]3.6469,[302]3.6400,[303]3.6353,[304]3.6298,[305]3.6259,[306]3.6197,[307]3.6140,[308]3.6111,[309]3.6038,[310]3.5991,[311]3.5946,[312]3.5918,[313]3.5887,[314]3.5865,[315]3.5774,[316]3.5714,[317]3.5638,[318]3.5534,[319]3.5606,[320]3.5698,[321]3.5747,[322]3.5744,[323]3.5705,[324]3.5709,[325]3.5781,[326]3.5811,[327]3.5834,[328]3.5871,[329]3.5912,[330]3.5929,[331]3.6018,[332]3.6007,[333]3.6069,[334]3.6029,[335]3.6013,[336]3.6042,[337]3.6050,[338]3.6043,[339]3.5963,[340]3.5952,[341]3.6003,[342]3.6043,[343]3.6083,[344]3.6092,[345]3.6125,[346]3.6134,[347]3.6171,[348]3.6215,[349]3.6254,[350]3.6257,[351]3.6280,[352]3.6296,[353]3.6273,[354]3.6264,[355]3.6289,[356]3.6330,[357]3.6325,[358]3.6403,[359]3.6388,[360]3.6378,[361]3.6391,[362]3.6445,[363]3.6518,[364]3.6558,[365]3.6591,[366]3.6621,[367]3.6685,[368]3.6691,[369]3.6719,[370]3.6750,[371]3.6745,[372]3.6803,[373]3.6834,[374]3.6843,[375]3.6844,[376]3.6899,[377]3.6891,[378]3.6923,[379]3.6947,[380]3.6918,[381]3.6910,[382]3.6879,[383]3.6873,[384]3.6886,[385]3.6886,[386]3.6889,[387]3.6905,[388]3.6888,[389]3.6865,[390]3.6838,[391]3.6814,[392]3.6801,[393]3.6785,[394]3.6819,[395]3.6828,[396]3.6802,[397]3.6852,[398]3.6895,[399]3.6949,[400]3.6952,[401]3.6963,[402]3.6967,[403]3.6994,[404]3.7043,[405]3.6937,[406]3.6836,[407]3.6760,[408]3.6734,[409]3.6793,[410]3.6862,[411]3.6941,[412]3.7035,[413]3.7092,[414]3.7121,[415]3.7162,[416]3.7198,[417]3.7259,[418]3.7247,[419]3.7277,[420]3.7313,[421]3.7381,[422]3.7396,[423]3.7413,[424]3.7459,[425]3.7499,[426]3.7534,[427]3.7555,[428]3.7584,[429]3.7607,[430]3.7640,[431]3.7717,[432]3.7735,[433]3.7725,[434]3.7708,[435]3.7730,[436]3.7739,[437]3.7801,[438]3.7859,[439]3.7847,[440]3.7831,[441]3.7758,[442]3.7734,[443]3.7743,[444]3.7747,[445]3.7743,[446]3.7756,[447]3.7775,[448]3.7776,[449]3.7763,[450]3.7771,[451]3.7755,[452]3.7665,[453]3.7577,[454]3.7497,[455]3.7409,[456]3.7409,[457]3.7332,[458]3.7271,[459]3.7194,[460]3.7099,[461]3.7004,[462]3.6912,[463]3.6846,[464]3.6765,[465]3.6681,[466]3.6596,[467]3.6504,[468]3.6435,[469]3.6346,[470]3.6274,[471]3.6192,[472]3.6116,[473]3.6049,[474]3.5978,[475]3.5906,[476]3.5834,[477]3.5800,[478]3.5788,[479]3.5735,[480]3.5676,[481]3.5622,[482]3.5558,[483]3.5482,[484]3.5399,[485]3.5362,[486]3.5299,[487]3.5255,[488]3.5242,[489]3.5169,[490]3.5089,[491]3.5027,[492]3.4950,[493]3.4898,[494]3.4846,[495]3.4782,[496]3.4741,[497]3.4668,[498]3.4591,[499]3.4557,[500]3.4489,[501]3.4410,[502]3.4359,[503]3.4306,[504]3.4231,[505]3.4184,[506]3.4189,[507]3.4168,[508]3.4165,[509]3.4189,[510]3.4212,[511]3.4254,[512]3.4295,[513]3.4328,[514]3.4375,[515]3.4333,[516]3.4341,[517]3.4344,[518]3.4343,[519]3.4366,[520]3.4383,[521]3.4399,[522]3.4420,[523]3.4431,[524]3.4478,[525]3.4504,[526]3.4515,[527]3.4526,[528]3.4502,[529]3.4523,[530]3.4518,[531]3.4534,[532]3.4586,[533]3.4624,[534]3.4618,[535]3.4633,[536]3.4607,[537]3.4606,[538]3.4616,[539]3.4626,[540]3.4626,[541]3.4602,[542]3.4614,[543]3.4636,[544]3.4660,[545]3.4664,[546]3.4683,[547]3.4676,[548]3.4661,[549]3.4672,[550]3.4669,[551]3.4664,[552]3.4669,[553]3.4661,[554]3.4661,[555]3.4660,[556]3.4669,[557]3.4682,[558]3.4674,[559]3.4689,[560]3.4656,[561]3.4671,[562]3.4666,[563]3.4673,[564]3.4720,[565]3.4745,[566]3.4763,[567]3.4754,[568]3.4776,[569]3.4783,[570]3.4813,[571]3.4831,[572]3.4845,[573]3.4864,[574]3.4851,[575]3.4846,[576]3.4854,[577]3.4852,[578]3.4850,[579]3.4854,[580]3.4837,[581]3.4831,[582]3.4845,[583]3.4869,[584]3.4894,[585]3.4869,[586]3.4847,[587]3.4865,[588]3.4902,[589]3.4949,[590]3.4980,[591]3.5007,[592]3.5012,[593]3.4948,[594]3.4906,[595]3.4851,[596]3.4863,[597]3.4850,[598]3.4815,[599]3.4825,[600]3.4827,[601]3.4779,[602]3.4753,[603]3.4750,[604]3.4737,[605]3.4732,[606]3.4733,[607]3.4733,[608]3.4728,[609]3.4737,[610]3.4760,[611]3.4758,[612]3.4776,[613]3.4766,[614]3.4747,[615]3.4722,[616]3.4749,[617]3.4733,[618]3.4719,[619]3.4706,[620]3.4649,[621]3.4628,[622]3.4585,[623]3.4587,[624]3.4598,[625]3.4611,[626]3.4618,[627]3.4638,[628]3.4650,[629]3.4657,[630]3.4680,[631]3.4705,[632]3.4742,[633]3.4743,[634]3.4769,[635]3.4769,[636]3.4751,[637]3.4701,[638]3.4666,[639]3.4616,[640]3.4558,[641]3.4519,[642]3.4488,[643]3.4436,[644]3.4391,[645]3.4348,[646]3.4321,[647]3.4272,[648]3.4254,[649]3.4251,[650]3.4270,[651]3.4297,[652]3.4299,[653]3.4331,[654]3.4314,[655]3.4313,
llama_print_timings:        load time = 57976.93 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 4176637.30 ms / 335360 tokens (   12.45 ms per token,    80.29 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 4242444.38 ms
ggml_metal_free: deallocating
@KerfuffleV2
Copy link
Contributor

Not sure why the estimated time is so off (~4 hours). It took just 1.2 hours

Anything like a random CPU usage spike or swapping while the first block is running will throw off the whole time estimation calculation since it's only based on the first block time. I've seen it happen from time to time.

@ikawrakow
Copy link
Contributor Author

Not sure why the estimated time is so off (~4 hours). It took just 1.2 hours

I'm observing this on a regular basis. The very first time you load a model, the time estimate is off by a sizable margin. If you stop the process after getting the time estimate, on next run with the same model the time estimate is fairly reliable.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Sep 1, 2023

@cebtenzzre Can you share details of your tests (GPU, CUDA settings)? Thanks.

Sorry, I missed this. I have a Tesla P40 24GB. I was comparing commit bac6699 ("PR") with commit 519c981 ("before"). I used ehartford/dolphin-llama2-7b because that's what I had on hand at the time. I compiled with make LLAMA_CUBLAS=1 and benchmarked with this command, which runs main three times so the cache is warm:

{ for i in 0 1 2; do CUDA_VISIBLE_DEVICES=0 ./main -n 128 -m dolphin-llama2-7b.q2_k.gguf -ngl 100 -mmq --ignore-eos -t 1; done } |& tail

I then wrote down the "eval time" t/s.

I just re-quantized and re-tested just to make sure, and for Q2_K I get 55.67 t/s before this PR and 33.32 t/s after - a difference of less than 0.2% from the previously provided numbers.

Johannes has a few P40s, so he should be able to reproduce my results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet