Skip to content

Conversation

@klauspost
Copy link
Contributor

@klauspost klauspost commented Apr 9, 2020

Improve inflate decompression speed, mainly through 3 optimizations:

  1. Read further ahead on non-final blocks.

The reader guarantees that it will not read beyond the end of the stream.
This poses limitations on the decoder in terms of how far it can read ahead
and is set to the size of an end-of-block marker in f.h1.min = f.bits[endBlockMarker].

We can however take advantage of the fact that each block gives
information on whether it is the final block on a stream.
So if we are not reading the final block we can safely add the size
of the smallest block possible with nothing but an EOB marker.

That is a block with a predefined table and a single EOB marker.
Since we know the size of the block header and the encoding
of the EOB this totals to 10 additional bits.
Adding 10 bits reduces the number of stream reads significantly.

Approximately 5% throughput increase.

  1. Manually inline f.huffSym call

This change by itself give about about 13% throughput increase.

  1. Generate decoders for stdlib io.ByteReader types

We generate decoders for the known implementations of io.ByteReader,
namely *bytes.Buffer, *bytes.Reader, *bufio.Reader and *strings.Reader.

This change by itself gives about 20-25% throughput increase,
including when an io.Reader is passed.

I would say only *strings.Reader probably isn't that common.

Minor changes:

  • Reuse h.chunks and h.links.
  • Trade some bounds checks for AND operations.
  • Change chunks from uint32 to uint16.
  • Avoid padding of decompressor struct members.

Per loop allocation removed from benchmarks.
The numbers in the benchmark below includes this change for the 'old' numbers.

name                              old time/op    new time/op    delta
Decode/Digits/Huffman/1e4-32        63.8µs ± 0%    41.3µs ± 0%   -35.22%  (p=0.000 n=10+10)
Decode/Digits/Huffman/1e5-32         625µs ± 0%     404µs ± 0%   -35.31%  (p=0.000 n=10+10)
Decode/Digits/Huffman/1e6-32        6.25ms ± 0%    4.02ms ± 1%   -35.64%  (p=0.000 n=10+10)
Decode/Digits/Speed/1e4-32          72.1µs ± 1%    48.0µs ± 0%   -33.36%   (p=0.000 n=10+9)
Decode/Digits/Speed/1e5-32           792µs ± 1%     578µs ± 1%   -27.04%  (p=0.000 n=10+10)
Decode/Digits/Speed/1e6-32          8.09ms ± 0%    5.85ms ± 0%   -27.68%   (p=0.000 n=9+10)
Decode/Digits/Default/1e4-32        74.1µs ± 1%    49.7µs ± 1%   -32.87%   (p=0.000 n=10+9)
Decode/Digits/Default/1e5-32         775µs ± 1%     579µs ± 0%   -25.35%  (p=0.000 n=10+10)
Decode/Digits/Default/1e6-32        7.84ms ± 1%    5.84ms ± 1%   -25.59%  (p=0.000 n=10+10)
Decode/Digits/Compression/1e4-32    74.1µs ± 0%    49.8µs ± 0%   -32.83%  (p=0.000 n=10+10)
Decode/Digits/Compression/1e5-32     777µs ± 1%     579µs ± 0%   -25.47%  (p=0.000 n=10+10)
Decode/Digits/Compression/1e6-32    7.83ms ± 1%    5.83ms ± 0%   -25.59%  (p=0.000 n=10+10)
Decode/Newton/Huffman/1e4-32        72.9µs ± 0%    45.6µs ± 1%   -37.48%  (p=0.000 n=10+10)
Decode/Newton/Huffman/1e5-32         712µs ± 1%     471µs ± 1%   -33.92%  (p=0.000 n=10+10)
Decode/Newton/Huffman/1e6-32        7.11ms ± 0%    4.70ms ± 1%   -33.98%  (p=0.000 n=10+10)
Decode/Newton/Speed/1e4-32          67.0µs ± 1%    45.4µs ± 1%   -32.19%  (p=0.000 n=10+10)
Decode/Newton/Speed/1e5-32           616µs ± 1%     447µs ± 0%   -27.49%  (p=0.000 n=10+10)
Decode/Newton/Speed/1e6-32          6.17ms ± 0%    4.50ms ± 0%   -26.98%  (p=0.000 n=10+10)
Decode/Newton/Default/1e4-32        60.7µs ± 0%    39.6µs ± 0%   -34.84%  (p=0.000 n=10+10)
Decode/Newton/Default/1e5-32         492µs ± 0%     360µs ± 0%   -26.84%  (p=0.000 n=10+10)
Decode/Newton/Default/1e6-32        4.87ms ± 1%    3.59ms ± 0%   -26.34%  (p=0.000 n=10+10)
Decode/Newton/Compression/1e4-32    60.8µs ± 1%    39.6µs ± 1%   -34.92%  (p=0.000 n=10+10)
Decode/Newton/Compression/1e5-32     491µs ± 1%     357µs ± 1%   -27.23%  (p=0.000 n=10+10)
Decode/Newton/Compression/1e6-32    4.84ms ± 0%    3.58ms ± 0%   -26.17%  (p=0.000 n=10+10)

name                              old speed      new speed      delta
Decode/Digits/Huffman/1e4-32       157MB/s ± 0%   242MB/s ± 0%   +54.37%  (p=0.000 n=10+10)
Decode/Digits/Huffman/1e5-32       160MB/s ± 0%   247MB/s ± 0%   +54.58%  (p=0.000 n=10+10)
Decode/Digits/Huffman/1e6-32       160MB/s ± 0%   249MB/s ± 1%   +55.39%  (p=0.000 n=10+10)
Decode/Digits/Speed/1e4-32         139MB/s ± 1%   208MB/s ± 0%   +50.06%   (p=0.000 n=10+9)
Decode/Digits/Speed/1e5-32         126MB/s ± 1%   173MB/s ± 1%   +37.05%  (p=0.000 n=10+10)
Decode/Digits/Speed/1e6-32         124MB/s ± 0%   171MB/s ± 0%   +38.28%   (p=0.000 n=9+10)
Decode/Digits/Default/1e4-32       135MB/s ± 1%   201MB/s ± 1%   +48.95%   (p=0.000 n=10+9)
Decode/Digits/Default/1e5-32       129MB/s ± 1%   173MB/s ± 0%   +33.95%  (p=0.000 n=10+10)
Decode/Digits/Default/1e6-32       127MB/s ± 1%   171MB/s ± 1%   +34.39%  (p=0.000 n=10+10)
Decode/Digits/Compression/1e4-32   135MB/s ± 0%   201MB/s ± 0%   +48.88%  (p=0.000 n=10+10)
Decode/Digits/Compression/1e5-32   129MB/s ± 1%   173MB/s ± 0%   +34.17%  (p=0.000 n=10+10)
Decode/Digits/Compression/1e6-32   128MB/s ± 1%   172MB/s ± 0%   +34.39%  (p=0.000 n=10+10)
Decode/Newton/Huffman/1e4-32       137MB/s ± 0%   219MB/s ± 1%   +59.96%  (p=0.000 n=10+10)
Decode/Newton/Huffman/1e5-32       140MB/s ± 1%   212MB/s ± 1%   +51.32%  (p=0.000 n=10+10)
Decode/Newton/Huffman/1e6-32       141MB/s ± 0%   213MB/s ± 1%   +51.46%  (p=0.000 n=10+10)
Decode/Newton/Speed/1e4-32         149MB/s ± 1%   220MB/s ± 1%   +47.48%  (p=0.000 n=10+10)
Decode/Newton/Speed/1e5-32         162MB/s ± 1%   224MB/s ± 0%   +37.92%  (p=0.000 n=10+10)
Decode/Newton/Speed/1e6-32         162MB/s ± 0%   222MB/s ± 0%   +36.95%  (p=0.000 n=10+10)
Decode/Newton/Default/1e4-32       165MB/s ± 0%   253MB/s ± 0%   +53.47%  (p=0.000 n=10+10)
Decode/Newton/Default/1e5-32       203MB/s ± 0%   278MB/s ± 0%   +36.68%  (p=0.000 n=10+10)
Decode/Newton/Default/1e6-32       205MB/s ± 1%   279MB/s ± 0%   +35.77%  (p=0.000 n=10+10)
Decode/Newton/Compression/1e4-32   164MB/s ± 1%   253MB/s ± 1%   +53.66%  (p=0.000 n=10+10)
Decode/Newton/Compression/1e5-32   204MB/s ± 1%   280MB/s ± 1%   +37.41%  (p=0.000 n=10+10)
Decode/Newton/Compression/1e6-32   206MB/s ± 0%   280MB/s ± 0%   +35.44%  (p=0.000 n=10+10)

name                              old alloc/op   new alloc/op   delta
Decode/Digits/Huffman/1e4-32        0.00B ±NaN%    16.00B ± 0%     +Inf%  (p=0.000 n=10+10)
Decode/Digits/Huffman/1e5-32         6.00B ± 0%    36.00B ± 0%  +500.00%  (p=0.000 n=10+10)
Decode/Digits/Huffman/1e6-32         64.0B ± 0%    304.0B ± 0%  +375.00%   (p=0.000 n=10+9)
Decode/Digits/Speed/1e4-32           80.0B ± 0%     16.0B ± 0%   -80.00%  (p=0.000 n=10+10)
Decode/Digits/Speed/1e5-32            296B ± 0%       39B ± 0%   -86.82%   (p=0.000 n=10+8)
Decode/Digits/Speed/1e6-32          3.78kB ± 0%    0.33kB ± 0%   -91.29%  (p=0.000 n=10+10)
Decode/Digits/Default/1e4-32         40.0B ± 0%     16.0B ± 0%   -60.00%  (p=0.000 n=10+10)
Decode/Digits/Default/1e5-32          287B ± 0%       54B ± 1%   -81.04%  (p=0.000 n=10+10)
Decode/Digits/Default/1e6-32        4.15kB ± 0%    0.44kB ± 0%   -89.43%   (p=0.000 n=10+9)
Decode/Digits/Compression/1e4-32     40.0B ± 0%     16.0B ± 0%   -60.00%  (p=0.000 n=10+10)
Decode/Digits/Compression/1e5-32      288B ± 0%       55B ± 1%   -81.01%  (p=0.000 n=10+10)
Decode/Digits/Compression/1e6-32    4.15kB ± 0%    0.44kB ± 0%   -89.43%   (p=0.000 n=10+8)
Decode/Newton/Huffman/1e4-32          705B ± 0%       16B ± 0%   -97.73%   (p=0.000 n=9+10)
Decode/Newton/Huffman/1e5-32        4.49kB ± 0%    0.04kB ± 0%   -99.15%  (p=0.000 n=10+10)
Decode/Newton/Huffman/1e6-32        39.4kB ± 0%     0.3kB ± 0%   -99.18%   (p=0.000 n=9+10)
Decode/Newton/Speed/1e4-32            617B ± 0%       16B ± 0%   -97.41%  (p=0.000 n=10+10)
Decode/Newton/Speed/1e5-32          3.19kB ± 0%    0.04kB ± 0%   -98.84%  (p=0.000 n=10+10)
Decode/Newton/Speed/1e6-32          40.5kB ± 0%     0.3kB ± 0%   -99.15%  (p=0.000 n=10+10)
Decode/Newton/Default/1e4-32          513B ± 0%       16B ± 0%   -96.88%  (p=0.000 n=10+10)
Decode/Newton/Default/1e5-32        2.35kB ± 0%    0.04kB ± 0%   -98.47%  (p=0.000 n=10+10)
Decode/Newton/Default/1e6-32        21.1kB ± 0%     0.3kB ± 0%   -98.80%    (p=0.000 n=8+8)
Decode/Newton/Compression/1e4-32      513B ± 0%       16B ± 0%   -96.88%  (p=0.000 n=10+10)
Decode/Newton/Compression/1e5-32    2.35kB ± 0%    0.04kB ± 0%   -98.47%  (p=0.000 n=10+10)
Decode/Newton/Compression/1e6-32    22.9kB ± 0%     0.2kB ± 0%   -98.92%    (p=0.000 n=8+8)

name                              old allocs/op  new allocs/op  delta
Decode/Digits/Huffman/1e4-32         0.00 ±NaN%      1.00 ± 0%     +Inf%  (p=0.000 n=10+10)
Decode/Digits/Huffman/1e5-32         0.00 ±NaN%      2.00 ± 0%     +Inf%  (p=0.000 n=10+10)
Decode/Digits/Huffman/1e6-32         0.00 ±NaN%     16.00 ± 0%     +Inf%  (p=0.000 n=10+10)
Decode/Digits/Speed/1e4-32            3.00 ± 0%      1.00 ± 0%   -66.67%  (p=0.000 n=10+10)
Decode/Digits/Speed/1e5-32            6.00 ± 0%      2.00 ± 0%   -66.67%  (p=0.000 n=10+10)
Decode/Digits/Speed/1e6-32            68.0 ± 0%      16.0 ± 0%   -76.47%  (p=0.000 n=10+10)
Decode/Digits/Default/1e4-32          2.00 ± 0%      1.00 ± 0%   -50.00%  (p=0.000 n=10+10)
Decode/Digits/Default/1e5-32          8.00 ± 0%      3.00 ± 0%   -62.50%  (p=0.000 n=10+10)
Decode/Digits/Default/1e6-32          74.0 ± 0%      23.0 ± 0%   -68.92%  (p=0.000 n=10+10)
Decode/Digits/Compression/1e4-32      2.00 ± 0%      1.00 ± 0%   -50.00%  (p=0.000 n=10+10)
Decode/Digits/Compression/1e5-32      8.00 ± 0%      3.00 ± 0%   -62.50%  (p=0.000 n=10+10)
Decode/Digits/Compression/1e6-32      74.0 ± 0%      23.0 ± 0%   -68.92%  (p=0.000 n=10+10)
Decode/Newton/Huffman/1e4-32          9.00 ± 0%      1.00 ± 0%   -88.89%  (p=0.000 n=10+10)
Decode/Newton/Huffman/1e5-32          18.0 ± 0%       2.0 ± 0%   -88.89%  (p=0.000 n=10+10)
Decode/Newton/Huffman/1e6-32           156 ± 0%        16 ± 0%   -89.74%  (p=0.000 n=10+10)
Decode/Newton/Speed/1e4-32            13.0 ± 0%       1.0 ± 0%   -92.31%  (p=0.000 n=10+10)
Decode/Newton/Speed/1e5-32            26.0 ± 0%       2.0 ± 0%   -92.31%  (p=0.000 n=10+10)
Decode/Newton/Speed/1e6-32             223 ± 0%        16 ± 0%   -92.83%  (p=0.000 n=10+10)
Decode/Newton/Default/1e4-32          10.0 ± 0%       1.0 ± 0%   -90.00%  (p=0.000 n=10+10)
Decode/Newton/Default/1e5-32          27.0 ± 0%       2.0 ± 0%   -92.59%  (p=0.000 n=10+10)
Decode/Newton/Default/1e6-32           153 ± 0%        12 ± 0%   -92.16%  (p=0.000 n=10+10)
Decode/Newton/Compression/1e4-32      10.0 ± 0%       1.0 ± 0%   -90.00%  (p=0.000 n=10+10)
Decode/Newton/Compression/1e5-32      27.0 ± 0%       2.0 ± 0%   -92.59%  (p=0.000 n=10+10)
Decode/Newton/Compression/1e6-32       145 ± 0%        12 ± 0%   -91.72%  (p=0.000 n=10+10)

These changes have been included in github.com/klauspost/compress
for a little more than a month now, which includes fuzz testing.

Change-Id: I7e346330512116baa27e448aa606a2f4e551054c

@googlebot googlebot added the cla: yes Used by googlebot to label PRs as having a valid CLA. The text of this label should not change. label Apr 9, 2020
@gopherbot
Copy link
Contributor

This PR (HEAD: 6180f3c) has been imported to Gerrit for code review.

Please visit https://go-review.googlesource.com/c/go/+/227737 to see it.

Tip: You can toggle comments from me using the comments slash command (e.g. /comments off)
See the Wiki page for more info

@mertakman
Copy link

why this is still not merged ?

@ianlancetaylor
Copy link
Contributor

@volknanebo We don't use GitHub for code review. If you want to make a comment, please make it at https://golang.org/cl/227737. Thanks.

@heschi heschi closed this Dec 15, 2021
@klauspost
Copy link
Contributor Author

@heschi What happened here?

@heschi
Copy link
Contributor

heschi commented Dec 16, 2021

I closed old PRs to reduce load on the Gerrit importer (#50197), sorry for the trouble. I'll reopen the CL and PR.

@heschi heschi reopened this Dec 16, 2021
@gopherbot
Copy link
Contributor

Message from Ian Lance Taylor:

Patch Set 2:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Klaus Post:

Patch Set 2:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

This PR (HEAD: c00babd) has been imported to Gerrit for code review.

Please visit https://go-review.googlesource.com/c/go/+/227737 to see it.

Tip: You can toggle comments from me using the comments slash command (e.g. /comments off)
See the Wiki page for more info

@gopherbot
Copy link
Contributor

This PR (HEAD: ae9b62a) has been imported to Gerrit for code review.

Please visit https://go-review.googlesource.com/c/go/+/227737 to see it.

Tip: You can toggle comments from me using the comments slash command (e.g. /comments off)
See the Wiki page for more info

@gopherbot
Copy link
Contributor

Message from Klaus Post:

Patch Set 5:

(2 comments)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

This PR (HEAD: 161f021) has been imported to Gerrit for code review.

Please visit https://go-review.googlesource.com/c/go/+/227737 to see it.

Tip: You can toggle comments from me using the comments slash command (e.g. /comments off)
See the Wiki page for more info

@greatroar
Copy link

I would say only *strings.Reader probably isn't that common.

Syncthing uses that. It keeps compressed web assets in strings to ensure they're in the RODATA section and can decompress them for HTTP clients without gzip support.

@klauspost
Copy link
Contributor Author

Ping @ianlancetaylor - if there is interest for this in 1.20 it would be good to get started on CR.

@gopherbot
Copy link
Contributor

Message from Ian Lance Taylor:

Patch Set 6:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Ian Lance Taylor:

Patch Set 6:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Joseph Tsai:

Patch Set 6:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Joseph Tsai:

Patch Set 6:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Joseph Tsai:

Patch Set 6:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Ian Lance Taylor:

Patch Set 2:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Klaus Post:

Patch Set 2:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Klaus Post:

Patch Set 5:

(2 comments)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Ian Lance Taylor:

Patch Set 6:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Ian Lance Taylor:

Patch Set 6:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Joseph Tsai:

Patch Set 6:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Joseph Tsai:

Patch Set 6:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Joseph Tsai:

Patch Set 6:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

Improve decompression speed, mainly through 3 optimizations:

1) Take advantage of the fact that we can read further ahead when we know current block isn't the last.

The reader guarantees that it will not read beyond the end of the stream.
This poses limitations on the decoder in terms of how far it can read ahead and is set to the size of an end-of-block marker in `f.h1.min = f.bits[endBlockMarker]`.

We can however take advantage of the fact that each block gives information on whether it is the final block on a stream. So if we are not reading the final block we can safely add the size of the smallest block possible with nothing but an EOB marker.

That is a block with a predefined table and a single EOB marker. Since we know the size of the block header and the encoding of the EOB this totals to 10 additional bits. Adding 10 bits reduces the number of stream reads significantly.

Approximately 5% throughput increase.

2) Manually inline f.huffSym call

This change by itself give about about 13% throughput increase.

3) Generate decoders for stdlib io.ByteReader types

We generate decoders for the known implementations of `io.ByteReader`, namely `*bytes.Buffer`, `*bytes.Reader`, `*bufio.Reader` and `*strings.Reader`.

This change by itself gives about 20-25% throughput increase, including when an `io.Reader` is passed.

I would say only `*strings.Reader` probably isn't that common.

Minor changes:

* Reuse `h.chunks` and `h.links`.
* Trade some bounds checks for AND operations.
* Change chunks from uint32 to uint16.
* Avoid padding of decompressor struct members.

Per loop allocation removed from benchmarks. The numbers in the benchmark below includes this change for the 'old' numbers.

```
name                              old time/op    new time/op    delta
Decode/Digits/Huffman/1e4-32        78.0µs ± 0%    50.5µs ± 1%   -35.26%  (p=0.008 n=5+5)
Decode/Digits/Huffman/1e5-32         779µs ± 2%     487µs ± 0%   -37.48%  (p=0.008 n=5+5)
Decode/Digits/Huffman/1e6-32        7.68ms ± 0%    4.88ms ± 1%   -36.44%  (p=0.008 n=5+5)
Decode/Digits/Speed/1e4-32          88.5µs ± 1%    59.9µs ± 1%   -32.33%  (p=0.008 n=5+5)
Decode/Digits/Speed/1e5-32           963µs ± 1%     678µs ± 1%   -29.58%  (p=0.008 n=5+5)
Decode/Digits/Speed/1e6-32          9.75ms ± 1%    6.90ms ± 0%   -29.21%  (p=0.008 n=5+5)
Decode/Digits/Default/1e4-32        91.2µs ± 1%    61.4µs ± 0%   -32.72%  (p=0.008 n=5+5)
Decode/Digits/Default/1e5-32         954µs ± 0%     675µs ± 0%   -29.25%  (p=0.008 n=5+5)
Decode/Digits/Default/1e6-32        9.67ms ± 0%    6.79ms ± 1%   -29.76%  (p=0.008 n=5+5)
Decode/Digits/Compression/1e4-32    90.7µs ± 1%    61.5µs ± 1%   -32.21%  (p=0.008 n=5+5)
Decode/Digits/Compression/1e5-32     953µs ± 1%     672µs ± 0%   -29.46%  (p=0.016 n=4+5)
Decode/Digits/Compression/1e6-32    9.76ms ± 4%    6.78ms ± 0%   -30.54%  (p=0.008 n=5+5)
Decode/Newton/Huffman/1e4-32        90.4µs ± 0%    54.7µs ± 0%   -39.52%  (p=0.008 n=5+5)
Decode/Newton/Huffman/1e5-32         885µs ± 0%     538µs ± 0%   -39.19%  (p=0.008 n=5+5)
Decode/Newton/Huffman/1e6-32        8.84ms ± 0%    5.44ms ± 0%   -38.46%  (p=0.016 n=4+5)
Decode/Newton/Speed/1e4-32          81.5µs ± 0%    55.1µs ± 1%   -32.42%  (p=0.016 n=4+5)
Decode/Newton/Speed/1e5-32           751µs ± 4%     528µs ± 0%   -29.70%  (p=0.008 n=5+5)
Decode/Newton/Speed/1e6-32          7.49ms ± 2%    5.32ms ± 0%   -28.92%  (p=0.008 n=5+5)
Decode/Newton/Default/1e4-32        73.3µs ± 1%    48.9µs ± 1%   -33.36%  (p=0.008 n=5+5)
Decode/Newton/Default/1e5-32         601µs ± 2%     418µs ± 0%   -30.40%  (p=0.008 n=5+5)
Decode/Newton/Default/1e6-32        5.92ms ± 0%    4.17ms ± 0%   -29.60%  (p=0.008 n=5+5)
Decode/Newton/Compression/1e4-32    72.7µs ± 0%    48.5µs ± 0%   -33.21%  (p=0.008 n=5+5)
Decode/Newton/Compression/1e5-32     597µs ± 0%     418µs ± 0%   -29.90%  (p=0.008 n=5+5)
Decode/Newton/Compression/1e6-32    5.90ms ± 0%    4.15ms ± 0%   -29.63%  (p=0.016 n=4+5)

name                              old speed      new speed      delta
Decode/Digits/Huffman/1e4-32       128MB/s ± 0%   198MB/s ± 1%   +54.46%  (p=0.008 n=5+5)
Decode/Digits/Huffman/1e5-32       128MB/s ± 2%   205MB/s ± 0%   +59.92%  (p=0.008 n=5+5)
Decode/Digits/Huffman/1e6-32       130MB/s ± 0%   205MB/s ± 1%   +57.33%  (p=0.008 n=5+5)
Decode/Digits/Speed/1e4-32         113MB/s ± 1%   167MB/s ± 1%   +47.79%  (p=0.008 n=5+5)
Decode/Digits/Speed/1e5-32         104MB/s ± 1%   147MB/s ± 1%   +42.01%  (p=0.008 n=5+5)
Decode/Digits/Speed/1e6-32         103MB/s ± 1%   145MB/s ± 0%   +41.26%  (p=0.008 n=5+5)
Decode/Digits/Default/1e4-32       110MB/s ± 1%   163MB/s ± 0%   +48.63%  (p=0.008 n=5+5)
Decode/Digits/Default/1e5-32       105MB/s ± 0%   148MB/s ± 0%   +41.34%  (p=0.008 n=5+5)
Decode/Digits/Default/1e6-32       103MB/s ± 0%   147MB/s ± 1%   +42.37%  (p=0.008 n=5+5)
Decode/Digits/Compression/1e4-32   110MB/s ± 1%   163MB/s ± 1%   +47.51%  (p=0.008 n=5+5)
Decode/Digits/Compression/1e5-32   105MB/s ± 1%   149MB/s ± 0%   +41.77%  (p=0.016 n=4+5)
Decode/Digits/Compression/1e6-32   102MB/s ± 4%   147MB/s ± 0%   +43.91%  (p=0.008 n=5+5)
Decode/Newton/Huffman/1e4-32       111MB/s ± 0%   183MB/s ± 0%   +65.35%  (p=0.008 n=5+5)
Decode/Newton/Huffman/1e5-32       113MB/s ± 0%   186MB/s ± 0%   +64.44%  (p=0.008 n=5+5)
Decode/Newton/Huffman/1e6-32       113MB/s ± 0%   184MB/s ± 0%   +62.50%  (p=0.016 n=4+5)
Decode/Newton/Speed/1e4-32         123MB/s ± 0%   182MB/s ± 1%   +47.98%  (p=0.016 n=4+5)
Decode/Newton/Speed/1e5-32         133MB/s ± 4%   189MB/s ± 0%   +42.20%  (p=0.008 n=5+5)
Decode/Newton/Speed/1e6-32         134MB/s ± 2%   188MB/s ± 0%   +40.67%  (p=0.008 n=5+5)
Decode/Newton/Default/1e4-32       136MB/s ± 1%   205MB/s ± 1%   +50.05%  (p=0.008 n=5+5)
Decode/Newton/Default/1e5-32       166MB/s ± 2%   239MB/s ± 0%   +43.67%  (p=0.008 n=5+5)
Decode/Newton/Default/1e6-32       169MB/s ± 0%   240MB/s ± 0%   +42.04%  (p=0.008 n=5+5)
Decode/Newton/Compression/1e4-32   138MB/s ± 0%   206MB/s ± 0%   +49.73%  (p=0.008 n=5+5)
Decode/Newton/Compression/1e5-32   168MB/s ± 0%   239MB/s ± 0%   +42.66%  (p=0.008 n=5+5)
Decode/Newton/Compression/1e6-32   170MB/s ± 0%   241MB/s ± 0%   +42.11%  (p=0.016 n=4+5)

name                              old alloc/op   new alloc/op   delta
Decode/Digits/Huffman/1e4-32        0.00B ±NaN%    16.00B ± 0%     +Inf%  (p=0.008 n=5+5)
Decode/Digits/Huffman/1e5-32         7.60B ± 8%    32.00B ± 0%  +321.05%  (p=0.008 n=5+5)
Decode/Digits/Huffman/1e6-32         79.6B ± 1%    264.0B ± 0%  +231.66%  (p=0.008 n=5+5)
Decode/Digits/Speed/1e4-32           80.0B ± 0%     16.0B ± 0%   -80.00%  (p=0.008 n=5+5)
Decode/Digits/Speed/1e5-32            297B ± 0%       33B ± 0%      ~     (p=0.079 n=4+5)
Decode/Digits/Speed/1e6-32          3.86kB ± 0%    0.27kB ± 0%   -92.98%  (p=0.008 n=5+5)
Decode/Digits/Default/1e4-32         48.0B ± 0%     16.0B ± 0%   -66.67%  (p=0.008 n=5+5)
Decode/Digits/Default/1e5-32          297B ± 0%       49B ± 0%   -83.50%  (p=0.008 n=5+5)
Decode/Digits/Default/1e6-32        4.28kB ± 0%    0.38kB ± 0%      ~     (p=0.079 n=4+5)
Decode/Digits/Compression/1e4-32     48.0B ± 0%     16.0B ± 0%   -66.67%  (p=0.008 n=5+5)
Decode/Digits/Compression/1e5-32      297B ± 0%       49B ± 0%      ~     (p=0.079 n=4+5)
Decode/Digits/Compression/1e6-32    4.28kB ± 0%    0.38kB ± 0%   -91.09%  (p=0.000 n=4+5)
Decode/Newton/Huffman/1e4-32          705B ± 0%       16B ± 0%   -97.73%  (p=0.008 n=5+5)
Decode/Newton/Huffman/1e5-32        4.50kB ± 0%    0.03kB ± 0%   -99.27%  (p=0.008 n=5+5)
Decode/Newton/Huffman/1e6-32        39.4kB ± 0%     0.3kB ± 0%   -99.29%  (p=0.008 n=5+5)
Decode/Newton/Speed/1e4-32            625B ± 0%       16B ± 0%   -97.44%  (p=0.008 n=5+5)
Decode/Newton/Speed/1e5-32          3.21kB ± 0%    0.03kB ± 0%   -98.97%  (p=0.008 n=5+5)
Decode/Newton/Speed/1e6-32          40.6kB ± 0%     0.3kB ± 0%   -99.25%  (p=0.008 n=5+5)
Decode/Newton/Default/1e4-32          513B ± 0%       16B ± 0%   -96.88%  (p=0.008 n=5+5)
Decode/Newton/Default/1e5-32        2.37kB ± 0%    0.03kB ± 0%   -98.61%  (p=0.008 n=5+5)
Decode/Newton/Default/1e6-32        21.2kB ± 0%     0.2kB ± 0%   -98.97%  (p=0.008 n=5+5)
Decode/Newton/Compression/1e4-32      513B ± 0%       16B ± 0%   -96.88%  (p=0.008 n=5+5)
Decode/Newton/Compression/1e5-32    2.37kB ± 0%    0.03kB ± 0%   -98.61%  (p=0.008 n=5+5)
Decode/Newton/Compression/1e6-32    23.0kB ± 0%     0.2kB ± 0%   -99.07%  (p=0.008 n=5+5)

name                              old allocs/op  new allocs/op  delta
Decode/Digits/Huffman/1e4-32         0.00 ±NaN%      1.00 ± 0%     +Inf%  (p=0.008 n=5+5)
Decode/Digits/Huffman/1e5-32         0.00 ±NaN%      2.00 ± 0%     +Inf%  (p=0.008 n=5+5)
Decode/Digits/Huffman/1e6-32         0.00 ±NaN%     16.00 ± 0%     +Inf%  (p=0.008 n=5+5)
Decode/Digits/Speed/1e4-32            3.00 ± 0%      1.00 ± 0%   -66.67%  (p=0.008 n=5+5)
Decode/Digits/Speed/1e5-32            6.00 ± 0%      2.00 ± 0%   -66.67%  (p=0.008 n=5+5)
Decode/Digits/Speed/1e6-32            68.0 ± 0%      16.0 ± 0%   -76.47%  (p=0.008 n=5+5)
Decode/Digits/Default/1e4-32          2.00 ± 0%      1.00 ± 0%   -50.00%  (p=0.008 n=5+5)
Decode/Digits/Default/1e5-32          8.00 ± 0%      3.00 ± 0%   -62.50%  (p=0.008 n=5+5)
Decode/Digits/Default/1e6-32          74.0 ± 0%      23.0 ± 0%   -68.92%  (p=0.008 n=5+5)
Decode/Digits/Compression/1e4-32      2.00 ± 0%      1.00 ± 0%   -50.00%  (p=0.008 n=5+5)
Decode/Digits/Compression/1e5-32      8.00 ± 0%      3.00 ± 0%   -62.50%  (p=0.008 n=5+5)
Decode/Digits/Compression/1e6-32      74.0 ± 0%      23.0 ± 0%   -68.92%  (p=0.008 n=5+5)
Decode/Newton/Huffman/1e4-32          9.00 ± 0%      1.00 ± 0%   -88.89%  (p=0.008 n=5+5)
Decode/Newton/Huffman/1e5-32          18.0 ± 0%       2.0 ± 0%   -88.89%  (p=0.008 n=5+5)
Decode/Newton/Huffman/1e6-32           156 ± 0%        16 ± 0%   -89.74%  (p=0.008 n=5+5)
Decode/Newton/Speed/1e4-32            13.0 ± 0%       1.0 ± 0%   -92.31%  (p=0.008 n=5+5)
Decode/Newton/Speed/1e5-32            26.0 ± 0%       2.0 ± 0%   -92.31%  (p=0.008 n=5+5)
Decode/Newton/Speed/1e6-32             223 ± 0%        16 ± 0%   -92.83%  (p=0.008 n=5+5)
Decode/Newton/Default/1e4-32          10.0 ± 0%       1.0 ± 0%   -90.00%  (p=0.008 n=5+5)
Decode/Newton/Default/1e5-32          27.0 ± 0%       2.0 ± 0%   -92.59%  (p=0.008 n=5+5)
Decode/Newton/Default/1e6-32           153 ± 0%        12 ± 0%   -92.16%  (p=0.008 n=5+5)
Decode/Newton/Compression/1e4-32      10.0 ± 0%       1.0 ± 0%   -90.00%  (p=0.008 n=5+5)
Decode/Newton/Compression/1e5-32      27.0 ± 0%       2.0 ± 0%   -92.59%  (p=0.008 n=5+5)
Decode/Newton/Compression/1e6-32       145 ± 0%        12 ± 0%   -91.72%  (p=0.008 n=5+5)
```

These changes have been included in https://github.com/klauspost/compress for a little more than a month now, which includes fuzz testing.

Change-Id: I7e346330512116baa27e448aa606a2f4e551054c
* Inline moreBits
* Put values on stack.
* Also generate the fallback.

Change-Id: I64d03424438ebc5dbacd4f364e3e6d3c4936a008
Change-Id: If11b81d2de23a2588f3d4c7baa088ed5d614de70
Change-Id: Ibe8034438ac4a7fe53d686e39154dbe869f864a2
@klauspost klauspost force-pushed the inflate-improve-speed branch from 161f021 to 3f1778f Compare October 23, 2025 10:31
@gopherbot
Copy link
Contributor

This PR (HEAD: 3f1778f) has been imported to Gerrit for code review.

Please visit Gerrit at https://go-review.googlesource.com/c/go/+/227737.

Important tips:

  • Don't comment on this PR. All discussion takes place in Gerrit.
  • You need a Gmail or other Google account to log in to Gerrit.
  • To change your code in response to feedback:
    • Push a new commit to the branch used by your GitHub PR.
    • A new "patch set" will then appear in Gerrit.
    • Respond to each comment by marking as Done in Gerrit if implemented as suggested. You can alternatively write a reply.
    • Critical: you must click the blue Reply button near the top to publish your Gerrit responses.
    • Multiple commits in the PR will be squashed by GerritBot.
  • The title and description of the GitHub PR are used to construct the final commit message.
    • Edit these as needed via the GitHub web interface (not via Gerrit or git).
    • You should word wrap the PR description at ~76 characters unless you need longer lines (e.g., for tables or URLs).
  • See the Sending a change via GitHub and Reviews sections of the Contribution Guide as well as the FAQ for details.
@gopherbot
Copy link
Contributor

Message from Klaus Post:

Patch Set 7:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Klaus Post:

Patch Set 7:

(2 comments)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from AHMAD ابو وليد:

Patch Set 7:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/227737.
After addressing review feedback, remember to publish your drafts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Used by googlebot to label PRs as having a valid CLA. The text of this label should not change.

7 participants