adler32-go

command module

v0.0.0-...-f0f4298 Latest Latest Go to latest Published: Apr 10, 2025 License: Unlicense Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/jasonrandrews/adler32-go

Links

Open Source Insights

README ¶

Accelerating Adler32 Checksum with Arm NEON Instructions

This project demonstrates how to significantly improve the performance of Adler32 checksum calculations using Arm NEON SIMD (Single Instruction, Multiple Data) instructions. The implementation shows how to integrate optimized C code with Go applications to achieve substantial performance gains over the standard library implementation.

What is Adler32?

Adler32 is a checksum algorithm that was invented by Mark Adler in 1995. It's used in the zlib compression library and is faster than CRC32 but provides less reliable error detection. The algorithm works by calculating two 16-bit sums:

s1: A simple sum of all bytes
s2: A sum of all s1 values after each byte

The final checksum is (s2 << 16) | s1.

Project Overview

This project includes:

A standard Go implementation using the hash/adler32 package
A NEON-optimized C implementation that processes multiple bytes in parallel
CGO bindings to call the optimized C code from Go
Benchmarking code to compare performance between implementations

Implementation Details

Standard Go Implementation

The Go standard library provides an Adler32 implementation in the hash/adler32 package. This serves as our baseline for performance comparison.

NEON-Optimized Implementation

The NEON implementation leverages Arm's SIMD instructions to process 16 bytes at a time, significantly improving throughput. Key optimizations include:

Processing data in 16-byte chunks using NEON vector instructions
Handling blocks of 5552 bytes to avoid overflow (as per the Adler32 algorithm requirements)
Carefully managing the position-dependent nature of the s2 calculation
Falling back to scalar code for any remaining bytes

Integration with Go

The project uses CGO to call the optimized C code from Go:

// calculateNeonAdler32 calculates Adler32 using the C NEON implementation
func calculateNeonAdler32(data []byte) uint32 {
    cData := (*C.uint8_t)(unsafe.Pointer(&data[0]))
    cLen := C.size_t(len(data))
    return uint32(C.adler32_neon(cData, cLen))
}

Performance Results

Neoverse N1 Performance Results

The following performance results were measured on a system with a Neoverse N1 processor:

Implementation Comparison

The project includes three different implementations of Adler32:

Simple: A basic scalar implementation
Block: An optimized scalar implementation that processes data in blocks to avoid overflow
NEON: A SIMD-optimized implementation using Arm NEON instructions

Here are the performance results for each implementation:

Data Size	Go Time	Simple Time	Block Time	NEON Time	NEON Speedup vs Go
1 KB	715ns	5.433µs	1.053µs	3.389µs	0.21×
10 KB	6.862µs	44.755µs	9.209µs	2.057µs	3.34×
100 KB	68.965µs	330.981µs	89.59µs	35.299µs	1.95×
1 MB	704.285µs	3.349ms	917.258µs	368.478µs	1.91×
10 MB	7.054ms	33.495ms	9.176ms	3.706ms	1.90×

Compiler Comparison for NEON Implementation

The NEON implementation was compiled with both GCC and Clang to compare performance:

Data Size	Go Time	GCC Time	Clang Time	Clang Speedup vs Go	Clang vs GCC
1 KB	718ns	6.003µs	3.431µs	0.21×	1.75×
10 KB	10.972µs	3.049µs	1.536µs	4.45×	1.98×
100 KB	89.965µs	45.367µs	25.816µs	2.67×	1.76×
1 MB	705.664µs	367.438µs	276.053µs	2.55×	1.33×
10 MB	7.054ms	3.695ms	2.691ms	2.62×	1.37×

Note: Actual performance numbers may vary based on your specific ARM hardware. These results were measured on the Neoverse N1 test system.

Neoverse V2 Performance Results (April 2025)

We also tested the implementations on a Neoverse V2 processor, which showed different performance characteristics:

Simple Implementation (Scalar)

Data Size	Go Time	Simple Time	Speedup vs Go
1 KB	392ns	2.628µs	0.15×
10 KB	3.935µs	25.772µs	0.15×
100 KB	39.679µs	257.517µs	0.15×
1 MB	403.678µs	2.630887ms	0.15×
10 MB	4.025969ms	26.309335ms	0.15×

Block Implementation (Optimized Scalar)

Data Size	Go Time	Block Time	Speedup vs Go
1 KB	393ns	645ns	0.61×
10 KB	3.927µs	5.867µs	0.67×
100 KB	39.25µs	58.176µs	0.67×
1 MB	401.945µs	594.972µs	0.68×
10 MB	4.028003ms	5.957824ms	0.68×

NEON Implementation (Clang vs GCC)

Data Size	Go Time	Clang Time	GCC Time	GCC Speedup vs Go
1 KB	392ns	2.631µs	2.627µs	0.15×
10 KB	3.944µs	1.19µs	908ns	4.38×
100 KB	39.234µs	19.8µs	15.267µs	2.60×
1 MB	402.348µs	206.185µs	159.349µs	2.52×
10 MB	4.024462ms	2.079317ms	1.578822ms	2.55×

Key observations:

For small data sizes (1 KB), all C implementations are slower than Go due to function call overhead
The NEON implementation is significantly faster than both scalar implementations for data sizes ≥ 10 KB
On the original test system, Clang consistently produces faster code than GCC for the NEON implementation, with up to 98% better performance
On the Neoverse V2 system, GCC produces faster code than Clang for the NEON implementation, with approximately 30% better performance
The Block implementation is slower than Go on the Neoverse V2 system, unlike the results on the original test system
The optimized implementations show the greatest advantage at the 10 KB data size

Building and Running

Prerequisites

Go 1.15 or later
GCC or Clang compiler (Clang is used by default as it produces faster code)
Arm processor with NEON support

Build Instructions

Clone the repository
Build the shared library:
```
make compile
```
Run the benchmark:
```
make run
```

Makefile Targets

The project includes a flexible Makefile that allows you to easily switch between different implementations:

make compile - Builds the library using the default implementation (adler32_neon.c)
make run - Runs the benchmark using the compiled library
make all - Combines compile and run targets

You can specify which implementation to use:

make simple - Builds using the simple scalar implementation (adler32_simple.c)
make block - Builds using the block-based implementation (adler32_block.c)
make neon - Builds using the NEON-optimized implementation (adler32_neon.c)

You can also specify the implementation directly:

make IMPL=adler32_simple.c compile

To build and run with a specific implementation in one command:

make IMPL=adler32_block.c all

or

make simple run

How It Works

The NEON implementation works by:

Processing 16 bytes at a time using NEON vector registers
Calculating s1 and s2 in parallel for these 16 bytes
Handling the position-dependent nature of s2 calculation with vector multiplications
Processing data in blocks of 5552 bytes to prevent overflow
Combining the results and applying the modulo operation

Conclusion

This project demonstrates how leveraging Arm NEON instructions can significantly improve the performance of checksum calculations. The techniques shown here can be applied to other algorithms that process data sequentially and can benefit from SIMD parallelism.

By using hardware-specific optimizations while maintaining a clean Go API, we can achieve the best of both worlds: the development experience of Go with the performance of optimized C code.

License

This project is released into the public domain under the Unlicense.

This means you can copy, modify, publish, use, compile, sell, or distribute this software, either in source code form or as a compiled binary, for any purpose, commercial or non-commercial, and by any means, without any attribution or restrictions.

Documentation ¶

Overview ¶

Package main provides benchmarking and testing for Adler32 checksum implementations.

This package demonstrates how to integrate optimized C code with Go applications to achieve substantial performance gains over the standard library implementation. It compares the performance of the standard Go Adler32 implementation with an optimized implementation using Arm NEON SIMD instructions.

Package main provides a Go wrapper for the optimized Adler-32 implementation.

This file contains the CGO bindings to call the optimized C implementation of the Adler-32 checksum algorithm using Arm NEON instructions.

Source Files ¶

View all Source files