The Billion Rows Project

A cross-language benchmark project for processing large datasets efficiently. This project provides a practical playground for learning new programming languages and optimization techniques through a common, real-world challenge: processing and analyzing CSV data at scale.

Project Goals

Language Exploration: Learn multiple programming languages by implementing the same task
Optimization Skills: Discover how to squeeze maximum performance from each language
Comparative Analysis: Understand each language's strengths and trade-offs for data processing
Performance at Scale: See how implementations behave across datasets from small to massive

The project emphasizes implementation quality over abstract language comparisons. Often, a well-optimized solution in a "slower" language can outperform a naive implementation in a "faster" language.

Implementations

Language	Status	Notes
C++	✅	Strong raw performance with manual memory management; excels at maps/aggregation
Carbon	🚧	Google's experimental C++ successor; implementation not started
Go	✅	Excellent for concurrent data processing; efficient string handling and map operations
Haskell	🚧	★ Elegant pattern matching and lazy evaluation good for streaming large data
JavaScript	✅	Surprisingly fast with V8's JIT; hash map operations well-optimized
Lua	✅	Lightweight with efficient tables; less overhead than other scripting languages
Mojo	🚧	★ Not started; would leverage LLVM and vectorization for CSV number crunching
Python	✅	Strong standard library CSV parser; dictionary operations well-suited for aggregation
Rust	✅	Zero-cost abstractions with memory safety; exceptional CSV parsing and map performance
Zig	✅	First optimized implementation completed; competitive with scripting languages

Benchmark Results

M2 Mac Pro 2023 (32GB RAM)

10,000 Row Dataset

Language	Time
C++	0.0006s
Rust	0.0016s
Go	0.0021s
Zig	0.0117s
Python	0.0139s
Lua	0.0169s
JavaScript	0.0278s

1,000,000 Row Dataset

Language	Time
C++	0.036s
Rust	0.073s
Go	0.184s
JavaScript	0.936s
Zig	0.755s
Python	0.671s
Lua	1.622s

1,000,000,000 Row Dataset (The Billion Rows Challenge)

Language	Time
C++	0.543s
Rust	0.757s
Go	2.652s
Python	9.668s
Zig	10.844s
JavaScript	13.532s
Lua	24.211s
Haskell	Build error
Mojo	Not created
Carbon	Not created

Analysis & Optimization

C++ remains the performance leader: Fastest across all dataset sizes
Rust consistently shows excellent performance: Close second to C++
Zig now competitive with Python: Our optimization efforts paid off dramatically
Performance rankings remain consistent across scales:
- Systems languages (C++, Rust, Go) maintain their advantage
- Zig implementation now performs in the mid-range
- Scripting languages (Python, JavaScript, Lua) perform as expected
Implementation quality matters enormously: Zig improved from >600s to 10.8s (55× improvement)
All implementations handle error conditions gracefully with the malformed line at 14413142

Learning Opportunities

This project provides several educational benefits:

CSV Processing Patterns: Learn efficient techniques for handling text data
Memory Management: Compare GC vs manual management across implementations
Algorithm Optimization: Profile and optimize hot spots in code
Language Idioms: Discover each language's natural approach to the same problem
Scaling Challenges: Handle increasingly large datasets efficiently

Usage

# Generate data and run all implementations
make

# Run with specific dataset size
make INPUT=ten-million
# Options: hundred, thousand, ten-thousand, hundred-thousand, million, ten-million

# Run all implementations (recommended)
make run INPUT=million

# Run using Lua script (for learning purposes)
make lua
# Note: The Bash script (make run) is generally more reliable

License

MIT License - See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bin		bin
cpp		cpp
generate		generate
golang		golang
haskell		haskell
javascript		javascript
lua		lua
mojo		mojo
python		python
runner		runner
rust		rust
zig		zig
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The Billion Rows Project

Project Goals

Implementations

Benchmark Results

M2 Mac Pro 2023 (32GB RAM)

10,000 Row Dataset

1,000,000 Row Dataset

1,000,000,000 Row Dataset (The Billion Rows Challenge)

Analysis & Optimization

Learning Opportunities

Usage

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

cloudwalksolutions/billion-rows

Folders and files

Latest commit

History

Repository files navigation

The Billion Rows Project

Project Goals

Implementations

Benchmark Results

M2 Mac Pro 2023 (32GB RAM)

10,000 Row Dataset

1,000,000 Row Dataset

1,000,000,000 Row Dataset (The Billion Rows Challenge)

Analysis & Optimization

Learning Opportunities

Usage

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages