Skip to content

Billion Rows - Doing something a billion times in all programming languages

License

Notifications You must be signed in to change notification settings

cloudwalksolutions/billion-rows

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Billion Rows Project

A cross-language benchmark project for processing large datasets efficiently. This project provides a practical playground for learning new programming languages and optimization techniques through a common, real-world challenge: processing and analyzing CSV data at scale.

Project Goals

  • Language Exploration: Learn multiple programming languages by implementing the same task
  • Optimization Skills: Discover how to squeeze maximum performance from each language
  • Comparative Analysis: Understand each language's strengths and trade-offs for data processing
  • Performance at Scale: See how implementations behave across datasets from small to massive

The project emphasizes implementation quality over abstract language comparisons. Often, a well-optimized solution in a "slower" language can outperform a naive implementation in a "faster" language.

Implementations

Language Status Notes
C++ βœ… Strong raw performance with manual memory management; excels at maps/aggregation
Carbon 🚧 Google's experimental C++ successor; implementation not started
Go βœ… Excellent for concurrent data processing; efficient string handling and map operations
Haskell 🚧 β˜… Elegant pattern matching and lazy evaluation good for streaming large data
JavaScript βœ… Surprisingly fast with V8's JIT; hash map operations well-optimized
Lua βœ… Lightweight with efficient tables; less overhead than other scripting languages
Mojo 🚧 β˜… Not started; would leverage LLVM and vectorization for CSV number crunching
Python βœ… Strong standard library CSV parser; dictionary operations well-suited for aggregation
Rust βœ… Zero-cost abstractions with memory safety; exceptional CSV parsing and map performance
Zig βœ… First optimized implementation completed; competitive with scripting languages

Benchmark Results

M2 Mac Pro 2023 (32GB RAM)

10,000 Row Dataset

Language Time
C++ 0.0006s
Rust 0.0016s
Go 0.0021s
Zig 0.0117s
Python 0.0139s
Lua 0.0169s
JavaScript 0.0278s

1,000,000 Row Dataset

Language Time
C++ 0.036s
Rust 0.073s
Go 0.184s
JavaScript 0.936s
Zig 0.755s
Python 0.671s
Lua 1.622s

1,000,000,000 Row Dataset (The Billion Rows Challenge)

Language Time
C++ 0.543s
Rust 0.757s
Go 2.652s
Python 9.668s
Zig 10.844s
JavaScript 13.532s
Lua 24.211s
Haskell Build error
Mojo Not created
Carbon Not created

Analysis & Optimization

  • C++ remains the performance leader: Fastest across all dataset sizes
  • Rust consistently shows excellent performance: Close second to C++
  • Zig now competitive with Python: Our optimization efforts paid off dramatically
  • Performance rankings remain consistent across scales:
    • Systems languages (C++, Rust, Go) maintain their advantage
    • Zig implementation now performs in the mid-range
    • Scripting languages (Python, JavaScript, Lua) perform as expected
  • Implementation quality matters enormously: Zig improved from >600s to 10.8s (55Γ— improvement)
  • All implementations handle error conditions gracefully with the malformed line at 14413142

Learning Opportunities

This project provides several educational benefits:

  • CSV Processing Patterns: Learn efficient techniques for handling text data
  • Memory Management: Compare GC vs manual management across implementations
  • Algorithm Optimization: Profile and optimize hot spots in code
  • Language Idioms: Discover each language's natural approach to the same problem
  • Scaling Challenges: Handle increasingly large datasets efficiently

Usage

# Generate data and run all implementations
make

# Run with specific dataset size
make INPUT=ten-million
# Options: hundred, thousand, ten-thousand, hundred-thousand, million, ten-million

# Run all implementations (recommended)
make run INPUT=million

# Run using Lua script (for learning purposes)
make lua
# Note: The Bash script (make run) is generally more reliable

License

MIT License - See LICENSE file for details.

About

Billion Rows - Doing something a billion times in all programming languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •