A cross-language benchmark project for processing large datasets efficiently. This project provides a practical playground for learning new programming languages and optimization techniques through a common, real-world challenge: processing and analyzing CSV data at scale.
- Language Exploration: Learn multiple programming languages by implementing the same task
- Optimization Skills: Discover how to squeeze maximum performance from each language
- Comparative Analysis: Understand each language's strengths and trade-offs for data processing
- Performance at Scale: See how implementations behave across datasets from small to massive
The project emphasizes implementation quality over abstract language comparisons. Often, a well-optimized solution in a "slower" language can outperform a naive implementation in a "faster" language.
| Language | Status | Notes |
|---|---|---|
| C++ | β | Strong raw performance with manual memory management; excels at maps/aggregation |
| Carbon | π§ | Google's experimental C++ successor; implementation not started |
| Go | β | Excellent for concurrent data processing; efficient string handling and map operations |
| Haskell | π§ | β Elegant pattern matching and lazy evaluation good for streaming large data |
| JavaScript | β | Surprisingly fast with V8's JIT; hash map operations well-optimized |
| Lua | β | Lightweight with efficient tables; less overhead than other scripting languages |
| Mojo | π§ | β Not started; would leverage LLVM and vectorization for CSV number crunching |
| Python | β | Strong standard library CSV parser; dictionary operations well-suited for aggregation |
| Rust | β | Zero-cost abstractions with memory safety; exceptional CSV parsing and map performance |
| Zig | β | First optimized implementation completed; competitive with scripting languages |
| Language | Time |
|---|---|
| C++ | 0.0006s |
| Rust | 0.0016s |
| Go | 0.0021s |
| Zig | 0.0117s |
| Python | 0.0139s |
| Lua | 0.0169s |
| JavaScript | 0.0278s |
| Language | Time |
|---|---|
| C++ | 0.036s |
| Rust | 0.073s |
| Go | 0.184s |
| JavaScript | 0.936s |
| Zig | 0.755s |
| Python | 0.671s |
| Lua | 1.622s |
| Language | Time |
|---|---|
| C++ | 0.543s |
| Rust | 0.757s |
| Go | 2.652s |
| Python | 9.668s |
| Zig | 10.844s |
| JavaScript | 13.532s |
| Lua | 24.211s |
| Haskell | Build error |
| Mojo | Not created |
| Carbon | Not created |
- C++ remains the performance leader: Fastest across all dataset sizes
- Rust consistently shows excellent performance: Close second to C++
- Zig now competitive with Python: Our optimization efforts paid off dramatically
- Performance rankings remain consistent across scales:
- Systems languages (C++, Rust, Go) maintain their advantage
- Zig implementation now performs in the mid-range
- Scripting languages (Python, JavaScript, Lua) perform as expected
- Implementation quality matters enormously: Zig improved from >600s to 10.8s (55Γ improvement)
- All implementations handle error conditions gracefully with the malformed line at 14413142
This project provides several educational benefits:
- CSV Processing Patterns: Learn efficient techniques for handling text data
- Memory Management: Compare GC vs manual management across implementations
- Algorithm Optimization: Profile and optimize hot spots in code
- Language Idioms: Discover each language's natural approach to the same problem
- Scaling Challenges: Handle increasingly large datasets efficiently
# Generate data and run all implementations
make
# Run with specific dataset size
make INPUT=ten-million
# Options: hundred, thousand, ten-thousand, hundred-thousand, million, ten-million
# Run all implementations (recommended)
make run INPUT=million
# Run using Lua script (for learning purposes)
make lua
# Note: The Bash script (make run) is generally more reliableMIT License - See LICENSE file for details.