Skip to content

Rewrite datafusion-sqlancer in Rust #14535

Open
@2010YOUY01

Description

@2010YOUY01

Is your feature request related to a problem or challenge?

This a project idea for GSoC 2025 #14478

datafusion-sqlancer is a SQL level fuzz testing implementation for DataFusion. #11030

Current implementation status

datafusion-sqlancer has covered partial SQL features, and data types, and implemented 3 relatively simple testing oracles1. With occasional manual runs, around 50 bugs have been found.
The implementation is in Java, and it's a fork of the original SQLancer.

Why rewrite in Rust

The SQLancer was first implemented in Java for very good reasons: it has to test the effectiveness of several testing oracles on many major databases, JDBC is a common interface.
DataFusion's SQLancer implementation now is done by extending SQLancer framework, it has saved us some effort to do CLI parsing, result comparison, etc.

There are several reasons I think it's a good idea to rewrite in Rust at this point:

  • (major) Making test oracles also apply to sqllogictests
    datafusion-sqlancer consists of two modules: random query generation, and property validation for test oracles. Those properties can also be applied to enhance existing SQL tests. If we have those properties implemented in Rust, enhancing existing sqllogictests would be easier.
    Now only 3 simple test oracles have been implemented, and I believe there are around 10 novel SQL testing algorithms have been proposed, one example is Equivalent Expression Transformation(https://www.usenix.org/conference/osdi24/presentation/jiang). EET I think is very suitable to enhance existing SQL tests.
    Overall, I think it's a good time to switch to native rust implementation before implementing more complex testing algorithms.
  • Simplier implementation
    One thing we simplify is now we don't have to use JDBC to connect the testing framework and DataFusion core, configuration fuzzing can be easier, and there might be some existing code we can reuse.
  • More contributors
    DataFusion ecosystem is mainly in Rust, IMO it would be easier to find people to help if the testing framework is written in Rust instead of Java.

Describe the solution you'd like

See #11030 for the background

  • Generate random query to a datafusion internal data structure (perhaps Statement)
  • Implement testing oracles. In order to support also running with existing SQL tests, we might want:
    • For query mutation: mutate the query's internal representation, and convert it back to SQL string
    • For property check: implement by extending sqllogictest framework

Describe alternatives you've considered

The project idea proposed above I believe is advanced in terms of difficulty.
A medium level project can be extending existing implementation with more SQL/types support, and implement more test oracles, also with better CI integration.
I'm also open to a fully LLM-based alternative, however I don't have a very good idea so far. Reference https://fuzz4all.github.io/

Additional context

No response

Footnotes

  1. https://github.com/apache/datafusion/issues/11030 has a minimal example for testing oracle NoREC

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions