This repo contains the source code and data for Pasta (SIGMOD'25 Paper).
Pasta is built on top of Apache Texera (Incubating), a collaborative data analytics workflow system.
The experiments in the paper were performed on a branch of Texera in July 2024. Since then the Pasta scheduler has been fully integrated into Texera's master and we have moved the additional source code related to running Pasta's experiments to this repo (forked from Texera on Nov 2025).
The experiments will work with MacOS, Windows, or Linux, but a local environment (desktop or laptop) is needed to set up a Texera dev environment as you will need to access Texera's frontend locally in a browser. A server environment is not recommended. Also some Linux distros may not be supported if they do not have Pgroonga support.
- Follow https://github.com/Texera/Pasta/wiki/Guide-for-Setting-Up-Texera-Dev-Environment to setup a dev environment for Texera and run both the backend micro services and the frontend in IntelliJ.
- Follow https://github.com/Texera/Pasta/wiki/Guide-for-how-to-use-Texera to get familiar with using Texera's frontend.
All the workflows we used for analyses and running experiments on the optimization goal of minimizing total sizes of materialization are in the following file:
/pasta_experiment_inputs/mat_size_experiment_workflows.zip
Extract this file in the same directory. It contains ~6K real-world workflows as Texera workflow source files. Note the workflows are only used for analysis and simulating scheduling optimization on the goal of reducing materialzation sizes, and will not be executable in Texera.
To run a complete set of experiments for Goal 1, navigate to this file in Intellij:
amber/src/main/scala/apache/texera/workflow/PastaMatSizeOptimizationExperimentRunner.scala
Execute PastaMatSizeOptimizationExperimentRunner in Intellij and provide 3 CLI arguments to this binary:
<input_file> <output_directory> <results_file>
- <input_file>: The path to the source of a workflow, e.g.,
"pasta_experiment_inputs/mat_size_experiment_workflows/_01_ChemicalLibraryEnumeration.json" - <output_directory>: Your desired path to the additional output files to be generated by the experiment runner. These will be images of the input physical plan and region plans of each method.
- <results_file>: Your desired path of the result CSV file. The complete statistics about the input workflow and the performance of each method on this workflow will be written to this CSV file.
The workflows and their input files of this goal can be found in
/pasta_experiment_inputs/wallclock_runtime_experiment_workflows.zip
Extract this file in the same directory. There are two executable Texera workflows for this experiment.
This step requires running Texera. Once you have set up a dev Texera environment (including Community features) and have Texera running locally, follow these additional steps to run experiments for each workflow:
- In "Your Work" -> "Datasets", Create a new dataset for the workflow.
- Navigate to the newly-created dataset, and upload the input file to this data set. Click "Submit" in the end and leave the dataset version to be the default.
- In "Your Work" -> "Workflows", upload the workflow JSON to Texera
- Open the uploaded workflow:
- Click the "CSV File Scan" operator and replace its "File" property with you uploaded file (click "Reselect File" and choose from the newly created dataset)
-
Again on "CSV File Scan" operator operator, change the "Limit" to be the desired input data size (e.g., 1000)
-
Click "Connect" on the top-right of the canvas and click "+ Computing Unit", and in the pop-up window, click "Create".
- The "+Connect" button on the top-right should become a blue "Run" button now. Hover on the button and input the execution name to be one of the following:
ALL_MAT
BASELINE
TOP_DOWN_GLOBAL
BOTTOM_UP_GLOBAL
TOP_DOWN_GREEDY
BOTTOM_UP_GREEDY
The execution name is used to indicate to the scheduler which method to use.
-
Click on "Run" to execute the workflow.
-
Note: Before running any experiment, always run "ALL_MAT" once and let the execution finish so that there is some past statistics to be used by the scheduler for calculating costs.
-
The details of the scheduling performance will be output as a console log, and the totol wall-clock runtime of this workflow can be viewed on the frontend:

The console log will be from ComputingUnitMaster and look like the following:
[WARN] [CONTROLLER] [CostBasedScheduleGenerator] [Amber-akka.actor.default-dispatcher-15] - WID: WorkflowIdentity(236), EID: ExecutionIdentity(1637), Scheduling method: BOTTOM_UP_GLOBAL, Cost of schedule: 70.510048792, scheduler ran for: 580.087292
- When testing different input file sizes, it is recommended to create a new workflow for each input size so that their respective cost information can be measured accurately.
This project is supported by the National Science Foundation under the award IIS-2107150.
This project is supported by an NIH NIDDK award.