Pasta: A Cost-Based Optimizer for Generating Pipelining Schedules for Dataflow DAGs

This repo contains the source code and data for Pasta (SIGMOD'25 Paper).

Pasta is built on top of Apache Texera (Incubating), a collaborative data analytics workflow system.

The experiments in the paper were performed on a branch of Texera in July 2024. Since then the Pasta scheduler has been fully integrated into Texera's master and we have moved the additional source code related to running Pasta's experiments to this repo (forked from Texera on Nov 2025).

Running Experiments

Prerequisites

OS Environment

The experiments will work with MacOS, Windows, or Linux, but a local environment (desktop or laptop) is needed to set up a Texera dev environment as you will need to access Texera's frontend locally in a browser. A server environment is not recommended. Also some Linux distros may not be supported if they do not have Pgroonga support.

Setting up Pasta (Texera)

Follow https://github.com/Texera/Pasta/wiki/Guide-for-Setting-Up-Texera-Dev-Environment to setup a dev environment for Texera and run both the backend micro services and the frontend in IntelliJ.
Follow https://github.com/Texera/Pasta/wiki/Guide-for-how-to-use-Texera to get familiar with using Texera's frontend.

Running Experiments Related to Goal 1: Minimizing Total Sizes of Materialization

All the workflows we used for analyses and running experiments on the optimization goal of minimizing total sizes of materialization are in the following file:

/pasta_experiment_inputs/mat_size_experiment_workflows.zip

Extract this file in the same directory. It contains ~6K real-world workflows as Texera workflow source files. Note the workflows are only used for analysis and simulating scheduling optimization on the goal of reducing materialzation sizes, and will not be executable in Texera.

To run a complete set of experiments for Goal 1, navigate to this file in Intellij:

amber/src/main/scala/apache/texera/workflow/PastaMatSizeOptimizationExperimentRunner.scala

Execute PastaMatSizeOptimizationExperimentRunner in Intellij and provide 3 CLI arguments to this binary:

<input_file> <output_directory> <results_file>

<input_file>: The path to the source of a workflow, e.g., "pasta_experiment_inputs/mat_size_experiment_workflows/_01_ChemicalLibraryEnumeration.json"
<output_directory>: Your desired path to the additional output files to be generated by the experiment runner. These will be images of the input physical plan and region plans of each method.
<results_file>: Your desired path of the result CSV file. The complete statistics about the input workflow and the performance of each method on this workflow will be written to this CSV file.

Running Experiments Related to Goal 2: Minimizing Workflow Wall-clock Runtime

The workflows and their input files of this goal can be found in

/pasta_experiment_inputs/wallclock_runtime_experiment_workflows.zip

Extract this file in the same directory. There are two executable Texera workflows for this experiment.

This step requires running Texera. Once you have set up a dev Texera environment (including Community features) and have Texera running locally, follow these additional steps to run experiments for each workflow:

In "Your Work" -> "Datasets", Create a new dataset for the workflow.

Navigate to the newly-created dataset, and upload the input file to this data set. Click "Submit" in the end and leave the dataset version to be the default.

In "Your Work" -> "Workflows", upload the workflow JSON to Texera

Open the uploaded workflow:

Click the "CSV File Scan" operator and replace its "File" property with you uploaded file (click "Reselect File" and choose from the newly created dataset)

Again on "CSV File Scan" operator operator, change the "Limit" to be the desired input data size (e.g., 1000)
Click "Connect" on the top-right of the canvas and click "+ Computing Unit", and in the pop-up window, click "Create".

The "+Connect" button on the top-right should become a blue "Run" button now. Hover on the button and input the execution name to be one of the following:

ALL_MAT
BASELINE
TOP_DOWN_GLOBAL
BOTTOM_UP_GLOBAL
TOP_DOWN_GREEDY
BOTTOM_UP_GREEDY

The execution name is used to indicate to the scheduler which method to use.

Click on "Run" to execute the workflow.
Note: Before running any experiment, always run "ALL_MAT" once and let the execution finish so that there is some past statistics to be used by the scheduler for calculating costs.
The details of the scheduling performance will be output as a console log, and the totol wall-clock runtime of this workflow can be viewed on the frontend:

The console log will be from ComputingUnitMaster and look like the following:

[WARN] [CONTROLLER] [CostBasedScheduleGenerator] [Amber-akka.actor.default-dispatcher-15] - WID: WorkflowIdentity(236), EID: ExecutionIdentity(1637), Scheduling method: BOTTOM_UP_GLOBAL, Cost of schedule: 70.510048792, scheduler ran for: 580.087292

When testing different input file sizes, it is recommended to create a new workflow for each input size so that their respective cost information can be measured accurately.

Acknowledgements

This project is supported by the National Science Foundation under the award IIS-2107150.

This project is supported by an NIH NIDDK award.

Name		Name	Last commit message	Last commit date
Latest commit History 7,049 Commits
.github		.github
access-control-service		access-control-service
amber		amber
bin		bin
common		common
computing-unit-managing-service		computing-unit-managing-service
config-service		config-service
file-service		file-service
frontend		frontend
pasta_experiment_inputs		pasta_experiment_inputs
project		project
pyright-language-service		pyright-language-service
sql		sql
workflow-compiling-service		workflow-compiling-service
.asf.yaml		.asf.yaml
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.licenserc.yaml		.licenserc.yaml
.scalafix.conf		.scalafix.conf
.scalafmt.conf		.scalafmt.conf
CONTRIBUTING.md		CONTRIBUTING.md
DISCLAIMER		DISCLAIMER
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pasta: A Cost-Based Optimizer for Generating Pipelining Schedules for Dataflow DAGs

Running Experiments

Prerequisites

OS Environment

Setting up Pasta (Texera)

Running Experiments Related to Goal 1: Minimizing Total Sizes of Materialization

Running Experiments Related to Goal 2: Minimizing Workflow Wall-clock Runtime

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Pasta: A Cost-Based Optimizer for Generating Pipelining Schedules for Dataflow DAGs

Running Experiments

Prerequisites

OS Environment

Setting up Pasta (Texera)

Running Experiments Related to Goal 1: Minimizing Total Sizes of Materialization

Running Experiments Related to Goal 2: Minimizing Workflow Wall-clock Runtime

Acknowledgements

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages