Skip to content

Texera/Pasta

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7,049 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pasta: A Cost-Based Optimizer for Generating Pipelining Schedules for Dataflow DAGs

This repo contains the source code and data for Pasta (SIGMOD'25 Paper).

texera-logo Pasta is built on top of Apache Texera (Incubating), a collaborative data analytics workflow system.

The experiments in the paper were performed on a branch of Texera in July 2024. Since then the Pasta scheduler has been fully integrated into Texera's master and we have moved the additional source code related to running Pasta's experiments to this repo (forked from Texera on Nov 2025).

Running Experiments

Prerequisites

OS Environment

The experiments will work with MacOS, Windows, or Linux, but a local environment (desktop or laptop) is needed to set up a Texera dev environment as you will need to access Texera's frontend locally in a browser. A server environment is not recommended. Also some Linux distros may not be supported if they do not have Pgroonga support.

Setting up Pasta (Texera)

Running Experiments Related to Goal 1: Minimizing Total Sizes of Materialization

All the workflows we used for analyses and running experiments on the optimization goal of minimizing total sizes of materialization are in the following file:

/pasta_experiment_inputs/mat_size_experiment_workflows.zip

Extract this file in the same directory. It contains ~6K real-world workflows as Texera workflow source files. Note the workflows are only used for analysis and simulating scheduling optimization on the goal of reducing materialzation sizes, and will not be executable in Texera.

To run a complete set of experiments for Goal 1, navigate to this file in Intellij:

amber/src/main/scala/apache/texera/workflow/PastaMatSizeOptimizationExperimentRunner.scala

Execute PastaMatSizeOptimizationExperimentRunner in Intellij and provide 3 CLI arguments to this binary:

<input_file> <output_directory> <results_file>

  • <input_file>: The path to the source of a workflow, e.g., "pasta_experiment_inputs/mat_size_experiment_workflows/_01_ChemicalLibraryEnumeration.json"
  • <output_directory>: Your desired path to the additional output files to be generated by the experiment runner. These will be images of the input physical plan and region plans of each method.
  • <results_file>: Your desired path of the result CSV file. The complete statistics about the input workflow and the performance of each method on this workflow will be written to this CSV file.

Running Experiments Related to Goal 2: Minimizing Workflow Wall-clock Runtime

The workflows and their input files of this goal can be found in

/pasta_experiment_inputs/wallclock_runtime_experiment_workflows.zip

Extract this file in the same directory. There are two executable Texera workflows for this experiment.

This step requires running Texera. Once you have set up a dev Texera environment (including Community features) and have Texera running locally, follow these additional steps to run experiments for each workflow:

  • In "Your Work" -> "Datasets", Create a new dataset for the workflow.
image
  • Navigate to the newly-created dataset, and upload the input file to this data set. Click "Submit" in the end and leave the dataset version to be the default.
image
  • In "Your Work" -> "Workflows", upload the workflow JSON to Texera
image
  • Open the uploaded workflow:
image
  • Click the "CSV File Scan" operator and replace its "File" property with you uploaded file (click "Reselect File" and choose from the newly created dataset)
image
  • Again on "CSV File Scan" operator operator, change the "Limit" to be the desired input data size (e.g., 1000)

  • Click "Connect" on the top-right of the canvas and click "+ Computing Unit", and in the pop-up window, click "Create".

image
  • The "+Connect" button on the top-right should become a blue "Run" button now. Hover on the button and input the execution name to be one of the following:
ALL_MAT
BASELINE
TOP_DOWN_GLOBAL
BOTTOM_UP_GLOBAL
TOP_DOWN_GREEDY
BOTTOM_UP_GREEDY

The execution name is used to indicate to the scheduler which method to use.

image
  • Click on "Run" to execute the workflow.

  • Note: Before running any experiment, always run "ALL_MAT" once and let the execution finish so that there is some past statistics to be used by the scheduler for calculating costs.

  • The details of the scheduling performance will be output as a console log, and the totol wall-clock runtime of this workflow can be viewed on the frontend: image

The console log will be from ComputingUnitMaster and look like the following:

[WARN] [CONTROLLER] [CostBasedScheduleGenerator] [Amber-akka.actor.default-dispatcher-15] - WID: WorkflowIdentity(236), EID: ExecutionIdentity(1637), Scheduling method: BOTTOM_UP_GLOBAL, Cost of schedule: 70.510048792, scheduler ran for: 580.087292
  • When testing different input file sizes, it is recommended to create a new workflow for each input size so that their respective cost information can be measured accurately.

Acknowledgements

This project is supported by the National Science Foundation under the award IIS-2107150.

  • NIH NIDDK This project is supported by an NIH NIDDK award.

About

Pasta: A Cost-Based Optimizer for Generating Pipelining Schedules for Dataflow DAGs

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Scala 43.9%
  • TypeScript 33.9%
  • Python 10.1%
  • Java 4.8%
  • HTML 4.0%
  • SCSS 1.9%
  • Other 1.4%