Modin Configuration Settings#
To adjust Modin’s default behavior, you can set the value of Modin
configs by setting an environment variable or by using the
modin.config
API. To list all available configs in Modin, please
run python -m modin.config
to print all
Modin configs with descriptions.
Public API#
Potentially, the source of configs can be any, but for now only environment
variables are implemented. Any environment variable originate from
EnvironmentVariable
, which contains most of
the config API implementation.
- class modin.config.envvars.EnvironmentVariable#
Base class for environment variables-based configuration.
- classmethod get() Any #
Get config value.
- Returns:
Decoded and verified config value.
- Return type:
Any
- classmethod get_help() str #
Generate user-presentable help for the config.
- Return type:
str
- classmethod get_value_source() ValueSource #
Get value source of the config.
- Return type:
ValueSource
- classmethod once(onvalue: Any, callback: Callable) None #
Execute callback if config value matches onvalue value.
Otherwise accumulate callbacks associated with the given onvalue in the _once container.
- Parameters:
onvalue (Any) – Config value to set.
callback (callable) – Callable that should be executed if config value matches onvalue.
- classmethod put(value: Any) None #
Set config value.
- Parameters:
value (Any) – Config value to set.
- classmethod subscribe(callback: Callable) None #
Add callback to the _subs list and then execute it.
- Parameters:
callback (callable) – Callable to execute.
Modin Configs List#
Config Name |
Env. Variable Name |
Default Value |
Description |
Options |
---|---|---|---|---|
AsvDataSizeConfig |
MODIN_ASV_DATASIZE_CONFIG |
Allows to override default size of data (shapes). |
||
AsvImplementation |
MODIN_ASV_USE_IMPL |
modin |
Allows to select a library that we will use for testing performance. |
(‘modin’, ‘pandas’) |
AsyncReadMode |
MODIN_ASYNC_READ_MODE |
False |
It does not wait for the end of reading information from the source. It basically means, that the reading function only launches tasks for the dataframe to be read/created, but not ensures that the construction is finalized by the time the reading function returns a dataframe. This option was brought to improve performance of reading/construction of Modin DataFrames, however it may also: 1. Increase the peak memory consumption. Since the garbage collection of the temporary objects created during the reading is now also lazy and will only be performed when the reading/construction is actually finished. 2. Can break situations when the source is manually deleted after the reading
function returns a result, for example, when reading inside of a context-block
that deletes the file on |
|
AutoSwitchBackend |
MODIN_AUTO_SWITCH_BACKENDS |
True |
Whether automatic backend switching is allowed. When this flag is set, a Modin backend can attempt to automatically choose an appropriate backend for different operations based on features of the input data. When disabled, backends should avoid implicit backend switching outside of explicit operations like to_pandas and to_ray. |
|
Backend |
MODIN_BACKEND |
Ray |
An alias for execution, i.e. the combination of StorageFormat and Engine. Setting backend may change StorageFormat and/or Engine to the corresponding respective values, and setting Engine or StorageFormat may change Backend.
|
(‘Ray’, ‘Dask’, ‘Python_Test’, ‘Unidist’, ‘Pandas’) |
BenchmarkMode |
MODIN_BENCHMARK_MODE |
False |
Whether or not to perform computations synchronously. |
|
CIAWSAccessKeyID |
AWS_ACCESS_KEY_ID |
foobar_key |
Set to AWS_ACCESS_KEY_ID when running mock S3 tests for Modin in GitHub CI. |
|
CIAWSSecretAccessKey |
AWS_SECRET_ACCESS_KEY |
foobar_secret |
Set to AWS_SECRET_ACCESS_KEY when running mock S3 tests for Modin in GitHub CI. |
|
CpuCount |
MODIN_CPUS |
multiprocessing.cpu_count() |
How many CPU cores to use during initialization of the Modin engine. |
|
DaskThreadsPerWorker |
MODIN_DASK_THREADS_PER_WORKER |
1 |
Number of threads per Dask worker. |
|
DocModule |
MODIN_DOC_MODULE |
pandas |
The module to use that will be used for docstrings. The value set here must be a valid, importable module. It should have a DataFrame, Series, and/or several APIs directly (e.g. read_csv). |
|
DynamicPartitioning |
MODIN_DYNAMIC_PARTITIONING |
False |
Set to true to use Modin’s dynamic-partitioning implementation where possible. Please refer to documentation for cases where enabling this options would be beneficial: https://modin.readthedocs.io/en/stable/usage_guide/optimization_notes/index.html#dynamic-partitioning-in-modin |
|
Engine |
MODIN_ENGINE |
Ray |
Distribution engine to run queries by. |
(‘Ray’, ‘Dask’, ‘Python’, ‘Unidist’, ‘Native’) |
GithubCI |
MODIN_GITHUB_CI |
False |
Set to true when running Modin in GitHub CI. |
|
GpuCount |
MODIN_GPUS |
How may GPU devices to utilize across the whole distribution. |
||
IsDebug |
MODIN_DEBUG |
Force Modin engine to be “Python” unless specified by $MODIN_ENGINE. |
||
IsExperimental |
MODIN_EXPERIMENTAL |
Whether to Turn on experimental features. |
||
IsRayCluster |
MODIN_RAY_CLUSTER |
Whether Modin is running on pre-initialized Ray cluster. |
||
LazyExecution |
MODIN_LAZY_EXECUTION |
Auto |
Lazy execution mode.
|
(‘Auto’, ‘On’, ‘Off’) |
LogFileSize |
MODIN_LOG_FILE_SIZE |
10 |
Max size of logs (in MBs) to store per Modin job. |
|
LogMemoryInterval |
MODIN_LOG_MEMORY_INTERVAL |
5 |
Interval (in seconds) to profile memory utilization for logging. |
|
LogMode |
MODIN_LOG_MODE |
disable |
Set |
(‘enable’, ‘disable’) |
Memory |
MODIN_MEMORY |
How much memory (in bytes) give to an execution engine. Notes:
|
||
MetricsMode |
MODIN_METRICS_MODE |
enable |
Set Metric handlers are registered through add_metric_handler and can be used to record graphite-style timings or values. It is the responsibility of the handler to define how those emitted metrics are handled. |
(‘enable’, ‘disable’) |
MinColumnPartitionSize |
MODIN_MIN_COLUMN_PARTITION_SIZE |
32 |
Minimum number of columns in a single pandas partition split. Once a partition for a pandas dataframe has more than this many elements, Modin adds another partition. |
|
MinPartitionSize |
MODIN_MIN_PARTITION_SIZE |
32 |
Minimum number of rows/columns in a single pandas partition split. Once a partition for a pandas dataframe has more than this many elements, Modin adds another partition. |
|
MinRowPartitionSize |
MODIN_MIN_ROW_PARTITION_SIZE |
32 |
Minimum number of rows in a single pandas partition split. Once a partition for a pandas dataframe has more than this many elements, Modin adds another partition. |
|
ModinNumpy |
MODIN_NUMPY |
False |
Set to true to use Modin’s implementation of NumPy API. |
|
NPartitions |
MODIN_NPARTITIONS |
equals to MODIN_CPUS env |
How many partitions to use for a Modin DataFrame (along each axis). |
|
NativePandasMaxRows |
MODIN_NATIVE_MAX_ROWS |
10000000 |
Maximum number of rows which can be processed using local, native, pandas. |
|
NativePandasTransferThreshold |
MODIN_NATIVE_MAX_XFER_ROWS |
10000000 |
Targeted max number of dataframe rows which should be transferred between engines. This is often the same value as MODIN_NATIVE_MAX_ROWS but it can be independently set to change how transfer costs are considered. |
|
PersistentPickle |
MODIN_PERSISTENT_PICKLE |
False |
Whether serialization should be persistent. |
|
ProgressBar |
MODIN_PROGRESS_BAR |
False |
Whether or not to show the progress bar. |
|
RangePartitioning |
MODIN_RANGE_PARTITIONING |
False |
Set to true to use Modin’s range-partitioning implementation where possible. Please refer to documentation for cases where enabling this options would be beneficial: https://modin.readthedocs.io/en/stable/flow/modin/experimental/range_partitioning_groupby.html |
|
RayInitCustomResources |
MODIN_RAY_INIT_CUSTOM_RESOURCES |
Ray node’s custom resources to initialize with. Visit Ray documentation for more details: https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#custom-resources Notes: Relying on Modin to initialize Ray, you should set this config for the proper initialization with custom resources. |
||
RayRedisAddress |
MODIN_REDIS_ADDRESS |
Redis address to connect to when running in Ray cluster. |
||
RayRedisPassword |
MODIN_REDIS_PASSWORD |
random string |
What password to use for connecting to Redis. |
|
RayTaskCustomResources |
MODIN_RAY_TASK_CUSTOM_RESOURCES |
Ray node’s custom resources to request them in tasks or actors. Visit Ray documentation for more details: https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#custom-resources Notes: You can use this config to limit the parallelism for the entire workflow by setting the config at the very beginning. >>> import modin.config as cfg >>> cfg.RayTaskCustomResources.put({“special_hardware”: 0.001}) This way each single remote task or actor will require 0.001 of “special_hardware” to run. You can also use this config to limit the parallelism for a certain operation by setting the config with context. >>> with context(RayTaskCustomResources={“special_hardware”: 0.001}): … df.<op> This way each single remote task or actor will require 0.001 of “special_hardware” to run within the context only. |
||
ReadSqlEngine |
MODIN_READ_SQL_ENGINE |
Pandas |
Engine to run read_sql. |
(‘Pandas’, ‘Connectorx’) |
StorageFormat |
MODIN_STORAGE_FORMAT |
Pandas |
Engine to run on a single node of distribution. |
(‘Pandas’, ‘Native’) |
TestDatasetSize |
MODIN_TEST_DATASET_SIZE |
Dataset size for running some tests. |
(‘Small’, ‘Normal’, ‘Big’) |
|
TestReadFromPostgres |
MODIN_TEST_READ_FROM_POSTGRES |
False |
Set to true to test reading from Postgres. |
|
TestReadFromSqlServer |
MODIN_TEST_READ_FROM_SQL_SERVER |
False |
Set to true to test reading from SQL server. |
|
TrackFileLeaks |
MODIN_TEST_TRACK_FILE_LEAKS |
True |
Whether to track for open file handles leakage during testing. |
Usage Guide#
See example of interaction with Modin configs below, as it can be seen config value can be set either by setting the environment variable or by using config API.
import os
# Setting `MODIN_ENGINE` environment variable.
# Also can be set outside the script.
os.environ["MODIN_ENGINE"] = "Dask"
import modin.config
import modin.pandas as pd
# Checking initially set `Engine` config,
# which corresponds to `MODIN_ENGINE` environment
# variable
print(modin.config.Engine.get()) # prints 'Dask'
# Checking default value of `NPartitions`
print(modin.config.NPartitions.get()) # prints '8'
# Changing value of `NPartitions`
modin.config.NPartitions.put(16)
print(modin.config.NPartitions.get()) # prints '16'
One can also use config variables with a context manager in order to use some config only for a certain part of the code:
import modin.config as cfg
# Default value for this config is 'False'
print(cfg.RangePartitioning.get()) # False
# Set the config to 'True' inside of the context-manager
with cfg.context(RangePartitioning=True):
print(cfg.RangePartitioning.get()) # True
df.merge(...) # will use range-partitioning impl
# Once the context is over, the config gets back to its previous value
print(cfg.RangePartitioning.get()) # False
# You can also set multiple config at once when you pass a dictionary to 'cfg.context'
print(cfg.AsyncReadMode.get()) # False
with cfg.context(RangePartitioning=True, AsyncReadMode=True):
print(cfg.RangePartitioning.get()) # True
print(cfg.AsyncReadMode.get()) # True
print(cfg.RangePartitioning.get()) # False
print(cfg.AsyncReadMode.get()) # False