ParallelRunner
kedro.runner.ParallelRunner ¶
ParallelRunner(max_workers=None, is_async=False)
Bases: AbstractRunner
ParallelRunner
is an AbstractRunner
implementation. It can
be used to run the Pipeline
in parallel groups formed by toposort.
Please note that this runner
implementation validates dataset using the
_validate_catalog
method, which checks if any of the datasets are
single process only using the _SINGLE_PROCESS
dataset attribute.
Parameters:
-
max_workers
(int | None
, default:None
) –Number of worker processes to spawn. If not set, calculated automatically based on the pipeline configuration and CPU core count. On windows machines, the max_workers value cannot be larger than 61 and will be set to min(61, max_workers).
-
is_async
(bool
, default:False
) –If True, the node inputs and outputs are loaded and saved asynchronously with threads. Defaults to False.
Raises: ValueError: bad parameters passed
Source code in kedro/runner/parallel_runner.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
|
__del__ ¶
__del__()
Source code in kedro/runner/parallel_runner.py
71 72 |
|
_get_executor ¶
_get_executor(max_workers)
Source code in kedro/runner/parallel_runner.py
119 120 121 122 123 124 |
|
_get_required_workers_count ¶
_get_required_workers_count(pipeline)
Calculate the max number of processes required for the pipeline, limit to the number of CPU cores.
Source code in kedro/runner/parallel_runner.py
105 106 107 108 109 110 111 112 113 114 115 116 117 |
|
_run ¶
_run(pipeline, catalog, hook_manager=None, run_id=None)
The method implementing parallel pipeline running.
Parameters:
-
pipeline
(Pipeline
) –The
Pipeline
to run. -
catalog
(SharedMemoryCatalogProtocol
) –An implemented instance of
SharedMemoryCatalogProtocol
from which to fetch data. -
hook_manager
(PluginManager | None
, default:None
) –The
PluginManager
to activate hooks. -
run_id
(str | None
, default:None
) –The id of the run.
Raises:
-
AttributeError
–When the provided pipeline is not suitable for parallel execution.
-
RuntimeError
–If the runner is unable to schedule the execution of all pipeline nodes.
-
Exception
–In case of any downstream node failure.
Source code in kedro/runner/parallel_runner.py
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
|
_set_manager_datasets ¶
_set_manager_datasets(catalog)
Source code in kedro/runner/parallel_runner.py
102 103 |
|
_validate_catalog
classmethod
¶
_validate_catalog(catalog)
Ensure that all datasets are serialisable and that we do not have any non proxied memory datasets being used as outputs as their content will not be synchronized across threads.
Source code in kedro/runner/parallel_runner.py
94 95 96 97 98 99 100 |
|
_validate_nodes
classmethod
¶
_validate_nodes(nodes)
Ensure all tasks are serialisable.
Source code in kedro/runner/parallel_runner.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
|