Running the Pipeline¶
Starting a pipeline run¶
The Pre-Processing Pipeline can be run from the command line using a CWL runner, e.g., cwltool or toil. Below we describe how the pipeline can be run with these runners.
$ cwltool --no-container $PREPROCESS_ROOT/workflows/pipeline.cwl input.json
$ toil-cwl-runner --no-container $PREPROCESS_ROOT/workflows/pipeline.cwl input.json
where $PREPROCESS_ROOT refers to the location where the CWL files have been installed; this environment variable needs to be set! The pipeline parameters are provided via a JSON file, described at the bottom of this page. Additionally, cwltool and toil come with a number of useful command line arguments, some of which are listed below. Please refer to their respective documentation for a full overview.
Note
Do not forget to add the --no-container option when the dependencies have been installed locally on system. Besides fully running the pipeline in a container (as explained in the next section), this is the only supported way to execute the Pre-Processing Pipeline.
Starting a run from within a container¶
If you followed the Docker installation instructions on the Downloading and Installation page, you can run the container using Docker as follows:
$ docker run --rm -v <source_directory>:<mount_point> -w <mount_point> preprocess cwltool --preserve-entire-environment --no-container /usr/local/share/prep/workflows/pipeline.cwl input.json
Since the Pre-Processing Pipeline is running inside a container, do not forgot to add the --no-container option to your CWL runner.
cwltool options¶
There are a number of command-line options you might want consider adding when running cwltool:
--outdir: specifies the (relative) path to the directory containing the output of the pipeline (make sure to mount this directory when running the pipeline in a container)--log-dir: specifies the location of the log files produces by thestdoutandstderrof aCommandLineTool(make sure to mount this directory when running the pipeline in a container)--preserve-entire-environment: use your system’s environment variables when manually installing the dependencies (or when running the pipeline inside a container)--no-container: do not execute jobs in a container (add this when the dependencies have been installed manually or when running fully inside a container)--singularity: use the Apptainer (previously Singularity) runtime for running containers instead of Docker--debug: more verbose output, useful when debugging
Make sure to mount the output and log directories, specified by --outdir and --log-dir, when running the pipeline inside a container to ensure the files are not lost after execution.
A full overview of CLI arguments is available in their documentation.
toil options¶
Similarly, these options might be of interest when using toil:
--outdir: specifies the path to the directory containing the output of the pipeline--workDir: specifies the path to the directory where the temporary files generated by Toil should be placed--log-dir: specifies the location of the log files produces by thestdoutandstderrof aCommandLineTool--logFile: path to the main log file—-jobStore: path to the Toil job-store (must not exist yet)—batchSystem: use a specific batch system of a HPC cluster (e.g.,slurmorsingle_machine)--preserve-entire-environment: use your system’s environment variables when manually installing the dependencies--no-container: do not execute jobs in a container (add this when the dependencies have been installed manually or when running fully inside a container)--singularity: use the Apptainer (previously Singularity) runtime for running containers instead of Docker--stats: with this option Toil collects runtime statistics (they can be used bytoil stats)
Make sure to mount the output, log directories, and working directories, when running the pipeline inside a container to ensure the files are not lost after execution.
A full overview of CLI arguments is available in their documentation.
Configuring the pipeline¶
The parameters of the pipeline are provided as a JSON file. As an example, a minimal input could be a list of MeasurementSets (MSs) that you would like to process:
{
"msin": [
{
"class": "Directory",
"path": "/data/L888536_SAP000_SB026_uv.MS"
},
{
"class": "Directory",
"path": "/data/L888536_SAP000_SB027_uv.MS"
}
]
}
Refer to the Overview of the Pipeline section for a full overview of all pipeline parameters and their default values.