Distributed computing using Beam on GCP

Since the input datasets can be large, distributed computing is key for efficient evaluation. Here is a guide for running the command line scripts on Google Cloud Dataflow.

Local execution

To run the scripts locally, chose the DirectRunner as your Beam runner. There, you will have an option for how many local workers to use, e.g.:

python evaluate.py \
  --forecast_path=gs://weatherbench2/datasets/hres/2016-2022-0012-64x32_equiangular_with_poles_conservative.zarr \
  --obs_path=gs://weatherbench2/datasets/era5/1959-2022-6h-64x32_equiangular_with_poles_conservative.zarr \
  --climatology_path=gs://weatherbench2/datasets/era5-hourly-climatology/1990-2019_6h_64x32_equiangular_with_poles_conservative.zarr \
  --output_dir=./ \
  --output_file_prefix=hres_vs_era_2020_ \
  --by_init=True \
  --input_chunks=init_time=1 \
  --eval_configs=deterministic \
  --time_start=2020-01-01 \
  --time_stop=2020-12-31 \
  --variables=geopotential \
  --use_beam=True \
  --beam_runner=DirectRunner \
  -- \
  --direct_num_workers 2

For a full list of how to configure the direct runner, please review this page.

Dataflow execution

Make sure to install the package including the GCP requirements: pip install -e .[gcp]

To run on Google Cloud Dataflow, use the DataflowRunner. Additionally, a few parameters need to be specified.

--runner: The PipelineRunner to use. This field can be either DirectRunner or DataflowRunner. Default: DirectRunner (local mode)
--project: The project ID for your Google Cloud Project. This is required if you want to run your pipeline using the Dataflow managed service (i.e. DataflowRunner).
--temp_location: Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning with gs://.
--region: Specifies a regional endpoint for deploying your Dataflow jobs. Default: us-central1.
--job_name: The name of the Dataflow job being executed as it appears in Dataflow’s jobs list and job details.
--setup_file: To make sure all the required packages are installed on the workers, pass the setup.py file to GCP.

Example run:

python scripts/evaluate.py \
  --forecast_path=gs://weatherbench2/datasets/hres/2016-2022-0012-64x32_equiangular_with_poles_conservative.zarr \
  --obs_path=gs://weatherbench2/datasets/era5/1959-2022-6h-64x32_equiangular_with_poles_conservative.zarr \
  --climatology_path=gs://weatherbench2/datasets/era5-hourly-climatology/1990-2019_6h_64x32_equiangular_with_poles_conservative.zarr \
  --output_dir=gs://$BUCKET \
  --output_file_prefix=hres_vs_era_2020_ \
  --by_init=True \
  --input_chunks=init_time=10 \
  --eval_configs=deterministic \
  --time_start=2020-01-01 \
  --time_stop=2020-12-31 \
  --variables=geopotential \
  --use_beam=True \
  --runner=DataflowRunner \
  -- \
  --project $PROJECT \
  --region $REGION \
  --temp_location gs://$BUCKET/tmp/
  --setup_file=./setup.py

For a full list of how to configure the Dataflow pipeline, please review this table.

Monitoring

When running Dataflow, you can monitor jobs through UI, or via Dataflow’s CLI commands:

For example, to see all outstanding Dataflow jobs, simply run:

gcloud dataflow jobs list

To describe stats about a particular Dataflow job, run:

gcloud dataflow jobs describe $JOBID

In addition, Dataflow provides a series of Beta CLI commands.

These can be used to keep track of job metrics, like so:

JOBID=<enter job id here>
gcloud beta dataflow metrics list $JOBID --source=user

You can even view logs via the beta commands:

gcloud beta dataflow logs list $JOBID

Spark execution

Users may want to run Weatherbench 2 scripts on other clouds beyond GCP. To best accomplish this, we recommend using a managed or self-hosted Apache Spark runner: beam.apache.org/documentation/runners/spark/.

Once you have a Spark cluster set up (specifically, running on the JVM), please follow the above documentation to configure the PortableRunner on the cluster.