class weatherbench2.metrics.EnergyScore(ensemble_dim='realization')

The Energy Score along with spread and skill parts.

Given ground truth random vector Y, and two iid predictions X, X’, the Energy Score is defined as

ES = E‖X - Y‖ - 0.5 * E‖X - X’‖

where E is mathematical expectation, and ‖⋅‖ is a weighted L2 norm. ES has a unique minimum when X is distributed the same as Y.

The associated spread/skill ratio is

SS(ES) = E‖X - X’‖ / E‖X - Y‖.

Assuming Y is non-constant, SS(ES) = 0 only when X is constant. Since X, X’ are independent, ‖X - Y‖ < ‖X - Y‖ + ‖X - Y‖, and thus 0 ≤ SS(ES) < 2. If X has the same distribution as Y, SS(ES) = 1. Caution, it is possible for SS(CRPS) = 1 even when X and Y have different distributions.

In our case, each prediction is conditioned on the start time t. Given T different start times, this class estimates time and ensemble averaged quantities for each tendency “V”, producing entries

V_spread := (1 / T) Σₜ ‖Xₜ - Xₜ’‖ V_skill := (1 / T) Σₜ ‖Xₜ - Yₜ‖ V_score := V_skill - 0.5 * V_spread

‖⋅‖ is the area-averaged L2 norm. Estimation is done separately for each tendency, level, and lag time. So correlations between tendency/level/lag are ignored.

If N ensemble members are available, we estimate the spread with N-1 adjacent differences. This strikes a balance between memory usage and variance reduction.

E‖Xₜ - Xₜ’‖ ≈ (1 / (N-1)) Σₙ ‖Xₜ[n] - Xₜ[n+1]‖

So long as 2 or more ensemble members are used, the estimates of spread, skill and ES are unbiased at each time. Therefore, assuming some ergodicity, one can average over many time points and obtain highly accurate estimates.

NaN values propagate through and result in NaN in the corresponding output position.

References: [Gneiting & Raftery, 2012], Strictly Proper Scoring Rules, Prediction, and



ensemble_dim (str) –


ensemble_dim (str) –

Return type:




compute(forecast, truth[, region])

Evaluate this metric on datasets with full temporal coverages.

compute_chunk(forecast, truth[, region])

Energy score, averaged over space, for a time chunk of data.