Fetching and Summarizing NWM Gridded Data#

Overview#

In this guide we’ll demonstrate fetching National Water Model (NWM) operational and retrospective gridded data from cloud storage. This example makes use of a pre-generated Evaluation dataset stored in TEEHR’s examples data module.

Note: For demonstration purposes several cells below are shown in markdown form. If you want to download this notebook and run them yourself, you will need to convert them to code cells.

For a refresher on loading location and location crosswalk data into a new Evaluation refer back to the Loading Local Data and Setting-up a Simple Example user guide pages.

Set up the example Evaluation#

from datetime import datetime
from pathlib import Path
import shutil
import hvplot.pandas  # noqa
import pandas as pd

from teehr.examples.setup_nwm_grid_example import setup_nwm_example
import teehr
from teehr.evaluation.utils import print_tree

# Tell Bokeh to output plots in the notebook
from bokeh.io import output_notebook
output_notebook()
Loading BokehJS ...

We’ll start with the Evaluation from the previous User Guide, which contains observed (USGS) and simulation (NWM) streamflow for a gage at Radford, VA during hurricane Helene.

# Define the directory where the Evaluation will be created.
test_eval_dir = Path(Path().home(), "temp", "11_fetch_gridded_nwm_data")

# Setup the example evaluation using data from the TEEHR repository.
shutil.rmtree(test_eval_dir, ignore_errors=True)
setup_nwm_example(tmpdir=test_eval_dir)

# Initialize the evaluation.
ev = teehr.Evaluation(dir_path=test_eval_dir)
Hide code cell output
ev.locations.to_geopandas()
id name geometry
0 usgs-03171000 NEW RIVER AT RADFORD, VA POINT (-80.56922 37.14179)

We can add polygons to summarize rainfall to, to see the impact of the forcing data on the NWM streamflow prediction

location_data_path = Path(test_eval_dir, "three_huc10s_radford.parquet")

ev.locations.load_spatial(
    location_data_path,
    field_mapping={
        "huc10": "id"
    },
    location_id_prefix="wbd",
    write_mode="append"  # this is the default
)

Now we have the USGS gage point, and three WBD HUC10 polygons in the locations table.

gdf = ev.locations.to_geopandas()
gdf
id name geometry
0 wbd-0505000115 Peak Creek-New River MULTIPOLYGON (((-80.81589 37.13150, -80.81594 ...
1 wbd-0505000117 Lower Little River MULTIPOLYGON (((-80.43397 37.11944, -80.43478 ...
2 wbd-0505000118 Back Creek-New River MULTIPOLYGON (((-80.40439 37.30049, -80.40460 ...
3 usgs-03171000 NEW RIVER AT RADFORD, VA POINT (-80.56922 37.14179)
gdf[gdf.id != "usgs-03171000"].hvplot(geo=True, tiles="OSM", alpha=0.5) * \
    gdf[gdf.id == "usgs-03171000"].hvplot(geo=True, color="black", size=50)

Now we’ll fetch NWM gridded analysis and assimilation rainfall data and calculate mean areal precipitation (MAP) for each of the HUC10 watersheds. We’ll consider this “observed” primary timeseries.

The location ID prefix is used to filter the locations table for WBD polygons during fetching. The calculated MAP timeseries will be appended (the default) to the primary_timeseries table along with the streamflow data (the primary timeseries table will contain both streamflow and rainfall).

# To run this cell locally first convert it to a python cell
ev.fetch.nwm_operational_grids(
    nwm_configuration="forcing_analysis_assim",
    output_type="forcing",
    variable_name="RAINRATE",
    start_date=datetime(2024, 9, 26),
    ingest_days=1,
    nwm_version="nwm30",
    prioritize_analysis_valid_time=True,
    t_minus_hours=[0],
    calculate_zonal_weights=True,
    location_id_prefix="wbd",
    timeseries_type="primary"  # To be considered "observations". This is the default.
)
primary_df = ev.primary_timeseries.to_pandas()
primary_df.head()
value_time value unit_name location_id configuration_name variable_name reference_time
0 2024-09-26 00:00:00 286.000153 m^3/s usgs-03171000 usgs_observations streamflow_hourly_inst NaT
1 2024-09-26 01:00:00 274.956573 m^3/s usgs-03171000 usgs_observations streamflow_hourly_inst NaT
2 2024-09-26 02:00:00 270.425873 m^3/s usgs-03171000 usgs_observations streamflow_hourly_inst NaT
3 2024-09-26 03:00:00 273.257568 m^3/s usgs-03171000 usgs_observations streamflow_hourly_inst NaT
4 2024-09-26 04:00:00 286.000153 m^3/s usgs-03171000 usgs_observations streamflow_hourly_inst NaT

Let’s take a look at the cache directory. After fetching the NWM analysis forcing, we have a directory for the weights file.

print_tree(Path(ev.cache_dir, "fetching", "weights"))
└── nwm30_forcing_analysis_assim
    └── nwm30_forcing_analysis_assim_pixel_weights.parquet

The weights file contains the fractional coverage for each pixel (row/col pair) that intersects with each polygon location (WBD HUC10 watershed).

cached_weights_filepath = Path(
        ev.cache_dir,
        "fetching",
        "weights",
        "nwm30_forcing_analysis_assim",
        "nwm30_forcing_analysis_assim_pixel_weights.parquet"
)
pd.read_parquet(cached_weights_filepath)
row col weight location_id
0 1730 3704 0.045272 wbd-0505000115
1 1730 3705 0.246893 wbd-0505000115
2 1730 3706 0.090052 wbd-0505000115
3 1730 3725 0.012040 wbd-0505000117
4 1730 3726 0.001101 wbd-0505000117
... ... ... ... ...
1599 1774 3726 0.801455 wbd-0505000118
1600 1774 3727 0.028692 wbd-0505000118
1601 1775 3724 0.377646 wbd-0505000118
1602 1775 3725 0.784341 wbd-0505000118
1603 1775 3726 0.109935 wbd-0505000118

1604 rows × 4 columns

Now we can fetch the NWM forecast rainfall and calculate MAP, to compare it to the “observed” rainfall. First we need to create and import the location_crosswalks table, which maps the secondary locations (“model”) to the primary (“observations”).

ev.locations.to_pandas()["id"].unique()
array(['wbd-0505000115', 'wbd-0505000117', 'wbd-0505000118',
       'usgs-03171000'], dtype=object)
xwalk_df = pd.DataFrame(
    {
        "primary_location_id": [
            "wbd-0505000115",
            "wbd-0505000117",
            "wbd-0505000118"
        ],
        "secondary_location_id": [
            "forecast-0505000115",
            "forecast-0505000117",
            "forecast-0505000118"
        ]
    }
)
temp_xwalk_path = Path(ev.cache_dir, "forcing_xwalk.parquet")
xwalk_df.to_parquet(temp_xwalk_path)

Now we can load it into the Evaluation, appending to the existing table.

ev.location_crosswalks.load_parquet(temp_xwalk_path)
ev.location_crosswalks.to_pandas()
primary_location_id secondary_location_id
0 usgs-03171000 nwm30-6884666
1 wbd-0505000115 forecast-0505000115
2 wbd-0505000117 forecast-0505000117
3 wbd-0505000118 forecast-0505000118

Now we can fetch the NWM forecast rainfall, which we’ll consider the “secondary” timeseries. Let’s grab a single short range rainfall forecast

# To run this cell locally first convert it to a python cell
ev.fetch.nwm_operational_grids(
    nwm_configuration="forcing_short_range",
    output_type="forcing",
    variable_name="RAINRATE",
    start_date=datetime(2024, 9, 26),
    ingest_days=1,
    nwm_version="nwm30",
    location_id_prefix="forecast",
    calculate_zonal_weights=False,  # re-use the weights file in the cache
    zonal_weights_filepath=cached_weights_filepath,
    starting_z_hour=1,
    ending_z_hour=1,
    timeseries_type="secondary"  # now considered forecast
)

And our secondary_timeseries table consists of NWM medium range streamflow forecasts (member 1), plus NWM short range rainfall forecast summarized to the three WBD HUC10 watersheds as mean areal precipitation (MAP).

secondary_df = ev.secondary_timeseries.to_pandas()
secondary_df.head()
value_time value unit_name location_id member configuration_name variable_name reference_time
0 2024-09-26 07:00:00 304.829987 m^3/s nwm30-6884666 1 nwm30_medium_range streamflow_hourly_inst 2024-09-26 06:00:00
1 2024-09-26 08:00:00 291.509979 m^3/s nwm30-6884666 1 nwm30_medium_range streamflow_hourly_inst 2024-09-26 06:00:00
2 2024-09-26 09:00:00 273.970001 m^3/s nwm30-6884666 1 nwm30_medium_range streamflow_hourly_inst 2024-09-26 06:00:00
3 2024-09-26 10:00:00 252.580002 m^3/s nwm30-6884666 1 nwm30_medium_range streamflow_hourly_inst 2024-09-26 06:00:00
4 2024-09-26 11:00:00 228.429993 m^3/s nwm30-6884666 1 nwm30_medium_range streamflow_hourly_inst 2024-09-26 06:00:00

And our primary_timeseries table consists of USGS streamflow data at the Radford gage, plus NWM analysis rainfall summarized to the three WBD HUC10 watersheds as mean areal precipitation (MAP).

primary_df = ev.primary_timeseries.to_pandas()
primary_df.head()
value_time value unit_name location_id configuration_name variable_name reference_time
0 2024-09-26 00:00:00 286.000153 m^3/s usgs-03171000 usgs_observations streamflow_hourly_inst NaT
1 2024-09-26 01:00:00 274.956573 m^3/s usgs-03171000 usgs_observations streamflow_hourly_inst NaT
2 2024-09-26 02:00:00 270.425873 m^3/s usgs-03171000 usgs_observations streamflow_hourly_inst NaT
3 2024-09-26 03:00:00 273.257568 m^3/s usgs-03171000 usgs_observations streamflow_hourly_inst NaT
4 2024-09-26 04:00:00 286.000153 m^3/s usgs-03171000 usgs_observations streamflow_hourly_inst NaT

Now we can create the joined_timeseries table allowing for efficient metric calculation and further exploration.

ev.joined_timeseries.create()

joined_df = ev.joined_timeseries.to_pandas()
joined_df.head()
value_time primary_location_id secondary_location_id primary_value secondary_value unit_name member configuration_name variable_name reference_time
0 2024-09-26 01:00:00 usgs-03171000 nwm30-6884666 274.956573 274.630005 m^3/s 1 nwm30_medium_range streamflow_hourly_inst 2024-09-26
1 2024-09-26 02:00:00 usgs-03171000 nwm30-6884666 270.425873 259.910004 m^3/s 1 nwm30_medium_range streamflow_hourly_inst 2024-09-26
2 2024-09-26 03:00:00 usgs-03171000 nwm30-6884666 273.257568 241.559998 m^3/s 1 nwm30_medium_range streamflow_hourly_inst 2024-09-26
3 2024-09-26 04:00:00 usgs-03171000 nwm30-6884666 286.000153 220.289993 m^3/s 1 nwm30_medium_range streamflow_hourly_inst 2024-09-26
4 2024-09-26 05:00:00 usgs-03171000 nwm30-6884666 302.990265 197.220001 m^3/s 1 nwm30_medium_range streamflow_hourly_inst 2024-09-26
joined_df.configuration_name.unique()
array(['nwm30_medium_range', 'nwm30_forcing_short_range'], dtype=object)
# We'll use this helper function for plotting timeseries.
def plot_timeseries(
    configuration_name: str,
    secondary_by: list[str],
    title: str
):
    sim_plot = joined_df[joined_df["configuration_name"] == configuration_name].\
        hvplot(
            x="value_time",
            y="secondary_value",
            by=secondary_by,
            legend=False,
            )
    obs_plot = joined_df[joined_df["configuration_name"] == configuration_name].\
        hvplot(
            x="value_time",
            y="primary_value",
            by=["primary_location_id", "configuration_name"],
            legend=False,
            color="black"
            )
    return (sim_plot * obs_plot).options(
                width=700,
                height=400,
                show_grid=True,
                title=title
            )

We can plot the primary (USGS) vs. secondary (NWM) streamflow and notice that this particular NWM forecast underestimated the flow.

plot_timeseries(
    configuration_name="nwm30_medium_range",
    secondary_by=["reference_time"],
    title="Observed streamflow (black) and forecast streamflow at the Radford gage"
)

In plotting the primary vs. secondary rainfall (MAP) in each watershed, we can see the this NWM forecast underestimated the observed rainfall. This is likely one reason why the NWM streamflow forecasts also underestimated the flow.

plot_timeseries(
    configuration_name="nwm30_forcing_short_range",
    secondary_by=["primary_location_id"],
    title="Observed MAP (black) and forecast MAP at the three HUC10s"
)

Now we can also calculate performance metrics for the streamflow and rainfall for this event.

bias = teehr.DeterministicMetrics.RelativeBias()
kge = teehr.DeterministicMetrics.KlingGuptaEfficiency()

ev.metrics.query(
    include_metrics=[bias, kge],
    group_by=["primary_location_id", "configuration_name"],
    order_by=["primary_location_id", "configuration_name"],
).to_pandas()
primary_location_id configuration_name relative_bias kling_gupta_efficiency
0 usgs-03171000 nwm30_medium_range -0.605284 -0.700095
1 wbd-0505000115 nwm30_forcing_short_range -0.624619 -0.539740
2 wbd-0505000117 nwm30_forcing_short_range 0.283773 0.228351
3 wbd-0505000118 nwm30_forcing_short_range -0.479357 -0.396829
ev.spark.stop()