Fetching and Summarizing NWM Gridded Data#

Overview#

In this guide we’ll demonstrate fetching National Water Model (NWM) operational and retrospective gridded data from cloud storage. This example makes use of a pre-generated Evaluation dataset stored in TEEHR’s examples data module.

Note: For demonstration purposes several cells below are shown in markdown form. If you want to download this notebook and run them yourself, you will need to convert them to code cells.

For a refresher on loading location and location crosswalk data into a new Evaluation refer back to the Loading Local Data and Setting-up a Simple Example user guide pages.

Set up the example Evaluation#

from datetime import datetime
from pathlib import Path
import shutil
import hvplot.pandas  # noqa
import pandas as pd

from teehr.examples.setup_nwm_grid_example import setup_nwm_example
import teehr
from teehr.evaluation.utils import print_tree

# Tell Bokeh to output plots in the notebook
from bokeh.io import output_notebook
output_notebook()

Loading BokehJS ...

We’ll start with the Evaluation from the previous User Guide, which contains observed (USGS) and simulation (NWM) streamflow for a gage at Radford, VA during hurricane Helene.

# Define the directory where the Evaluation will be created.
test_eval_dir = Path(Path().home(), "temp", "11_fetch_gridded_nwm_data")

# Setup the example evaluation using data from the TEEHR repository.
shutil.rmtree(test_eval_dir, ignore_errors=True)
setup_nwm_example(tmpdir=test_eval_dir)

# Initialize the evaluation.
ev = teehr.Evaluation(dir_path=test_eval_dir)

ev.locations.to_geopandas()

	id	name	geometry
0	usgs-03171000	NEW RIVER AT RADFORD, VA	POINT (-80.56922 37.14179)

We can add polygons to summarize rainfall to, to see the impact of the forcing data on the NWM streamflow prediction

location_data_path = Path(test_eval_dir, "three_huc10s_radford.parquet")

ev.locations.load_spatial(
    location_data_path,
    field_mapping={
        "huc10": "id"
    },
    location_id_prefix="wbd",
    write_mode="append"  # this is the default
)

Now we have the USGS gage point, and three WBD HUC10 polygons in the locations table.

gdf = ev.locations.to_geopandas()
gdf

	id	name	geometry
0	wbd-0505000115	Peak Creek-New River	MULTIPOLYGON (((-80.81589 37.13150, -80.81594 ...
1	wbd-0505000117	Lower Little River	MULTIPOLYGON (((-80.43397 37.11944, -80.43478 ...
2	wbd-0505000118	Back Creek-New River	MULTIPOLYGON (((-80.40439 37.30049, -80.40460 ...
3	usgs-03171000	NEW RIVER AT RADFORD, VA	POINT (-80.56922 37.14179)

gdf[gdf.id != "usgs-03171000"].hvplot(geo=True, tiles="OSM", alpha=0.5) * \
    gdf[gdf.id == "usgs-03171000"].hvplot(geo=True, color="black", size=50)

Now we’ll fetch NWM gridded analysis and assimilation rainfall data and calculate mean areal precipitation (MAP) for each of the HUC10 watersheds. We’ll consider this “observed” primary timeseries.

The location ID prefix is used to filter the locations table for WBD polygons during fetching. The calculated MAP timeseries will be appended (the default) to the primary_timeseries table along with the streamflow data (the primary timeseries table will contain both streamflow and rainfall).

# To run this cell locally first convert it to a python cell
ev.fetch.nwm_operational_grids(
    nwm_configuration="forcing_analysis_assim",
    output_type="forcing",
    variable_name="RAINRATE",
    start_date=datetime(2024, 9, 26),
    end_date=datetime(2024, 9, 26),
    nwm_version="nwm30",
    prioritize_analysis_value_time=True,
    t_minus_hours=[0],
    calculate_zonal_weights=True,
    location_id_prefix="wbd",
    timeseries_type="primary"  # To be considered "observations". This is the default.
)

primary_df = ev.primary_timeseries.to_pandas()
primary_df.head()

	reference_time	value_time	value	variable_name	configuration_name	unit_name	location_id
0	NaT	2024-09-26 00:00:00	286.000153	streamflow_hourly_inst	usgs_observations	m^3/s	usgs-03171000
1	NaT	2024-09-26 01:00:00	274.956573	streamflow_hourly_inst	usgs_observations	m^3/s	usgs-03171000
2	NaT	2024-09-26 02:00:00	270.425873	streamflow_hourly_inst	usgs_observations	m^3/s	usgs-03171000
3	NaT	2024-09-26 03:00:00	273.257568	streamflow_hourly_inst	usgs_observations	m^3/s	usgs-03171000
4	NaT	2024-09-26 04:00:00	286.000153	streamflow_hourly_inst	usgs_observations	m^3/s	usgs-03171000

Let’s take a look at the cache directory. After fetching the NWM analysis forcing, we have a directory for the weights file.

print_tree(Path(ev.cache_dir, "fetching", "weights"))

└── nwm30_forcing_analysis_assim
    └── nwm30_forcing_analysis_assim_pixel_weights.parquet

The weights file contains the fractional coverage for each pixel (row/col pair) that intersects with each polygon location (WBD HUC10 watershed).

cached_weights_filepath = Path(
        ev.cache_dir,
        "fetching",
        "weights",
        "nwm30_forcing_analysis_assim",
        "nwm30_forcing_analysis_assim_pixel_weights.parquet"
)
pd.read_parquet(cached_weights_filepath)

	row	col	weight	location_id
0	1730	3704	0.045272	wbd-0505000115
1	1730	3705	0.246893	wbd-0505000115
2	1730	3706	0.090052	wbd-0505000115
3	1730	3725	0.012040	wbd-0505000117
4	1730	3726	0.001101	wbd-0505000117
...	...	...	...	...
1599	1774	3726	0.801455	wbd-0505000118
1600	1774	3727	0.028692	wbd-0505000118
1601	1775	3724	0.377646	wbd-0505000118
1602	1775	3725	0.784341	wbd-0505000118
1603	1775	3726	0.109935	wbd-0505000118

1604 rows × 4 columns

Now we can fetch the NWM forecast rainfall and calculate MAP, to compare it to the “observed” rainfall. First we need to create and import the location_crosswalks table, which maps the secondary locations (“model”) to the primary (“observations”).

ev.locations.to_pandas()["id"].unique()

array(['wbd-0505000115', 'wbd-0505000117', 'wbd-0505000118',
       'usgs-03171000'], dtype=object)

xwalk_df = pd.DataFrame(
    {
        "primary_location_id": [
            "wbd-0505000115",
            "wbd-0505000117",
            "wbd-0505000118"
        ],
        "secondary_location_id": [
            "forecast-0505000115",
            "forecast-0505000117",
            "forecast-0505000118"
        ]
    }
)
temp_xwalk_path = Path(ev.cache_dir, "forcing_xwalk.parquet")
xwalk_df.to_parquet(temp_xwalk_path)

Now we can load it into the Evaluation, appending to the existing table.

ev.location_crosswalks.load_parquet(temp_xwalk_path)
ev.location_crosswalks.to_pandas()

	primary_location_id	secondary_location_id
0	usgs-03171000	nwm30-6884666
1	wbd-0505000115	forecast-0505000115
2	wbd-0505000117	forecast-0505000117
3	wbd-0505000118	forecast-0505000118

Now we can fetch the NWM forecast rainfall, which we’ll consider the “secondary” timeseries. Let’s grab a single short range rainfall forecast

# To run this cell locally first convert it to a python cell
ev.fetch.nwm_operational_grids(
    nwm_configuration="forcing_short_range",
    output_type="forcing",
    variable_name="RAINRATE",
    start_date=datetime(2024, 9, 26),
    end_date=datetime(2024, 9, 26),
    nwm_version="nwm30",
    location_id_prefix="forecast",
    calculate_zonal_weights=False,  # re-use the weights file in the cache
    zonal_weights_filepath=cached_weights_filepath,
    starting_z_hour=1,
    ending_z_hour=1,
    timeseries_type="secondary"  # now considered forecast
)

And our secondary_timeseries table consists of NWM medium range streamflow forecasts (member 1), plus NWM short range rainfall forecast summarized to the three WBD HUC10 watersheds as mean areal precipitation (MAP).

secondary_df = ev.secondary_timeseries.to_pandas()
secondary_df.head()

	reference_time	value_time	value	variable_name	configuration_name	unit_name	location_id	member
0	2024-09-26 06:00:00	2024-09-26 07:00:00	304.829987	streamflow_hourly_inst	nwm30_medium_range	m^3/s	nwm30-6884666	1
1	2024-09-26 06:00:00	2024-09-26 08:00:00	291.509979	streamflow_hourly_inst	nwm30_medium_range	m^3/s	nwm30-6884666	1
2	2024-09-26 06:00:00	2024-09-26 09:00:00	273.970001	streamflow_hourly_inst	nwm30_medium_range	m^3/s	nwm30-6884666	1
3	2024-09-26 06:00:00	2024-09-26 10:00:00	252.580002	streamflow_hourly_inst	nwm30_medium_range	m^3/s	nwm30-6884666	1
4	2024-09-26 06:00:00	2024-09-26 11:00:00	228.429993	streamflow_hourly_inst	nwm30_medium_range	m^3/s	nwm30-6884666	1

And our primary_timeseries table consists of USGS streamflow data at the Radford gage, plus NWM analysis rainfall summarized to the three WBD HUC10 watersheds as mean areal precipitation (MAP).

primary_df = ev.primary_timeseries.to_pandas()
primary_df.head()

	reference_time	value_time	value	variable_name	configuration_name	unit_name	location_id
0	NaT	2024-09-26 00:00:00	286.000153	streamflow_hourly_inst	usgs_observations	m^3/s	usgs-03171000
1	NaT	2024-09-26 01:00:00	274.956573	streamflow_hourly_inst	usgs_observations	m^3/s	usgs-03171000
2	NaT	2024-09-26 02:00:00	270.425873	streamflow_hourly_inst	usgs_observations	m^3/s	usgs-03171000
3	NaT	2024-09-26 03:00:00	273.257568	streamflow_hourly_inst	usgs_observations	m^3/s	usgs-03171000
4	NaT	2024-09-26 04:00:00	286.000153	streamflow_hourly_inst	usgs_observations	m^3/s	usgs-03171000

Now we can create the joined_timeseries table allowing for efficient metric calculation and further exploration.

ev.joined_timeseries.create()

joined_df = ev.joined_timeseries.to_pandas()
joined_df.head()

	value_time	primary_location_id	secondary_location_id	primary_value	secondary_value	unit_name	member	configuration_name	variable_name	reference_time
0	2024-09-26 01:00:00	usgs-03171000	nwm30-6884666	274.956573	274.630005	m^3/s	1	nwm30_medium_range	streamflow_hourly_inst	2024-09-26
1	2024-09-26 02:00:00	usgs-03171000	nwm30-6884666	270.425873	259.910004	m^3/s	1	nwm30_medium_range	streamflow_hourly_inst	2024-09-26
2	2024-09-26 03:00:00	usgs-03171000	nwm30-6884666	273.257568	241.559998	m^3/s	1	nwm30_medium_range	streamflow_hourly_inst	2024-09-26
3	2024-09-26 04:00:00	usgs-03171000	nwm30-6884666	286.000153	220.289993	m^3/s	1	nwm30_medium_range	streamflow_hourly_inst	2024-09-26
4	2024-09-26 05:00:00	usgs-03171000	nwm30-6884666	302.990265	197.220001	m^3/s	1	nwm30_medium_range	streamflow_hourly_inst	2024-09-26

joined_df.configuration_name.unique()

array(['nwm30_medium_range', 'nwm30_forcing_short_range'], dtype=object)

# We'll use this helper function for plotting timeseries.
def plot_timeseries(
    configuration_name: str,
    secondary_by: list[str],
    title: str
):
    sim_plot = joined_df[joined_df["configuration_name"] == configuration_name].\
        hvplot(
            x="value_time",
            y="secondary_value",
            by=secondary_by,
            legend=False,
            )
    obs_plot = joined_df[joined_df["configuration_name"] == configuration_name].\
        hvplot(
            x="value_time",
            y="primary_value",
            by=["primary_location_id", "configuration_name"],
            legend=False,
            color="black"
            )
    return (sim_plot * obs_plot).options(
                width=700,
                height=400,
                show_grid=True,
                title=title
            )

We can plot the primary (USGS) vs. secondary (NWM) streamflow and notice that this particular NWM forecast underestimated the flow.

plot_timeseries(
    configuration_name="nwm30_medium_range",
    secondary_by=["reference_time"],
    title="Observed streamflow (black) and forecast streamflow at the Radford gage"
)

In plotting the primary vs. secondary rainfall (MAP) in each watershed, we can see the this NWM forecast underestimated the observed rainfall. This is likely one reason why the NWM streamflow forecasts also underestimated the flow.

plot_timeseries(
    configuration_name="nwm30_forcing_short_range",
    secondary_by=["primary_location_id"],
    title="Observed MAP (black) and forecast MAP at the three HUC10s"
)

Now we can also calculate performance metrics for the streamflow and rainfall for this event.

bias = teehr.DeterministicMetrics.RelativeBias()
kge = teehr.DeterministicMetrics.KlingGuptaEfficiency()

ev.metrics.query(
    include_metrics=[bias, kge],
    group_by=["primary_location_id", "configuration_name"],
    order_by=["primary_location_id", "configuration_name"],
).to_pandas()

	primary_location_id	configuration_name	relative_bias	kling_gupta_efficiency
0	usgs-03171000	nwm30_medium_range	-0.605284	-0.700095
1	wbd-0505000115	nwm30_forcing_short_range	-0.624619	-0.539740
2	wbd-0505000117	nwm30_forcing_short_range	0.283773	0.228351
3	wbd-0505000118	nwm30_forcing_short_range	-0.479357	-0.396829

ev.spark.stop()

Fetching and Summarizing NWM Gridded Data#

Overview#

Set up the example Evaluation#

This Page