Fetching and Summarizing NWM Gridded Data#
Overview#
In this guide we’ll demonstrate fetching National Water Model (NWM) operational and retrospective gridded data from cloud storage. This example makes use of a pre-generated Evaluation dataset stored in TEEHR’s examples data module.
Note: For demonstration purposes several cells below are shown in markdown form. If you want to download this notebook and run them yourself, you will need to convert them to code cells.
For a refresher on loading location and location crosswalk data into a new Evaluation refer back to the Loading Local Data and Setting-up a Simple Example user guide pages.
Set up the example Evaluation#
from datetime import datetime
from pathlib import Path
import shutil
import hvplot.pandas # noqa
import pandas as pd
from teehr.examples.setup_nwm_grid_example import setup_nwm_example
import teehr
from teehr.evaluation.utils import print_tree
# Tell Bokeh to output plots in the notebook
from bokeh.io import output_notebook
output_notebook()
We’ll start with the Evaluation from the previous User Guide, which contains observed (USGS) and simulation (NWM) streamflow for a gage at Radford, VA during hurricane Helene.
# Define the directory where the Evaluation will be created.
test_eval_dir = Path(Path().home(), "temp", "11_fetch_gridded_nwm_data")
# Setup the example evaluation using data from the TEEHR repository.
shutil.rmtree(test_eval_dir, ignore_errors=True)
setup_nwm_example(tmpdir=test_eval_dir)
# Initialize the evaluation.
ev = teehr.Evaluation(dir_path=test_eval_dir)
Show code cell output
ev.locations.to_geopandas()
id | name | geometry | |
---|---|---|---|
0 | usgs-03171000 | NEW RIVER AT RADFORD, VA | POINT (-80.56922 37.14179) |
We can add polygons to summarize rainfall to, to see the impact of the forcing data on the NWM streamflow prediction
location_data_path = Path(test_eval_dir, "three_huc10s_radford.parquet")
ev.locations.load_spatial(
location_data_path,
field_mapping={
"huc10": "id"
},
location_id_prefix="wbd",
write_mode="append" # this is the default
)
Now we have the USGS gage point, and three WBD HUC10 polygons in the locations table.
gdf = ev.locations.to_geopandas()
gdf
id | name | geometry | |
---|---|---|---|
0 | wbd-0505000115 | Peak Creek-New River | MULTIPOLYGON (((-80.81589 37.13150, -80.81594 ... |
1 | wbd-0505000117 | Lower Little River | MULTIPOLYGON (((-80.43397 37.11944, -80.43478 ... |
2 | wbd-0505000118 | Back Creek-New River | MULTIPOLYGON (((-80.40439 37.30049, -80.40460 ... |
3 | usgs-03171000 | NEW RIVER AT RADFORD, VA | POINT (-80.56922 37.14179) |
gdf[gdf.id != "usgs-03171000"].hvplot(geo=True, tiles="OSM", alpha=0.5) * \
gdf[gdf.id == "usgs-03171000"].hvplot(geo=True, color="black", size=50)
Now we’ll fetch NWM gridded analysis and assimilation rainfall data and calculate mean areal precipitation (MAP) for each of the HUC10 watersheds. We’ll consider this “observed” primary timeseries.
The location ID prefix is used to filter the locations table for WBD polygons during fetching. The calculated MAP timeseries will be appended (the default) to the primary_timeseries
table along with the streamflow data (the primary timeseries table will contain both streamflow and rainfall).
# To run this cell locally first convert it to a python cell
ev.fetch.nwm_operational_grids(
nwm_configuration="forcing_analysis_assim",
output_type="forcing",
variable_name="RAINRATE",
start_date=datetime(2024, 9, 26),
ingest_days=1,
nwm_version="nwm30",
prioritize_analysis_valid_time=True,
t_minus_hours=[0],
calculate_zonal_weights=True,
location_id_prefix="wbd",
timeseries_type="primary" # To be considered "observations". This is the default.
)
primary_df = ev.primary_timeseries.to_pandas()
primary_df.head()
value_time | value | unit_name | location_id | configuration_name | variable_name | reference_time | |
---|---|---|---|---|---|---|---|
0 | 2024-09-26 00:00:00 | 286.000153 | m^3/s | usgs-03171000 | usgs_observations | streamflow_hourly_inst | NaT |
1 | 2024-09-26 01:00:00 | 274.956573 | m^3/s | usgs-03171000 | usgs_observations | streamflow_hourly_inst | NaT |
2 | 2024-09-26 02:00:00 | 270.425873 | m^3/s | usgs-03171000 | usgs_observations | streamflow_hourly_inst | NaT |
3 | 2024-09-26 03:00:00 | 273.257568 | m^3/s | usgs-03171000 | usgs_observations | streamflow_hourly_inst | NaT |
4 | 2024-09-26 04:00:00 | 286.000153 | m^3/s | usgs-03171000 | usgs_observations | streamflow_hourly_inst | NaT |
Let’s take a look at the cache
directory. After fetching the NWM analysis forcing, we have a directory for the weights file.
print_tree(Path(ev.cache_dir, "fetching", "weights"))
└── nwm30_forcing_analysis_assim
└── nwm30_forcing_analysis_assim_pixel_weights.parquet
The weights file contains the fractional coverage for each pixel (row/col pair) that intersects with each polygon location (WBD HUC10 watershed).
cached_weights_filepath = Path(
ev.cache_dir,
"fetching",
"weights",
"nwm30_forcing_analysis_assim",
"nwm30_forcing_analysis_assim_pixel_weights.parquet"
)
pd.read_parquet(cached_weights_filepath)
row | col | weight | location_id | |
---|---|---|---|---|
0 | 1730 | 3704 | 0.045272 | wbd-0505000115 |
1 | 1730 | 3705 | 0.246893 | wbd-0505000115 |
2 | 1730 | 3706 | 0.090052 | wbd-0505000115 |
3 | 1730 | 3725 | 0.012040 | wbd-0505000117 |
4 | 1730 | 3726 | 0.001101 | wbd-0505000117 |
... | ... | ... | ... | ... |
1599 | 1774 | 3726 | 0.801455 | wbd-0505000118 |
1600 | 1774 | 3727 | 0.028692 | wbd-0505000118 |
1601 | 1775 | 3724 | 0.377646 | wbd-0505000118 |
1602 | 1775 | 3725 | 0.784341 | wbd-0505000118 |
1603 | 1775 | 3726 | 0.109935 | wbd-0505000118 |
1604 rows × 4 columns
Now we can fetch the NWM forecast rainfall and calculate MAP, to compare it to the “observed” rainfall. First we need to create and import the location_crosswalks
table, which maps the secondary locations (“model”) to the primary (“observations”).
ev.locations.to_pandas()["id"].unique()
array(['wbd-0505000115', 'wbd-0505000117', 'wbd-0505000118',
'usgs-03171000'], dtype=object)
xwalk_df = pd.DataFrame(
{
"primary_location_id": [
"wbd-0505000115",
"wbd-0505000117",
"wbd-0505000118"
],
"secondary_location_id": [
"forecast-0505000115",
"forecast-0505000117",
"forecast-0505000118"
]
}
)
temp_xwalk_path = Path(ev.cache_dir, "forcing_xwalk.parquet")
xwalk_df.to_parquet(temp_xwalk_path)
Now we can load it into the Evaluation, appending to the existing table.
ev.location_crosswalks.load_parquet(temp_xwalk_path)
ev.location_crosswalks.to_pandas()
primary_location_id | secondary_location_id | |
---|---|---|
0 | usgs-03171000 | nwm30-6884666 |
1 | wbd-0505000115 | forecast-0505000115 |
2 | wbd-0505000117 | forecast-0505000117 |
3 | wbd-0505000118 | forecast-0505000118 |
Now we can fetch the NWM forecast rainfall, which we’ll consider the “secondary” timeseries. Let’s grab a single short range rainfall forecast
# To run this cell locally first convert it to a python cell
ev.fetch.nwm_operational_grids(
nwm_configuration="forcing_short_range",
output_type="forcing",
variable_name="RAINRATE",
start_date=datetime(2024, 9, 26),
ingest_days=1,
nwm_version="nwm30",
location_id_prefix="forecast",
calculate_zonal_weights=False, # re-use the weights file in the cache
zonal_weights_filepath=cached_weights_filepath,
starting_z_hour=1,
ending_z_hour=1,
timeseries_type="secondary" # now considered forecast
)
And our secondary_timeseries
table consists of NWM medium range streamflow forecasts (member 1), plus NWM short range rainfall forecast summarized to the three WBD HUC10 watersheds as mean areal precipitation (MAP).
secondary_df = ev.secondary_timeseries.to_pandas()
secondary_df.head()
value_time | value | unit_name | location_id | member | configuration_name | variable_name | reference_time | |
---|---|---|---|---|---|---|---|---|
0 | 2024-09-26 07:00:00 | 304.829987 | m^3/s | nwm30-6884666 | 1 | nwm30_medium_range | streamflow_hourly_inst | 2024-09-26 06:00:00 |
1 | 2024-09-26 08:00:00 | 291.509979 | m^3/s | nwm30-6884666 | 1 | nwm30_medium_range | streamflow_hourly_inst | 2024-09-26 06:00:00 |
2 | 2024-09-26 09:00:00 | 273.970001 | m^3/s | nwm30-6884666 | 1 | nwm30_medium_range | streamflow_hourly_inst | 2024-09-26 06:00:00 |
3 | 2024-09-26 10:00:00 | 252.580002 | m^3/s | nwm30-6884666 | 1 | nwm30_medium_range | streamflow_hourly_inst | 2024-09-26 06:00:00 |
4 | 2024-09-26 11:00:00 | 228.429993 | m^3/s | nwm30-6884666 | 1 | nwm30_medium_range | streamflow_hourly_inst | 2024-09-26 06:00:00 |
And our primary_timeseries
table consists of USGS streamflow data at the Radford gage, plus NWM analysis rainfall summarized to the three WBD HUC10 watersheds as mean areal precipitation (MAP).
primary_df = ev.primary_timeseries.to_pandas()
primary_df.head()
value_time | value | unit_name | location_id | configuration_name | variable_name | reference_time | |
---|---|---|---|---|---|---|---|
0 | 2024-09-26 00:00:00 | 286.000153 | m^3/s | usgs-03171000 | usgs_observations | streamflow_hourly_inst | NaT |
1 | 2024-09-26 01:00:00 | 274.956573 | m^3/s | usgs-03171000 | usgs_observations | streamflow_hourly_inst | NaT |
2 | 2024-09-26 02:00:00 | 270.425873 | m^3/s | usgs-03171000 | usgs_observations | streamflow_hourly_inst | NaT |
3 | 2024-09-26 03:00:00 | 273.257568 | m^3/s | usgs-03171000 | usgs_observations | streamflow_hourly_inst | NaT |
4 | 2024-09-26 04:00:00 | 286.000153 | m^3/s | usgs-03171000 | usgs_observations | streamflow_hourly_inst | NaT |
Now we can create the joined_timeseries
table allowing for efficient metric calculation and further exploration.
ev.joined_timeseries.create()
joined_df = ev.joined_timeseries.to_pandas()
joined_df.head()
value_time | primary_location_id | secondary_location_id | primary_value | secondary_value | unit_name | member | configuration_name | variable_name | reference_time | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2024-09-26 01:00:00 | usgs-03171000 | nwm30-6884666 | 274.956573 | 274.630005 | m^3/s | 1 | nwm30_medium_range | streamflow_hourly_inst | 2024-09-26 |
1 | 2024-09-26 02:00:00 | usgs-03171000 | nwm30-6884666 | 270.425873 | 259.910004 | m^3/s | 1 | nwm30_medium_range | streamflow_hourly_inst | 2024-09-26 |
2 | 2024-09-26 03:00:00 | usgs-03171000 | nwm30-6884666 | 273.257568 | 241.559998 | m^3/s | 1 | nwm30_medium_range | streamflow_hourly_inst | 2024-09-26 |
3 | 2024-09-26 04:00:00 | usgs-03171000 | nwm30-6884666 | 286.000153 | 220.289993 | m^3/s | 1 | nwm30_medium_range | streamflow_hourly_inst | 2024-09-26 |
4 | 2024-09-26 05:00:00 | usgs-03171000 | nwm30-6884666 | 302.990265 | 197.220001 | m^3/s | 1 | nwm30_medium_range | streamflow_hourly_inst | 2024-09-26 |
joined_df.configuration_name.unique()
array(['nwm30_medium_range', 'nwm30_forcing_short_range'], dtype=object)
# We'll use this helper function for plotting timeseries.
def plot_timeseries(
configuration_name: str,
secondary_by: list[str],
title: str
):
sim_plot = joined_df[joined_df["configuration_name"] == configuration_name].\
hvplot(
x="value_time",
y="secondary_value",
by=secondary_by,
legend=False,
)
obs_plot = joined_df[joined_df["configuration_name"] == configuration_name].\
hvplot(
x="value_time",
y="primary_value",
by=["primary_location_id", "configuration_name"],
legend=False,
color="black"
)
return (sim_plot * obs_plot).options(
width=700,
height=400,
show_grid=True,
title=title
)
We can plot the primary (USGS) vs. secondary (NWM) streamflow and notice that this particular NWM forecast underestimated the flow.
plot_timeseries(
configuration_name="nwm30_medium_range",
secondary_by=["reference_time"],
title="Observed streamflow (black) and forecast streamflow at the Radford gage"
)
In plotting the primary vs. secondary rainfall (MAP) in each watershed, we can see the this NWM forecast underestimated the observed rainfall. This is likely one reason why the NWM streamflow forecasts also underestimated the flow.
plot_timeseries(
configuration_name="nwm30_forcing_short_range",
secondary_by=["primary_location_id"],
title="Observed MAP (black) and forecast MAP at the three HUC10s"
)
Now we can also calculate performance metrics for the streamflow and rainfall for this event.
bias = teehr.DeterministicMetrics.RelativeBias()
kge = teehr.DeterministicMetrics.KlingGuptaEfficiency()
ev.metrics.query(
include_metrics=[bias, kge],
group_by=["primary_location_id", "configuration_name"],
order_by=["primary_location_id", "configuration_name"],
).to_pandas()
primary_location_id | configuration_name | relative_bias | kling_gupta_efficiency | |
---|---|---|---|---|
0 | usgs-03171000 | nwm30_medium_range | -0.605284 | -0.700095 |
1 | wbd-0505000115 | nwm30_forcing_short_range | -0.624619 | -0.539740 |
2 | wbd-0505000117 | nwm30_forcing_short_range | 0.283773 | 0.228351 |
3 | wbd-0505000118 | nwm30_forcing_short_range | -0.479357 | -0.396829 |
ev.spark.stop()