Clone an Example from S3#

Overview#

In this lesson we will clone a small example TEEHR Evaluation from S3 and run through some simple example metrics calculations, demonstrating filter, grouping and chaining of query methods.

Create a new Evaluation#

First we will import TEEHR along with some other required libraries for this example. Then we create a new instance of the Evaluation that points to a directory where the evaluation data will be stored.

import teehr
from pathlib import Path
import shutil

# Tell Bokeh to output plots in the notebook
from bokeh.io import output_notebook
output_notebook()

Loading BokehJS ...

# Define the directory where the Evaluation will be created
test_eval_dir = Path(Path().home(), "temp", "05_clone_from_s3")
shutil.rmtree(test_eval_dir, ignore_errors=True)

# Create an Evaluation object and create the directory
ev = teehr.Evaluation(dir_path=test_eval_dir, create_dir=True)

Clone Evaluation Data from S3#

As mentioned above, for this exercise we will be cloning a complete Evaluation dataset from the TEEHR S3 bucket. First we will list the available Evaluations and then we will clone the e0_2_location_example evaluation which is a small example Evaluation that only contains 2 gages.

# List the evaluations in the S3 bucket
ev.list_s3_evaluations()

	name	description	url
0	e0_2_location_example	Example evaluation datsets with 2 USGS gages	s3a://ciroh-rti-public-data/teehr-data-warehou...
1	e1_camels_daily_streamflow	Daily average streamflow at ther Camels basins	s3a://ciroh-rti-public-data/teehr-data-warehou...
2	e2_camels_hourly_streamflow	Hourly instantaneous streamflow at ther Camels...	s3a://ciroh-rti-public-data/teehr-data-warehou...
3	e3_usgs_hourly_streamflow	Hourly instantaneous streamflow at USGS CONUS ...	s3a://ciroh-rti-public-data/teehr-data-warehou...
4	e4_nwm_operational	Empty template to load and evaluate NWM operat...	s3a://ciroh-rti-public-data/teehr-data-warehou...

# Clone the e0_2_location_example evaluation from the S3 bucket
ev.clone_from_s3("e0_2_location_example")

Now that we have cloned all the data for this evaluation from the TEEHR S3 bucket, lets query the locations table as a GeoPandas GeoDataFrame and then plot the gages on a map using the TEEHR plotting.

locations_gdf = ev.locations.to_geopandas()
locations_gdf.teehr.locations_map()

Lets also query the primary_timeseries and plot the timeseries data using the df.teehr.timeseries_plot() method.

pt_df = ev. primary_timeseries.to_pandas()
pt_df.head()

	reference_time	value_time	value	unit_name	location_id	configuration_name	variable_name
0	NaT	2000-10-01 00:00:00	3.341388	m^3/s	usgs-14138800	usgs_observations	streamflow_hourly_inst
1	NaT	2000-10-01 01:00:00	3.992675	m^3/s	usgs-14138800	usgs_observations	streamflow_hourly_inst
2	NaT	2000-10-01 02:00:00	4.445745	m^3/s	usgs-14138800	usgs_observations	streamflow_hourly_inst
3	NaT	2000-10-01 03:00:00	5.408518	m^3/s	usgs-14138800	usgs_observations	streamflow_hourly_inst
4	NaT	2000-10-01 04:00:00	5.606736	m^3/s	usgs-14138800	usgs_observations	streamflow_hourly_inst

pt_df.teehr.timeseries_plot()

And the location_crosswalks table.

lc_df = ev.location_crosswalks.to_pandas()
lc_df.head()

	primary_location_id	secondary_location_id
0	usgs-14316700	nwm30-23894572
1	usgs-14138800	nwm30-23736071

And the secondary_timeseries table.

st_df = ev.secondary_timeseries.to_pandas()
st_df.head()

	value_time	value	unit_name	location_id	member	configuration_name	variable_name	reference_time
0	2000-10-01 00:00:00	0.38	m^3/s	nwm30-23894572	None	nwm30_retrospective	streamflow_hourly_inst	NaT
1	2000-10-01 00:00:00	0.06	m^3/s	nwm30-23736071	None	nwm30_retrospective	streamflow_hourly_inst	NaT
2	2000-10-01 01:00:00	0.38	m^3/s	nwm30-23894572	None	nwm30_retrospective	streamflow_hourly_inst	NaT
3	2000-10-01 01:00:00	0.06	m^3/s	nwm30-23736071	None	nwm30_retrospective	streamflow_hourly_inst	NaT
4	2000-10-01 02:00:00	0.38	m^3/s	nwm30-23894572	None	nwm30_retrospective	streamflow_hourly_inst	NaT

st_df.teehr.timeseries_plot()

And lastly, the joined_timeseries table.

jt_df = ev.joined_timeseries.to_pandas()
jt_df.head()

	value_time	primary_location_id	secondary_location_id	primary_value	secondary_value	unit_name	member	location_id	frac_snow	...	baseflow_index	river_forecast_center	month	year	water_year	primary_normalized_flow	secondary_normalized_flow	configuration_name	variable_name	reference_time
0	2000-10-01 00:00:00	usgs-14316700	nwm30-23894572	1.132674	0.38	m^3/s	None	usgs-14316700	0.176336580742005	...	0.508616082222394	NWRFC	10	2000	2001	0.001927	0.000646	nwm30_retrospective	streamflow_hourly_inst	None
1	2000-10-01 00:00:00	usgs-14138800	nwm30-23736071	3.341388	0.06	m^3/s	None	usgs-14138800	0.317266212149897	...	0.457869583655904	NWRFC	10	2000	2001	0.157613	0.002830	nwm30_retrospective	streamflow_hourly_inst	None
2	2000-10-01 01:00:00	usgs-14316700	nwm30-23894572	1.132674	0.38	m^3/s	None	usgs-14316700	0.176336580742005	...	0.508616082222394	NWRFC	10	2000	2001	0.001927	0.000646	nwm30_retrospective	streamflow_hourly_inst	None
3	2000-10-01 01:00:00	usgs-14138800	nwm30-23736071	3.992675	0.06	m^3/s	None	usgs-14138800	0.317266212149897	...	0.457869583655904	NWRFC	10	2000	2001	0.188334	0.002830	nwm30_retrospective	streamflow_hourly_inst	None
4	2000-10-01 02:00:00	usgs-14316700	nwm30-23894572	1.132674	0.38	m^3/s	None	usgs-14316700	0.176336580742005	...	0.508616082222394	NWRFC	10	2000	2001	0.001927	0.000646	nwm30_retrospective	streamflow_hourly_inst	None

5 rows × 41 columns

Metrics#

Now that we have confirmed that we have all the data tables and the joined_timeseries table, we can move on to analyzing the data. The user is encouraged to check out the documentation pages relating to filtering and grouping in the context of generating metrics. The short explanation is that filters can be used to select what values are used when calculating metrics, while the group_by determines how the values are grouped into populations before calculating metrics.

The most basic way to evaluate simulation performance is to group_by configuration_name and primary_location_id, and generate some basic metrics. In this case it will be Nash-Sutcliffe Efficiency, Kling-Gupta Efficiency and Relative Bias, calculated at each location for each configuration. As we saw there are 2 locations and 1 configuration, so the total number of rows that are output is just 2. If there were more locations or more configurations, there would be more rows in the output for this query. TEEHR contains many more metrics that can be calculated by simply including them in the list of include_metrics, and there are also many other ways to look at performance besides the basic metrics.

ev.metrics.query(
    group_by=["configuration_name", "primary_location_id"],
    include_metrics=[
        teehr.DeterministicMetrics.NashSutcliffeEfficiency(),
        teehr.DeterministicMetrics.KlingGuptaEfficiency(),
        teehr.DeterministicMetrics.RelativeBias()
    ]
).to_pandas()

	configuration_name	primary_location_id	nash_sutcliffe_efficiency	kling_gupta_efficiency	relative_bias
0	nwm30_retrospective	usgs-14138800	0.560528	0.778547	-0.050056
1	nwm30_retrospective	usgs-14316700	0.568308	0.739879	0.029096

Now to demonstrate how filters work, if we add a filter to only select values where the primary_location_id is usgs-14138800. Accordingly, it will only include rows from the join_timeseries table where primary_location_id is usgs-14138800 in the metrics calculations, and since we are grouping by primary_location_id, that means we can expect one row in the output. And that is what we see below.

(
    ev.metrics
    .query(
        group_by=["configuration_name", "primary_location_id"],
        filters=[
            {
                "column": "primary_location_id",
                "operator": "=",
                "value": "usgs-14138800"
            }],
        include_metrics=[
            teehr.DeterministicMetrics.NashSutcliffeEfficiency(),
            teehr.DeterministicMetrics.KlingGuptaEfficiency(),
            teehr.DeterministicMetrics.RelativeBias()
        ]
    )
    .to_pandas()
)

	configuration_name	primary_location_id	nash_sutcliffe_efficiency	kling_gupta_efficiency	relative_bias
0	nwm30_retrospective	usgs-14138800	0.560528	0.778547	-0.050056

As another example, because the joined_timeseries table contains a year column which was added as a user defined field, we can also group by year. In this case we will get the metrics calculated for each configuration_name, primary_location_id, and year.

(
    ev.metrics
    .query(
        group_by=["configuration_name", "primary_location_id", "year"],
        filters=[
            {
                "column": "primary_location_id",
                "operator": "=",
                "value": "usgs-14138800"
            }],
        include_metrics=[
            teehr.DeterministicMetrics.NashSutcliffeEfficiency(),
            teehr.DeterministicMetrics.KlingGuptaEfficiency(),
            teehr.DeterministicMetrics.RelativeBias()
        ]
    )
    .to_pandas()
)

	configuration_name	primary_location_id	year	nash_sutcliffe_efficiency	kling_gupta_efficiency	relative_bias
0	nwm30_retrospective	usgs-14138800	2000	-0.117555	-0.217691	-0.464838
1	nwm30_retrospective	usgs-14138800	2001	0.567095	0.753355	-0.034241
2	nwm30_retrospective	usgs-14138800	2002	0.180258	0.602793	-0.037807
3	nwm30_retrospective	usgs-14138800	2003	0.642918	0.788425	0.015950
4	nwm30_retrospective	usgs-14138800	2004	0.417384	0.718745	-0.031999
5	nwm30_retrospective	usgs-14138800	2005	0.537438	0.709779	-0.049876
6	nwm30_retrospective	usgs-14138800	2006	0.770892	0.883926	-0.000343
7	nwm30_retrospective	usgs-14138800	2007	0.271840	0.568460	-0.068697
8	nwm30_retrospective	usgs-14138800	2008	0.470112	0.602048	-0.196977
9	nwm30_retrospective	usgs-14138800	2009	0.584974	0.768143	-0.078182
10	nwm30_retrospective	usgs-14138800	2010	0.516353	0.676750	-0.057300
11	nwm30_retrospective	usgs-14138800	2011	0.778329	0.879296	-0.040333
12	nwm30_retrospective	usgs-14138800	2012	0.442015	0.704763	0.120819

There are many ways that TEEHR can be used to “slice and dice” the data in the TEEHR dataset. One last example here before wrapping up this lesson. Lets say we wanted the “annual peak relative bias”, so that is the relative bias of the annual peak values. Well, TEEHR can do this too by chaining the query methods together and overriding the input_field_names and the output_field_name as shown below. We will do this step by step to understand it. First run the following query where the second query is commented out then in the next cell run it with the second query uncommented. As you can see first we calculate the peak primary value (max_primary_value) and peak secondary value (max_secondary_value) for each year, then we calculate the relative bias across the yearly peaks (annual_max_relative_bias).

(
    ev.metrics
    .query(
        group_by=["configuration_name", "primary_location_id", "year"],
        filters=[
            {
                "column": "primary_location_id",
                "operator": "=",
                "value": "usgs-14138800"
            }],
        include_metrics=[
            teehr.SignatureMetrics.Maximum(
                input_field_names=["primary_value"],
                output_field_name="max_primary_value"
            ),
            teehr.SignatureMetrics.Maximum(
                input_field_names=["secondary_value"],
                output_field_name="max_secondary_value"
            )
        ]
    )
    # .query(
    #     group_by=["configuration_name", "primary_location_id"],
    #     include_metrics=[
    #         teehr.DeterministicMetrics.RelativeBias(
    #             input_field_names=["max_primary_value", "max_secondary_value"],
    #             output_field_name="monthly_max_relative_bias"
    #         )
    #     ]
    # )
    .to_pandas()
)

	configuration_name	primary_location_id	year	max_primary_value	max_secondary_value
0	nwm30_retrospective	usgs-14138800	2000	11.015253	1.430000
1	nwm30_retrospective	usgs-14138800	2001	16.452087	24.150000
2	nwm30_retrospective	usgs-14138800	2002	31.148531	29.109999
3	nwm30_retrospective	usgs-14138800	2003	21.719021	30.869999
4	nwm30_retrospective	usgs-14138800	2004	33.130711	30.099998
5	nwm30_retrospective	usgs-14138800	2005	22.030506	27.519999
6	nwm30_retrospective	usgs-14138800	2006	43.607944	53.669998
7	nwm30_retrospective	usgs-14138800	2007	28.316847	65.059998
8	nwm30_retrospective	usgs-14138800	2008	43.041607	29.689999
9	nwm30_retrospective	usgs-14138800	2009	37.378239	24.930000
10	nwm30_retrospective	usgs-14138800	2010	25.853281	23.939999
11	nwm30_retrospective	usgs-14138800	2011	50.687153	45.039997
12	nwm30_retrospective	usgs-14138800	2012	24.437439	24.260000

(
    ev.metrics
    .query(
        group_by=["configuration_name", "primary_location_id", "year"],
        filters=[
            {
                "column": "primary_location_id",
                "operator": "=",
                "value": "usgs-14138800"
            }],
        include_metrics=[
            teehr.SignatureMetrics.Maximum(
                input_field_names=["primary_value"],
                output_field_name="max_primary_value"
            ),
            teehr.SignatureMetrics.Maximum(
                input_field_names=["secondary_value"],
                output_field_name="max_secondary_value"
            )
        ]
    )
    .query(
        group_by=["configuration_name", "primary_location_id"],
        include_metrics=[
            teehr.DeterministicMetrics.RelativeBias(
                input_field_names=["max_primary_value", "max_secondary_value"],
                output_field_name="annual_max_relative_bias"
            )
        ]
    )
    .to_pandas()
)

	configuration_name	primary_location_id	annual_max_relative_bias
0	nwm30_retrospective	usgs-14138800	0.053885

ev.spark.stop()

Clone an Example from S3#

Overview#

Create a new Evaluation#

Clone Evaluation Data from S3#

Metrics#

This Page