Grouping and Filtering#

Once the data has been joined into a single table, we can start to group and filter the data based on the table attributes, and calculate metrics for specific subsets of the data. This is the explorative power of TEEHR, which allows us to better understand model performance. For example, if the joined table contained several model simulations (“configurations”) we could group the configuration_name field to calculate performance metrics for each model configuration.

We could then include filters to further narrow the population subset such as only considering first order stream locations or locations below a certain mean slope value. This allows us to gain more insight into the model performance through specific quantitative analysis.

The grouping and filtering capabilities in TEEHR provide the ability to explore models across different subsets of the data, allowing us to better understand where and why the model performs well or poorly.

We’ll look at an example to help illustrate the grouping and filtering concepts.

https://github.com/RTIInternational/teehr/blob/main/docs/images/tutorials/grouping_filtering/grouping_and_filtering_01.png?raw=true

Consider this joined timeseries table containing:

2 USGS locations
3 Model configurations
4 Daily timesteps spanning two months
1 Location attribute (q95_cms)
1 User-defined attribute (month)

When calculating metrics in TEEHR, we can use the data in this table to calculate metrics over specific subsets or populations of the data. For example, we could calculate the relative bias for each model configuration for each month.

Grouping#

Let’s use this table of joined timeseries values to demonstrate how grouping selected fields affects the results.

First, we’ll calculate the relative bias for each model configuration at each location:

https://github.com/RTIInternational/teehr/blob/main/docs/images/tutorials/grouping_filtering/grouping_and_filtering_02.png?raw=true

We can demonstrate how this calculation is performed in TEEHR using sample data. First, we’ll set up a local directory that will contain our Evaluation, then we’ll clone a subset of an existing Evaluation from s3 storage.

from pathlib import Path
import shutil

import teehr

# Define the directory where the Evaluation will be created
test_eval_dir = Path(Path().home(), "temp", "grouping_tutorial")
shutil.rmtree(test_eval_dir, ignore_errors=True)

# Create an Evaluation object and create the directory
ev = teehr.Evaluation(dir_path=test_eval_dir, create_dir=True)

# List the evaluations in the S3 bucket
ev.list_s3_evaluations()

	name	description	url
0	e0_2_location_example	Example evaluation datsets with 2 USGS gages	s3a://ciroh-rti-public-data/teehr-data-warehou...
1	e1_camels_daily_streamflow	Daily average streamflow at ther Camels basins	s3a://ciroh-rti-public-data/teehr-data-warehou...
2	e2_camels_hourly_streamflow	Hourly instantaneous streamflow at ther Camels...	s3a://ciroh-rti-public-data/teehr-data-warehou...
3	e3_usgs_hourly_streamflow	Hourly instantaneous streamflow at USGS CONUS ...	s3a://ciroh-rti-public-data/teehr-data-warehou...
4	e4_nwm_operational	Empty template to load and evaluate NWM operat...	s3a://ciroh-rti-public-data/teehr-data-warehou...

ev.clone_from_s3(
    evaluation_name="e1_camels_daily_streamflow",
    primary_location_ids=["usgs-01013500", "usgs-01022500"],
    start_date="1990-10-30 00:00",
    end_date="1990-11-02 23:00"
)

Here we calculate relative bias, grouping by primary_location_id and configuration_name:

from teehr import DeterministicMetrics as m

metrics_df = ev.metrics.query(
    group_by=["primary_location_id", "configuration_name"],
    include_metrics=[
        m.RelativeBias(),
    ]
).to_pandas()

metrics_df

	primary_location_id	configuration_name	relative_bias
0	usgs-01013500	camels_daymet_05	-0.174345
1	usgs-01013500	marrmot_37_hbv_obj1	-0.148127
2	usgs-01013500	nwm30_retrospective	-0.250164
3	usgs-01022500	camels_daymet_05	0.271166
4	usgs-01022500	marrmot_37_hbv_obj1	-0.018177
5	usgs-01022500	nwm30_retrospective	0.969050

Note that if you wanted to include a field in the query result, it must be included in the group_by list even if it’s not necessary for the grouping operation.

For example, if we wanted to include the location attribute q95 in the query result, we would need to include it in the group_by list:

https://github.com/RTIInternational/teehr/blob/main/docs/images/tutorials/grouping_filtering/grouping_and_filtering_03.png?raw=true

# Adding q95_cms to the group_by list to include it in the results.
metrics_df = ev.metrics.query(
    group_by=["primary_location_id", "configuration_name", "q95"],
    include_metrics=[
        m.RelativeBias(),
    ]
).to_pandas()

metrics_df

	primary_location_id	configuration_name	q95	relative_bias
0	usgs-01013500	camels_daymet_05	166.7828551043532	-0.174345
1	usgs-01013500	marrmot_37_hbv_obj1	166.7828551043532	-0.148127
2	usgs-01013500	nwm30_retrospective	166.7828551043532	-0.250164
3	usgs-01022500	camels_daymet_05	48.468064445780094	0.271166
4	usgs-01022500	marrmot_37_hbv_obj1	48.468064445780094	-0.018177
5	usgs-01022500	nwm30_retrospective	48.468064445780094	0.969050

In addition grouping by location attributes like q95, we can also include user-defined attributes like month to the group_by list.

https://github.com/RTIInternational/teehr/blob/main/docs/images/tutorials/grouping_filtering/grouping_and_filtering_04.png?raw=true

# Adding q95_cms to the group_by list to include it in the results.
metrics_df = ev.metrics.query(
    group_by=["primary_location_id", "configuration_name", "month"],
    include_metrics=[
        m.RelativeBias(),
    ]
).to_pandas()

metrics_df

	primary_location_id	configuration_name	month	relative_bias
0	usgs-01013500	camels_daymet_05	10	-0.189393
1	usgs-01013500	camels_daymet_05	11	-0.157528
2	usgs-01013500	marrmot_37_hbv_obj1	10	-0.211469
3	usgs-01013500	marrmot_37_hbv_obj1	11	-0.077339
4	usgs-01013500	nwm30_retrospective	10	-0.284921
5	usgs-01013500	nwm30_retrospective	11	-0.211322
6	usgs-01022500	camels_daymet_05	10	0.317072
7	usgs-01022500	camels_daymet_05	11	0.205172
8	usgs-01022500	marrmot_37_hbv_obj1	10	0.141619
9	usgs-01022500	marrmot_37_hbv_obj1	11	-0.247900
10	usgs-01022500	nwm30_retrospective	10	0.869188
11	usgs-01022500	nwm30_retrospective	11	1.112612

If we want to obtain the relative bias for both locations across all configurations, we simply include only primary_location_id in the group_by list

https://github.com/RTIInternational/teehr/blob/main/docs/images/tutorials/grouping_filtering/grouping_and_filtering_05.png?raw=true

metrics_df = ev.metrics.query(
    group_by=["primary_location_id"],
    include_metrics=[
        m.RelativeBias(),
    ]
).to_pandas()

metrics_df

	primary_location_id	relative_bias
0	usgs-01013500	-0.190879
1	usgs-01022500	0.407347

Filtering#

Next, we’ll add filtering to further narrow the population for our metric calculations. Let’s say we only want to consider NWM v3.0 and Marrmot model configurations:

We need to specify a filter in the query method to only include the desired model configurations:

# Adding a filter to further limit the population for metrics calculations.
metrics_df = ev.metrics.query(
    group_by=["primary_location_id", "configuration_name", "q95"],
    include_metrics=[
        m.RelativeBias(),
    ],
    filters = [
        {
            "column": "configuration_name",
            "operator": "in",
            "value": ["nwm30_retro", "marrmot_37_hbv_obj1"]
        }
    ]
).to_pandas()

metrics_df

	primary_location_id	configuration_name	q95	relative_bias
0	usgs-01013500	marrmot_37_hbv_obj1	166.7828551043532	-0.148127
1	usgs-01022500	marrmot_37_hbv_obj1	48.468064445780094	-0.018177

Summary#

Grouping and filtering are powerful tools in TEEHR that allow us to explore the data in more detail and calculate metrics for specific subsets of the data.

See the User Guide for more in-depth examples using the code base.

ev.spark.stop()

Grouping and Filtering#

Grouping#

Filtering#

Summary#

This Page