Read an Example from S3#

Overview#

Similar to 05 Clone from S3, we will again work with an Evaluation dataset that is located in an S3 bucket. Unlike in 05 Clone from S3, this time we will read the data directly from the S3 bucket. We will run all the same commands against the dataset as in 05 Clone from S3.

Create an Evaluation#

First we will import TEEHR along with some other required libraries for this example. Then we create an Evaluation object that points tot he S3 bucket.

import teehr

# Tell Bokeh to output plots in the notebook
from bokeh.io import output_notebook
output_notebook()
Loading BokehJS ...
from teehr.loading.s3.clone_from_s3 import list_s3_evaluations
list_s3_evaluations()["url"].values
array(['s3a://ciroh-rti-public-data/teehr-data-warehouse/v0_4_evaluations/e0_2_location_example',
       's3a://ciroh-rti-public-data/teehr-data-warehouse/v0_4_evaluations/e1_camels_daily_streamflow',
       's3a://ciroh-rti-public-data/teehr-data-warehouse/v0_4_evaluations/e2_camels_hourly_streamflow',
       's3a://ciroh-rti-public-data/teehr-data-warehouse/v0_4_evaluations/e3_usgs_hourly_streamflow',
       's3a://ciroh-rti-public-data/teehr-data-warehouse/v0_4_evaluations/e4_hefs_ensemble_example'],
      dtype=object)
# Create an Evaluation object that points to the S3 location
ev = teehr.Evaluation("s3a://ciroh-rti-public-data/teehr-data-warehouse/v0_4_evaluations/e0_2_location_example")
Hide code cell output

Now that we have created an Evaluation that points to the data in S3, lets query the locations table as a GeoPandas GeoDataFrame and then plot the gages on a map using the TEEHR plotting.

locations_gdf = ev.locations.to_geopandas()
locations_gdf.teehr.locations_map()

Lets also query the primary_timeseries and plot the timeseries data using the df.teehr.timeseries_plot() method.

pt_df = ev. primary_timeseries.to_pandas()
pt_df.head()
reference_time value_time value unit_name location_id configuration_name variable_name
0 NaT 2000-10-01 00:00:00 3.341388 m^3/s usgs-14138800 usgs_observations streamflow_hourly_inst
1 NaT 2000-10-01 01:00:00 3.992675 m^3/s usgs-14138800 usgs_observations streamflow_hourly_inst
2 NaT 2000-10-01 02:00:00 4.445745 m^3/s usgs-14138800 usgs_observations streamflow_hourly_inst
3 NaT 2000-10-01 03:00:00 5.408518 m^3/s usgs-14138800 usgs_observations streamflow_hourly_inst
4 NaT 2000-10-01 04:00:00 5.606736 m^3/s usgs-14138800 usgs_observations streamflow_hourly_inst
pt_df.teehr.timeseries_plot()

And the location_crosswalks table.

lc_df = ev.location_crosswalks.to_pandas()
lc_df.head()
primary_location_id secondary_location_id
0 usgs-14316700 nwm30-23894572
1 usgs-14138800 nwm30-23736071

And the secondary_timeseries table.

st_df = ev.secondary_timeseries.to_pandas()
st_df.head()
reference_time value_time value unit_name location_id member configuration_name variable_name
0 NaT 2000-10-01 00:00:00 0.38 m^3/s nwm30-23894572 None nwm30_retrospective streamflow_hourly_inst
1 NaT 2000-10-01 00:00:00 0.06 m^3/s nwm30-23736071 None nwm30_retrospective streamflow_hourly_inst
2 NaT 2000-10-01 01:00:00 0.38 m^3/s nwm30-23894572 None nwm30_retrospective streamflow_hourly_inst
3 NaT 2000-10-01 01:00:00 0.06 m^3/s nwm30-23736071 None nwm30_retrospective streamflow_hourly_inst
4 NaT 2000-10-01 02:00:00 0.38 m^3/s nwm30-23894572 None nwm30_retrospective streamflow_hourly_inst
st_df.teehr.timeseries_plot()

And lastly, the joined_timeseries table.

jt_df = ev.joined_timeseries.to_pandas()
jt_df.head()
reference_time value_time primary_location_id secondary_location_id primary_value secondary_value unit_name member location_id frac_snow ... q_mean baseflow_index river_forecast_center month year water_year primary_normalized_flow secondary_normalized_flow configuration_name variable_name
0 NaT 2000-10-01 00:00:00 usgs-14316700 nwm30-23894572 1.132674 0.38 m^3/s None usgs-14316700 0.176336580742005 ... 19.909239533259516 0.508616082222394 NWRFC 10 2000 2001 0.001927 0.000646 nwm30_retrospective streamflow_hourly_inst
1 NaT 2000-10-01 00:00:00 usgs-14138800 nwm30-23736071 3.341388 0.06 m^3/s None usgs-14138800 0.317266212149897 ... 1.5975415858787265 0.457869583655904 NWRFC 10 2000 2001 0.157613 0.002830 nwm30_retrospective streamflow_hourly_inst
2 NaT 2000-10-01 01:00:00 usgs-14316700 nwm30-23894572 1.132674 0.38 m^3/s None usgs-14316700 0.176336580742005 ... 19.909239533259516 0.508616082222394 NWRFC 10 2000 2001 0.001927 0.000646 nwm30_retrospective streamflow_hourly_inst
3 NaT 2000-10-01 01:00:00 usgs-14138800 nwm30-23736071 3.992675 0.06 m^3/s None usgs-14138800 0.317266212149897 ... 1.5975415858787265 0.457869583655904 NWRFC 10 2000 2001 0.188334 0.002830 nwm30_retrospective streamflow_hourly_inst
4 NaT 2000-10-01 02:00:00 usgs-14316700 nwm30-23894572 1.132674 0.38 m^3/s None usgs-14316700 0.176336580742005 ... 19.909239533259516 0.508616082222394 NWRFC 10 2000 2001 0.001927 0.000646 nwm30_retrospective streamflow_hourly_inst

5 rows × 41 columns

Metrics#

Now that we have confirmed that we have all the data tables and the joined_timeseries table, we can move on to analyzing the data. The user is encouraged to check out the documentation pages relating to filtering and grouping in the context of generating metrics. The short explanation is that filters can be used to select what values are used when calculating metrics, while the group_by determines how the values are grouped into populations before calculating metrics.

The most basic way to evaluate simulation performance is to group_by configuration_name and primary_location_id, and generate some basic metrics. In this case it will be Nash-Sutcliffe Efficiency, Kling-Gupta Efficiency and Relative Bias, calculated at each location for each configuration. As we saw there are 2 locations and 1 configuration, so the total number of rows that are output is just 2. If there were more locations or more configurations, there would be more rows in the output for this query. TEEHR contains many more metrics that can be calculated by simply including them in the list of include_metrics, and there are also many other ways to look at performance besides the basic metrics.

ev.metrics.query(
    group_by=["configuration_name", "primary_location_id"],
    include_metrics=[
        teehr.DeterministicMetrics.NashSutcliffeEfficiency(),
        teehr.DeterministicMetrics.KlingGuptaEfficiency(),
        teehr.DeterministicMetrics.RelativeBias()
    ]
).to_pandas()
configuration_name primary_location_id nash_sutcliffe_efficiency kling_gupta_efficiency relative_bias
0 nwm30_retrospective usgs-14138800 0.560528 0.778547 -0.050056
1 nwm30_retrospective usgs-14316700 0.568308 0.739879 0.029096

Now to demonstrate how filters work, if we add a filter to only select values where the primary_location_id is usgs-14138800. Accordingly, it will only include rows from the join_timeseries table where primary_location_id is usgs-14138800 in the metrics calculations, and since we are grouping by primary_location_id, that means we can expect one row in the output. And that is what we see below.

(
    ev.metrics
    .query(
        group_by=["configuration_name", "primary_location_id"],
        filters=[
            {
                "column": "primary_location_id",
                "operator": "=",
                "value": "usgs-14138800"
            }],
        include_metrics=[
            teehr.DeterministicMetrics.NashSutcliffeEfficiency(),
            teehr.DeterministicMetrics.KlingGuptaEfficiency(),
            teehr.DeterministicMetrics.RelativeBias()
        ]
    )
    .to_pandas()
)
configuration_name primary_location_id nash_sutcliffe_efficiency kling_gupta_efficiency relative_bias
0 nwm30_retrospective usgs-14138800 0.560528 0.778547 -0.050056

As another example, because the joined_timeseries table contains a year column which was added as a user defined field, we can also group by year. In this case we will get the metrics calculated for each configuration_name, primary_location_id, and year.

(
    ev.metrics
    .query(
        group_by=["configuration_name", "primary_location_id", "year"],
        filters=[
            {
                "column": "primary_location_id",
                "operator": "=",
                "value": "usgs-14138800"
            }],
        include_metrics=[
            teehr.DeterministicMetrics.NashSutcliffeEfficiency(),
            teehr.DeterministicMetrics.KlingGuptaEfficiency(),
            teehr.DeterministicMetrics.RelativeBias()
        ]
    )
    .to_pandas()
)
configuration_name primary_location_id year nash_sutcliffe_efficiency kling_gupta_efficiency relative_bias
0 nwm30_retrospective usgs-14138800 2000 -0.117555 -0.217691 -0.464838
1 nwm30_retrospective usgs-14138800 2001 0.567095 0.753355 -0.034241
2 nwm30_retrospective usgs-14138800 2002 0.180258 0.602793 -0.037807
3 nwm30_retrospective usgs-14138800 2003 0.642918 0.788425 0.015950
4 nwm30_retrospective usgs-14138800 2004 0.417384 0.718745 -0.031999
5 nwm30_retrospective usgs-14138800 2005 0.537438 0.709779 -0.049876
6 nwm30_retrospective usgs-14138800 2006 0.770892 0.883926 -0.000343
7 nwm30_retrospective usgs-14138800 2007 0.271840 0.568460 -0.068697
8 nwm30_retrospective usgs-14138800 2008 0.470112 0.602048 -0.196977
9 nwm30_retrospective usgs-14138800 2009 0.584974 0.768143 -0.078182
10 nwm30_retrospective usgs-14138800 2010 0.516353 0.676750 -0.057300
11 nwm30_retrospective usgs-14138800 2011 0.778329 0.879296 -0.040333
12 nwm30_retrospective usgs-14138800 2012 0.442015 0.704763 0.120819

There are many ways that TEEHR can be used to “slice and dice” the data in the TEEHR dataset. One last example here before wrapping up this lesson. Lets say we wanted the “annual peak relative bias”, so that is the relative bias of the annual peak values. Well, TEEHR can do this too by chaining the query methods together and overriding the input_field_names and the output_field_name as shown below. We will do this step by step to understand it. First run the following query where the second query is commented out then in the next cell run it with the second query uncommented. As you can see first we calculate the peak primary value (max_primary_value) and peak secondary value (max_secondary_value) for each year, then we calculate the relative bias across the yearly peaks (annual_max_relative_bias).

(
    ev.metrics
    .query(
        group_by=["configuration_name", "primary_location_id", "year"],
        filters=[
            {
                "column": "primary_location_id",
                "operator": "=",
                "value": "usgs-14138800"
            }],
        include_metrics=[
            teehr.SignatureMetrics.Maximum(
                input_field_names=["primary_value"],
                output_field_name="max_primary_value"
            ),
            teehr.SignatureMetrics.Maximum(
                input_field_names=["secondary_value"],
                output_field_name="max_secondary_value"
            )
        ]
    )
    # .query(
    #     group_by=["configuration_name", "primary_location_id"],
    #     include_metrics=[
    #         teehr.DeterministicMetrics.RelativeBias(
    #             input_field_names=["max_primary_value", "max_secondary_value"],
    #             output_field_name="monthly_max_relative_bias"
    #         )
    #     ]
    # )
    .to_pandas()
)
configuration_name primary_location_id year max_primary_value max_secondary_value
0 nwm30_retrospective usgs-14138800 2000 11.015253 1.430000
1 nwm30_retrospective usgs-14138800 2001 16.452087 24.150000
2 nwm30_retrospective usgs-14138800 2002 31.148531 29.109999
3 nwm30_retrospective usgs-14138800 2003 21.719021 30.869999
4 nwm30_retrospective usgs-14138800 2004 33.130711 30.099998
5 nwm30_retrospective usgs-14138800 2005 22.030506 27.519999
6 nwm30_retrospective usgs-14138800 2006 43.607944 53.669998
7 nwm30_retrospective usgs-14138800 2007 28.316847 65.059998
8 nwm30_retrospective usgs-14138800 2008 43.041607 29.689999
9 nwm30_retrospective usgs-14138800 2009 37.378239 24.930000
10 nwm30_retrospective usgs-14138800 2010 25.853281 23.939999
11 nwm30_retrospective usgs-14138800 2011 50.687153 45.039997
12 nwm30_retrospective usgs-14138800 2012 24.437439 24.260000
(
    ev.metrics
    .query(
        group_by=["configuration_name", "primary_location_id", "year"],
        filters=[
            {
                "column": "primary_location_id",
                "operator": "=",
                "value": "usgs-14138800"
            }],
        include_metrics=[
            teehr.SignatureMetrics.Maximum(
                input_field_names=["primary_value"],
                output_field_name="max_primary_value"
            ),
            teehr.SignatureMetrics.Maximum(
                input_field_names=["secondary_value"],
                output_field_name="max_secondary_value"
            )
        ]
    )
    .query(
        group_by=["configuration_name", "primary_location_id"],
        include_metrics=[
            teehr.DeterministicMetrics.RelativeBias(
                input_field_names=["max_primary_value", "max_secondary_value"],
                output_field_name="annual_max_relative_bias"
            )
        ]
    )
    .to_pandas()
)
configuration_name primary_location_id annual_max_relative_bias
0 nwm30_retrospective usgs-14138800 0.053885
ev.spark.stop()