Add Calculated Fields#

Overview#

In this lesson we will clone a small example TEEHR Evaluation from S3 and work through the different ways to add calculated fields to the joined_timeseries table, save them to the joined_timeseries table on disk to persist them, or add them temporarily before calculating metrics. The first few steps for creating and cloning the TEEHR Evaluation should look familiar if you worked through the previous examples.

Create a new Evaluation#

First we will import TEEHR along with some other required libraries for this example. Then we create a new instance of the Evaluation that points to a directory where the evaluation data will be stored.

import teehr
from pathlib import Path
import shutil

# Tell Bokeh to output plots in the notebook
from bokeh.io import output_notebook
output_notebook()

Loading BokehJS ...

# Define the directory where the Evaluation will be created
test_eval_dir = Path(Path().home(), "temp", "08_calculated_fields")
shutil.rmtree(test_eval_dir, ignore_errors=True)

# Create an Evaluation object and create the directory
ev = teehr.Evaluation(dir_path=test_eval_dir, create_dir=True)

Clone Evaluation Data form S3#

As mentioned above, for this exercise we will be cloning a complete Evaluation dataset from the TEEHR S3 bucket. First we will list the available Evaluations and then we will clone the e0_2_location_example evaluation which is a small example Evaluation that only contains 2 gages.

# List the evaluations in the S3 bucket
ev.list_s3_evaluations()

	name	description	url
0	e0_2_location_example	Example evaluation datsets with 2 USGS gages	s3a://ciroh-rti-public-data/teehr-data-warehou...
1	e1_camels_daily_streamflow	Daily average streamflow at ther Camels basins	s3a://ciroh-rti-public-data/teehr-data-warehou...
2	e2_camels_hourly_streamflow	Hourly instantaneous streamflow at ther Camels...	s3a://ciroh-rti-public-data/teehr-data-warehou...
3	e3_usgs_hourly_streamflow	Hourly instantaneous streamflow at USGS CONUS ...	s3a://ciroh-rti-public-data/teehr-data-warehou...

# Clone the e0_2_location_example evaluation from the S3 bucket
ev.clone_from_s3("e0_2_location_example")

Add Calculated Fields to Joined Timeseries#

Now we will get right to it. Lets start by re-creating the joined_timeseries table and setting the add_attrs=False and execute_udf=False so that we get just the basic joined_timeseries table that joins the primary and secondary timeseries but does not add attributes or user defined fields. This step is not strictly necessary but makes it a bit easier to follow what we are doing in the subsequent steps. Lets take a look at the joined_timeseries table and see what fields are included.

ev.joined_timeseries.create(add_attrs=False, execute_scripts=False)
ev.joined_timeseries.to_sdf().show()

+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-------------------+--------------------+--------------+
|         value_time|primary_location_id|secondary_location_id|primary_value|secondary_value|unit_name|member| configuration_name|       variable_name|reference_time|
+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-------------------+--------------------+--------------+
|2000-10-01 00:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 00:00:00|      usgs-14138800|       nwm30-23736071|    3.3413877|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 01:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 01:00:00|      usgs-14138800|       nwm30-23736071|    3.9926753|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 02:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 02:00:00|      usgs-14138800|       nwm30-23736071|     4.445745|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 03:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 03:00:00|      usgs-14138800|       nwm30-23736071|     5.408518|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 04:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 04:00:00|      usgs-14138800|       nwm30-23736071|    5.6067357|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 05:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 05:00:00|      usgs-14138800|       nwm30-23736071|     5.153666|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 06:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 06:00:00|      usgs-14138800|       nwm30-23736071|    4.5590124|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 07:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 07:00:00|      usgs-14138800|       nwm30-23736071|    5.2952504|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 08:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 08:00:00|      usgs-14138800|       nwm30-23736071|     7.730499|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 09:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 09:00:00|      usgs-14138800|       nwm30-23736071|     9.825946|           0.07|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-------------------+--------------------+--------------+
only showing top 20 rows

Ok, so, as shown, the joined_timeseries table now only contains the basic fields required without any attributes or user defined fields. Now lets add some fields to the joined_timeseries table using the TEEHR calculated fields. Adding fields can be useful for grouping and filtering operations. The TEEHR calculated fields (CF) allow for row level and timeseries aware CFs:

Row level CFs are applied to each row in the table based on data that is in one or more existing fields. These are applied per row and are not aware of the data in any other row (e.g., are not aware of any other timeseries values in a “timeseries”). This can be used for adding fields such as a field based on the data/time (e.g., month, year, season, etc.) or based on the value field (e.g., normalized flow, log flow, etc.) and many other uses.
Timeseries aware CFs are aware of ordered groups of data (e.g., a timeseries). This is useful for things such as event detection, base flow separation, and other fields that need to be calculated based on a entire collection of timeseries values. The definition of what creates a unique set of timeseries (i.e., a timeseries) can be specified.

There are two ways that these CFs can be used:

First, they can be used to add the CF to the joined_timeseries which can then be persisted by writing to disk. This is useful if the calculation is expected to be needed for multiple different metric calculations.
Second, they can be used as a pre-processing step in the calculation of metrics.

These use cases will be demonstrated below. First we will import the CF classes. Normally this would be done at the top of the page, but is done here for demonstration purposes.

# first we need to import the UDF classes
from teehr import RowLevelCalculatedFields as rcf
from teehr import TimeseriesAwareCalculatedFields as tcf

Available Calculated Fields#

There are a number of calculated field classes that are included in the TEEHR package. They are:

Row Level Calculated Fields:

Month
Year
WaterYear
NormalizedFlow
Seasons
ForecastLeadTime
ThresholdValueExceeded
DayOfYear

Timeseries Aware Calculated Fields.

PercentileEventDetection

There will be more added over time. If there is one you are particularly interested in, please reach out and let us know.

Add Calculated Field in Memory#

Now we will use the Row level CFs to add year, month, water year and season to the joined_timeseries table in memory, using the add_udf_columns() method on the joined_timeseries table, but will not save it to disk yet. Adding the UDFs and displaying the table shows that the new fields were added.

sdf = ev.joined_timeseries.add_calculated_fields([
    rcf.Month(),
    rcf.Year(),
    rcf.WaterYear(),
    rcf.Seasons()
]).to_sdf()
sdf.show()

+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-------------------+--------------------+--------------+-----+----+----------+------+
|         value_time|primary_location_id|secondary_location_id|primary_value|secondary_value|unit_name|member| configuration_name|       variable_name|reference_time|month|year|water_year|season|
+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-------------------+--------------------+--------------+-----+----+----------+------+
|2000-10-01 00:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 00:00:00|      usgs-14138800|       nwm30-23736071|    3.3413877|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 01:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 01:00:00|      usgs-14138800|       nwm30-23736071|    3.9926753|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 02:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 02:00:00|      usgs-14138800|       nwm30-23736071|     4.445745|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 03:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 03:00:00|      usgs-14138800|       nwm30-23736071|     5.408518|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 04:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 04:00:00|      usgs-14138800|       nwm30-23736071|    5.6067357|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 05:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 05:00:00|      usgs-14138800|       nwm30-23736071|     5.153666|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 06:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 06:00:00|      usgs-14138800|       nwm30-23736071|    4.5590124|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 07:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 07:00:00|      usgs-14138800|       nwm30-23736071|    5.2952504|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 08:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 08:00:00|      usgs-14138800|       nwm30-23736071|     7.730499|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 09:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
|2000-10-01 09:00:00|      usgs-14138800|       nwm30-23736071|     9.825946|           0.07|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|   10|2000|      2001|  fall|
+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-------------------+--------------------+--------------+-----+----+----------+------+
only showing top 20 rows

However, if we query the table again, we can see that the additional fields are not there. This is because we did not write the table with the additional fields to disk.

ev.joined_timeseries.to_sdf().show()

+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-------------------+--------------------+--------------+
|         value_time|primary_location_id|secondary_location_id|primary_value|secondary_value|unit_name|member| configuration_name|       variable_name|reference_time|
+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-------------------+--------------------+--------------+
|2000-10-01 00:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 00:00:00|      usgs-14138800|       nwm30-23736071|    3.3413877|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 01:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 01:00:00|      usgs-14138800|       nwm30-23736071|    3.9926753|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 02:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 02:00:00|      usgs-14138800|       nwm30-23736071|     4.445745|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 03:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 03:00:00|      usgs-14138800|       nwm30-23736071|     5.408518|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 04:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 04:00:00|      usgs-14138800|       nwm30-23736071|    5.6067357|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 05:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 05:00:00|      usgs-14138800|       nwm30-23736071|     5.153666|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 06:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 06:00:00|      usgs-14138800|       nwm30-23736071|    4.5590124|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 07:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 07:00:00|      usgs-14138800|       nwm30-23736071|    5.2952504|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 08:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 08:00:00|      usgs-14138800|       nwm30-23736071|     7.730499|           0.06|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 09:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 09:00:00|      usgs-14138800|       nwm30-23736071|     9.825946|           0.07|    m^3/s|  NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-------------------+--------------------+--------------+
only showing top 20 rows

Add Calculated Field and Persist#

Now we will add the UDFs again, but this time we will write the table with the additional fields to disk so they are persisted.

ev.joined_timeseries.add_calculated_fields([
    rcf.Month(),
    rcf.Year(),
    rcf.WaterYear(),
    rcf.Seasons()
]).write()

And query the table again. This time we can see that the new fields we added are there as they were written to disk and are now part of the joined_timeseries table on disk.

ev.joined_timeseries.to_sdf().show()

+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-----+----+----------+------+-------------------+--------------------+--------------+
|         value_time|primary_location_id|secondary_location_id|primary_value|secondary_value|unit_name|member|month|year|water_year|season| configuration_name|       variable_name|reference_time|
+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-----+----+----------+------+-------------------+--------------------+--------------+
|2000-10-01 00:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 00:00:00|      usgs-14138800|       nwm30-23736071|    3.3413877|           0.06|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 01:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 01:00:00|      usgs-14138800|       nwm30-23736071|    3.9926753|           0.06|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 02:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 02:00:00|      usgs-14138800|       nwm30-23736071|     4.445745|           0.06|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 03:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 03:00:00|      usgs-14138800|       nwm30-23736071|     5.408518|           0.06|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 04:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 04:00:00|      usgs-14138800|       nwm30-23736071|    5.6067357|           0.06|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 05:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 05:00:00|      usgs-14138800|       nwm30-23736071|     5.153666|           0.06|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 06:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 06:00:00|      usgs-14138800|       nwm30-23736071|    4.5590124|           0.06|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 07:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 07:00:00|      usgs-14138800|       nwm30-23736071|    5.2952504|           0.06|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 08:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 08:00:00|      usgs-14138800|       nwm30-23736071|     7.730499|           0.06|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 09:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 09:00:00|      usgs-14138800|       nwm30-23736071|     9.825946|           0.07|    m^3/s|  NULL|   10|2000|      2001|  fall|nwm30_retrospective|streamflow_hourly...|          NULL|
+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-----+----+----------+------+-------------------+--------------------+--------------+
only showing top 20 rows

Timeseries Aware Calculated Fields#

The timeseries aware calculated fields behave the same way from the users perspective, but behind the scenes are performing some extra grouping and sorting to ensure that the field is calculated based on an ordered group of timeseries values (i.e., a “timeseries”). This is necessary for doing things like event detection, but comes at a computational cost, so use with care, especially on large datasets. Lets try it. This time we will jump right to writing the resulting data frame back to disk to persist it, but you could add the field and display the results without persisting as we did above.

ev.joined_timeseries.add_calculated_fields([
    tcf.PercentileEventDetection()
]).write()

And query the joined_timeseries table to see the new event and event_id fields.

ev.joined_timeseries.to_sdf().show()

+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-----+----+----------+------+-----+--------+-------------------+--------------------+--------------+
|         value_time|primary_location_id|secondary_location_id|primary_value|secondary_value|unit_name|member|month|year|water_year|season|event|event_id| configuration_name|       variable_name|reference_time|
+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-----+----+----------+------+-----+--------+-------------------+--------------------+--------------+
|2000-10-01 00:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 01:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 02:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 03:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 04:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 05:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 06:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 07:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 08:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 09:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 10:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 11:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 12:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 13:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 14:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 15:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 16:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 17:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 18:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
|2000-10-01 19:00:00|      usgs-14316700|       nwm30-23894572|    1.1326739|           0.38|    m^3/s|  NULL|   10|2000|      2001|  fall|false|    NULL|nwm30_retrospective|streamflow_hourly...|          NULL|
+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-----+----+----------+------+-----+--------+-------------------+--------------------+--------------+
only showing top 20 rows

In the table above you can see the event and event_id fields, but it is a bit difficult to see what was done based on the table alone. Lets create a plot to see what the new fields mean. In the next few cells we will query the joined_timeseries table and filter the data to a single location (‘usgs-14138800’) and only values that were identified as event=True, then create a plot where we color the points by the new event_id field.

import hvplot.pandas  # noqa

pdf = ev.joined_timeseries.filter([
    "primary_location_id = 'usgs-14138800'",
    "event = true",
]).to_pandas()

primary_plot = pdf.hvplot.points(x="value_time", y="primary_value", color="event_id") #.opts(width=1200, height=400)

primary_plot.opts(width=1200, height=400)

If you zoom in on the plot above, you can see that each unique event has been given a unique event_id that was used to color the individual data points. This was done by first identifying the 85 percentile value at each location and identifying each value as either above or below that value for its location. Then the values were grouped by continuous sets of values that were identified as event=True and each continuous group was given an event_id. This is just one example of what can be done using TimeseriesAwareCalculatedFields.

Use Calculated Fields in Metrics#

Beyond just identifying high flow events (in this example), the event and event_id fields can be used in subsequent metrics calculations. In the next few cells we will demonstrate one way that the event and event_id fields can be used by working through the steps to calculate the event_max_relative_bias. The event_max_relative_bias is the relative bias between the primary and secondary timeseries maximum values within each event. We calculate this in twos steps using chained queries. We will do it in two steps to demonstrate what we are doing. Note, if you have not worked though the grouping and filter notebooks you may want to go back and do that first as it is an important concept to understanding what is being done here. First we run a metrics query where we filter to a single location and only values that were identified as being event=True, group by configuration_name, primary_location_id and event_id, and calculate the maximum primary and secondary values which we call max_primary_value and max_secondary_value but could give them any name we wanted.

(
    ev.metrics
    .query(
        group_by=["configuration_name", "primary_location_id", "event_id"],
        filters=[
            "primary_location_id = 'usgs-14138800'",
            "event = true",
        ],
        include_metrics=[
            teehr.SignatureMetrics.Maximum(
                input_field_names=["primary_value"],
                output_field_name="max_primary_value"
            ),
            teehr.SignatureMetrics.Maximum(
                input_field_names=["secondary_value"],
                output_field_name="max_secondary_value"
            )
        ]
    )
    .to_sdf().show()
)

+-------------------+-------------------+--------------------+-----------------+-------------------+
| configuration_name|primary_location_id|            event_id|max_primary_value|max_secondary_value|
+-------------------+-------------------+--------------------+-----------------+-------------------+
|nwm30_retrospective|      usgs-14138800|2000-10-01 00:00:...|        11.015253|               0.07|
|nwm30_retrospective|      usgs-14138800|2000-10-21 03:00:...|         5.153666|               0.96|
|nwm30_retrospective|      usgs-14138800|2000-11-08 15:00:...|          3.68119|               1.43|
|nwm30_retrospective|      usgs-14138800|2000-11-27 09:00:...|        7.5039644|               1.24|
|nwm30_retrospective|      usgs-14138800|2000-12-15 03:00:...|        2.8600016|               0.64|
|nwm30_retrospective|      usgs-14138800|2000-12-15 07:00:...|          3.68119|               0.64|
|nwm30_retrospective|      usgs-14138800|2000-12-17 04:00:...|         4.049309|               1.24|
|nwm30_retrospective|      usgs-14138800|2000-12-22 23:00:...|        5.5501018|                1.1|
|nwm30_retrospective|      usgs-14138800|2001-01-05 15:00:...|        3.3413877|               2.47|
|nwm30_retrospective|      usgs-14138800|2001-02-05 01:00:...|        7.2207956|               8.19|
|nwm30_retrospective|      usgs-14138800|2001-03-18 14:00:...|        12.884165|              11.99|
|nwm30_retrospective|      usgs-14138800|2001-03-25 12:00:...|        3.3413877|               4.33|
|nwm30_retrospective|      usgs-14138800|2001-03-25 22:00:...|        3.1148531|               5.66|
|nwm30_retrospective|      usgs-14138800|2001-03-26 22:00:...|        2.8600016|                4.0|
|nwm30_retrospective|      usgs-14138800|2001-03-28 01:00:...|         7.305746|          7.3399997|
|nwm30_retrospective|      usgs-14138800|2001-03-31 17:00:...|         8.070301|               4.89|
|nwm30_retrospective|      usgs-14138800|2001-04-25 23:00:...|         3.879408|          1.4699999|
|nwm30_retrospective|      usgs-14138800|2001-04-30 12:00:...|        14.016839|               8.44|
|nwm30_retrospective|      usgs-14138800|2001-05-14 22:00:...|          9.42951|          1.4399999|
|nwm30_retrospective|      usgs-14138800|2001-06-03 15:00:...|        2.8600016|         0.71999997|
+-------------------+-------------------+--------------------+-----------------+-------------------+
only showing top 20 rows

You can see that this gives us the max_primary_value and max_secondary_value for each unique group of configuration_name, primary_location_id and event_id. But, that is not what we are actually trying to calculate. We are really trying to calculate the event_max_relative_bias. To do that we have to add one more step by chaining together queries. In the following query we add an additional chained query where we only group by configuration_name and primary_location_id which causes the query to aggregate the values from the different events, and then as an aggregation method (include_metrics) we choose relative bias.

(
    ev.metrics
    .query(
        group_by=["configuration_name", "primary_location_id", "event_id"],
        filters=[
            "primary_location_id = 'usgs-14138800'",
            "event = true",
        ],
        include_metrics=[
            teehr.SignatureMetrics.Maximum(
                input_field_names=["primary_value"],
                output_field_name="max_primary_value"
            ),
            teehr.SignatureMetrics.Maximum(
                input_field_names=["secondary_value"],
                output_field_name="max_secondary_value"
            )
        ]
    )
    .query(
        group_by=["configuration_name", "primary_location_id"],
        include_metrics=[
            teehr.DeterministicMetrics.RelativeBias(
                input_field_names=["max_primary_value", "max_secondary_value"],
                output_field_name="event_max_relative_bias"
            )
        ]
    )
    .to_sdf().show()
)

+-------------------+-------------------+-----------------------+
| configuration_name|primary_location_id|event_max_relative_bias|
+-------------------+-------------------+-----------------------+
|nwm30_retrospective|      usgs-14138800|            -0.08458245|
+-------------------+-------------------+-----------------------+

One last thing to cover here. So far we have added the calculated fields on the joined_timeseries table, written them to disk, and then queried the joined_timeseries table to calculate metrics. This works well and allows the calculated fields to be calculated once and used in many subsequent metrics, plots, etc. However, you may wish to add temporary fields to the joined_timeseries table as part of the metrics calculation. This can be done too. Building on the pervious example, where we calculated the “event_max_relative_bias”, lets now assume we want to calculate the same metric but for the 90th percentile instead of the default 85th percentile that we used when we added the added the event and event_id fields to the joined_timeseries table. We could add the new “90th percentile event” to the joined_timeseries table and save to disk and then proceed as we did before, or we can add new event90 and event90_id fields to the data frame temporarily before calculating the maximum event values and ultimately the “event_90th_max_relative_bias”.

(
    ev.metrics
    # Add the PercentileEventDetection calculated field to identify events greater than the 90th percentile.
    # Note the output_event_field_name and output_event_id_field_name are set to "event90" and "event90_id" respectively.
    .add_calculated_fields([
        tcf.PercentileEventDetection(
            quantile=0.90,
            output_event_field_name="event90",
            output_event_id_field_name="event90_id"
        )
    ])
    # First query to calculate the maximum primary and secondary values for each event.
    # Note the filters are set to only include events where event90 is true and the group_by includes event90_id.
    .query(
        group_by=["configuration_name", "primary_location_id", "event90_id"],
        filters=[
            "primary_location_id = 'usgs-14138800'",
            "event90 = true",
        ],
        include_metrics=[
            teehr.SignatureMetrics.Maximum(
                input_field_names=["primary_value"],
                output_field_name="max_primary_value"
            ),
            teehr.SignatureMetrics.Maximum(
                input_field_names=["secondary_value"],
                output_field_name="max_secondary_value"
            )
        ]
    )
    # Second query to calculate the relative bias between the maximum primary and secondary values.
    .query(
        group_by=["configuration_name", "primary_location_id"],
        include_metrics=[
            teehr.DeterministicMetrics.RelativeBias(
                input_field_names=["max_primary_value", "max_secondary_value"],
                output_field_name="event_90th_max_relative_bias"
            )
        ]
    )
    # Convert the metrics to a pandas DataFrame
    .to_sdf().show()
)

+-------------------+-------------------+----------------------------+
| configuration_name|primary_location_id|event_90th_max_relative_bias|
+-------------------+-------------------+----------------------------+
|nwm30_retrospective|      usgs-14138800|                 -0.08735727|
+-------------------+-------------------+----------------------------+

ev.spark.stop()