Setting-up a Simple Example#

Overview#

In this lesson we will set up a more realistic, but still very small, example using real gage locations. In this case we will use a few data files that we have on hand in the repository for the locations and location_crosswalks and fetch the USGS gage data and the NWM v3.0 streamflow simulation data from NWIS and AWS, respectively.

Create a new Evaluation#

First we will import TEEHR along with some other required libraries for this example. Then we create a new instance of the Evaluation that points to a directory where the evaluation data will be stored and clone the basic evaluation template.

import teehr
import teehr.example_data.two_locations as two_locations_data
import pandas as pd
import geopandas as gpd
from pathlib import Path
import shutil

# Tell Bokeh to output plots in the notebook
from bokeh.io import output_notebook
output_notebook()

Loading BokehJS ...

# Define the directory where the Evaluation will be created
test_eval_dir = Path(Path().home(), "temp", "04_setup_real_example")
shutil.rmtree(test_eval_dir, ignore_errors=True)

# Create an Evaluation object and create the directory
ev = teehr.Evaluation(dir_path=test_eval_dir, create_dir=True)

# Clone the template
ev.clone_template()

Now that we have the template cloned, Lets fetch locations and location_crosswalks files from the repository. When setting up a brand new evaluation you may need to develop this information yourself, however, as we will show in some subsequent examples, the TEEHR team has several baseline datasets that can be used as a starting point. For this example, we will download some small files from the repository to use.

location_data_path = Path(test_eval_dir, "two_locations.parquet")
two_locations_data.fetch_file("two_locations.parquet", location_data_path)

crosswalk_data_path = Path(test_eval_dir, "two_crosswalks.parquet")
two_locations_data.fetch_file("two_crosswalks.parquet", crosswalk_data_path)

Location Data#

As we have done in previous examples, lets open the spatial data file and examine the contents before loading it into the TEEHR dataset.

gdf = gpd.read_parquet(location_data_path)
gdf

	id	name	geometry
0	usgs-14316700	STEAMBOAT CREEK NEAR GLIDE, OR	POINT (-122.72894 43.34984)
1	usgs-14138800	BLAZED ALDER CREEK NEAR RHODODENDRON, OR	POINT (-121.89147 45.45262)

As you can see, the file contains 2 USGS gages located in Oregon. Lets load them into the TEEHR dataset using the load_spatial() method. Note that the id column contains two hyphen separated values. To the left of the hyphen we have a a set of characters the indicate the “source” while to the right we have the identifier for that source. While not strictly necessary, this is standard used within TEEHR. This is also necessary when using the fetching functionality to fetch data from remote sources, as demonstrated below.

ev.locations.load_spatial(location_data_path)

And, as we have done in previous examples, lets query the locations table as a GeoPandas GeoDataFrame and then plot the gages on a map using the TEEHR plotting

locations_gdf = ev.locations.to_geopandas()
locations_gdf.teehr.locations_map()

USGS Primary Timeseries#

In previous examples we loaded the primary timeseries from files that we already had. In the following cells we will utilize the ev.fetching.usgs_streamflow() method to fetch streamflow data from NWIS. This data is automatically formatted and stored in the TEEHR dataset.

ev.fetch.usgs_streamflow(
    start_date="2020-01-01",
    end_date="2020-12-31"
)

Ok, now that we have fetched the USGS gage data and stored it in the TEEHR dataset as primary_timeseries, we will query the primary_timeseries and plot the timeseries data using the teehr.timeseries_plot()` method.

pt_df = ev. primary_timeseries.to_pandas()
pt_df.head()

	reference_time	value_time	value	unit_name	location_id	configuration_name	variable_name
0	NaT	2020-01-01 11:00:00	7.758816	m^3/s	usgs-14138800	usgs_observations	streamflow_hourly_inst
1	NaT	2020-01-07 08:00:00	5.012082	m^3/s	usgs-14138800	usgs_observations	streamflow_hourly_inst
2	NaT	2020-01-10 11:00:00	2.115268	m^3/s	usgs-14138800	usgs_observations	streamflow_hourly_inst
3	NaT	2020-01-12 06:00:00	3.143170	m^3/s	usgs-14138800	usgs_observations	streamflow_hourly_inst
4	NaT	2020-01-17 10:00:00	0.886317	m^3/s	usgs-14138800	usgs_observations	streamflow_hourly_inst

pt_df.teehr.timeseries_plot()

Location Crosswalk#

As we saw above the IDs used for the location data in the locations table reference the USGS gages (e.g., usgs-14138800). Before we load any secondary timeseries into the TEEHR dataset, we first need to load location crosswalk data into the location_crosswalks table to relate the primary_location_id values to the secondary_location_id values. In this case because we are going to be fetching NWM v3.0 data, we need to insert crosswalk that the relates the USGS gage IDs to the NWM IDs. We downloaded this data from the repository. As with other datasets, first we will examine the data and the load it into the TEEHR dataset.

location_crosswalk_path = Path(test_eval_dir, "two_crosswalks.parquet")
crosswalk_df = pd.read_parquet(location_crosswalk_path)
crosswalk_df

	secondary_location_id	primary_location_id	__index_level_0__
0	nwm30-23894572	usgs-14316700	7851
1	nwm30-23736071	usgs-14138800	7709

ev.location_crosswalks.load_parquet(
    location_crosswalk_path
)

ev.location_crosswalks.to_pandas()

	primary_location_id	secondary_location_id
0	usgs-14138800	nwm30-23736071
1	usgs-14316700	nwm30-23894572

NWM v3.0 Secondary Timeseries#

Now that we have the USGS to NWM v3.0 crosswalk data loaded into the TEEHR dataset, we can fetch the NWM retrospective data into the TEEHR dataset as secondary_timeseries

ev.fetch.nwm_retrospective_points(
    nwm_version="nwm30",
    variable_name="streamflow",
    start_date="2020-01-01",
    end_date="2020-12-31",
)

As we did with the primary_timeseries, once the data is in the TEEHR dataset, we can query the timeseries data and visualize it as a table and then as a plot.

st_df = ev.secondary_timeseries.to_pandas()
st_df.head()

	value_time	value	unit_name	location_id	member	configuration_name	variable_name	reference_time
0	2020-01-01 03:00:00	1.710000	m^3/s	nwm30-23736071	None	nwm30_retrospective	streamflow_hourly_inst	NaT
1	2020-01-01 04:00:00	2.360000	m^3/s	nwm30-23736071	None	nwm30_retrospective	streamflow_hourly_inst	NaT
2	2020-01-02 20:00:00	36.399998	m^3/s	nwm30-23894572	None	nwm30_retrospective	streamflow_hourly_inst	NaT
3	2020-01-05 05:00:00	1.770000	m^3/s	nwm30-23736071	None	nwm30_retrospective	streamflow_hourly_inst	NaT
4	2020-01-10 16:00:00	1.470000	m^3/s	nwm30-23736071	None	nwm30_retrospective	streamflow_hourly_inst	NaT

st_df.teehr.timeseries_plot()

Joined Timeseries#

Like we did in previous examples, once we have the TEEHR dataset tables populated, we can create the joined_timeseries view and populate the joined_timeseries table. By default the method joins the primary_timeseries to the secondary_timeseries and also joins the location_attributes but the user can control wether the user_defined_functions.py script is executed. In this case, we do.

ev.joined_timeseries.create(execute_scripts=True)

Now that we have created the joined_timeseries table, lets take a look at what it contains.

jt_df = ev.joined_timeseries.to_pandas()
jt_df.head()

	value_time	primary_location_id	secondary_location_id	primary_value	secondary_value	unit_name	member	month	year	water_year	season	configuration_name	variable_name	reference_time
0	2020-01-01 03:00:00	usgs-14138800	nwm30-23736071	1.531941	1.710000	m^3/s	None	1	2020	2020	winter	nwm30_retrospective	streamflow_hourly_inst	None
1	2020-01-01 04:00:00	usgs-14138800	nwm30-23736071	2.211546	2.360000	m^3/s	None	1	2020	2020	winter	nwm30_retrospective	streamflow_hourly_inst	None
2	2020-01-02 20:00:00	usgs-14316700	nwm30-23894572	33.130711	36.399998	m^3/s	None	1	2020	2020	winter	nwm30_retrospective	streamflow_hourly_inst	None
3	2020-01-05 05:00:00	usgs-14138800	nwm30-23736071	1.948199	1.770000	m^3/s	None	1	2020	2020	winter	nwm30_retrospective	streamflow_hourly_inst	None
4	2020-01-10 16:00:00	usgs-14138800	nwm30-23736071	1.987843	1.470000	m^3/s	None	1	2020	2020	winter	nwm30_retrospective	streamflow_hourly_inst	None

ev.spark.stop()

That concludes 04 Setting-up a Real Example. In this example we loaded some location data that we had (downloaded from the repository), inspected it and loaded it into the TEEHR dataset, and then fetched USGS and NWM v3.0 data from NWIS and AWS, respectively. Finally we created the joined_timeseries table that serves at the main data table for generating metrics and conducting an evaluation.

In the next lesson, we will clone a Evaluation dataset from S3 and run some metrics on it.