Setting-up a Simple Example#

Overview#

In this lesson we will set up a more realistic, but still very small, example using real gage locations. In this case we will use a few data files that we have on hand in the repository for the locations and location_crosswalks and fetch the USGS gage data and the NWM v3.0 streamflow simulation data from NWIS and AWS, respectively.

Create a new Evaluation#

First we will import TEEHR along with some other required libraries for this example. Then we create a new instance of the Evaluation that points to a directory where the evaluation data will be stored and clone the basic evaluation template.

import teehr
import teehr.example_data.two_locations as two_locations_data
import pandas as pd
import geopandas as gpd
from pathlib import Path
import shutil

# Tell Bokeh to output plots in the notebook
from bokeh.io import output_notebook
output_notebook()
Loading BokehJS ...
# Define the directory where the Evaluation will be created
test_eval_dir = Path(Path().home(), "temp", "04_setup_real_example")
shutil.rmtree(test_eval_dir, ignore_errors=True)

# Create an Evaluation object and create the directory
ev = teehr.Evaluation(dir_path=test_eval_dir, create_dir=True)

# Clone the template
ev.clone_template()
Hide code cell output
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/19 18:51:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Now that we have the template cloned, Lets fetch locations and location_crosswalks files from the repository. When setting up a brand new evaluation you may need to develop this information yourself, however, as we will show in some subsequent examples, the TEEHR team has several baseline datasets that can be used as a starting point. For this example, we will download some small files from the repository to use.

location_data_path = Path(test_eval_dir, "two_locations.parquet")
two_locations_data.fetch_file("two_locations.parquet", location_data_path)

crosswalk_data_path = Path(test_eval_dir, "two_crosswalks.parquet")
two_locations_data.fetch_file("two_crosswalks.parquet", crosswalk_data_path)

Location Data#

As we have done in previous examples, lets open the spatial data file and examine the contents before loading it into the TEEHR dataset.

gdf = gpd.read_parquet(location_data_path)
gdf
id name geometry
0 usgs-14316700 STEAMBOAT CREEK NEAR GLIDE, OR POINT (-122.72894 43.34984)
1 usgs-14138800 BLAZED ALDER CREEK NEAR RHODODENDRON, OR POINT (-121.89147 45.45262)

As you can see, the file contains 2 USGS gages located in Oregon. Lets load them into the TEEHR dataset using the load_spatial() method. Note that the id column contains two hyphen separated values. To the left of the hyphen we have a a set of characters the indicate the “source” while to the right we have the identifier for that source. While not strictly necessary, this is standard used within TEEHR. This is also necessary when using the fetching functionality to fetch data from remote sources, as demonstrated below.

ev.locations.load_spatial(location_data_path)

And, as we have done in previous examples, lets query the locations table as a GeoPandas GeoDataFrame and then plot the gages on a map using the TEEHR plotting

locations_gdf = ev.locations.to_geopandas()
locations_gdf.teehr.location_map()

USGS Primary Timeseries#

In previous examples we loaded the primary timeseries from files that we already had. In the following cells we will utilize the ev.fetching.usgs_streamflow() method to fetch streamflow data from NWIS. This data is automatically formatted and stored in the TEEHR dataset.

ev.fetch.usgs_streamflow(
    start_date="2020-01-01",
    end_date="2020-12-31"
)
[Stage 8:>                                                          (0 + 1) / 1]

                                                                                
[Stage 52:>                                                         (0 + 1) / 1]

                                                                                

Ok, now that we have fetched the USGS gage data and stored it in the TEEHR dataset as primary_timeseries, we will query the primary_timeseries and plot the timeseries data using the teehr.timeseries_plot()` method.

pt_df = ev. primary_timeseries.to_pandas()
pt_df.head()
reference_time value_time value unit_name location_id configuration_name variable_name
0 NaT 2020-01-01 00:00:00 0.710753 m^3/s usgs-14138800 usgs_observations streamflow_hourly_inst
1 NaT 2020-01-01 01:00:00 0.807030 m^3/s usgs-14138800 usgs_observations streamflow_hourly_inst
2 NaT 2020-01-01 02:00:00 1.081704 m^3/s usgs-14138800 usgs_observations streamflow_hourly_inst
3 NaT 2020-01-01 03:00:00 1.531941 m^3/s usgs-14138800 usgs_observations streamflow_hourly_inst
4 NaT 2020-01-01 04:00:00 2.211546 m^3/s usgs-14138800 usgs_observations streamflow_hourly_inst
pt_df.teehr.timeseries_plot()

Location Crosswalk#

As we saw above the IDs used for the location data in the locations table reference the USGS gages (e.g., usgs-14138800). Before we load any secondary timeseries into the TEEHR dataset, we first need to load location crosswalk data into the location_crosswalks table to relate the primary_location_id values to the secondary_location_id values. In this case because we are going to be fetching NWM v3.0 data, we need to insert crosswalk that the relates the USGS gage IDs to the NWM IDs. We downloaded this data from the repository. As with other datasets, first we will examine the data and the load it into the TEEHR dataset.

location_crosswalk_path = Path(test_eval_dir, "two_crosswalks.parquet")
crosswalk_df = pd.read_parquet(location_crosswalk_path)
crosswalk_df
secondary_location_id primary_location_id __index_level_0__
0 nwm30-23894572 usgs-14316700 7851
1 nwm30-23736071 usgs-14138800 7709
ev.location_crosswalks.load_parquet(
    location_crosswalk_path
)
ev.location_crosswalks.to_pandas()
primary_location_id secondary_location_id
0 usgs-14316700 nwm30-23894572
1 usgs-14138800 nwm30-23736071

NWM v3.0 Secondary Timeseries#

Now that we have the USGS to NWM v3.0 crosswalk data loaded into the TEEHR dataset, we can fetch the NWM retrospective data into the TEEHR dataset as secondary_timeseries

ev.fetch.nwm_retrospective_points(
    nwm_version="nwm30",
    variable_name="streamflow",
    start_date="2020-01-01",
    end_date="2020-12-31",
)

As we did with the primary_timeseries, once the data is in the TEEHR dataset, we can query the timeseries data and visualize it as a table and then as a plot.

st_df = ev.secondary_timeseries.to_pandas()
st_df.head()
reference_time value_time value unit_name location_id configuration_name variable_name
0 NaT 2020-01-01 00:00:00 10.440000 m^3/s nwm30-23894572 nwm30_retrospective streamflow_hourly_inst
1 NaT 2020-01-01 00:00:00 1.260000 m^3/s nwm30-23736071 nwm30_retrospective streamflow_hourly_inst
2 NaT 2020-01-01 01:00:00 10.429999 m^3/s nwm30-23894572 nwm30_retrospective streamflow_hourly_inst
3 NaT 2020-01-01 01:00:00 1.840000 m^3/s nwm30-23736071 nwm30_retrospective streamflow_hourly_inst
4 NaT 2020-01-01 02:00:00 10.460000 m^3/s nwm30-23894572 nwm30_retrospective streamflow_hourly_inst
st_df.teehr.timeseries_plot()

Joined Timeseries#

Like we did in previous examples, once we have the TEEHR dataset tables populated, we can create the joined_timeseries view and populate the joined_timeseries table. By default the method joins the primary_timeseries to the secondary_timeseries and also joins the location_attributes but the user can control wether the user_defined_functions.py script is executed. In this case, we do.

ev.joined_timeseries.create(execute_udf=True)
[Stage 131:>                                                        (0 + 1) / 1]
                                                                                

Now that we have created the joined_timeseries table, lets take a look at what it contains.

jt_df = ev.joined_timeseries.to_pandas()
jt_df.head()
reference_time value_time primary_location_id secondary_location_id primary_value secondary_value unit_name month year water_year configuration_name variable_name
0 NaT 2020-01-01 00:00:00 usgs-14316700 nwm30-23894572 3.313071 10.440000 m^3/s 1 2020 2020 nwm30_retrospective streamflow_hourly_inst
1 NaT 2020-01-01 00:00:00 usgs-14138800 nwm30-23736071 0.710753 1.260000 m^3/s 1 2020 2020 nwm30_retrospective streamflow_hourly_inst
2 NaT 2020-01-01 01:00:00 usgs-14316700 nwm30-23894572 3.313071 10.429999 m^3/s 1 2020 2020 nwm30_retrospective streamflow_hourly_inst
3 NaT 2020-01-01 01:00:00 usgs-14138800 nwm30-23736071 0.807030 1.840000 m^3/s 1 2020 2020 nwm30_retrospective streamflow_hourly_inst
4 NaT 2020-01-01 02:00:00 usgs-14316700 nwm30-23894572 3.313071 10.460000 m^3/s 1 2020 2020 nwm30_retrospective streamflow_hourly_inst
ev.spark.stop()

That concludes 04 Setting-up a Real Example. In this example we loaded some location data that we had (downloaded from the repository), inspected it and loaded it into the TEEHR dataset, and then fetched USGS and NWM v3.0 data from NWIS and AWS, respectively. Finally we created the joined_timeseries table that serves at the main data table for generating metrics and conducting an evaluation.

In the next lesson, we will clone a Evaluation dataset from S3 and run some metrics on it.