Setting-up a Simple Example#
Overview#
In this lesson we will set up a more realistic, but still very small, example using real gage locations. In this case we will use a few data files that we have on hand in the repository for the locations
and location_crosswalks
and fetch the USGS gage data and the NWM v3.0 streamflow simulation data from NWIS and AWS, respectively.
Create a new Evaluation#
First we will import TEEHR along with some other required libraries for this example. Then we create a new instance of the Evaluation that points to a directory where the evaluation data will be stored and clone the basic evaluation template.
import teehr
import teehr.example_data.two_locations as two_locations_data
import pandas as pd
import geopandas as gpd
from pathlib import Path
import shutil
# Tell Bokeh to output plots in the notebook
from bokeh.io import output_notebook
output_notebook()
# Define the directory where the Evaluation will be created
test_eval_dir = Path(Path().home(), "temp", "04_setup_real_example")
shutil.rmtree(test_eval_dir, ignore_errors=True)
# Create an Evaluation object and create the directory
ev = teehr.Evaluation(dir_path=test_eval_dir, create_dir=True)
# Clone the template
ev.clone_template()
Show code cell output
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/19 18:51:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Now that we have the template cloned, Lets fetch locations
and location_crosswalks
files from the repository. When setting up a brand new evaluation you may need to develop this information yourself, however, as we will show in some subsequent examples, the TEEHR team has several baseline datasets that can be used as a starting point. For this example, we will download some small files from the repository to use.
location_data_path = Path(test_eval_dir, "two_locations.parquet")
two_locations_data.fetch_file("two_locations.parquet", location_data_path)
crosswalk_data_path = Path(test_eval_dir, "two_crosswalks.parquet")
two_locations_data.fetch_file("two_crosswalks.parquet", crosswalk_data_path)
Location Data#
As we have done in previous examples, lets open the spatial data file and examine the contents before loading it into the TEEHR dataset.
gdf = gpd.read_parquet(location_data_path)
gdf
id | name | geometry | |
---|---|---|---|
0 | usgs-14316700 | STEAMBOAT CREEK NEAR GLIDE, OR | POINT (-122.72894 43.34984) |
1 | usgs-14138800 | BLAZED ALDER CREEK NEAR RHODODENDRON, OR | POINT (-121.89147 45.45262) |
As you can see, the file contains 2 USGS gages located in Oregon. Lets load them into the TEEHR dataset using the load_spatial()
method. Note that the id
column contains two hyphen separated values. To the left of the hyphen we have a a set of characters the indicate the “source” while to the right we have the identifier for that source. While not strictly necessary, this is standard used within TEEHR. This is also necessary when using the fetching functionality to fetch data from remote sources, as demonstrated below.
ev.locations.load_spatial(location_data_path)
And, as we have done in previous examples, lets query the locations
table as a GeoPandas GeoDataFrame and then plot the gages on a map using the TEEHR plotting
locations_gdf = ev.locations.to_geopandas()
locations_gdf.teehr.location_map()
USGS Primary Timeseries#
In previous examples we loaded the primary timeseries from files that we already had. In the following cells we will utilize the ev.fetching.usgs_streamflow()
method to fetch streamflow data from NWIS. This data is automatically formatted and stored in the TEEHR dataset.
ev.fetch.usgs_streamflow(
start_date="2020-01-01",
end_date="2020-12-31"
)
[Stage 8:> (0 + 1) / 1]
[Stage 52:> (0 + 1) / 1]
Ok, now that we have fetched the USGS gage data and stored it in the TEEHR dataset as primary_timeseries
, we will query the primary_timeseries
and plot the timeseries data using the teehr.timeseries_plot()` method.
pt_df = ev. primary_timeseries.to_pandas()
pt_df.head()
reference_time | value_time | value | unit_name | location_id | configuration_name | variable_name | |
---|---|---|---|---|---|---|---|
0 | NaT | 2020-01-01 00:00:00 | 0.710753 | m^3/s | usgs-14138800 | usgs_observations | streamflow_hourly_inst |
1 | NaT | 2020-01-01 01:00:00 | 0.807030 | m^3/s | usgs-14138800 | usgs_observations | streamflow_hourly_inst |
2 | NaT | 2020-01-01 02:00:00 | 1.081704 | m^3/s | usgs-14138800 | usgs_observations | streamflow_hourly_inst |
3 | NaT | 2020-01-01 03:00:00 | 1.531941 | m^3/s | usgs-14138800 | usgs_observations | streamflow_hourly_inst |
4 | NaT | 2020-01-01 04:00:00 | 2.211546 | m^3/s | usgs-14138800 | usgs_observations | streamflow_hourly_inst |
pt_df.teehr.timeseries_plot()
Location Crosswalk#
As we saw above the IDs used for the location data in the locations
table reference the USGS gages (e.g., usgs-14138800). Before we load any secondary timeseries into the TEEHR dataset, we first need to load location crosswalk data into the location_crosswalks
table to relate the primary_location_id
values to the secondary_location_id
values. In this case because we are going to be fetching NWM v3.0 data, we need to insert crosswalk that the relates the USGS gage IDs to the NWM IDs. We downloaded this data from the repository. As with other datasets, first we will examine the data and the load it into the TEEHR dataset.
location_crosswalk_path = Path(test_eval_dir, "two_crosswalks.parquet")
crosswalk_df = pd.read_parquet(location_crosswalk_path)
crosswalk_df
secondary_location_id | primary_location_id | __index_level_0__ | |
---|---|---|---|
0 | nwm30-23894572 | usgs-14316700 | 7851 |
1 | nwm30-23736071 | usgs-14138800 | 7709 |
ev.location_crosswalks.load_parquet(
location_crosswalk_path
)
ev.location_crosswalks.to_pandas()
primary_location_id | secondary_location_id | |
---|---|---|
0 | usgs-14316700 | nwm30-23894572 |
1 | usgs-14138800 | nwm30-23736071 |
NWM v3.0 Secondary Timeseries#
Now that we have the USGS to NWM v3.0 crosswalk data loaded into the TEEHR dataset, we can fetch the NWM retrospective data into the TEEHR dataset as secondary_timeseries
ev.fetch.nwm_retrospective_points(
nwm_version="nwm30",
variable_name="streamflow",
start_date="2020-01-01",
end_date="2020-12-31",
)
As we did with the primary_timeseries
, once the data is in the TEEHR dataset, we can query the timeseries data and visualize it as a table and then as a plot.
st_df = ev.secondary_timeseries.to_pandas()
st_df.head()
reference_time | value_time | value | unit_name | location_id | configuration_name | variable_name | |
---|---|---|---|---|---|---|---|
0 | NaT | 2020-01-01 00:00:00 | 10.440000 | m^3/s | nwm30-23894572 | nwm30_retrospective | streamflow_hourly_inst |
1 | NaT | 2020-01-01 00:00:00 | 1.260000 | m^3/s | nwm30-23736071 | nwm30_retrospective | streamflow_hourly_inst |
2 | NaT | 2020-01-01 01:00:00 | 10.429999 | m^3/s | nwm30-23894572 | nwm30_retrospective | streamflow_hourly_inst |
3 | NaT | 2020-01-01 01:00:00 | 1.840000 | m^3/s | nwm30-23736071 | nwm30_retrospective | streamflow_hourly_inst |
4 | NaT | 2020-01-01 02:00:00 | 10.460000 | m^3/s | nwm30-23894572 | nwm30_retrospective | streamflow_hourly_inst |
st_df.teehr.timeseries_plot()
Joined Timeseries#
Like we did in previous examples, once we have the TEEHR dataset tables populated, we can create the joined_timeseries
view and populate the joined_timeseries
table. By default the method joins the primary_timeseries
to the secondary_timeseries
and also joins the location_attributes
but the user can control wether the user_defined_functions.py
script is executed. In this case, we do.
ev.joined_timeseries.create(execute_udf=True)
[Stage 131:> (0 + 1) / 1]
Now that we have created the joined_timeseries
table, lets take a look at what it contains.
jt_df = ev.joined_timeseries.to_pandas()
jt_df.head()
reference_time | value_time | primary_location_id | secondary_location_id | primary_value | secondary_value | unit_name | month | year | water_year | configuration_name | variable_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaT | 2020-01-01 00:00:00 | usgs-14316700 | nwm30-23894572 | 3.313071 | 10.440000 | m^3/s | 1 | 2020 | 2020 | nwm30_retrospective | streamflow_hourly_inst |
1 | NaT | 2020-01-01 00:00:00 | usgs-14138800 | nwm30-23736071 | 0.710753 | 1.260000 | m^3/s | 1 | 2020 | 2020 | nwm30_retrospective | streamflow_hourly_inst |
2 | NaT | 2020-01-01 01:00:00 | usgs-14316700 | nwm30-23894572 | 3.313071 | 10.429999 | m^3/s | 1 | 2020 | 2020 | nwm30_retrospective | streamflow_hourly_inst |
3 | NaT | 2020-01-01 01:00:00 | usgs-14138800 | nwm30-23736071 | 0.807030 | 1.840000 | m^3/s | 1 | 2020 | 2020 | nwm30_retrospective | streamflow_hourly_inst |
4 | NaT | 2020-01-01 02:00:00 | usgs-14316700 | nwm30-23894572 | 3.313071 | 10.460000 | m^3/s | 1 | 2020 | 2020 | nwm30_retrospective | streamflow_hourly_inst |
ev.spark.stop()
That concludes 04 Setting-up a Real Example
. In this example we loaded some location data that we had (downloaded from the repository), inspected it and loaded it into the TEEHR dataset, and then fetched USGS and NWM v3.0 data from NWIS and AWS, respectively. Finally we created the joined_timeseries
table that serves at the main data table for generating metrics and conducting an evaluation.
In the next lesson, we will clone a Evaluation dataset from S3 and run some metrics on it.