{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Setting-up a Simple Example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "In this lesson we will set up a more realistic, but still very small, example using real gage locations. In this case we will use a few data files that we have on hand in the repository for the `locations` and `location_crosswalks` and fetch the USGS gage data and the NWM v3.0 streamflow simulation data from NWIS and AWS, respectively." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a new Evaluation\n", "First we will import TEEHR along with some other required libraries for this example. Then we create a new instance of the Evaluation that points to a directory where the evaluation data will be stored and clone the basic evaluation template." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import teehr\n", "import teehr.example_data.two_locations as two_locations_data\n", "import pandas as pd\n", "import geopandas as gpd\n", "from pathlib import Path\n", "import shutil\n", "\n", "# Tell Bokeh to output plots in the notebook\n", "from bokeh.io import output_notebook\n", "output_notebook()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-output" ] }, "outputs": [], "source": [ "# Define the directory where the Evaluation will be created\n", "test_eval_dir = Path(Path().home(), \"temp\", \"04_setup_real_example\")\n", "shutil.rmtree(test_eval_dir, ignore_errors=True)\n", "\n", "# Create an Evaluation object and create the directory\n", "ev = teehr.Evaluation(dir_path=test_eval_dir, create_dir=True)\n", "\n", "# Clone the template\n", "ev.clone_template()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have the template cloned, Lets fetch `locations` and `location_crosswalks` files from the repository. When setting up a brand new evaluation you may need to develop this information yourself, however, as we will show in some subsequent examples, the TEEHR team has several baseline datasets that can be used as a starting point. For this example, we will download some small files from the repository to use." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-output" ] }, "outputs": [], "source": [ "location_data_path = Path(test_eval_dir, \"two_locations.parquet\")\n", "two_locations_data.fetch_file(\"two_locations.parquet\", location_data_path)\n", "\n", "crosswalk_data_path = Path(test_eval_dir, \"two_crosswalks.parquet\")\n", "two_locations_data.fetch_file(\"two_crosswalks.parquet\", crosswalk_data_path)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Location Data\n", "As we have done in previous examples, lets open the spatial data file and examine the contents before loading it into the TEEHR dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf = gpd.read_parquet(location_data_path)\n", "gdf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the file contains 2 USGS gages located in Oregon. Lets load them into the TEEHR dataset using the `load_spatial()` method. Note that the `id` column contains two hyphen separated values. To the left of the hyphen we have a a set of characters the indicate the \"source\" while to the right we have the identifier for that source. While not strictly necessary, this is standard used within TEEHR. This is also necessary when using the fetching functionality to fetch data from remote sources, as demonstrated below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-ouput" ] }, "outputs": [], "source": [ "ev.locations.load_spatial(location_data_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And, as we have done in previous examples, lets query the `locations` table as a GeoPandas GeoDataFrame and then plot the gages on a map using the TEEHR plotting" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "locations_gdf = ev.locations.to_geopandas()\n", "locations_gdf.teehr.location_map()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### USGS Primary Timeseries\n", "In previous examples we loaded the primary timeseries from files that we already had. In the following cells we will utilize the `ev.fetching.usgs_streamflow()` method to fetch streamflow data from NWIS. This data is automatically formatted and stored in the TEEHR dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ev.fetch.usgs_streamflow(\n", " start_date=\"2020-01-01\",\n", " end_date=\"2020-12-31\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, now that we have fetched the USGS gage data and stored it in the TEEHR dataset as `primary_timeseries`, we will query the `primary_timeseries` and plot the timeseries data using the teehr.timeseries_plot()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pt_df = ev. primary_timeseries.to_pandas()\n", "pt_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pt_df.teehr.timeseries_plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Location Crosswalk\n", "As we saw above the IDs used for the location data in the `locations` table reference the USGS gages (e.g., usgs-14138800). Before we load any secondary timeseries into the TEEHR dataset, we first need to load location crosswalk data into the `location_crosswalks` table to relate the `primary_location_id` values to the `secondary_location_id` values. In this case because we are going to be fetching NWM v3.0 data, we need to insert crosswalk that the relates the USGS gage IDs to the NWM IDs. We downloaded this data from the repository. As with other datasets, first we will examine the data and the load it into the TEEHR dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "location_crosswalk_path = Path(test_eval_dir, \"two_crosswalks.parquet\")\n", "crosswalk_df = pd.read_parquet(location_crosswalk_path)\n", "crosswalk_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ev.location_crosswalks.load_parquet(\n", " location_crosswalk_path\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ev.location_crosswalks.to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NWM v3.0 Secondary Timeseries\n", "Now that we have the USGS to NWM v3.0 crosswalk data loaded into the TEEHR dataset, we can fetch the NWM retrospective data into the TEEHR dataset as `secondary_timeseries`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ev.fetch.nwm_retrospective_points(\n", " nwm_version=\"nwm30\",\n", " variable_name=\"streamflow\",\n", " start_date=\"2020-01-01\",\n", " end_date=\"2020-12-31\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we did with the `primary_timeseries`, once the data is in the TEEHR dataset, we can query the timeseries data and visualize it as a table and then as a plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "st_df = ev.secondary_timeseries.to_pandas()\n", "st_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "st_df.teehr.timeseries_plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Joined Timeseries\n", "Like we did in previous examples, once we have the TEEHR dataset tables populated, we can create the `joined_timeseries` view and populate the `joined_timeseries` table. By default the method joins the `primary_timeseries` to the `secondary_timeseries` and also joins the `location_attributes` but the user can control wether the `user_defined_functions.py` script is executed. In this case, we do." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ev.joined_timeseries.create(execute_udf=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have created the `joined_timeseries` table, lets take a look at what it contains." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "jt_df = ev.joined_timeseries.to_pandas()\n", "jt_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ev.spark.stop()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That concludes `04 Setting-up a Real Example`. In this example we loaded some location data that we had (downloaded from the repository), inspected it and loaded it into the TEEHR dataset, and then fetched USGS and NWM v3.0 data from NWIS and AWS, respectively. Finally we created the `joined_timeseries` table that serves at the main data table for generating metrics and conducting an evaluation.\n", "\n", "In the next lesson, we will clone a Evaluation dataset from S3 and run some metrics on it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 2 }