Introduction to the Schema#

Overview#

TEEHR enables conducting an exploratory evaluation through the concept of an Evaluation. In this context an Evaluation is a class in Python and a directory with a defined set of subdirectories to hold all the relevant data that work together to make up an Evaluation, while the Evaluation class provides users with methods to interact with the data that is in the Evaluation directory. These interactions include fetching data from external sources, loading data that you already have, and querying and visualizing data that is in the Evaluation.

In this workbook we will walk though the structure of an Evaluation directory by using the Evaluation class to create a new Evaluation and explore its contents, then we will move on to the dataset Schema.

First we will import the TEEHR class and use it to create a new Evaluation in the user’s Home directory in a temp folder. If the directory path you provided doesn’t exist it will be created because we passed create_dir=True. We will dive into the Evaluation class methods later, but for now we are just going to use it to create the Evaluation directory structure including some standard domain tables (i.e., units, variables, etc.).

Note, if you want to follow along you will need to either have access to TEEHR-HUB or create a virtual environment and install TEEHR locally as described in the Getting Started section of the TEEHR documentation.

For now we will just run the following command to setup an Evaluation directory so we can explore the directory structure and schema. We will dig into what is actually being done in later notebooks.

import teehr
from teehr.evaluation.utils import print_tree
import shutil
from pathlib import Path

# Define the directory where the Evaluation will be created
test_eval_dir = Path(Path().home(), "temp", "01_introduction_schema")
shutil.rmtree(test_eval_dir, ignore_errors=True)

# Create an Evaluation object and create the directory
ev = teehr.Evaluation(dir_path=test_eval_dir, create_dir=True)

# Clone the evaluation template
ev.clone_template()
Hide code cell output
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/19 18:50:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Directory Structure#

To begin exploring the TEEHR directory structure lets use the tree command to print out the directory contents. We will only go one level deep for now (-L 1).

Note, if you don’t have tree installed and don’t want to install it, you can uncomment the two lines that are commented and to use a Python function roughly does the same thing, but not as cleanly.

print_tree(ev.dir_path, max_depth=1)
├── scripts
│   ├── __init__.py
│   └── user_defined_fields.py
├── .gitignore
├── dataset
│   ├── configurations
│   ├── primary_timeseries
│   ├── locations
│   ├── joined_timeseries
│   ├── units
│   ├── location_attributes
│   ├── location_crosswalks
│   ├── secondary_timeseries
│   ├── attributes
│   └── variables
├── __init__.py
├── cache
│   └── readme.md
└── readme.md

As shown, the TEEHR directory structure contains the following subdirectories:

  • cache - the cache directory is used internally by TEEHR to store temporary files during processing of fetching and loading methods. The user should not need to interact with this directory except if troubleshooting an issue.

  • dataset - the files and data within the dataset directory make up the TEEHR dataset. Each subfolder within the dataset directory can be thought of as a database table and contains one or more files that contain the data for that table. All the files within a “table folder” conform to the specified schema and file format for that table. We will explore this more next in the schema section.

  • scripts - the scripts directory contains processing scripts that are specific to a given evaluation. These are typically data setup and preprocessing scripts, not the evaluation code used to explore the data or calculate metrics.

In addition to the default directories, there is a readme.md file that contains some basic information about the directory contents and a teehr.log file where TEEHR writes log messages if logging as been enabled.

Dataset Schema#

Now lets explore the dataset subdirectory in the TEEHR Evaluation by going a level deeper into the dataset directory.

print_tree(ev.dataset_dir, max_depth=0)
├── configurations
├── primary_timeseries
├── locations
├── joined_timeseries
├── units
├── location_attributes
├── location_crosswalks
├── secondary_timeseries
├── attributes
└── variables

As shown, there are 9 tables in the standard TEEHR dataset that users can enter data into and one table (joined_timeseries) that acts as a materialized view to speed up analytic queries, for a total of 10 tables. They fall into three categories, domain tables, location data tables, and timeseries tables. The domain tables are populated with values automatically and the data is stored in plain text as CSV files so that they are more easily edited by the user. These tables tend to have a relatively small amount of data compared to say the timeseries and location data tables which tend to contain much more data and utilize the Apache Parquet format. This is managed for the user, but is provided here for reference.

Domain Tables#

The domain tables contain “lookup” data that is referenced in the data and location tables. These tables serve as a way to keep the data consistent across an Evaluation.

  • units - the units table contains the valid measurement units that can be referenced in other tables, such as the timeseries table (e.g., ft^3/s or mm/hr).

  • variables - the variables table contains the valid variables that can be referenced in other tables, such as the timeseries table. These take the format of “variable_interval_statistic” (e.g., streamflow_hourly_inst or streamflow_daily_mean, etc.)

  • configurations - the configurations table contains the valid configurations that can be referenced in other tables, such as the timeseries table. Configurations are a way to label collections of timeseries or simulation outputs (e.g., nwm30_retrospective, nwm22_medium_range_mem1, etc.)

  • attributes - the attributes table contains the valid attributes that can be referenced in other tables, such as the location_attribute table.

Location Data Tables#

The location data tables store the spatial data related to the observations and simulated data as well as location attributes and location crosswalk data.

  • locations - the locations table is a spatial data table that stores data associated with the “where” of the data. For example, this could be gage locations, basin boundaries, river reaches, etc.

  • location_attributes - the location_attributes table stores attribute data per location. This could be any location specific data such as drainage area, ecoregion, state, river forecast center, etc.

  • location_crosswalks - the location_crosswalks table stores crosswalk data needed to relate a primary location ID (e.g., a USGS gage ID) to the corresponding secondary location ID (e.g., NWM ID).

Timeseries Tables#

  • primary_timeseries - the primary_timeseries table stores the primary timeseries data (“observations”) used in the evaluation. This represents the “truth” or the data that is considered correct when evaluating simulations.

  • secondary_timeseries - the secondary_timeseries table stores the secondary timeseries data (“model simulations”) that is used in the evaluation.

  • joined_timeseries - the joined_timeseries table is effectively a materialized view that joins the primary_timeseries to the secondary_timeseries as well as location_attributes and other user defined fields and is stored to speed up analytic queries.

Schema Diagram#

The following diagram shows the Domain Tables, Location Data Tables and the Timeseries Tables (i.e., table groups) associated with a TEEHR Evaluation. The table groups are delineated with labeled dashed lines. The tables, relationships (lines) and the “foreign key” references are color coded to help explain the relationships. Foreign key is in quotes because these relationships represent something like a foreign key without actually being a foreign key as the data storage format used does not support foreign keys. The “foreign keys” are enforced when data is loaded, but not beyond that.

https://github.com/RTIInternational/teehr/raw/refs/heads/main/docs/images/getting_started/data_model.drawio.svg?raw=true
ev.spark.stop()

Conclusion#

That concludes Introduction to the Schema. In the next lesson we will cover Loading Local Data.