The Evaluation#
Creating an Evaluation is the first step to working with TEEHR. An Evaluation represents both a Python class and a directory structure that contains your data, configuration, and analysis results.
Evaluation Classes#
TEEHR provides several Evaluation classes for different use cases:
Class |
Description |
|---|---|
Full read-write access to a local evaluation directory. Use this when working
locally or managing your own data. |
|
Read-only access to a remote TEEHR catalog. Use this for querying data in TEEHR-Hub without local storage. |
|
Read-write access to a remote TEEHR catalog. This will primarily be used by remote catalog administrators. Requires appropriate AWS credentials. |
.
Creating a Local Evaluation#
If you are working with TEEHR locally and or want to create your own evaluation dataset, the most common way to work with TEEHR is to create a local evaluation:
import teehr
# Create a new evaluation (creates directory if it doesn't exist)
ev = teehr.LocalReadWriteEvaluation(
dir_path="/path/to/my_evaluation",
create_dir=True
)
# Or open an existing evaluation
ev = teehr.LocalReadWriteEvaluation(dir_path="/path/to/existing_evaluation")
The Evaluation class is an alias for LocalReadWriteEvaluation. Both work identically
but you should use LocalReadWriteEvaluation moving forward since Evaluation will be deprecated
in a future release.:
# These are equivalent
ev = teehr.Evaluation(dir_path="./my_eval", create_dir=True)
ev = teehr.LocalReadWriteEvaluation(dir_path="./my_eval", create_dir=True)
Parameters#
dir_pathstr or PathThe directory path for the evaluation. This is where the Iceberg warehouse and cache directories will be created.
create_dirbool, default FalseWhether to create the directory if it doesn’t exist. Set to
Truewhen creating a new evaluation.check_evaluation_versionbool, default TrueWhether to verify the evaluation version is compatible with the current TEEHR version. Set to
Falseto bypass version checking (use with caution).sparkSparkSession, optionalAn existing SparkSession to use. If not provided, TEEHR creates a new session with default configuration.
Directory Structure#
When you create an evaluation, TEEHR sets up the following directory structure:
my_evaluation/
└── local/ # Iceberg warehouse directory
├── teehr/ # Namespace containing tables
│ ├── attributes/
│ ├── configurations/
│ ├── locations/
│ ├── location_attributes/
│ ├── location_crosswalks/
│ ├── primary_timeseries/
│ ├── secondary_timeseries/
│ ├── units/
│ └── variables/
├── cache/ # Temporary files
├── schema_evolution/ # Tracks schema migrations
├── local_catalog.db # Iceberg catalog (SQLite)
└── version # Version file
Remote Evaluations (TEEHR-HUB)#
When running in the TEEHR-HUB environment, you can access remote catalogs directly to access the data that is in the TEEHR warehouse without needing to download it. This evaluation type is only available in TEEHR-HUB where the necessary permissions and configurations are in place.
Read-Only Access#
import teehr
# Read-only access to remote catalog
ev = teehr.RemoteReadOnlyEvaluation()
# Query data without local storage
ev.primary_timeseries.filter("location_id = 'usgs-02424000'").to_sdf().show()
This evaluation provides read-only access to the TEEHR Data Warehouse. It is ideal for users who want to query and analyze data in TEEHR-HUB without needing to manage local storage. It does not require AWS credentials when used in TEEHR-HUB since it only accesses the remote catalog in read-only mode. For users who need temporary storage for caching and logging, it creates a temporary directory and sets the active catalog to remote.
Read-Write Access#
import teehr
# Read-write access (requires AWS credentials with write permissions)
ev = teehr.RemoteReadWriteEvaluation()
Note
This evaluation creates a read-write remote evaluation. This is primarily intended for TEEHR-HUB
administrators who need to manage the remote catalogs. It requires appropriate AWS credentials with
write permissions to the remote catalogs. Use with caution, as changes will affect all users accessing
the remote catalogs. For typical TEEHR-HUB users, the RemoteReadOnlyEvaluation is recommended to
safely access data without risking unintended modifications. For local access to TEEHR warehouse data,
use the Download class with a local evaluation.
Upgrading Existing Evaluations#
If you have an evaluation created with an earlier version of TEEHR (pre-v0.6), you’ll need to migrate it to the new Iceberg-based format.
When opening an older evaluation, TEEHR will detect the version mismatch and provide instructions for migrating your data to the new format. The migration process involves converting your existing evaluation in-place to the new structure. You can use the provided utility function to perform the conversion.:
ev = teehr.LocalReadWriteEvaluation(dir_path="./old_evaluation")
# Raises ValueError with migration instructions
To migrate:
from teehr.utilities.convert_to_iceberg import convert_evaluation
# Convert the evaluation in-place
convert_evaluation("/path/to/old_evaluation")
# Now you can open it
ev = teehr.LocalReadWriteEvaluation(dir_path="/path/to/old_evaluation")
Note this will leave your existing data intact and create the necessary Iceberg t ables and structure. However, it’s always a good idea to back up your data before performing the conversion, just in case. Once you are convinced the new format is working well for you, you can delete the backup of the old evaluation and the old version files in the evaluation directory.
Apache Spark Configuration#
TEEHR uses Apache Spark with Apache Iceberg for scalable data processing. By default, TEEHR automatically creates an optimized Spark session for local usage on your machine or within TEEHR-HUB.
Default Configuration#
The default configuration:
Runs in local mode using all available cores
Sets driver memory to 75% of available system memory
Configures Iceberg catalogs (local JDBC + remote REST)
Includes required packages (Iceberg, Sedona for geospatial, AWS S3)
Custom Spark Session#
For advanced use cases, you can create a custom Spark session that will allow your analyics
to scale beyond your local machine or to customize Spark settings. You can create a Spark
session using the create_spark_session utility function and pass it to your evaluation:
from teehr.evaluation.spark_session_utils import create_spark_session
import teehr
# Create a custom Spark session
spark = create_spark_session(
app_name="My TEEHR Analysis",
driver_memory="16g",
driver_max_result_size="8g"
)
# Use it with your evaluation
ev = teehr.LocalReadWriteEvaluation(
dir_path="./my_eval",
create_dir=True,
spark=spark
)
Spark Cluster Mode (Kubernetes)#
In TEEHR-HUB or other Kubernetes environments, you can start executor pods:
spark = create_spark_session(
start_spark_cluster=True,
executor_instances=4,
executor_memory="4g",
executor_cores=2
)
Note
When running in cluster mode, the Spark UI will be accessible through the TEEHR-HUB interface, and logs will be written to the temporary evaluation directory for easier debugging.
Warning
It is very important to stop your Spark session when running in cluster mode to avoid
leaving orphaned executor pods running in your Kubernetes cluster. Always call spark.stop()
when you are done with your analysis, especially in scripts or notebooks that may be left open.
Logging and Debugging#
Enable logging to troubleshoot issues:
ev = teehr.LocalReadWriteEvaluation(dir_path="./my_eval")
ev.enable_logging() # Writes to teehr.log in evaluation directory
# View current Spark configuration
ev.log_spark_config()
Cleaning Up#
Clean temporary cache files:
ev.clean_cache()
Always stop the Spark session when done (especially in scripts):
ev.spark.stop()
Common Operations#
Once you have an evaluation, you can access its components. The following are common operations you can perform with an Evaluation instance:
import teehr
ev = teehr.LocalReadWriteEvaluation(dir_path="./my_eval", create_dir=True)
# Access tables
ev.locations # LocationTable
ev.primary_timeseries # PrimaryTimeseriesTable
ev.secondary_timeseries # SecondaryTimeseriesTable
# ... and more
# Access views
ev.joined_timeseries_view() # Computed join of primary/secondary
# Access I/O components
ev.fetch # Fetch from external sources (USGS, NWM)
ev.download # Download from TEEHR warehouse
ev.load # Load local files
ev.write # Write query results
# Run SQL queries directly
df = ev.sql("SELECT * FROM primary_timeseries LIMIT 10")
# List available tables
ev.list_tables()
# Drop a user-created table
ev.drop_table("my_saved_results")