RemoteReadWriteEvaluation#

class RemoteReadWriteEvaluation(spark: SparkSession | None = None, temp_dir_path: Path | str | None = None)[source]#

A read-write Evaluation class for access to remote catalogs.

This class provides a convenient way to access a remote TEEHR catalog without needing to manage local directories. It automatically creates a temporary directory and sets the active catalog to remote.

Note: This is intended for read-write access to remote data. Write operations to the remote catalog are supported through this class, however an AWS profile with write permissions is required in the Spark session.

Currently only users in the TEEHR-Hub environment have access to the remote catalog, so this class is intended for use within that environment until remote access is more broadly available.

Methods

drop_table

Drop a user-created table from the catalog.

enable_logging

Enable logging.

joined_timeseries_view

Create a computed view that joins primary and secondary timeseries.

list_tables

List the tables in the catalog returning a Pandas DataFrame.

list_views

List the views in the catalog returning a Pandas DataFrame.

location_attributes_view

Create a computed view of pivoted location attributes.

log_spark_config

Log the current Spark session configuration.

primary_timeseries_view

Create a computed view of primary timeseries with optional attrs.

secondary_timeseries_view

Create a computed view of secondary timeseries with crosswalk.

sql

Execute a SQL query using the active catalog and namespace.

table

Get a table instance by name.

Attributes

active_catalog

Alias for catalog property (backwards compatibility).

attributes

Access the attributes table.

catalog

The remote catalog for this evaluation.

configurations

Access the configurations table.

download

The download component class for managing data downloads.

extract

The extract component class for extracting data.

fetch

The fetch component class for accessing external data.

generate

The generate component class for generating synthetic data.

load

The load component class for loading data.

location_attributes

Access the location attributes table.

location_crosswalks

Access the location crosswalks table.

locations

Access the locations table.

metrics

The metrics component class for calculating performance metrics.

primary_timeseries

Access the primary timeseries table.

read

The read component class for reading data.

remote_catalog

Alias for catalog property (backwards compatibility).

secondary_timeseries

Access the secondary timeseries table.

units

Access the units table.

validate

The validate component class for validating data.

variables

Access the variables table.

write

The write component class for writing data.

property active_catalog#

Alias for catalog property (backwards compatibility).

Deprecated since version Use: catalog property instead. This alias will be removed in a future version.

Returns:

LocalCatalog or RemoteCatalog – The catalog configuration for this evaluation.

property attributes: AttributeTable#

Access the attributes table.

property catalog#

The remote catalog for this evaluation.

Returns:

RemoteCatalog – The remote catalog configuration.

property configurations: ConfigurationTable#

Access the configurations table.

property download: Download#

The download component class for managing data downloads.

drop_table(table_name: str, namespace_name: str | None = None, catalog_name: str | None = None)#

Drop a user-created table from the catalog.

Only non-core tables (user-created tables, materialized views, saved query results) can be dropped. Attempting to drop a core table (e.g., primary_timeseries, locations, units) will raise a ValueError.

Parameters:
  • table_name (str) – The name of the table to drop.

  • namespace_name (Union[str, None], optional) – The namespace containing the table. If None, uses the active catalog’s namespace.

  • catalog_name (Union[str, None], optional) – The catalog containing the table. If None, uses the active catalog name.

Raises:

ValueError – If the table is a core TEEHR table.

Examples

Write and then drop a user-created table:

>>> ev.joined_timeseries_view().write("my_results")
>>> ev.drop_table("my_results")
enable_logging()#

Enable logging.

property extract: Extract#

The extract component class for extracting data.

property fetch: Fetch#

The fetch component class for accessing external data.

property generate: GeneratedTimeseries#

The generate component class for generating synthetic data.

joined_timeseries_view(primary_filters: str | dict | List[str | dict] | None = None, secondary_filters: str | dict | List[str | dict] | None = None, add_attrs: bool = False, attr_list: List[str] | None = None, catalog_name: str | None = None, namespace_name: str | None = None) JoinedTimeseriesView#

Create a computed view that joins primary and secondary timeseries.

This returns a lazy view that computes the join on-the-fly when accessed. The view can be filtered, transformed, and optionally materialized to an iceberg table via write().

Parameters:
  • primary_filters (Union[str, dict, List[...]], optional) – Filters to apply to primary timeseries before joining.

  • secondary_filters (Union[str, dict, List[...]], optional) – Filters to apply to secondary timeseries before joining.

  • add_attrs (bool, optional) – Whether to add location attributes. Default False.

  • attr_list (List[str], optional) – Specific attributes to add (if add_attrs=True).

  • catalog_name (Union[str, None], optional) – The catalog containing the source tables. If None, uses the active catalog.

  • namespace_name (Union[str, None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.

Returns:

JoinedTimeseriesView – A lazy view of the joined timeseries.

Examples

Create different join views:

>>> winter = ev.joined_timeseries_view(primary_filters=["month IN (12, 1, 2)"])
>>> summer = ev.joined_timeseries_view(primary_filters=["month IN (6, 7, 8)"])

Use directly (computes on-the-fly):

>>> ev.joined_timeseries_view().to_pandas()

Chain operations:

>>> ev.joined_timeseries_view().filter("primary_location_id LIKE 'usgs%'").to_pandas()

Compute metrics and materialize:

>>> ev.joined_timeseries_view().aggregate(
...     metrics=[KGE()],
...     group_by=["primary_location_id"]
... ).write("location_kge")

Materialize joined data:

>>> ev.joined_timeseries_view(add_attrs=True).write("joined_timeseries")

Read from a remote catalog and namespace:

>>> ev.joined_timeseries_view(
...     catalog_name="some_catalog",
...     namespace_name="some_namespace"
... ).aggregate(
...     metrics=[KGE()],
...     group_by=["primary_location_id"]
... ).write("location_kge")
list_tables(catalog_name: str | None = None, namespace: str | None = None) DataFrame#

List the tables in the catalog returning a Pandas DataFrame.

Parameters:
  • catalog_name (str, optional) –

    The catalog name to list tables from, by default None,

    which means the catalog_name of the active catalog is used.

  • namespace (str, optional) –

    The namespace name to list tables from, by default None,

    which means the namespace_name of the active catalog is used.

list_views() DataFrame#

List the views in the catalog returning a Pandas DataFrame.

property load: Load#

The load component class for loading data.

property location_attributes: LocationAttributeTable#

Access the location attributes table.

location_attributes_view(attr_list: List[str] | None = None, catalog_name: str | None = None, namespace_name: str | None = None) LocationAttributesView#

Create a computed view of pivoted location attributes.

Transforms the location_attributes table from long format (location_id, attribute_name, value) to wide format where each attribute becomes a column.

Parameters:
  • attr_list (List[str], optional) – Specific attributes to include. If None, includes all.

  • catalog_name (Union[str, None], optional) – The catalog containing the source tables. If None, uses the active catalog.

  • namespace_name (Union[str, None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.

Returns:

LocationAttributesView – A lazy view of the pivoted attributes.

Examples

Pivot all attributes:

>>> ev.location_attributes_view().to_pandas()

Pivot specific attributes:

>>> ev.location_attributes_view(
...     attr_list=["drainage_area", "ecoregion"]
... ).to_pandas()

With filters (chained):

>>> ev.location_attributes_view().filter(
...     "location_id LIKE 'usgs%'"
... ).to_pandas()

Materialize for later use:

>>> ev.location_attributes_view().write("pivoted_attrs")
property location_crosswalks: LocationCrosswalkTable#

Access the location crosswalks table.

property locations: LocationTable#

Access the locations table.

log_spark_config()#

Log the current Spark session configuration.

property metrics: Metrics#

The metrics component class for calculating performance metrics.

Deprecated since version 0.6.0: The metrics property is deprecated and will be removed in a future version. Use the aggregate method on the table directly with the metrics argument instead. For example:

ev.table("joined_timeseries").aggregate(
    metrics=[...],
    group_by=[...]
)
property primary_timeseries: PrimaryTimeseriesTable#

Access the primary timeseries table.

primary_timeseries_view(add_attrs: bool = False, attr_list: List[str] | None = None, catalog_name: str | None = None, namespace_name: str | None = None) PrimaryTimeseriesView#

Create a computed view of primary timeseries with optional attrs.

Parameters:
  • add_attrs (bool, optional) – Whether to add location attributes. Default False.

  • attr_list (List[str], optional) – Specific attributes to add. If None and add_attrs=True, adds all.

  • catalog_name (Union[str, None], optional) – The catalog containing the source tables. If None, uses the active catalog.

  • namespace_name (Union[str, None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.

Returns:

PrimaryTimeseriesView – A lazy view of the primary timeseries.

Examples

Basic usage:

>>> ev.primary_timeseries_view().to_pandas()

With filters (chained):

>>> ev.primary_timeseries_view().filter(
...     "location_id LIKE 'usgs%'"
... ).to_pandas()

With location attributes:

>>> ev.primary_timeseries_view(
...     add_attrs=True,
...     attr_list=["drainage_area", "ecoregion"]
... ).to_pandas()
property read: Read#

The read component class for reading data.

property remote_catalog#

Alias for catalog property (backwards compatibility).

Deprecated since version Use: catalog property instead.

Returns:

RemoteCatalog – The remote catalog configuration.

property secondary_timeseries: SecondaryTimeseriesTable#

Access the secondary timeseries table.

secondary_timeseries_view(add_attrs: bool = False, attr_list: List[str] | None = None, catalog_name: str | None = None, namespace_name: str | None = None) SecondaryTimeseriesView#

Create a computed view of secondary timeseries with crosswalk.

Joins secondary timeseries with location_crosswalks to add primary_location_id, and optionally joins location attributes.

Parameters:
  • add_attrs (bool, optional) – Whether to add location attributes. Default False.

  • attr_list (List[str], optional) – Specific attributes to add. If None and add_attrs=True, adds all.

  • catalog_name (Union[str, None], optional) – The catalog containing the source tables. If None, uses the active catalog.

  • namespace_name (Union[str, None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.

Returns:

SecondaryTimeseriesView – A lazy view of the secondary timeseries with primary_location_id.

Examples

Basic usage (adds primary_location_id via crosswalk):

>>> ev.secondary_timeseries_view().to_pandas()

With filters (chained):

>>> ev.secondary_timeseries_view().filter(
...     "configuration_name = 'nwm30_retrospective'"
... ).to_pandas()

With location attributes:

>>> ev.secondary_timeseries_view(
...     add_attrs=True,
...     attr_list=["drainage_area", "ecoregion"]
... ).to_pandas()
sql(query: str)#

Execute a SQL query using the active catalog and namespace.

This is a thin wrapper around spark.sql() that automatically sets the active catalog and namespace so the user does not have to qualify table names in their queries.

Parameters:

query (str) – The SQL query to execute. Table names can be unqualified (e.g. primary_timeseries) or partially qualified (e.g. teehr.primary_timeseries). The active catalog (ev.active_catalog.catalog_name) and active namespace (ev.active_catalog.namespace_name) are set automatically before the query runs.

Returns:

pyspark.sql.DataFrame – The result of the SQL query as a Spark DataFrame.

Examples

Query a table without specifying the catalog or namespace:

>>> df = ev.sql("SELECT * FROM primary_timeseries LIMIT 10")
>>> df.toPandas()

Use aggregate functions:

>>> df = ev.sql(
...     "SELECT location_id, COUNT(*) as n "
...     "FROM primary_timeseries GROUP BY location_id"
... )
table(table_name: str, namespace_name: str | None = None, catalog_name: str | None = None)#

Get a table instance by name.

This is a factory method that returns the appropriate table class for the given table name. For known table names (like ‘primary_timeseries’), returns the specialized table class. For unknown names, returns a generic BaseTable instance.

Parameters:
  • table_name (str) – The name of the table to access.

  • namespace_name (Union[str, None], optional) – The namespace containing the table. If None, uses the active catalog’s namespace.

  • catalog_name (Union[str, None], optional) – The catalog containing the table. If None, uses the active catalog name.

Returns:

BaseTable – The appropriate table instance.

Examples

>>> # Access a known table
>>> ev.table("primary_timeseries").aggregate(...)
>>> # Access a custom/user-defined table
>>> ev.table("my_custom_table").to_pandas()
property units: UnitTable#

Access the units table.

property validate: Validate#

The validate component class for validating data.

property variables: VariableTable#

Access the variables table.

property write: Write#

The write component class for writing data.