RemoteReadOnlyEvaluation#

class RemoteReadOnlyEvaluation(spark: SparkSession = None, temp_dir_path: str | Path = None, enable_spark_proxy: bool = False)[source]#

A read-only Evaluation class for accessing remote catalogs.

This class provides a convenient way to access a remote TEEHR catalog without needing to manage local directories. It automatically creates a temporary directory and sets the active catalog to remote.

Note: This is intended for read-only access to remote data. Write operations to the remote catalog are not supported through this class.

Currently only users in the TEEHR-Hub environment have access to the remote catalog, so this class is intended for use within that environment until remote access is more broadly available.

Methods

`drop_table`	Drop a user-created table from the catalog.
`enable_logging`	Enable logging.
`joined_timeseries_view`	Create a computed view that joins primary and secondary timeseries.
`list_tables`	List the tables in the catalog returning a Pandas DataFrame.
`list_views`	List the views in the catalog returning a Pandas DataFrame.
`location_attributes_view`	Create a computed view of pivoted location attributes.
`log_spark_config`	Log the current Spark session configuration.
`primary_timeseries_view`	Create a computed view of primary timeseries with optional attrs.
`secondary_timeseries_view`	Create a computed view of secondary timeseries with crosswalk.
`sql`	Execute a SQL query using the active catalog and namespace.
`table`	Get a table instance by name.

Attributes

`active_catalog`	Alias for catalog property (backwards compatibility).
`attributes`	Access the attributes table.
`catalog`	The remote catalog for this evaluation.
`configurations`	Access the configurations table.
`download`	The download component class for managing data downloads.
`fetch`	The fetch component class for accessing external data.
`generate`	The generate component class for generating synthetic data.
`location_attributes`	Access the location attributes table.
`location_crosswalks`	Access the location crosswalks table.
`locations`	Access the locations table.
`metrics`	The metrics component class for calculating performance metrics.
`primary_timeseries`	Access the primary timeseries table.
`remote_catalog`	Alias for catalog property (backwards compatibility).
`secondary_timeseries`	Access the secondary timeseries table.
`units`	Access the units table.
`variables`	Access the variables table.

property active_catalog#

Alias for catalog property (backwards compatibility).

Deprecated since version Use: catalog property instead. This alias will be removed in a future version.

Returns:: LocalCatalog or RemoteCatalog – The catalog configuration for this evaluation.

property attributes: AttributeTable#: Access the attributes table.

property catalog#

The remote catalog for this evaluation.

Returns:: RemoteCatalog – The remote catalog configuration.

property configurations: ConfigurationTable#: Access the configurations table.

property download: Download#: The download component class for managing data downloads.

drop_table(table_name: str, namespace_name: str | None = None, catalog_name: str | None = None)#

Drop a user-created table from the catalog.

Only non-core tables (user-created tables, materialized views, saved query results) can be dropped. Attempting to drop a core table (e.g., primary_timeseries, locations, units) will raise a ValueError.

Parameters:

table_name (str) – The name of the table to drop.
namespace_name (Union[str, None], optional) – The namespace containing the table. If None, uses the active catalog’s namespace.
catalog_name (Union[str, None], optional) – The catalog containing the table. If None, uses the active catalog name.

Raises:

ValueError – If the table is a core TEEHR table.

Examples

Write and then drop a user-created table:

>>> ev.joined_timeseries_view().write("my_results")
>>> ev.drop_table("my_results")

enable_logging()#: Enable logging.

property fetch: Fetch#: The fetch component class for accessing external data.

property generate: GeneratedTimeseries#: The generate component class for generating synthetic data.

Create a computed view that joins primary and secondary timeseries.

This returns a lazy view that computes the join on-the-fly when accessed. The view can be filtered, transformed, and optionally materialized to an iceberg table via write().

Parameters:

primary_filters (Union[str, dict, List[...]], optional) – Filters to apply to primary timeseries before joining.
secondary_filters (Union[str, dict, List[...]], optional) – Filters to apply to secondary timeseries before joining.
add_attrs (bool, optional) – Whether to add location attributes. Default False.
attr_list (List[str], optional) – Specific attributes to add (if add_attrs=True).
catalog_name (Union[str, None], optional) – The catalog containing the source tables. If None, uses the active catalog.
namespace_name (Union[str, None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.

Returns:

JoinedTimeseriesView – A lazy view of the joined timeseries.

Examples

Create different join views:

>>> winter = ev.joined_timeseries_view(primary_filters=["month IN (12, 1, 2)"])
>>> summer = ev.joined_timeseries_view(primary_filters=["month IN (6, 7, 8)"])

Use directly (computes on-the-fly):

>>> ev.joined_timeseries_view().to_pandas()

Chain operations:

>>> ev.joined_timeseries_view().filter("primary_location_id LIKE 'usgs%'").to_pandas()

Compute metrics and materialize:

>>> ev.joined_timeseries_view().aggregate(
...     metrics=[KGE()],
...     group_by=["primary_location_id"]
... ).write("location_kge")

Materialize joined data:

>>> ev.joined_timeseries_view(add_attrs=True).write("joined_timeseries")

Read from a remote catalog and namespace:

>>> ev.joined_timeseries_view(
...     catalog_name="some_catalog",
...     namespace_name="some_namespace"
... ).aggregate(
...     metrics=[KGE()],
...     group_by=["primary_location_id"]
... ).write("location_kge")

list_tables(catalog_name: str = None, namespace: str = None) → DataFrame#

List the tables in the catalog returning a Pandas DataFrame.

Parameters:

catalog_name (str, optional) –

The catalog name to list tables from, by default None,
which means the catalog_name of the active catalog is used.
namespace (str, optional) –

The namespace name to list tables from, by default None,
which means the namespace_name of the active catalog is used.

list_views() → DataFrame#: List the views in the catalog returning a Pandas DataFrame.

property location_attributes: LocationAttributeTable#: Access the location attributes table.

location_attributes_view(attr_list: List[str] = None, catalog_name: str | None = None, namespace_name: str | None = None) → LocationAttributesView#

Create a computed view of pivoted location attributes.

Transforms the location_attributes table from long format (location_id, attribute_name, value) to wide format where each attribute becomes a column.

Parameters:

attr_list (List[str], optional) – Specific attributes to include. If None, includes all.
catalog_name (Union[str, None], optional) – The catalog containing the source tables. If None, uses the active catalog.
namespace_name (Union[str, None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.

Returns:

LocationAttributesView – A lazy view of the pivoted attributes.

Examples

Pivot all attributes:

>>> ev.location_attributes_view().to_pandas()

Pivot specific attributes:

>>> ev.location_attributes_view(
...     attr_list=["drainage_area", "ecoregion"]
... ).to_pandas()

With filters (chained):

>>> ev.location_attributes_view().filter(
...     "location_id LIKE 'usgs%'"
... ).to_pandas()

Materialize for later use:

>>> ev.location_attributes_view().write("pivoted_attrs")

property location_crosswalks: LocationCrosswalkTable#: Access the location crosswalks table.

property locations: LocationTable#: Access the locations table.

log_spark_config()#: Log the current Spark session configuration.

property metrics: Metrics#

The metrics component class for calculating performance metrics.

Deprecated since version 0.6.0: The metrics property is deprecated and will be removed in a future version. Use the aggregate method on the table directly with the metrics argument instead. For example:

ev.table("joined_timeseries").aggregate(
    metrics=[...],
    group_by=[...]
)

property primary_timeseries: PrimaryTimeseriesTable#: Access the primary timeseries table.

primary_timeseries_view(add_attrs: bool = False, attr_list: List[str] = None, catalog_name: str | None = None, namespace_name: str | None = None) → PrimaryTimeseriesView#

Create a computed view of primary timeseries with optional attrs.

Parameters:

add_attrs (bool, optional) – Whether to add location attributes. Default False.
attr_list (List[str], optional) – Specific attributes to add. If None and add_attrs=True, adds all.
catalog_name (Union[str, None], optional) – The catalog containing the source tables. If None, uses the active catalog.
namespace_name (Union[str, None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.

Returns:

PrimaryTimeseriesView – A lazy view of the primary timeseries.

Examples

Basic usage:

>>> ev.primary_timeseries_view().to_pandas()

With filters (chained):

>>> ev.primary_timeseries_view().filter(
...     "location_id LIKE 'usgs%'"
... ).to_pandas()

With location attributes:

>>> ev.primary_timeseries_view(
...     add_attrs=True,
...     attr_list=["drainage_area", "ecoregion"]
... ).to_pandas()

property remote_catalog#

Alias for catalog property (backwards compatibility).

Deprecated since version Use: catalog property instead.

Returns:: RemoteCatalog – The remote catalog configuration.

property secondary_timeseries: SecondaryTimeseriesTable#: Access the secondary timeseries table.

secondary_timeseries_view(add_attrs: bool = False, attr_list: List[str] = None, catalog_name: str | None = None, namespace_name: str | None = None) → SecondaryTimeseriesView#

Create a computed view of secondary timeseries with crosswalk.

Joins secondary timeseries with location_crosswalks to add primary_location_id, and optionally joins location attributes.

Parameters:

add_attrs (bool, optional) – Whether to add location attributes. Default False.
attr_list (List[str], optional) – Specific attributes to add. If None and add_attrs=True, adds all.
catalog_name (Union[str, None], optional) – The catalog containing the source tables. If None, uses the active catalog.
namespace_name (Union[str, None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.

Returns:

SecondaryTimeseriesView – A lazy view of the secondary timeseries with primary_location_id.

Examples

Basic usage (adds primary_location_id via crosswalk):

>>> ev.secondary_timeseries_view().to_pandas()

With filters (chained):

>>> ev.secondary_timeseries_view().filter(
...     "configuration_name = 'nwm30_retrospective'"
... ).to_pandas()

With location attributes:

>>> ev.secondary_timeseries_view(
...     add_attrs=True,
...     attr_list=["drainage_area", "ecoregion"]
... ).to_pandas()

sql(query: str)#

Execute a SQL query using the active catalog and namespace.

This is a thin wrapper around spark.sql() that automatically sets the active catalog and namespace so the user does not have to qualify table names in their queries.

Parameters:: query (str) – The SQL query to execute. Table names can be unqualified (e.g. primary_timeseries) or partially qualified (e.g. teehr.primary_timeseries). The active catalog (ev.active_catalog.catalog_name) and active namespace (ev.active_catalog.namespace_name) are set automatically before the query runs.
Returns:: pyspark.sql.DataFrame – The result of the SQL query as a Spark DataFrame.

Examples

Query a table without specifying the catalog or namespace:

>>> df = ev.sql("SELECT * FROM primary_timeseries LIMIT 10")
>>> df.toPandas()

Use aggregate functions:

>>> df = ev.sql(
...     "SELECT location_id, COUNT(*) as n "
...     "FROM primary_timeseries GROUP BY location_id"
... )

table(table_name: str, namespace_name: str | None = None, catalog_name: str | None = None)#

Get a table instance by name.

This is a factory method that returns the appropriate table class for the given table name. For known table names (like ‘primary_timeseries’), returns the specialized table class. For unknown names, returns a generic BaseTable instance.

Parameters:

table_name (str) – The name of the table to access.
namespace_name (Union[str, None], optional) – The namespace containing the table. If None, uses the active catalog’s namespace.
catalog_name (Union[str, None], optional) – The catalog containing the table. If None, uses the active catalog name.

Returns:

BaseTable – The appropriate table instance.

Examples

>>> # Access a known table
>>> ev.table("primary_timeseries").aggregate(...)

>>> # Access a custom/user-defined table
>>> ev.table("my_custom_table").to_pandas()

property units: UnitTable#: Access the units table.

property variables: VariableTable#: Access the variables table.

RemoteReadOnlyEvaluation#

This Page