RemoteReadWriteEvaluation#

class RemoteReadWriteEvaluation(spark: SparkSession = None, temp_dir_path: str | Path = None, enable_spark_proxy: bool = False)[source]#

A read-write Evaluation class for access to remote catalogs.

This class provides a convenient way to access a remote TEEHR catalog without needing to manage local directories. It automatically creates a temporary directory and sets the active catalog to remote.

Note: This is intended for read-write access to remote data. Write operations to the remote catalog are supported through this class, however an AWS profile with write permissions is required in the Spark session.

Currently only users in the TEEHR-Hub environment have access to the remote catalog, so this class is intended for use within that environment until remote access is more broadly available.

Methods

`drop_table`	Drop a user-created table from the catalog.
`enable_logging`	Enable logging.
`joined_timeseries_view`	Create a computed view that joins primary and secondary timeseries.
`list_tables`	List the tables in the catalog returning a Pandas DataFrame.
`list_views`	List the views in the catalog returning a Pandas DataFrame.
`location_attributes_view`	Create a computed view of pivoted location attributes.
`log_spark_config`	Log the current Spark session configuration.
`primary_timeseries_view`	Create a computed view of primary timeseries with optional attrs.
`secondary_timeseries_view`	Create a computed view of secondary timeseries with crosswalk.
`sql`	Execute a SQL query using the active catalog and namespace.
`table`	Get a table instance by name.

Attributes

`active_catalog`	Alias for catalog property (backwards compatibility).
`attributes`	Access the attributes table.
`catalog`	The remote catalog for this evaluation.
`configurations`	Access the configurations table.
`download`	The download component class for managing data downloads.
`fetch`	The fetch component class for accessing external data.
`generate`	The generate component class for generating synthetic data.
`location_attributes`	Access the location attributes table.
`location_crosswalks`	Access the location crosswalks table.
`locations`	Access the locations table.
`metrics`	The metrics component class for calculating performance metrics.
`primary_timeseries`	Access the primary timeseries table.
`remote_catalog`	Alias for catalog property (backwards compatibility).
`secondary_timeseries`	Access the secondary timeseries table.
`units`	Access the units table.
`variables`	Access the variables table.

property active_catalog#

Alias for catalog property (backwards compatibility).

Deprecated since version Use: catalog property instead. This alias will be removed in a future version.

Returns:: LocalCatalog or RemoteCatalog – The catalog configuration for this evaluation.

property attributes: AttributeTable#: Access the attributes table.

property catalog#

The remote catalog for this evaluation.

Returns:: RemoteCatalog – The remote catalog configuration.

property configurations: ConfigurationTable#: Access the configurations table.

property download: Download#: The download component class for managing data downloads.

drop_table(table_name: str, namespace_name: str | None = None, catalog_name: str | None = None)#

Drop a user-created table from the catalog.

Only non-core tables (user-created tables, materialized views, saved query results) can be dropped. Attempting to drop a core table (e.g., primary_timeseries, locations, units) will raise a ValueError.

Parameters:

table_name (str) – The name of the table to drop.
namespace_name (Union[str, None], optional) – The namespace containing the table. If None, uses the active catalog’s namespace.
catalog_name (Union[str, None], optional) – The catalog containing the table. If None, uses the active catalog name.

Raises:

ValueError – If the table is a core TEEHR table.

Examples

Write and then drop a user-created table:

>>> ev.joined_timeseries_view().write("my_results")
>>> ev.drop_table("my_results")

enable_logging()#: Enable logging.

property fetch: Fetch#: The fetch component class for accessing external data.

property generate: GeneratedTimeseries#: The generate component class for generating synthetic data.

Create a computed view that joins primary and secondary timeseries.

This returns a lazy view that computes the join on-the-fly when accessed. The view can be filtered, transformed, and optionally materialized to an iceberg table via write().

Parameters:

primary_filters (Union[str, dict, List[...]], optional) – Filters to apply to primary timeseries before joining.
secondary_filters (Union[str, dict, List[...]], optional) – Filters to apply to secondary timeseries before joining.
add_attrs (bool, optional) – Whether to add location attributes. Default False.
attr_list (List[str], optional) – Specific attributes to add (if add_attrs=True).
catalog_name (Union[str, None], optional) – The catalog containing the source tables. If None, uses the active catalog.
namespace_name (Union[str, None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.

Returns:

JoinedTimeseriesView – A lazy view of the joined timeseries.

Examples

Create different join views:

>>> winter = ev.joined_timeseries_view(primary_filters=["month IN (12, 1, 2)"])
>>> summer = ev.joined_timeseries_view(primary_filters=["month IN (6, 7, 8)"])

Use directly (computes on-the-fly):

>>> ev.joined_timeseries_view().to_pandas()

Chain operations:

>>> ev.joined_timeseries_view().filter("primary_location_id LIKE 'usgs%'").to_pandas()

Compute metrics and materialize:

>>> ev.joined_timeseries_view().aggregate(
...     metrics=[KGE()],
...     group_by=["primary_location_id"]
... ).write("location_kge")

Materialize joined data:

>>> ev.joined_timeseries_view(add_attrs=True).write("joined_timeseries")

Read from a remote catalog and namespace:

>>> ev.joined_timeseries_view(
...     catalog_name="some_catalog",
...     namespace_name="some_namespace"
... ).aggregate(
...     metrics=[KGE()],
...     group_by=["primary_location_id"]
... ).write("location_kge")

list_tables(catalog_name: str = None, namespace: str = None) → DataFrame#

List the tables in the catalog returning a Pandas DataFrame.

Parameters:

catalog_name (str, optional) –

The catalog name to list tables from, by default None,
which means the catalog_name of the active catalog is used.
namespace (str, optional) –

The namespace name to list tables from, by default None,
which means the namespace_name of the active catalog is used.

list_views() → DataFrame#: List the views in the catalog returning a Pandas DataFrame.

property location_attributes: LocationAttributeTable#: Access the location attributes table.

location_attributes_view(attr_list: List[str] = None, catalog_name: str | None = None, namespace_name: str | None = None) → LocationAttributesView#

Create a computed view of pivoted location attributes.

Transforms the location_attributes table from long format (location_id, attribute_name, value) to wide format where each attribute becomes a column.

Parameters:

attr_list (List[str], optional) – Specific attributes to include. If None, includes all.
catalog_name (Union[str, None], optional) – The catalog containing the source tables. If None, uses the active catalog.
namespace_name (Union[str, None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.

Returns:

LocationAttributesView – A lazy view of the pivoted attributes.

Examples

Pivot all attributes:

>>> ev.location_attributes_view().to_pandas()

Pivot specific attributes:

>>> ev.location_attributes_view(
...     attr_list=["drainage_area", "ecoregion"]
... ).to_pandas()

With filters (chained):

>>> ev.location_attributes_view().filter(
...     "location_id LIKE 'usgs%'"
... ).to_pandas()

Materialize for later use:

>>> ev.location_attributes_view().write("pivoted_attrs")

property location_crosswalks: LocationCrosswalkTable#: Access the location crosswalks table.

property locations: LocationTable#: Access the locations table.

log_spark_config()#: Log the current Spark session configuration.

property metrics: Metrics#

The metrics component class for calculating performance metrics.

Deprecated since version 0.6.0: The metrics property is deprecated and will be removed in a future version. Use the aggregate method on the table directly with the metrics argument instead. For example:

ev.table("joined_timeseries").aggregate(
    metrics=[...],
    group_by=[...]
)

property primary_timeseries: PrimaryTimeseriesTable#: Access the primary timeseries table.

primary_timeseries_view(add_attrs: bool = False, attr_list: List[str] = None, catalog_name: str | None = None, namespace_name: str | None = None) → PrimaryTimeseriesView#

Create a computed view of primary timeseries with optional attrs.

Parameters:

add_attrs (bool, optional) – Whether to add location attributes. Default False.
attr_list (List[str], optional) – Specific attributes to add. If None and add_attrs=True, adds all.
catalog_name (Union[str, None], optional) – The catalog containing the source tables. If None, uses the active catalog.
namespace_name (Union[str, None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.

Returns:

PrimaryTimeseriesView – A lazy view of the primary timeseries.

Examples

Basic usage:

>>> ev.primary_timeseries_view().to_pandas()

With filters (chained):

>>> ev.primary_timeseries_view().filter(
...     "location_id LIKE 'usgs%'"
... ).to_pandas()

With location attributes:

>>> ev.primary_timeseries_view(
...     add_attrs=True,
...     attr_list=["drainage_area", "ecoregion"]
... ).to_pandas()

property remote_catalog#

Alias for catalog property (backwards compatibility).

Deprecated since version Use: catalog property instead.

Returns:: RemoteCatalog – The remote catalog configuration.

property secondary_timeseries: SecondaryTimeseriesTable#: Access the secondary timeseries table.

secondary_timeseries_view(add_attrs: bool = False, attr_list: List[str] = None, catalog_name: str | None = None, namespace_name: str | None = None) → SecondaryTimeseriesView#

Create a computed view of secondary timeseries with crosswalk.

Joins secondary timeseries with location_crosswalks to add primary_location_id, and optionally joins location attributes.

Parameters:

add_attrs (bool, optional) – Whether to add location attributes. Default False.
attr_list (List[str], optional) – Specific attributes to add. If None and add_attrs=True, adds all.
catalog_name (Union[str, None], optional) – The catalog containing the source tables. If None, uses the active catalog.
namespace_name (Union[str, None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.

Returns:

SecondaryTimeseriesView – A lazy view of the secondary timeseries with primary_location_id.

Examples

Basic usage (adds primary_location_id via crosswalk):

>>> ev.secondary_timeseries_view().to_pandas()

With filters (chained):

>>> ev.secondary_timeseries_view().filter(
...     "configuration_name = 'nwm30_retrospective'"
... ).to_pandas()

With location attributes:

>>> ev.secondary_timeseries_view(
...     add_attrs=True,
...     attr_list=["drainage_area", "ecoregion"]
... ).to_pandas()

sql(query: str)#

Execute a SQL query using the active catalog and namespace.

This is a thin wrapper around spark.sql() that automatically sets the active catalog and namespace so the user does not have to qualify table names in their queries.

Parameters:: query (str) – The SQL query to execute. Table names can be unqualified (e.g. primary_timeseries) or partially qualified (e.g. teehr.primary_timeseries). The active catalog (ev.active_catalog.catalog_name) and active namespace (ev.active_catalog.namespace_name) are set automatically before the query runs.
Returns:: pyspark.sql.DataFrame – The result of the SQL query as a Spark DataFrame.

Examples

Query a table without specifying the catalog or namespace:

>>> df = ev.sql("SELECT * FROM primary_timeseries LIMIT 10")
>>> df.toPandas()

Use aggregate functions:

>>> df = ev.sql(
...     "SELECT location_id, COUNT(*) as n "
...     "FROM primary_timeseries GROUP BY location_id"
... )

table(table_name: str, namespace_name: str | None = None, catalog_name: str | None = None)#

Get a table instance by name.

This is a factory method that returns the appropriate table class for the given table name. For known table names (like ‘primary_timeseries’), returns the specialized table class. For unknown names, returns a generic BaseTable instance.

Parameters:

table_name (str) – The name of the table to access.
namespace_name (Union[str, None], optional) – The namespace containing the table. If None, uses the active catalog’s namespace.
catalog_name (Union[str, None], optional) – The catalog containing the table. If None, uses the active catalog name.

Returns:

BaseTable – The appropriate table instance.

Examples

>>> # Access a known table
>>> ev.table("primary_timeseries").aggregate(...)

>>> # Access a custom/user-defined table
>>> ev.table("my_custom_table").to_pandas()

property units: UnitTable#: Access the units table.

property variables: VariableTable#: Access the variables table.

RemoteReadWriteEvaluation#

This Page