RemoteReadOnlyEvaluation#
- class RemoteReadOnlyEvaluation(spark: SparkSession | None = None, temp_dir_path: Path | str | None = None)[source]#
A read-only Evaluation class for accessing remote catalogs.
This class provides a convenient way to access a remote TEEHR catalog without needing to manage local directories. It automatically creates a temporary directory and sets the active catalog to remote.
Note: This is intended for read-only access to remote data. Write operations to the remote catalog are not supported through this class.
Currently only users in the TEEHR-Hub environment have access to the remote catalog, so this class is intended for use within that environment until remote access is more broadly available.
Methods
Drop a user-created table from the catalog.
Enable logging.
Create a computed view that joins primary and secondary timeseries.
List the tables in the catalog returning a Pandas DataFrame.
List the views in the catalog returning a Pandas DataFrame.
Create a computed view of pivoted location attributes.
Log the current Spark session configuration.
Create a computed view of primary timeseries with optional attrs.
Create a computed view of secondary timeseries with crosswalk.
Execute a SQL query using the active catalog and namespace.
Get a table instance by name.
Attributes
Alias for catalog property (backwards compatibility).
Access the attributes table.
The remote catalog for this evaluation.
Access the configurations table.
The download component class for managing data downloads.
The extract component class for extracting data.
The fetch component class for accessing external data.
The generate component class for generating synthetic data.
The load component class for loading data.
Access the location attributes table.
Access the location crosswalks table.
Access the locations table.
The metrics component class for calculating performance metrics.
Access the primary timeseries table.
The read component class for reading data.
Alias for catalog property (backwards compatibility).
Access the secondary timeseries table.
Access the units table.
The validate component class for validating data.
Access the variables table.
The write component class for writing data.
- property active_catalog#
Alias for catalog property (backwards compatibility).
Deprecated since version Use:
catalogproperty instead. This alias will be removed in a future version.- Returns:
LocalCatalogorRemoteCatalog– The catalog configuration for this evaluation.
- property attributes: AttributeTable#
Access the attributes table.
- property catalog#
The remote catalog for this evaluation.
- Returns:
RemoteCatalog– The remote catalog configuration.
- property configurations: ConfigurationTable#
Access the configurations table.
- drop_table(table_name: str, namespace_name: str | None = None, catalog_name: str | None = None)#
Drop a user-created table from the catalog.
Only non-core tables (user-created tables, materialized views, saved query results) can be dropped. Attempting to drop a core table (e.g., primary_timeseries, locations, units) will raise a ValueError.
- Parameters:
table_name (
str) – The name of the table to drop.namespace_name (
Union[str,None], optional) – The namespace containing the table. If None, uses the active catalog’s namespace.catalog_name (
Union[str,None], optional) – The catalog containing the table. If None, uses the active catalog name.
- Raises:
ValueError – If the table is a core TEEHR table.
Examples
Write and then drop a user-created table:
>>> ev.joined_timeseries_view().write("my_results") >>> ev.drop_table("my_results")
- enable_logging()#
Enable logging.
- property generate: GeneratedTimeseries#
The generate component class for generating synthetic data.
- joined_timeseries_view(primary_filters: str | dict | List[str | dict] | None = None, secondary_filters: str | dict | List[str | dict] | None = None, add_attrs: bool = False, attr_list: List[str] | None = None, catalog_name: str | None = None, namespace_name: str | None = None) JoinedTimeseriesView#
Create a computed view that joins primary and secondary timeseries.
This returns a lazy view that computes the join on-the-fly when accessed. The view can be filtered, transformed, and optionally materialized to an iceberg table via write().
- Parameters:
primary_filters (
Union[str,dict,List[...]], optional) – Filters to apply to primary timeseries before joining.secondary_filters (
Union[str,dict,List[...]], optional) – Filters to apply to secondary timeseries before joining.add_attrs (
bool, optional) – Whether to add location attributes. Default False.attr_list (
List[str], optional) – Specific attributes to add (if add_attrs=True).catalog_name (
Union[str,None], optional) – The catalog containing the source tables. If None, uses the active catalog.namespace_name (
Union[str,None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.
- Returns:
JoinedTimeseriesView– A lazy view of the joined timeseries.
Examples
Create different join views:
>>> winter = ev.joined_timeseries_view(primary_filters=["month IN (12, 1, 2)"]) >>> summer = ev.joined_timeseries_view(primary_filters=["month IN (6, 7, 8)"])
Use directly (computes on-the-fly):
>>> ev.joined_timeseries_view().to_pandas()
Chain operations:
>>> ev.joined_timeseries_view().filter("primary_location_id LIKE 'usgs%'").to_pandas()
Compute metrics and materialize:
>>> ev.joined_timeseries_view().aggregate( ... metrics=[KGE()], ... group_by=["primary_location_id"] ... ).write("location_kge")
Materialize joined data:
>>> ev.joined_timeseries_view(add_attrs=True).write("joined_timeseries")
Read from a remote catalog and namespace:
>>> ev.joined_timeseries_view( ... catalog_name="some_catalog", ... namespace_name="some_namespace" ... ).aggregate( ... metrics=[KGE()], ... group_by=["primary_location_id"] ... ).write("location_kge")
- list_tables(catalog_name: str | None = None, namespace: str | None = None) DataFrame#
List the tables in the catalog returning a Pandas DataFrame.
- Parameters:
catalog_name (
str, optional) –- The catalog name to list tables from, by default None,
which means the catalog_name of the active catalog is used.
namespace (
str, optional) –- The namespace name to list tables from, by default None,
which means the namespace_name of the active catalog is used.
- list_views() DataFrame#
List the views in the catalog returning a Pandas DataFrame.
- property location_attributes: LocationAttributeTable#
Access the location attributes table.
- location_attributes_view(attr_list: List[str] | None = None, catalog_name: str | None = None, namespace_name: str | None = None) LocationAttributesView#
Create a computed view of pivoted location attributes.
Transforms the location_attributes table from long format (location_id, attribute_name, value) to wide format where each attribute becomes a column.
- Parameters:
attr_list (
List[str], optional) – Specific attributes to include. If None, includes all.catalog_name (
Union[str,None], optional) – The catalog containing the source tables. If None, uses the active catalog.namespace_name (
Union[str,None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.
- Returns:
LocationAttributesView– A lazy view of the pivoted attributes.
Examples
Pivot all attributes:
>>> ev.location_attributes_view().to_pandas()
Pivot specific attributes:
>>> ev.location_attributes_view( ... attr_list=["drainage_area", "ecoregion"] ... ).to_pandas()
With filters (chained):
>>> ev.location_attributes_view().filter( ... "location_id LIKE 'usgs%'" ... ).to_pandas()
Materialize for later use:
>>> ev.location_attributes_view().write("pivoted_attrs")
- property location_crosswalks: LocationCrosswalkTable#
Access the location crosswalks table.
- property locations: LocationTable#
Access the locations table.
- log_spark_config()#
Log the current Spark session configuration.
- property metrics: Metrics#
The metrics component class for calculating performance metrics.
Deprecated since version 0.6.0: The
metricsproperty is deprecated and will be removed in a future version. Use theaggregatemethod on the table directly with themetricsargument instead. For example:ev.table("joined_timeseries").aggregate( metrics=[...], group_by=[...] )
- property primary_timeseries: PrimaryTimeseriesTable#
Access the primary timeseries table.
- primary_timeseries_view(add_attrs: bool = False, attr_list: List[str] | None = None, catalog_name: str | None = None, namespace_name: str | None = None) PrimaryTimeseriesView#
Create a computed view of primary timeseries with optional attrs.
- Parameters:
add_attrs (
bool, optional) – Whether to add location attributes. Default False.attr_list (
List[str], optional) – Specific attributes to add. If None and add_attrs=True, adds all.catalog_name (
Union[str,None], optional) – The catalog containing the source tables. If None, uses the active catalog.namespace_name (
Union[str,None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.
- Returns:
PrimaryTimeseriesView– A lazy view of the primary timeseries.
Examples
Basic usage:
>>> ev.primary_timeseries_view().to_pandas()
With filters (chained):
>>> ev.primary_timeseries_view().filter( ... "location_id LIKE 'usgs%'" ... ).to_pandas()
With location attributes:
>>> ev.primary_timeseries_view( ... add_attrs=True, ... attr_list=["drainage_area", "ecoregion"] ... ).to_pandas()
- property remote_catalog#
Alias for catalog property (backwards compatibility).
Deprecated since version Use:
catalogproperty instead.- Returns:
RemoteCatalog– The remote catalog configuration.
- property secondary_timeseries: SecondaryTimeseriesTable#
Access the secondary timeseries table.
- secondary_timeseries_view(add_attrs: bool = False, attr_list: List[str] | None = None, catalog_name: str | None = None, namespace_name: str | None = None) SecondaryTimeseriesView#
Create a computed view of secondary timeseries with crosswalk.
Joins secondary timeseries with location_crosswalks to add primary_location_id, and optionally joins location attributes.
- Parameters:
add_attrs (
bool, optional) – Whether to add location attributes. Default False.attr_list (
List[str], optional) – Specific attributes to add. If None and add_attrs=True, adds all.catalog_name (
Union[str,None], optional) – The catalog containing the source tables. If None, uses the active catalog.namespace_name (
Union[str,None], optional) – The namespace containing the source tables. If None, uses the active catalog’s namespace.
- Returns:
SecondaryTimeseriesView– A lazy view of the secondary timeseries with primary_location_id.
Examples
Basic usage (adds primary_location_id via crosswalk):
>>> ev.secondary_timeseries_view().to_pandas()
With filters (chained):
>>> ev.secondary_timeseries_view().filter( ... "configuration_name = 'nwm30_retrospective'" ... ).to_pandas()
With location attributes:
>>> ev.secondary_timeseries_view( ... add_attrs=True, ... attr_list=["drainage_area", "ecoregion"] ... ).to_pandas()
- sql(query: str)#
Execute a SQL query using the active catalog and namespace.
This is a thin wrapper around
spark.sql()that automatically sets the active catalog and namespace so the user does not have to qualify table names in their queries.- Parameters:
query (
str) – The SQL query to execute. Table names can be unqualified (e.g.primary_timeseries) or partially qualified (e.g.teehr.primary_timeseries). The active catalog (ev.active_catalog.catalog_name) and active namespace (ev.active_catalog.namespace_name) are set automatically before the query runs.- Returns:
pyspark.sql.DataFrame– The result of the SQL query as a Spark DataFrame.
Examples
Query a table without specifying the catalog or namespace:
>>> df = ev.sql("SELECT * FROM primary_timeseries LIMIT 10") >>> df.toPandas()
Use aggregate functions:
>>> df = ev.sql( ... "SELECT location_id, COUNT(*) as n " ... "FROM primary_timeseries GROUP BY location_id" ... )
- table(table_name: str, namespace_name: str | None = None, catalog_name: str | None = None)#
Get a table instance by name.
This is a factory method that returns the appropriate table class for the given table name. For known table names (like ‘primary_timeseries’), returns the specialized table class. For unknown names, returns a generic BaseTable instance.
- Parameters:
table_name (
str) – The name of the table to access.namespace_name (
Union[str,None], optional) – The namespace containing the table. If None, uses the active catalog’s namespace.catalog_name (
Union[str,None], optional) – The catalog containing the table. If None, uses the active catalog name.
- Returns:
BaseTable– The appropriate table instance.
Examples
>>> # Access a known table >>> ev.table("primary_timeseries").aggregate(...)
>>> # Access a custom/user-defined table >>> ev.table("my_custom_table").to_pandas()
- property variables: VariableTable#
Access the variables table.