DomainTable#

class DomainTable(ev, table_name: str = None, namespace_name: str | None = None, catalog_name: str | None = None)[source]#

Domain table class.

Domain tables store reference data (units, variables, configurations, attributes) that other tables reference via foreign keys.

Methods

`add_attributes`	Add location attributes to the DataFrame.
`add_calculated_fields`	Add calculated fields to the DataFrame.
`add_geometry`	Add geometry to the DataFrame.
`aggregate`	Aggregate data with grouping and metrics.
`delete`	Delete rows from this table based on filter conditions.
`distinct_values`	Return distinct values for a column.
`drop`	Drop this table from the catalog.
`filter`	Apply filters to the DataFrame.
`load_csv`	Import location attributes from CSV file format.
`load_dataframe`	Import data from an in-memory dataframe.
`load_parquet`	Import location attributes from parquet file format.
`order_by`	Apply ordering to the DataFrame.
`to_geopandas`	Return GeoPandas DataFrame.
`to_pandas`	Return Pandas DataFrame.
`to_sdf`	Return the PySpark DataFrame.
`validate`	Validate the dataset table against the schema.
`write`	Write the DataFrame to an iceberg table.
`write_to`	Write the DataFrame to an iceberg table.

Attributes

`foreign_keys`
`is_core_table`	Return True if this table is a core (built-in) TEEHR table.
`primary_location_id_field`
`schema_func`
`secondary_location_id_field`
`strict_validation`
`table_name`
`uniqueness_fields`
`validate_filter_field_types`
`extraction_func`(in_filepath[, ...])	Convert a single file to a pandas DataFrame.

add_attributes(attr_list: List[str] = None, location_id_col: str = None)#

Add location attributes to the DataFrame.

Joins pivoted location attributes to the DataFrame. The join column is auto-detected from common location ID field names (‘location_id’, ‘primary_location_id’) unless specified.

This is especially useful when called after a aggregate() with GROUP BY and aggregation metrics, so that attributes do not need to be included in the group_by clause in order to pass through to the result.

Parameters:

attr_list (List[str], optional) – Specific attributes to add. If None, all attributes are added.
location_id_col (str, optional) – The column name in the DataFrame to join on. If None, checks for ‘location_id’ then ‘primary_location_id’.

Returns:

TeehrDataFrameBase – A new accessor instance with attributes joined.

Examples

Add all attributes:

>>> df = accessor.add_attributes().to_pandas()

Add specific attributes:

>>> df = accessor.add_attributes(
...     attr_list=["drainage_area", "ecoregion"]
... ).to_pandas()

Specify join column explicitly:

>>> df = accessor.add_attributes(
...     location_id_col="primary_location_id"
... ).to_pandas()

Add attributes after metric aggregation — avoids including them in group_by:

>>> from teehr.metrics import KGE
>>> df = (
...     ev.joined_timeseries_view()
...     .aggregate(
...         group_by=["primary_location_id"],
...         metrics=[KGE()]
...     )
...     .add_attributes(attr_list=["drainage_area", "ecoregion"])
...     .to_pandas()
... )

add_calculated_fields(cfs: CalculatedFieldBaseModel | List[CalculatedFieldBaseModel], engine: str = 'auto')#

Add calculated fields to the DataFrame.

Parameters:

cfs (Union[CalculatedFieldBaseModel, List[...]]) – The calculated fields to add.
engine (str, optional) – Execution engine for calculated fields. Options are "auto", "python", or "spark". Default is "auto".

Returns:

TeehrDataFrameBase – A new accessor instance with calculated fields added.

Examples

>>> import teehr
>>> from teehr import RowLevelCalculatedFields as rcf
>>>
>>> df = accessor.add_calculated_fields([
>>>     rcf.Month()
>>> ]).to_pandas()

add_geometry()[source]#: Add geometry to the DataFrame.

aggregate(group_by: str | List[str], metrics: List[MetricsBasemodel], engine: str = 'auto')#

Aggregate data with grouping and metrics.

Parameters:

group_by (Union[str, List[str]]) – Fields to group by for metric calculation.
metrics (List[MetricsBasemodel]) – Metrics to calculate.
engine (str, optional) – Aggregation engine to use. Options are "auto", "python", or "spark". Default is "auto".

Returns:

TeehrDataFrameBase – A new accessor instance with aggregation results.

Examples

>>> df = accessor.aggregate(
>>>     metrics=[KGE()],
>>>     group_by=["primary_location_id"]
>>> ).to_pandas()

Chain with filter and order_by:

>>> from teehr import DeterministicMetrics as dm
>>>
>>> df = (
>>>     accessor
>>>     .filter("primary_location_id LIKE 'usgs%'")
>>>     .aggregate(
>>>         group_by=["primary_location_id", "configuration_name"],
>>>         metrics=[dm.KlingGuptaEfficiency(), dm.RelativeBias()]
>>>     )
>>>     .order_by(["primary_location_id", "configuration_name"])
>>>     .to_pandas()
>>> )

Delete rows from this table based on filter conditions.

Delegates to Write.delete_from().

Parameters:

filters (Union[str, dict, TableFilter, List[...]], optional) – Filter conditions specifying which rows to delete. Supports SQL strings, dictionaries, or TableFilter objects. If None, all rows in the table will be deleted.
dry_run (bool, optional) – If True, returns a Spark DataFrame of rows that would be deleted without performing the actual deletion. Default is False.

Returns:

int or ps.DataFrame – If dry_run=False, returns the number of rows deleted (int). If dry_run=True, returns a Spark DataFrame of rows that would be deleted.

Examples

Preview rows that would be deleted (dry run):

>>> sdf = ev.table("primary_timeseries").delete(
>>>     filters=["location_id = 'usgs-01234567'"],
>>>     dry_run=True,
>>> )
>>> print(f"Rows to delete: {sdf.count()}")

Delete rows and get the count:

>>> count = ev.table("primary_timeseries").delete(
>>>     filters=["location_id = 'usgs-01234567'"],
>>> )
>>> print(f"Deleted {count} rows.")

Delete all rows from this table:

>>> count = ev.primary_timeseries.delete()

distinct_values(column: str, location_prefixes: bool = False) → List[str]#

Return distinct values for a column.

Parameters:

column (str) – The column to get distinct values for.
location_prefixes (bool) – Whether to return location prefixes. If True, only the unique prefixes of the locations will be returned. Default: False

Returns:

List[str] – The distinct values for the column.

Examples

Get distinct location IDs from the primary timeseries table:

>>> ev.table(table_name="primary_timeseries").distinct_values(
>>>     column='location_id',
>>>     location_prefixes=False
>>> )

Get distinct location prefixes from the joined timeseries table:

>>> ev.table(table_name="joined_timeseries").distinct_values(
>>>     column='primary_location_id',
>>>     location_prefixes=True
>>> )

drop()#

Drop this table from the catalog.

Only non-core tables (user-created tables, materialized views, saved query results) can be dropped. Attempting to drop a core table (e.g., primary_timeseries, locations, units) will raise a ValueError.

Raises:: ValueError – If the table is a core TEEHR table.

Examples

Write and then drop a user-created table:

>>> ev.joined_timeseries_view().write("my_results")
>>> ev.table("my_results").drop()

static extraction_func(in_filepath: str | Path, field_mapping: dict = None, table_name: str = 'data', **kwargs) → DataFrame#

Convert a single file to a pandas DataFrame.

Supports parquet, csv, and netcdf file formats.

Parameters:

in_filepath (Union[str, Path]) – The input file path.
field_mapping (dict, optional) – A dictionary mapping input fields to output fields. Format: {input_field: output_field}. The default is None.
table_name (str, optional) – Name used for logging purposes. The default is “data”.
**kwargs – Additional keyword arguments are passed to pd.read_csv(), pd.read_parquet(), or xr.open_dataset().

Returns:

pd.DataFrame – The converted DataFrame with renamed columns if field_mapping provided.

Raises:

ValueError – If the file type is not supported.

Apply filters to the DataFrame.

Parameters:: filters (Union[str, dict, TableFilter, List[...]]) – The filters to apply. Can be SQL strings, dictionaries, or TableFilter objects.
Returns:: TeehrDataFrameBase – A new accessor instance with filters applied.

Examples

Filters as dictionary:

>>> df = accessor.filter(
>>>     filters=[
>>>         {
>>>             "column": "value_time",
>>>             "operator": ">",
>>>             "value": "2022-01-01",
>>>         },
>>>     ]
>>> ).to_pandas()

Filters as string:

>>> df = accessor.filter(
>>>     filters=["value_time > '2022-01-01'"]
>>> ).to_pandas()

property is_core_table: bool#

Return True if this table is a core (built-in) TEEHR table.

Core tables (e.g., primary_timeseries, locations, units) are part of the standard TEEHR schema and cannot be dropped. User-created tables (e.g., materialized views or saved query results) are not core tables and can be dropped.

Returns:: bool – True if the table is a core TEEHR table, False otherwise.

load_csv(in_path: Path | str, namespace_name: str = None, catalog_name: str = None, extraction_function: callable = None, pattern: str = '**/*.csv', field_mapping: dict = None, write_mode: str = 'append', drop_duplicates: bool = True, **kwargs)[source]#

Import location attributes from CSV file format.

Parameters:

in_path (Union[Path, str]) – The input file or directory path. CSV file format.
namespace_name (str, optional) – The namespace name to write to, by default None, which means the namespace_name of the active catalog is used.
catalog_name (str, optional) – The catalog name to write to, by default None, which means the catalog_name of the active catalog is used.
extraction_function (callable, optional) – A custom function to extract and transform the data from the input files to the TEEHR data model. If None (default), uses the table’s default extraction function.
pattern (str, optional) – The glob pattern to use when searching for files in a directory. Default is ‘**/*.csv’ to search for all CSV files recursively.
field_mapping (dict, optional) – A dictionary mapping input fields to output fields. Format: {input_field: output_field}
write_mode (str, optional (default: "append")) – The write mode for the table. Options include:
- “insert”: Insert new data without checking for duplicates.
- “append”: Insert new data, skipping rows that already exist.
- “upsert”: Update existing data, insert new data.
- “overwrite”: Update table with new snapshot version preserving historical versions.
- “create_or_replace”: Drop and recreate the table with new data.
drop_duplicates (bool, optional (default: True)) – Whether to drop duplicates from the DataFrame during validation.
**kwargs – Additional keyword arguments are passed to pd.read_csv() or pd.read_parquet().

Notes

The TEEHR Location Crosswalk table schema includes fields:

primary_location_id
secondary_location_id

load_dataframe(df: DataFrame | DataFrame, namespace_name: str = None, catalog_name: str = None, field_mapping: dict = None, constant_field_values: dict = None, write_mode: str = 'append', drop_duplicates: bool = True)#

Import data from an in-memory dataframe.

Parameters:

df (Union[pd.DataFrame, ps.DataFrame]) – DataFrame to load into the table.
namespace_name (str, optional) – The namespace name to write to, by default None, which means the namespace_name of the active catalog is used.
catalog_name (str, optional) – The catalog name to write to, by default None, which means the catalog_name of the active catalog is used.
field_mapping (dict, optional) – A dictionary mapping input fields to output fields. Format: {input_field: output_field}
constant_field_values (dict, optional) – A dictionary mapping field names to constant values. Format: {field_name: value}.
write_mode (str, optional (default: "append")) – The write mode for the table. Options:
- "insert": Insert all rows directly without duplicate checking.
- "append": Insert new rows; skip rows matching uniqueness fields.
- "upsert": Insert new rows; update existing rows matching uniqueness fields.
- "overwrite": Replace all data, preserving table history.
- "create_or_replace": Drop and recreate table. Loses history.
drop_duplicates (bool, optional (default: True)) – Whether to drop duplicates from the DataFrame during validation.

load_parquet(in_path: Path | str, namespace_name: str = None, catalog_name: str = None, extraction_function: callable = None, pattern: str = '**/*.parquet', field_mapping: dict = None, write_mode: str = 'append', drop_duplicates: bool = True, **kwargs)[source]#

Import location attributes from parquet file format.

Parameters:

in_path (Union[Path, str]) – The input file or directory path. Parquet file format.
namespace_name (str, optional) – The namespace name to write to, by default None, which means the namespace_name of the active catalog is used.
catalog_name (str, optional) – The catalog name to write to, by default None, which means the catalog_name of the active catalog is used.
extraction_function (callable, optional) – A custom function to extract and transform the data from the input files to the TEEHR data model. If None (default), uses the table’s default extraction function.
pattern (str, optional) – The glob pattern to use when searching for files in a directory. Default is ‘**/*.parquet’ to search for all parquet files recursively.
field_mapping (dict, optional) – A dictionary mapping input fields to output fields. Format: {input_field: output_field}
write_mode (str, optional (default: "append")) – The write mode for the table. Options include:
- “insert”: Insert new data without checking for duplicates.
- “append”: Insert new data, skipping rows that already exist.
- “upsert”: Update existing data, insert new data.
- “overwrite”: Update table with new snapshot version preserving historical versions.
- “create_or_replace”: Drop and recreate the table with new data.
drop_duplicates (bool, optional (default: True)) – Whether to drop duplicates from the DataFrame during validation.
**kwargs – Additional keyword arguments are passed to pd.read_csv() or pd.read_parquet().

Notes

The TEEHR Location Crosswalk table schema includes fields:

primary_location_id
secondary_location_id

order_by(fields: str | StrEnum | List[str | StrEnum])#

Apply ordering to the DataFrame.

Parameters:: fields (Union[str, StrEnum, List[...]]) – The fields to order by.
Returns:: TeehrDataFrameBase – A new accessor instance with ordering applied.

Examples

>>> df = accessor.order_by("value_time").to_pandas()

to_geopandas()[source]#: Return GeoPandas DataFrame.

to_pandas()#

Return Pandas DataFrame.

Returns:: pd.DataFrame – The data as a Pandas DataFrame.

to_sdf() → DataFrame#

Return the PySpark DataFrame.

The PySpark DataFrame can be further processed using PySpark. Note, PySpark DataFrames are lazy and will not be executed until an action is called (e.g., show(), collect(), toPandas()).

Returns:: ps.DataFrame – The Spark DataFrame.

validate(drop_duplicates: bool = True)#

Validate the dataset table against the schema.

Parameters:: drop_duplicates (bool, optional) – Whether to drop duplicates based on the uniqueness fields. Default is True.

Examples

Validate a table:

>>> ev.table(
>>>     table_name="primary_timeseries"
>>> ).validate(drop_duplicates=True)

write(table_name: str, write_mode: str = 'create_or_replace', uniqueness_fields: list[str] | None = None, nullable_fields: list[str] | None = None, partition_by: list[str] | None = None, write_ordered_by: list[str] | None = None, use_partition_filters: bool = True)#

Write the DataFrame to an iceberg table.

Deprecated since version Use: write_to() instead. This method will be removed in a future release.

Parameters:

table_name (str) – The name of the table to write to.
write_mode (str, optional) – The write mode. Options:
- "insert": Insert all rows directly without duplicate checking.
- "append": Insert new rows; skip rows matching uniqueness fields.
- "upsert": Insert new rows; update existing rows matching uniqueness fields.
- "overwrite": Replace all data, preserving table history.
- "create_or_replace": Drop and recreate table. Loses history.
Default is “create_or_replace”.
uniqueness_fields (list[str], optional) – Explicit uniqueness fields to use for custom-table append or upsert writes. If omitted, uses the target table metadata.
nullable_fields (list[str], optional) – Explicit nullable uniqueness fields to compare with null-safe equality during append or upsert writes. If omitted, uses the target table schema when available.
partition_by (list[str], optional) – Partition expressions to use when creating a custom table with write_mode="create_or_replace".
write_ordered_by (list[str], optional) – Field names to use for Iceberg table write order via ALTER TABLE ... WRITE ORDERED BY. Each field is written as ASC NULLS LAST.
use_partition_filters (bool, optional) – Whether to add partition-based predicates for MERGE partition pruning. Default is True.

Returns:

self – Returns self for method chaining.

write_to(table_name: str, write_mode: str = 'create_or_replace', uniqueness_fields: list[str] | None = None, nullable_fields: list[str] | None = None, partition_by: list[str] | None = None, write_ordered_by: list[str] | None = None, use_partition_filters: bool = True)#

Write the DataFrame to an iceberg table.

Parameters:

table_name (str) – The name of the table to write to.
write_mode (str, optional) – The write mode. Options:
- "insert": Insert all rows directly without duplicate checking.
- "append": Insert new rows; skip rows matching uniqueness fields.
- "upsert": Insert new rows; update existing rows matching uniqueness fields.
- "overwrite": Replace all data, preserving table history.
- "create_or_replace": Drop and recreate table. Loses history.
Default is “create_or_replace”.
uniqueness_fields (list[str], optional) – Explicit uniqueness fields to use for custom-table append or upsert writes. If omitted, uses the target table metadata.
nullable_fields (list[str], optional) – Explicit nullable uniqueness fields to compare with null-safe equality during append or upsert writes. If omitted, uses the target table schema when available.
partition_by (list[str], optional) – Partition expressions to use when creating a custom table with write_mode="create_or_replace".
write_ordered_by (list[str], optional) – Field names to use for Iceberg table write order via ALTER TABLE ... WRITE ORDERED BY. Each field is written as ASC NULLS LAST.
use_partition_filters (bool, optional) – Whether to add partition-based predicates for MERGE partition pruning. Default is True.

Returns:

self – Returns self for method chaining.

Examples

>>> accessor.aggregate(
...     metrics=[KGE()],
...     group_by=["primary_location_id"]
... ).write_to("location_metrics")

DomainTable#

This Page