Load#

class Load(ev=None)[source]#

Class to handle loading data into the warehouse.

Methods

`dataframe`	Load data from an in-memory dataframe.
`file`	Load data from a file on local storage.
`from_cache`	Load data from the cache.

dataframe(df: DataFrame | DataFrame, table_name: str, namespace_name: str = None, catalog_name: str = None, field_mapping: dict = None, constant_field_values: dict = None, primary_location_id_prefix: str = None, primary_location_id_field: str = 'location_id', secondary_location_id_prefix: str = None, secondary_location_id_field: str = None, write_mode: str = 'append', drop_duplicates: bool = True)[source]#

Load data from an in-memory dataframe.

Parameters:

df (pd.DataFrame | ps.DataFrame) – The input dataframe to load into the warehouse.
table_name (str) – The name of the table to load the data into.
namespace_name (str, optional) – The namespace name to load the data into. The default is None, which uses the active namespace of the Evaluation.
catalog_name (str, optional) – The catalog name to load the data into. The default is None, which uses the active catalog of the Evaluation.
field_mapping (dict, optional) – A dictionary mapping input fields to output fields. Format: {input_field: output_field}
constant_field_values (dict, optional) – A dictionary mapping field names to constant values. Format: {field_name: value}.
primary_location_id_prefix (str, optional) – The prefix to add to primary location IDs. Used to ensure unique location IDs across configurations.
primary_location_id_field (str, optional) – The name of the primary location ID field in the dataframe. The default is “location_id”.
secondary_location_id_prefix (str, optional) – The prefix to add to secondary location IDs. Used to ensure unique location IDs across configurations.
secondary_location_id_field (str, optional) – The name of the secondary location ID field in the dataframe.
write_mode (str, optional (default: "append")) – The write mode for the table. Options include:
- “insert”: Insert new data without checking for duplicates.
- “append”: Insert new data, skipping rows that already exist.
- “upsert”: Update existing data, insert new data.
- “overwrite”: Update table with new snapshot version preserving historical versions.
- “create_or_replace”: Drop and recreate the table with new data.
drop_duplicates (bool, optional (default: True)) – Whether to drop duplicates from the DataFrame during validation.

file(in_path: Path | str, table_name: str, namespace_name: str = None, catalog_name: str = None, extraction_function: callable = None, pattern: str = None, field_mapping: dict = None, constant_field_values: dict = None, primary_location_id_prefix: str = None, primary_location_id_field: str = 'location_id', secondary_location_id_prefix: str = None, secondary_location_id_field: str = None, write_mode: str = 'append', drop_duplicates: bool = True, update_attrs_table: bool = True, parallel: bool = False, max_workers: int = 1, **kwargs)[source]#

Load data from a file on local storage.

Parameters:

in_path (Path | str) – The input file or directory path.
table_name (str) – The name of the table to load the data into.
namespace_name (str, optional) – The namespace name to load the data into. The default is None, which uses the active namespace of the Evaluation.
catalog_name (str, optional) – The catalog name to load the data into. The default is None, which uses the active catalog of the Evaluation.
extraction_function (callable, optional) – The function to extract data from the input files into TEEHR’s data model. If None, the table’s default extraction function is used.
pattern (str, optional) – The glob pattern to match files in the input directory.
field_mapping (dict, optional) – A dictionary mapping input fields to output fields. Format: {input_field: output_field}
constant_field_values (dict, optional) – A dictionary mapping field names to constant values. Format: {field_name: value}.
primary_location_id_prefix (str, optional) – The prefix to add to primary location IDs. Used to ensure unique location IDs across configurations.
primary_location_id_field (str, optional) – The name of the primary location ID field in the dataframe. The default is “location_id”.
secondary_location_id_prefix (str, optional) – The prefix to add to secondary location IDs. Used to ensure unique location IDs across configurations.
secondary_location_id_field (str, optional) – The name of the secondary location ID field in the dataframe.
write_mode (str, optional (default: "append")) – The write mode for the table. Options include:
- “insert”: Insert new data without checking for duplicates.
- “append”: Insert new data, skipping rows that already exist.
- “upsert”: Update existing data, insert new data.
- “overwrite”: Update table with new snapshot version preserving historical versions.
- “create_or_replace”: Drop and recreate the table with new data.
drop_duplicates (bool, optional (default: True)) – Whether to drop duplicates from the DataFrame during validation.
update_attrs_table (bool, optional (default: True)) – Whether to update the location attributes table with any new attribute names found in the input data. Only applicable when table_name is “location_attributes”.
parallel (bool, optional) – Whether to process timeseries files in parallel. Default is False.
max_workers (int, optional) – The maximum number of worker processes to use if parallel is True. Default is 1. If set to -1, uses the number of CPUs available.
**kwargs – Additional keyword arguments passed to the extraction function.

from_cache(in_path: Path | str, table_name: str, namespace_name: str = None, catalog_name: str = None, write_mode: str = 'append', drop_duplicates: bool = True, update_attrs_table: bool = True)[source]#

Load data from the cache.

Parameters:

in_path (Path | str) – The input cache directory path.
table_name (str) – The name of the table to load the data into.
namespace_name (str, optional) – The namespace name to load the data into. The default is None, which uses the active namespace of the Evaluation.
catalog_name (str, optional) – The catalog name to load the data into. The default is None, which uses the active catalog of the Evaluation.
write_mode (str, optional (default: "append")) – The write mode for the table. Options include:
- “insert”: Insert new data without checking for duplicates.
- “append”: Insert new data, skipping rows that already exist.
- “upsert”: Update existing data, insert new data.
- “overwrite”: Update table with new snapshot version preserving historical versions.
- “create_or_replace”: Drop and recreate the table with new data.
drop_duplicates (bool, optional (default: True)) – Whether to drop duplicates from the DataFrame during validation.
update_attrs_table (bool, optional (default: True)) – Whether to update the location attributes table with any new attribute names found in the input data. Only applicable when table_name is “location_attributes”.

Load#

This Page