Load#

class Load(ev=None)[source]#

Class to handle loading data into the warehouse.

Methods

dataframe

Load data from an in-memory dataframe.

file

Load data from a file on local storage.

from_cache

Load data from the cache.

dataframe(df: DataFrame | DataFrame, table_name: str, namespace_name: str | None = None, catalog_name: str | None = None, field_mapping: dict | None = None, constant_field_values: dict | None = None, primary_location_id_prefix: str | None = None, primary_location_id_field: str = 'location_id', secondary_location_id_prefix: str | None = None, secondary_location_id_field: str | None = None, write_mode: str = 'append', drop_duplicates: bool = True)[source]#

Load data from an in-memory dataframe.

Parameters:
  • df (pd.DataFrame | ps.DataFrame) – The input dataframe to load into the warehouse.

  • table_name (str) – The name of the table to load the data into.

  • namespace_name (str, optional) – The namespace name to load the data into. The default is None, which uses the active namespace of the Evaluation.

  • catalog_name (str, optional) – The catalog name to load the data into. The default is None, which uses the active catalog of the Evaluation.

  • field_mapping (dict, optional) – A dictionary mapping input fields to output fields. Format: {input_field: output_field}

  • constant_field_values (dict, optional) – A dictionary mapping field names to constant values. Format: {field_name: value}.

  • primary_location_id_prefix (str, optional) – The prefix to add to primary location IDs. Used to ensure unique location IDs across configurations.

  • primary_location_id_field (str, optional) – The name of the primary location ID field in the dataframe. The default is “location_id”.

  • secondary_location_id_prefix (str, optional) – The prefix to add to secondary location IDs. Used to ensure unique location IDs across configurations.

  • secondary_location_id_field (str, optional) – The name of the secondary location ID field in the dataframe.

  • write_mode (str, optional (default: "append")) – The write mode for the table. Options are “append”, “upsert”, and “create_or_replace”. If “append”, the table will be appended without checking existing data. If “upsert”, existing data will be replaced and new data that does not exist will be appended. If “create_or_replace”, a new table will be created or an existing table will be replaced.

  • drop_duplicates (bool, optional (default: True)) – Whether to drop duplicates from the DataFrame during validation.

file(in_path: Path | str, table_name: str, namespace_name: str | None = None, catalog_name: str | None = None, extraction_function: callable | None = None, pattern: str | None = None, field_mapping: dict | None = None, constant_field_values: dict | None = None, primary_location_id_prefix: str | None = None, primary_location_id_field: str = 'location_id', secondary_location_id_prefix: str | None = None, secondary_location_id_field: str | None = None, write_mode: str = 'append', drop_duplicates: bool = True, update_attrs_table: bool = True, parallel: bool = False, max_workers: int = 1, **kwargs)[source]#

Load data from a file on local storage.

Parameters:
  • in_path (Path | str) – The input file or directory path.

  • table_name (str) – The name of the table to load the data into.

  • namespace_name (str, optional) – The namespace name to load the data into. The default is None, which uses the active namespace of the Evaluation.

  • catalog_name (str, optional) – The catalog name to load the data into. The default is None, which uses the active catalog of the Evaluation.

  • extraction_function (callable, optional) – The function to extract data from the input files into TEEHR’s data model. If None, the table’s default extraction function is used.

  • pattern (str, optional) – The glob pattern to match files in the input directory.

  • field_mapping (dict, optional) – A dictionary mapping input fields to output fields. Format: {input_field: output_field}

  • constant_field_values (dict, optional) – A dictionary mapping field names to constant values. Format: {field_name: value}.

  • primary_location_id_prefix (str, optional) – The prefix to add to primary location IDs. Used to ensure unique location IDs across configurations.

  • primary_location_id_field (str, optional) – The name of the primary location ID field in the dataframe. The default is “location_id”.

  • secondary_location_id_prefix (str, optional) – The prefix to add to secondary location IDs. Used to ensure unique location IDs across configurations.

  • secondary_location_id_field (str, optional) – The name of the secondary location ID field in the dataframe.

  • write_mode (str, optional (default: "append")) – The write mode for the table. Options are “append”, “upsert”, and “create_or_replace”. If “append”, the table will be appended without checking existing data. If “upsert”, existing data will be replaced and new data that does not exist will be appended. If “create_or_replace”, a new table will be created or an existing table will be replaced.

  • drop_duplicates (bool, optional (default: True)) – Whether to drop duplicates from the DataFrame during validation.

  • update_attrs_table (bool, optional (default: True)) – Whether to update the location attributes table with any new attribute names found in the input data. Only applicable when table_name is “location_attributes”.

  • parallel (bool, optional) – Whether to process timeseries files in parallel. Default is False.

  • max_workers (int, optional) – The maximum number of worker processes to use if parallel is True. Default is 1. If set to -1, uses the number of CPUs available.

  • **kwargs – Additional keyword arguments passed to the extraction function.

from_cache(in_path: Path | str, table_name: str, namespace_name: str | None = None, catalog_name: str | None = None, write_mode: str = 'append', drop_duplicates: bool = True, update_attrs_table: bool = True)[source]#

Load data from the cache.

Parameters:
  • in_path (Path | str) – The input cache directory path.

  • table_name (str) – The name of the table to load the data into.

  • namespace_name (str, optional) – The namespace name to load the data into. The default is None, which uses the active namespace of the Evaluation.

  • catalog_name (str, optional) – The catalog name to load the data into. The default is None, which uses the active catalog of the Evaluation.

  • write_mode (str, optional (default: "append")) – The write mode for the table. Options are “append”, “upsert”, and “create_or_replace”. If “append”, the table will be appended without checking existing data. If “upsert”, existing data will be replaced and new data that does not exist will be appended. If “create_or_replace”, a new table will be created or an existing table will be replaced.

  • drop_duplicates (bool, optional (default: True)) – Whether to drop duplicates from the DataFrame during validation.

  • update_attrs_table (bool, optional (default: True)) – Whether to update the location attributes table with any new attribute names found in the input data. Only applicable when table_name is “location_attributes”.