Load#
- class Load(ev=None)[source]#
Class to handle loading data into the warehouse.
Methods
Load data from an in-memory dataframe.
Load data from a file on local storage.
Load data from the cache.
- dataframe(df: DataFrame | DataFrame, table_name: str, namespace_name: str | None = None, catalog_name: str | None = None, field_mapping: dict | None = None, constant_field_values: dict | None = None, primary_location_id_prefix: str | None = None, primary_location_id_field: str = 'location_id', secondary_location_id_prefix: str | None = None, secondary_location_id_field: str | None = None, write_mode: str = 'append', drop_duplicates: bool = True)[source]#
Load data from an in-memory dataframe.
- Parameters:
df (
pd.DataFrame | ps.DataFrame) – The input dataframe to load into the warehouse.table_name (
str) – The name of the table to load the data into.namespace_name (
str, optional) – The namespace name to load the data into. The default is None, which uses the active namespace of the Evaluation.catalog_name (
str, optional) – The catalog name to load the data into. The default is None, which uses the active catalog of the Evaluation.field_mapping (
dict, optional) – A dictionary mapping input fields to output fields. Format: {input_field: output_field}constant_field_values (
dict, optional) – A dictionary mapping field names to constant values. Format: {field_name: value}.primary_location_id_prefix (
str, optional) – The prefix to add to primary location IDs. Used to ensure unique location IDs across configurations.primary_location_id_field (
str, optional) – The name of the primary location ID field in the dataframe. The default is “location_id”.secondary_location_id_prefix (
str, optional) – The prefix to add to secondary location IDs. Used to ensure unique location IDs across configurations.secondary_location_id_field (
str, optional) – The name of the secondary location ID field in the dataframe.write_mode (
str,optional (default:"append")) – The write mode for the table. Options are “append”, “upsert”, and “create_or_replace”. If “append”, the table will be appended without checking existing data. If “upsert”, existing data will be replaced and new data that does not exist will be appended. If “create_or_replace”, a new table will be created or an existing table will be replaced.drop_duplicates (
bool,optional (default:True)) – Whether to drop duplicates from the DataFrame during validation.
- file(in_path: Path | str, table_name: str, namespace_name: str | None = None, catalog_name: str | None = None, extraction_function: callable | None = None, pattern: str | None = None, field_mapping: dict | None = None, constant_field_values: dict | None = None, primary_location_id_prefix: str | None = None, primary_location_id_field: str = 'location_id', secondary_location_id_prefix: str | None = None, secondary_location_id_field: str | None = None, write_mode: str = 'append', drop_duplicates: bool = True, update_attrs_table: bool = True, parallel: bool = False, max_workers: int = 1, **kwargs)[source]#
Load data from a file on local storage.
- Parameters:
in_path (
Path | str) – The input file or directory path.table_name (
str) – The name of the table to load the data into.namespace_name (
str, optional) – The namespace name to load the data into. The default is None, which uses the active namespace of the Evaluation.catalog_name (
str, optional) – The catalog name to load the data into. The default is None, which uses the active catalog of the Evaluation.extraction_function (
callable, optional) – The function to extract data from the input files into TEEHR’s data model. If None, the table’s default extraction function is used.pattern (
str, optional) – The glob pattern to match files in the input directory.field_mapping (
dict, optional) – A dictionary mapping input fields to output fields. Format: {input_field: output_field}constant_field_values (
dict, optional) – A dictionary mapping field names to constant values. Format: {field_name: value}.primary_location_id_prefix (
str, optional) – The prefix to add to primary location IDs. Used to ensure unique location IDs across configurations.primary_location_id_field (
str, optional) – The name of the primary location ID field in the dataframe. The default is “location_id”.secondary_location_id_prefix (
str, optional) – The prefix to add to secondary location IDs. Used to ensure unique location IDs across configurations.secondary_location_id_field (
str, optional) – The name of the secondary location ID field in the dataframe.write_mode (
str,optional (default:"append")) – The write mode for the table. Options are “append”, “upsert”, and “create_or_replace”. If “append”, the table will be appended without checking existing data. If “upsert”, existing data will be replaced and new data that does not exist will be appended. If “create_or_replace”, a new table will be created or an existing table will be replaced.drop_duplicates (
bool,optional (default:True)) – Whether to drop duplicates from the DataFrame during validation.update_attrs_table (
bool,optional (default:True)) – Whether to update the location attributes table with any new attribute names found in the input data. Only applicable when table_name is “location_attributes”.parallel (
bool, optional) – Whether to process timeseries files in parallel. Default is False.max_workers (
int, optional) – The maximum number of worker processes to use if parallel is True. Default is 1. If set to -1, uses the number of CPUs available.**kwargs – Additional keyword arguments passed to the extraction function.
- from_cache(in_path: Path | str, table_name: str, namespace_name: str | None = None, catalog_name: str | None = None, write_mode: str = 'append', drop_duplicates: bool = True, update_attrs_table: bool = True)[source]#
Load data from the cache.
- Parameters:
in_path (
Path | str) – The input cache directory path.table_name (
str) – The name of the table to load the data into.namespace_name (
str, optional) – The namespace name to load the data into. The default is None, which uses the active namespace of the Evaluation.catalog_name (
str, optional) – The catalog name to load the data into. The default is None, which uses the active catalog of the Evaluation.write_mode (
str,optional (default:"append")) – The write mode for the table. Options are “append”, “upsert”, and “create_or_replace”. If “append”, the table will be appended without checking existing data. If “upsert”, existing data will be replaced and new data that does not exist will be appended. If “create_or_replace”, a new table will be created or an existing table will be replaced.drop_duplicates (
bool,optional (default:True)) – Whether to drop duplicates from the DataFrame during validation.update_attrs_table (
bool,optional (default:True)) – Whether to update the location attributes table with any new attribute names found in the input data. Only applicable when table_name is “location_attributes”.