GeneratedTimeseries#

class GeneratedTimeseries(ev)[source]#

Component class for generating synthetic data.

Methods

`benchmark_forecast`	Generate a benchmark forecast from two timeseries.
`signature_timeseries`	Generate synthetic summary from a single timeseries.
`to_pandas`	Return Pandas DataFrame.
`to_sdf`	Return PySpark DataFrame.
`write`	Write the generated DataFrame to a specified table.

Generate a benchmark forecast from two timeseries.

Parameters:

method (BenchmarkGeneratorBaseModel) – The method to use for generating the benchmark forecast.
reference_table_name (str) – The name of the reference table to query the timeseries from.
template_table_name (str) – The name of the template table to query the timeseries from.
output_configuration_name (str) – The configuration name for the generated benchmark forecast.
reference_table_filters (Union[) – str, dict, TableFilter, List[Union[str, dict, TableFilter]] ], optional The reference table filter(s) defining the timeseries containing the values to assign to the template timeseries.
template_table_filters (Union[) – str, dict, TableFilter, List[Union[str, dict, TableFilter]] ], optional The template table filter(s) that defines the timeseries containing the forecast structure (lead time, time interval, issuance frequency, etc) to use for the benchmark.

Returns:

GeneratedTimeseries – The generated timeseries class object.

Example

Generate a Climatology benchmark forecast using a previously generated climatology timeseries as the reference and the secondary_timeseries as the template forecast.

>>> from teehr import BenchmarkForecastGenerators as bmf

Define the benchmark forecast method

>>> ref_fcst = bmf.ReferenceForecast()
>>> ref_fcst.aggregate_reference_timeseries = True

Specify the tables and optional filters that define the reference and template timeseries.

>>> reference_table_name = "primary_timeseries"
>>> reference_filters = [
>>>     "variable_name = 'streamflow_hour_of_year_mean'",
>>>     "unit_name = 'ft^3/s'"
>>> ]

>>> template_table_name = "secondary_timeseries"
>>> template_filters = [
>>>     "variable_name = 'streamflow_hourly_inst'",
>>>     "unit_name = 'ft^3/s'",
>>>     "member = '1993'"
>>> ]

Generate the benchmark forecast timeseries and write to secondary_timeseries, with the configuration name ‘benchmark_forecast_hourly_normals’.

>>> ev.generate.benchmark_forecast(
>>>     method=ref_fcst,
>>>     reference_table_name=reference_table_name,
>>>     reference_table_filters=reference_filters,
>>>     template_table_name=template_table_name,
>>>     template_table_filters=template_filters,
>>>     output_configuration_name="benchmark_forecast_hourly_normals"
>>> ).write(destination_table="secondary_timeseries")

Generate synthetic summary from a single timeseries.

Parameters:

method (TimeseriesGeneratorBaseModel) – The method to use for generating the timeseries.
input_table_name (str) – The name of the input table to query the timeseries from.
start_datetime (Union[str, datetime]) – The start datetime for the generated timeseries. If provided as a str, the format must be supported by PySpark’s to_timestamp function, such as “yyyy-MM-dd HH:mm:ss”.
end_datetime (Union[str, datetime]) – The end datetime for the generated timeseries. If provided as a str, the format must be supported by PySpark’s to_timestamp function, such as “yyyy-MM-dd HH:mm:ss”.
input_table_filters (Union[) – str, dict, TableFilter, List[Union[str, dict, TableFilter]] ], optional The input table filter(s) that define a timeseries that will be used as the input_dataframe.
timestep (Union[str, timedelta], optional) – The timestep for the generated timeseries. Defaults to “1 hour”.
update_variable_table (bool, optional) – Whether to update the variable table. Defaults to True.
fillna (bool, optional) – Whether to forward and back-fill NaN values. Defaults to False.
dropna (bool, optional) – Whether to drop rows with NaN values. Defaults to True.

Returns:

GeneratedTimeseries – The generated timeseries class object.

Notes

This method operates on a single timeseries (e.g., Climatology, Normals, Detrending, etc.)

The output variable name is derived automatically based on the input variable name, and added to the Evaluation if it does not exist.

The variable naming convention follows the pattern: <variable>_<temporal_resolution>_<summary_statistic>

Example

Generate a daily climatology timeseries from the primary_timeseries.

>>> from teehr import SignatureTimeseriesGenerators as sts

Define the signature timeseries method

>>> ts_normals = sts.Normals()
>>> ts_normals.temporal_resolution = "day_of_year"
>>> ts_normals.summary_statistic = "mean"

Generate the signature timeseries, operating on the primary_timeseries.

>>> ev.generate.signature_timeseries(
>>>     method=ts_normals,
>>>     input_table_name="primary_timeseries",
>>>     start_datetime="1924-11-19 12:00:00",
>>>     end_datetime="2024-11-21 13:00:00",
>>> ).write()

to_pandas()#: Return Pandas DataFrame.

to_sdf()#

Return PySpark DataFrame.

The PySpark DataFrame can be further processed using PySpark. Note, PySpark DataFrames are lazy and will not be executed until an action is called. For example, calling show(), collect() or toPandas(). This can be useful for further processing or analysis.

write(destination_table: str = 'primary_timeseries', write_mode: str = 'append', drop_duplicates: bool = True)#

Write the generated DataFrame to a specified table.

Parameters:

destination_table (str) – The name of the destination table to write to.
write_mode (str) – The write mode for the DataFrame (e.g., “append”, “overwrite”).
drop_duplicates (bool, optional (default: True)) – Whether to drop duplicate rows during validation before writing to the destination table.

GeneratedTimeseries#

This Page