teehr.Fetch.usgs_streamflow#

Fetch.usgs_streamflow(start_date: str | datetime | Timestamp, end_date: str | datetime | Timestamp, service: USGSServiceEnum = 'iv', chunk_by: USGSChunkByEnum | None = None, filter_to_hourly: bool = True, filter_no_data: bool = True, convert_to_si: bool = True, overwrite_output: bool | None = False, timeseries_type: TimeseriesTypeEnum = 'primary')[source]#

Fetch USGS gage data and load into the TEEHR dataset.

Data is fetched for all IDs in the locations table, and all dates and times within the files and in the cached file names are in UTC.

Parameters:
  • start_date (Union[str, datetime, pd.Timestamp]) – Start time of data to fetch.

  • end_date (Union[str, datetime, pd.Timestamp]) – End time of data to fetch. Note, since start_date is inclusive for the USGS service, we subtract 1 minute from this time so we don’t get overlap between consecutive calls.

  • service (USGSServiceEnum, default = "iv") – The USGS service to use for fetching data (‘iv’ for hourly instantaneous values or ‘dv’ for daily mean values).

  • chunk_by (Union[str, None], default = None) – A string specifying how much data is fetched and read into memory at once. The default is to fetch all locations and all dates at once. Valid options = [“location_id”, “day”, “week”, “month”, “year”, None].

  • filter_to_hourly (bool = True) – Return only values that fall on the hour (i.e. drop 15 minute data).

  • filter_no_data (bool = True) – Filter out -999 values.

  • convert_to_si (bool = True) – Multiplies values by 0.3048**3 and sets measurement_units to “m3/s”.

  • overwrite_output (bool) – Flag specifying whether or not to overwrite output files if they already exist. True = overwrite; False = fail.

  • timeseries_type (str) – Whether to consider as the “primary” or “secondary” timeseries. Default is “primary”.

Examples

Here we fetch over a year of USGS hourly streamflow data. Initially the data is saved to the cache directory, then it is validated and loaded into the TEEHR dataset.

>>> import teehr
>>> ev = teehr.Evaluation()

Fetch the data for locations in the locations table.

>>> eval.fetch.usgs_streamflow(
>>>     start_date=datetime(2021, 1, 1),
>>>     end_date=datetime(2022, 2, 28)
>>> )

Note

USGS data can also be fetched outside of a TEEHR Evaluation by calling the method directly.

>>> from teehr.fetching.usgs.usgs import usgs_to_parquet

This requires specifying a list of USGS gage IDs and an output directory in addition to the above arguments.

>>> usgs_to_parquet(
>>>     sites=["02449838", "02450825"],
>>>     start_date=datetime(2023, 2, 20),
>>>     end_date=datetime(2023, 2, 25),
>>>     output_parquet_dir=Path(Path().home(), "temp", "usgs"),
>>>     chunk_by="day",
>>>     overwrite_output=True
>>> )