Extract#

class Extract(ev)[source]#

Class for extracting data from raw files.

Methods

to_cache

Convert raw data to parquet format enforcing the provided table schema.

to_cache(in_datapath: str | Path, cache_dir: str | Path, table_fields: list[str], table_schema: DataFrameSchema | DataFrameSchema, write_schema: Schema, extraction_func: Callable[[str | Path], dict[str]], field_mapping: dict | None = None, constant_field_values: dict | None = None, pattern: str = '**/*.parquet', parallel: bool = False, max_workers: int = 1, **kwargs)[source]#

Convert raw data to parquet format enforcing the provided table schema.

It seems to be doing a lot of things related to reading raw data, validating it, and writing it to cache.

Parameters:
  • in_datapath (str | Path) – The input file or directory path.

  • cache_dir (str | Path) – The directory to write the cached parquet files to.

  • table_fields (list[str]) – The list of fields in the target table.

  • table_schema (SparkDataFrameSchema | PandasDataFrameSchema) – The schema for validating the data.

  • write_schema (ArrowSchema) – A pyarrow schema for writing the parquet file to the cache.

  • extraction_func (Callable) – A function that extracts a DataFrame from a raw data file. The function must have the signature: func(in_filepath: str | Path, field_mapping: dict, **kwargs) -> DataFrame

  • field_mapping (dict, optional) – A dictionary mapping input fields to output fields. format: {input_field: output_field}

  • constant_field_values (dict, optional) – A dictionary of constant field values to add to the DataFrame. format: {field: value}

  • pattern (str, optional) – The glob pattern to use when searching for files in a directory. Default is ‘**/*.parquet’ to search for all parquet files recursively.

  • parallel (bool, optional) – Whether to process files in parallel. Default is False. Note: Parallel processing is not yet implemented.

  • max_workers (int, optional) – The maximum number of worker processes to use if parallel is True. Default is 1. If set to -1, uses the number of CPUs available.

  • **kwargs – Additional keyword arguments are passed to the extraction function related to reading the raw data files.