{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to the Evaluation Class"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "In the previous lesson we loaded some test data from the user's local drive (although we first downloaded it from the repository). In this example we will continue to explore the Evaluation schema through the Evaluation class interface. \n",
    "\n",
    "Note: this lesson builds off of the dataset that we created in the last lesson `Loading Local Data`.  If you have not run through the Loading Local Data lesson, then go back and first work though that notebook to generate the required dataset for this lesson."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a new Evaluation\n",
    "First we will import the the TEEHR Evaluation class and create a new instance that points to a directory where the data loaded in lesson `02_loading_data` is stored."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import teehr\n",
    "from teehr.evaluation.utils import print_tree\n",
    "from pathlib import Path\n",
    "\n",
    "# Define the directory where the Evaluation will be created\n",
    "test_eval_dir = Path(Path().home(), \"temp\", \"02_loading_data\")\n",
    "\n",
    "# Create an Evaluation object and create the directory\n",
    "ev = teehr.Evaluation(dir_path=test_eval_dir, create_dir=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have created a new evaluation that points to the dataset we created in `02_loading _data`, lets take a look at the data, specifically the `dataset` directory.  You can see that the three different data groups are stored in slightly different ways.  \n",
    "- The domain tables (units, variables, configurations, attributes) are stored as *.csv files.  While in this case the files happen to have the same name as the table, there is no requirement that they do.\n",
    "- The location tables (locations, location_attributes, location_crosswalks) are stored as parquet files without hive partitioning.  The file names are managed by Spark.\n",
    "- The timeseries tables (primary_timeseries, secondary_timeseries, joined_timeseries) are stored as parquet files with hive partitioning. The file names are managed by Spark.\n",
    "\n",
    "Note, if you don't have tree installed and don't want to install it, you can uncomment the comment lines to use a Python function roughly does the same thing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print_tree(ev.dataset_dir, exclude_patterns=[\".*\", \"_*\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Table Classes\n",
    "The TEEHR Evaluation class contains different sub-classes that are used to organize class methods into logical groups.  One of these types of sub-classes is the \"table\" sub-classes which contain methods for interacting with the data tables. Each of the tables in the Evaluation dataset has a respective sub-class with the table name.\n",
    "```\n",
    "ev.units\n",
    "ev.attributes\n",
    "ev.variables\n",
    "ev.configurations\n",
    "ev.locations\n",
    "ev.location_attributes\n",
    "ev.location_crosswalks\n",
    "ev.primary_timeseries\n",
    "ev.secondary_timeseries\n",
    "ev.joined_timeseries\n",
    "```\n",
    "Each of the table sub-classes then has methods to add and/or load new data as well as methods to query the table to get data out.  These are documented in the API documentation. For now, because all the tables are relatively small, we will just use the `to_pandas()` method and then the `head()` method on the Pandas DataFrame to see an example of the data that is returned.  In an actual evaluation setting, with lots of data in the TEEHR dataset, you would likely want to include a `filter()` method to reduce the amount of data you are querying and putting into memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ev.units.to_pandas().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ev.attributes.to_pandas().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ev.variables.to_pandas().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ev.configurations.to_pandas().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ev.locations.to_pandas().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ev.location_attributes.to_pandas().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ev.primary_timeseries.to_pandas().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ev.location_crosswalks.to_pandas().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ev.secondary_timeseries.to_pandas().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Querying\n",
    "Above, we just used the `to_pandas()` method on each table in the dataset to see an example of the data that is in each table. The underlying query engine for TEEHR is PySpark.  As a result, each of the table sub-classes can return data as either a Spark DataFrame (using the `to_sdf()` method) or as a Pandas DataFrame (using the `to_pandas()` method).  The location data tables have an additional method that returns a GeoPandas DataFrame (using the `to_geopandas()` method) where the geometry bytes column has been converted to a proper WKT geometry column.\n",
    "\n",
    "Note: PySpark itself is \"lazy loaded\" meaning that it does not actually run the query until the data is needed for display, plotting, etc.  Therefore, if you just use the `to_sdf()` method, you do not get the data but rather a lazy Spark DataFrame that can be used with subsequent Spark operations that will all be evaluated when the results are requested.  Here we show how to get the Spark DataFrame and show the data but there are many other ways that the lazy Spark DataFrame can be used in subsequent operations that are beyond the scope of this document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Query the locations and return as a lazy Spark DataFrame.\n",
    "ev.locations.to_sdf()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Query the locations and return as a Spark DataFrame but tell Spark to show the data.\n",
    "ev.locations.to_sdf().show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Query the locations and return as a Pandas DataFrame.\n",
    "# Note that the geometry column is shown as a byte string.\n",
    "ev.locations.to_pandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Query the locations and return as a GeoPandas DataFrame.\n",
    "# Note that the geometry column is now a proper WKT geometry column.\n",
    "ev.locations.to_geopandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One very quick example of how the Spark DataFrame's lazy loading can be beneficial, would be to get the number of rows in a query result.  If you did `len(ev.primary_timeseries.to_pandas())`, first the entire data frame would have to be loaded in memory as a Pandas DataFrame and then the length calculated.  On the otherhand, if you were to `ev.primary_timeseries.to_sdf().count()` the Spark engine would calculate the number of rows without loading the entire dataset into memory first.  For larger datsets this could be very important."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(\n",
    "    len(ev.primary_timeseries.to_pandas())\n",
    ")\n",
    "display(\n",
    "    ev.primary_timeseries.to_sdf().count()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filtering and Ordering\n",
    "As noted above, because the tables are a lazy loaded Spark DataFrames, we can filter and order the data before returning it as a Pandas or GeoPandas DataFrame. The filter methods take either a raw SQL string, a filter dictionary or a FilterObject, Operator and field enumeration. Using an FilterObject, Operator and field enumeration is probably not a common pattern for most users, but it is used internally to validate filter arguments and is available to users if they would like to use it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter using a raw SQL string\n",
    "ev.locations.filter(\"id = 'usgs-A'\").to_geopandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter using a dictionary\n",
    "ev.locations.filter({\n",
    "    \"column\": \"id\",\n",
    "    \"operator\": \"=\",\n",
    "    \"value\": \"usgs-A\"\n",
    "}).to_geopandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the LocationFilter and Operators classes\n",
    "from teehr import LocationFilter, Operators\n",
    "\n",
    "# Get the field enumeration\n",
    "fields = ev.locations.field_enum()\n",
    "\n",
    "# Filter using the LocationFilter class\n",
    "lf = LocationFilter(\n",
    "    column=fields.id,\n",
    "    operator=Operators.eq,\n",
    "    value=\"usgs-A\"\n",
    ")\n",
    "ev.locations.filter(lf).to_geopandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ev.spark.stop()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This same approach can be used to query the other tables in the evaluation dataset.  There are also other methods that we did not explore and users are encouraged to checkout the TEEHR API documentation as well as the PySpark documentation for a more in-depth understanding of what happens in the background."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}