{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "65f79301-6b74-4df9-91a3-d8ab060bbe9e",
   "metadata": {},
   "source": [
    "# TEEHR Evaluation Example 3  \n",
    "## Hourly NWM Retrospective 3.0, CAMELS Subset (648)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26909a06-6c1f-4abf-bf34-861c1dae5ead",
   "metadata": {},
   "source": [
    "### 1. Get the data from S3\n",
    "For the sake of time, we prepared the individual datasets in advance and are simply copying to your 2i2c home directory. After running the cell below to copy the example_2 data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26dfc5fb-d4cb-4744-8716-58925b20b330",
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm -rf ~/teehr/example-3/*\n",
    "!aws s3 cp --recursive --no-sign-request s3://ciroh-rti-public-data/teehr-workshop-devcon-2024/workshop-data/example-3 ~/teehr/example-3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb119a6c-197e-4c88-823b-a4d12b2aa0b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "!tree ~/teehr/example-3/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91f899b9-5de5-45d2-8697-cd42825fc89b",
   "metadata": {},
   "source": [
    "### Evaluate Model Output\n",
    "This notebook we will demonstrate how to use TEEHR to calculate metrics from a previously created joined TEEHR database containing hourly NWM3.0 Retrospective simulations and USGS observations from 1981-2022, using a range of different options for grouping and filtering.  We will then create some common graphics based on the results (the same as Example 2)\n",
    "\n",
    "\n",
    "#### In this notebook we will perform the following steps:\n",
    "<ol>\n",
    "    <li> Review the contents of our joined parquet file </li>\n",
    "    <li> Calculate metrics with different group_by options </li>\n",
    "    <li> Calculate metrics with different filters options </li>\n",
    "    <li> Example visualizations of TEEHR results</li> \n",
    "</ol>\n",
    "\n",
    "#### First setup the TEEHR class and review the contents of the joined parquet file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f3b6e4c1-f86e-4fa9-ab32-58b479240669",
   "metadata": {},
   "outputs": [],
   "source": [
    "from teehr.classes.duckdb_joined_parquet import DuckDBJoinedParquet\n",
    "from pathlib import Path\n",
    "\n",
    "# Define the paths to the joined parquet file and the geometry files\n",
    "TEEHR_BASE = Path(Path.home(), 'teehr/example-3')\n",
    "JOINED_PARQUET_FILEPATH = f\"{TEEHR_BASE}/joined/configuration=nwm30_retro/variable_name=streamflow_hourly_inst/*.parquet\"\n",
    "GEOMETRY_FILEPATH = f\"{TEEHR_BASE}/geometry/**/*.parquet\"\n",
    "\n",
    "# Initialize a teehr joined parquet class with our parquet file and geometry\n",
    "joined_data = DuckDBJoinedParquet(\n",
    "    joined_parquet_filepath = JOINED_PARQUET_FILEPATH,\n",
    "    geometry_filepath = GEOMETRY_FILEPATH\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fd5873b-f112-4d7e-beaa-2471d8440c3b",
   "metadata": {},
   "source": [
    "### 1. Review the contents of the joined parquet files\n",
    "\n",
    "In practice, you may want to review the fields of data in the parquet file to plan your evaluation strategy.  If the dataset is large, reading it into a dataframe may be cumbersome or even infeasible in some cases. TEEHR provides the ```get_joined_timeseries_schema``` method to quickly review the fields of the joined parquet file and the ```get_unique_field_values``` method to review the unique values contained in a specified field.  The latter is particularly helpful for building dashboards for evaluation (e.g., to populate a drop down menu of possible filter or group_by values)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ffe147d4-d016-4cfa-bf56-de99d77caed1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remind ourselves what fields were included\n",
    "joined_data.get_joined_timeseries_schema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24f86eed-d5f5-4cca-b6fe-94f6692e2b4a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Review what configuration datasets were included\n",
    "joined_data.get_unique_field_values('configuration')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf82147a-fabf-49fc-99cb-a688c56894a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ...number of locations\n",
    "len(joined_data.get_unique_field_values('primary_location_id'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27b57c53-9d64-4a09-bbf0-ac82119f6f76",
   "metadata": {},
   "source": [
    "### 2. Calculate metrics "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "328663d1-6124-48e6-9ae5-6877df693152",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "gdf_all = joined_data.get_metrics(\n",
    "    group_by=[\"primary_location_id\", \"configuration\"],\n",
    "    order_by=[\"primary_location_id\", \"configuration\"],\n",
    "    include_metrics=[\n",
    "        'kling_gupta_efficiency_mod2',\n",
    "        'relative_bias',\n",
    "        'pearson_correlation',                  \n",
    "        'nash_sutcliffe_efficiency_normalized',  \n",
    "        'mean_absolute_relative_error',\n",
    "        'primary_count' \n",
    "    ],\n",
    "    include_geometry=True,\n",
    ")\n",
    "# view the dataframe\n",
    "gdf_all"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df66bc5d-f050-448c-81d0-cde0c07c6142",
   "metadata": {},
   "outputs": [],
   "source": [
    "# use pandas magic to create a nice summary table of the metrics by model configuration across locations\n",
    "gdf_all.groupby('configuration').describe(percentiles=[.5]).unstack(1).reset_index().rename(\n",
    "    columns={'level_0':'metric','level_1':'summary'}).pivot(\n",
    "    index=['metric','configuration'], values=0, columns='summary')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c881e37e-4d38-46d3-a628-402a18d0e0af",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "'''\n",
    "Calculate metrics separately for low flows and high flows based on the \n",
    "calculated field \"obs_flow_category_q_mean\" -> add the field to the group_by list.  \n",
    "'''\n",
    "\n",
    "gdf_flowcat = joined_data.get_metrics(\n",
    "    group_by=[\"primary_location_id\", \"configuration\", \"obs_flow_category_q_mean\"],\n",
    "    order_by=[\"primary_location_id\", \"configuration\"],\n",
    "    include_metrics=[\n",
    "        'kling_gupta_efficiency_mod2',\n",
    "        'pearson_correlation',                  \n",
    "        'mean_absolute_relative_error',\n",
    "        'primary_count' \n",
    "    ],\n",
    ")\n",
    "display(gdf_flowcat)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdf38d3e-36ba-4d76-87a1-5b76fc2f7ec4",
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf_flowcat.groupby(['obs_flow_category_q_mean']).describe(percentiles=[.5]).unstack().reset_index().rename(\n",
    "    columns={'level_0':'metric','level_1':'summary'}).pivot(\n",
    "    index=['metric','obs_flow_category_q_mean'], values=0, columns='summary')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ba1c82fb-333d-4dc1-add5-ea3c1f03431b",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "'''\n",
    "Now add the location characteristics you want included in the metrics table\n",
    "(for output tables and visualization)\n",
    "\n",
    "To include location-specific attributes in the metrics table, those attributes \n",
    "must be added to the group_by list.  If grouping across locations (.e.g., all locations \n",
    "within an RFC region), you should only add attributes that are already aggregated by that \n",
    "same region (TEEHR does not check for this). An example of including location characteristic \n",
    "attributes is included below.\n",
    "\n",
    "'''\n",
    "# list the attributes that are location characteristics that you want to include \n",
    "# in the metrics results tables\n",
    "# in the metrics results tables\n",
    "include_location_characteristics = [\n",
    "    'aridity_none',\n",
    "    'runoff_ratio_none',\n",
    "    'baseflow_index_none',\n",
    "    'stream_order_none',  \n",
    "    'q_mean_cms',\n",
    "    'slope_fdc_none',  \n",
    "    'frac_urban_none',\n",
    "    'frac_snow_none',\n",
    "    'forest_frac_none',\n",
    "    'ecoregion_L2_none',\n",
    "    'river_forecast_center_none',\n",
    "]\n",
    "df_atts = joined_data.get_metrics(\n",
    "    group_by=[\"primary_location_id\", \"configuration\"] + include_location_characteristics,\n",
    "    order_by=[\"primary_location_id\", \"configuration\"],\n",
    "    include_metrics=[\n",
    "        'kling_gupta_efficiency_mod2',\n",
    "        'pearson_correlation',                  \n",
    "        'mean_absolute_relative_error',\n",
    "        'relative_bias',\n",
    "        'primary_count' \n",
    "    ],\n",
    "    include_geometry=False,\n",
    ")\n",
    "\n",
    "# view the dataframe\n",
    "display(df_atts)\n",
    "\n",
    "# summarize just the median results across locations by attribute (river forecast center)\n",
    "df_atts_summary = df_atts.groupby(['configuration','river_forecast_center_none'])\\\n",
    "    .describe(percentiles=[.5]).unstack().unstack().reset_index()\\\n",
    "    .rename(columns={'level_0':'metric','level_1':'summary'})\n",
    "df_atts_summary[df_atts_summary['summary'].isin(['50%'])].pivot(\n",
    "    index=['river_forecast_center_none','configuration'],values=0, columns=['metric','summary'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0000fcc9-d3aa-4ec7-b14b-c6414475f2b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "import geoviews as gv\n",
    "import holoviews as hv\n",
    "import colorcet as cc\n",
    "hv.extension('bokeh', logo=False)\n",
    "gv.extension('bokeh', logo=False)\n",
    "basemap = hv.element.tiles.CartoLight()\n",
    "\n",
    "gdf_filters = joined_data.get_metrics(\n",
    "    group_by=[\"primary_location_id\", \"configuration\", \"stream_order_none\"],\n",
    "    order_by=[\"primary_location_id\", \"configuration\"],\n",
    "    include_metrics=[\n",
    "        'kling_gupta_efficiency_mod2',\n",
    "        'relative_bias',\n",
    "        'pearson_correlation',                  \n",
    "        'nash_sutcliffe_efficiency_normalized',  \n",
    "        'mean_absolute_relative_error',\n",
    "        'primary_count' \n",
    "    ],\n",
    "    filters = [\n",
    "          {\n",
    "              \"column\": \"stream_order_none\",\n",
    "              \"operator\": \"in\",\n",
    "              \"value\": ['1','2','3','4']\n",
    "              #\"value\": ['5','6','7','8']\n",
    "          },\n",
    "         # {\n",
    "         #     \"column\": \"month\",\n",
    "         #     \"operator\": \"in\",\n",
    "         #     \"value\": ['5','6','7','8','9']\n",
    "         # },\n",
    "         # {\n",
    "         #     \"column\": \"river_forecast_center_none\",\n",
    "         #     \"operator\": \"=\",\n",
    "         #     \"value\": \"SERFC\"\n",
    "         # },\n",
    "    ],\n",
    "    include_geometry=True,\n",
    ")\n",
    "#display(gdf_filters.head())\n",
    "\n",
    "# make a quick map of locations - see how it changes as you make different filter selections\n",
    "basemap * gv.Points(gdf_filters, vdims=['kling_gupta_efficiency_mod2','configuration']).select(\n",
    "    configuration='nwm30_retro').opts(\n",
    "    color='kling_gupta_efficiency_mod2', \n",
    "    height=400, width=600, size=7, \n",
    "    cmap=cc.rainbow[::-1], colorbar=True, clim=(0,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09399b9a-dba7-402d-982e-e06869048d45",
   "metadata": {},
   "source": [
    "### 4. More visualizations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50dc35bd-8af3-4e95-9851-0278b4610b39",
   "metadata": {},
   "outputs": [],
   "source": [
    "# set up color and abbrevation settings to use across multiple plots\n",
    "\n",
    "metric_abbrev=dict(\n",
    "    kling_gupta_efficiency_mod2 = \"KGE''\",\n",
    "    mean_absolute_relative_error = \"MAE\",\n",
    "    pearson_correlation = \"Corr\",\n",
    "    relative_bias  = \"Rel.Bias\",\n",
    "    nash_sutcliffe_efficiency_normalized = \"NNSE\",\n",
    ")\n",
    "cmap_lin = cc.rainbow[::-1]\n",
    "cmap_div = cc.CET_D1A[::-1]\n",
    "metric_colors=dict(\n",
    "    kling_gupta_efficiency_mod2          = {'cmap': cmap_lin, 'clim': (0,1)},  \n",
    "    relative_bias                        = {'cmap': cmap_div, 'clim': (-1,1)},   \n",
    "    pearson_correlation                  = {'cmap': cmap_lin, 'clim': (0,1)},     \n",
    "    nash_sutcliffe_efficiency_normalized = {'cmap': cmap_lin, 'clim': (0,1)}, \n",
    "    mean_absolute_relative_error         = {'cmap': cmap_lin, 'clim': (0,2)},\n",
    ")\n",
    "metrics = list(metric_colors.keys())\n",
    "configs = ['nwm30_retro']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9081b4ca-2971-47d6-aa0e-8bf7cdb34c94",
   "metadata": {},
   "source": [
    "#### 4a. Side by side metric maps\n",
    "First we will create side-by-side maps of the first query results above (all locations and configurations, no filters), showing metric values at each location, where dots are colored by metric value and sized by sample size.  See how the comparison changes for each metric."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a54023ff-afeb-455f-8107-cd5753b2b560",
   "metadata": {},
   "outputs": [],
   "source": [
    "# map_metric = 'kling_gupta_efficiency_mod2'\n",
    "# map_metric = 'pearson_correlation'                  \n",
    "# map_metric = 'nash_sutcliffe_efficiency_normalized'\n",
    "# map_metric = 'mean_absolute_relative_error' \n",
    "map_metric = 'relative_bias'\n",
    "\n",
    "# factor to size dots based on sample size \n",
    "size_factor = 15/max(gdf_filters[('primary_count')])\n",
    "\n",
    "polys = gv.Points(\n",
    "    gdf_all, \n",
    "    vdims = metrics + ['primary_location_id','configuration','primary_count'],\n",
    "    label = 'metric value (color), sample size (size)',\n",
    ").opts(\n",
    "    height = 400,\n",
    "    width = 600,\n",
    "    line_color = 'gray',\n",
    "    colorbar = True,\n",
    "    size = hv.dim('primary_count') * 15/max(gdf_filters[('primary_count')]),\n",
    "    tools = ['hover'],\n",
    "    xaxis = 'bare',\n",
    "    yaxis = 'bare',\n",
    "    show_legend = True\n",
    ")\n",
    "maps = []\n",
    "config = configs[0]\n",
    "for map_metric in ['kling_gupta_efficiency_mod2','relative_bias']:\n",
    "    maps.append(basemap * polys.select(configuration=config).opts(\n",
    "            title=f\"{config} | {metric_abbrev[map_metric]}\",\n",
    "            color = map_metric,\n",
    "            clim = metric_colors[map_metric]['clim'],\n",
    "            cmap = metric_colors[map_metric]['cmap']\n",
    "        )\n",
    "    )\n",
    "maps[0] + maps[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f491530-434d-49a0-9e0e-d69316bc1bd7",
   "metadata": {},
   "source": [
    "#### 4b. Dataframe table and bar chart side by side\n",
    "Next we will summarize results across locations by creating a summary table with pandas (as we did above) and juxtapose it with a bar chart using holoviews and panel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b7cc231-e1e6-4199-95ff-d94f5e69c2f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display dataframes and simple plots side by side using Panel\n",
    "import panel as pn\n",
    "\n",
    "gdf_summary = gdf_all.groupby('configuration').describe(percentiles=[.5]).unstack(1).reset_index().rename(\n",
    "    columns={'level_0':'metric','level_1':'summary'}).pivot(\n",
    "    index=['metric','configuration'], values=0, columns='summary')\n",
    "\n",
    "gdf_bars = gdf_summary.drop('primary_count', axis=0)['50%'].reset_index().replace({'metric':metric_abbrev})\n",
    "bars = hv.Bars(gdf_bars, kdims=['metric', 'configuration']).opts(\n",
    "    xrotation=90, height=400, width=300, ylabel='median',xlabel='')\n",
    "\n",
    "pn.Row(pn.pane.DataFrame(gdf_summary, width=800), bars)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3e7c84b-f8b7-4c27-8d8d-b02b267e528f",
   "metadata": {},
   "source": [
    "#### 4c. Box-whisker plots of results by metric and model\n",
    "\n",
    "Next we'll create box-whisker plots to see the distribution of metrics across locations for each metric and configuration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54fd5813-ee5b-4517-8b76-02cfdebe6d5b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove geometry so holoviews knows this is not a map.\n",
    "df = gdf_all.drop('geometry', axis=1)\n",
    "\n",
    "opts = dict(\n",
    "    show_legend=False, \n",
    "    width=100, \n",
    "    cmap='Set1', \n",
    "    xrotation=45,\n",
    "    labelled=[]\n",
    ")\n",
    "boxplots = []\n",
    "for metric in metrics:\n",
    "    boxplots.append(\n",
    "        hv.BoxWhisker(df, 'configuration', metric, label=metric_abbrev[metric]).opts(\n",
    "            **opts,\n",
    "            box_fill_color=hv.dim('configuration')\n",
    "        )\n",
    "    )\n",
    "hv.Layout(boxplots).cols(len(metrics))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c77b52b-e27f-4264-9ecf-68e288b3c05e",
   "metadata": {},
   "source": [
    "#### 4d. Histograms by metric and model\n",
    "Every good scientist loves a histogram.  The below example creates a layout of histograms by configuration and metric, which gives us a more complete understanding of the metric distributions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed63c73f-c3d9-4877-983b-e309606f2e84",
   "metadata": {},
   "outputs": [],
   "source": [
    "import hvplot.pandas\n",
    "histograms =[]\n",
    "for metric in metrics:\n",
    "    histograms.append(\n",
    "        df[df['configuration']==config].hvplot.hist(\n",
    "            y=metric, \n",
    "            ylim=(0,200),\n",
    "            bin_range=metric_colors[metric]['clim'], \n",
    "            xlabel=metric_abbrev[metric],\n",
    "        ).opts(height = 200, width=250, title = config)\n",
    "    )\n",
    "hv.Layout(histograms).cols(len(metrics))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d886a6e-1aa4-4c33-9715-b8af91e92bf3",
   "metadata": {},
   "source": [
    "#### 4e. CDFs overlays by metric\n",
    "Every good scientist loves a CDF even more.  The below example creates a layout of histograms by configuration and metric, which gives us a more complete understanding of the metric distributions.  We include metrics here with (mostly) the same range (0,1) and 'good' value (1).  \n",
    "\n",
    "We kept this graphic for consistiency, but it is not that interesting with 1 model scenario."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "451b7be9-45da-4e4e-a20a-ea3ea39ef77a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "layout = []\n",
    "for metric in [\n",
    "    'kling_gupta_efficiency_mod2',\n",
    "    'pearson_correlation',                  \n",
    "    'nash_sutcliffe_efficiency_normalized',\n",
    "]:\n",
    "    xlim = metric_colors[metric]['clim']\n",
    "    xlabel = metric_abbrev[metric]\n",
    "    \n",
    "    cdfs = hv.Curve([])\n",
    "    for config in ['nwm30_retro']:\n",
    "        data = df[df['configuration']==config]\n",
    "        data[xlabel] = np.sort(data[metric])\n",
    "        n = len(data[xlabel])\n",
    "        data['y'] = 1. * np.arange(n) / (n - 1)    \n",
    "        cdfs = cdfs * hv.Curve(data, xlabel, 'y', label=config)\n",
    "        \n",
    "    layout.append(\n",
    "        cdfs.opts(\n",
    "            width = 300,\n",
    "            legend_position='top_left',\n",
    "            xlim=xlim, \n",
    "            xlabel=xlabel,\n",
    "            title=metric_abbrev[metric],\n",
    "            shared_axes=False,\n",
    "        )\n",
    "    )\n",
    "    \n",
    "hv.Layout(layout).cols(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2ca00ac-29ad-4be8-ba16-27e3c45efed9",
   "metadata": {},
   "source": [
    "#### 4f. Bar charts by attribute\n",
    "In one of the queries above, we demonstrate how to add attributes to the resulting dataframe for summary and visualization purposes.  In that example we generated a summary table to RFC region.  The below example uses those result to build bar charts of the median performance metric across locations within each RFC region."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3fdcd05-78f5-44a5-bbd3-53440956893c",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_bars = df_atts_summary.set_index('metric').drop('primary_count', axis=0).reset_index().set_index('summary').loc['50%']\n",
    "df_bars = df_bars.replace({'metric': metric_abbrev}) \\\n",
    "    .rename(columns={'river_forecast_center_none':'rfc',0:'median'}) \\\n",
    "    .reset_index().drop('summary', axis=1)\n",
    "df_bars.loc[df_bars['metric'] == 'MAE', 'median'] = 1 - df_bars.loc[df_bars['metric'] == 'MAE', 'median']\n",
    "df_bars = df_bars.replace('MAE','1-MAE')\n",
    "\n",
    "bars = hv.Bars(df_bars, kdims=['metric','configuration','rfc'], vdims=['median']).opts(\n",
    "    xrotation=90, height=300, width=200, ylabel='median',xlabel='')\n",
    "\n",
    "layout = []\n",
    "for rfc in df_bars['rfc'].unique():\n",
    "    layout.append(bars.select(rfc=rfc).opts(title=rfc))\n",
    "hv.Layout(layout).cols(6)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0c3abfc-5357-4083-a1bb-a672f36d18ca",
   "metadata": {},
   "source": [
    "#### 4g Scatter plots by attribute\n",
    "\n",
    "Scatter plots of location metric values and location characteristics can provide insight about the relationship between the two - i.e., does model performance have a clear relationship with any of the characteristics?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0364a005-d26e-4c91-b2c0-85da79181e74",
   "metadata": {},
   "outputs": [],
   "source": [
    "# As examples, let's create scatter plots of KGE with each of the numeric attributes\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "import geoviews as gv\n",
    "import holoviews as hv\n",
    "import colorcet as cc\n",
    "hv.extension('bokeh', logo=False)\n",
    "gv.extension('bokeh', logo=False)\n",
    "basemap = hv.element.tiles.CartoLight()\n",
    "\n",
    "location_chars = [\n",
    "    'aridity',\n",
    "    'runoff_ratio',\n",
    "    'baseflow_index',\n",
    "    'stream_order',  \n",
    "    'q_mean_cms',\n",
    "    'slope_fdc',  \n",
    "    'frac_urban',\n",
    "    'frac_snow',\n",
    "    'forest_frac'\n",
    "]\n",
    "df_atts.columns = df_atts.columns.str.replace('_none', '')\n",
    "df_atts[location_chars] = df_atts[location_chars].apply(pd.to_numeric)\n",
    "df_atts['config_num'] = np.where(df_atts['configuration']=='nwm30_retro',1,2)\n",
    "\n",
    "metrics = [\n",
    "    'kling_gupta_efficiency_mod2',\n",
    "    'pearson_correlation',                  \n",
    "    'mean_absolute_relative_error',\n",
    "    'relative_bias',\n",
    "]\n",
    "from bokeh.models import FixedTicker\n",
    "\n",
    "scatter_layout = []\n",
    "for char in location_chars:\n",
    "    scatter_layout.append(\n",
    "        hv.Scatter(\n",
    "            df_atts, \n",
    "            kdims=[char],\n",
    "            vdims=['kling_gupta_efficiency_mod2', 'relative_bias', 'primary_location_id','config_num'],\n",
    "            label=\"nwm3.0\"\n",
    "        ).opts(\n",
    "            width = 400, height = 300,\n",
    "            #color = 'relative_bias',\n",
    "            #color = 'config_num',\n",
    "            #cmap = ['#377EB8', '#E41A1C'],\n",
    "            #colorbar = True,\n",
    "            clim=(0.5,2.5),\n",
    "            ylabel = \"KGE''\",\n",
    "            tools=['hover'],\n",
    "            ylim=(-1,1),\n",
    "            size=4,\n",
    "            alpha=0.8,\n",
    "            show_legend = True,\n",
    "            # colorbar_opts={\n",
    "            #     'ticker': FixedTicker(ticks=[1,2]),\n",
    "            #     'major_label_overrides': {\n",
    "            #         1: 'nwm30_retro',  \n",
    "            #     },\n",
    "            #     'major_label_text_align': 'left',\n",
    "            # },\n",
    "        ))\n",
    "hv.Layout(scatter_layout).opts(show_legends = True).cols(3)        "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}