Source code for RES.cluster


"""
Spatial clustering module for renewable energy resource assessment.

This module provides K-means clustering functionality for aggregating grid cells
with similar renewable energy characteristics into representative clusters. The
clustering is based on techno-economic metrics such as Levelized Cost of Electricity
(LCOE) and potential capacity, enabling spatial aggregation for energy system
modeling and optimization.

The module implements an automated workflow for determining optimal cluster numbers,
performing spatial clustering, and creating representative cluster geometries that
maintain spatial relationships while reducing computational complexity for large-scale
renewable energy assessments.

Key Features
------------
- Automated optimal cluster number determination using elbow method
- Spatial clustering based on LCOE and capacity metrics
- Grid cell identifier generation for data linking
- Cluster geometry creation through spatial union operations
- Regional boundary clipping for precise spatial extent
- Visualization of clustering analysis results

Functions
---------
assign_cluster_id(cells, source_column=sub_national_unit_tag, index_name='cell')
    Generate unique identifiers for grid cells based on region and coordinates
    
find_optimal_K(resource_type, data_for_clustering, region, wcss_tolerance, max_k)
    Determine optimal number of clusters using elbow method and WCSS tolerance
    
pre_process_cluster_mapping(cells_scored, vis_directory, wcss_tolerance, resource_type)
    Preprocess data and determine optimal cluster numbers for each region
    
cells_to_cluster_mapping(cells_scored, vis_directory, wcss_tolerance, resource_type, sort_columns)
    Map grid cells to clusters based on similarity metrics and optimal cluster numbers
    
create_cells_Union_in_clusters(cluster_map_gdf, region_optimal_k_df, resource_type)
    Create unified cluster geometries by dissolving individual cell boundaries
    
clip_cluster_boundaries_upto_regions(cell_cluster_gdf, gadm_regions_gdf, resource_type)
    Clip cluster boundaries to precise regional administrative boundaries

Clustering Methodology
----------------------
The clustering approach follows a multi-step process:

1. **Data Preparation**: Grid cells with calculated LCOE and capacity metrics
2. **Optimal K Determination**: Uses elbow method with Within-Cluster Sum of Squares (WCSS)
3. **Regional Clustering**: Performs K-means clustering separately for each region
4. **Spatial Aggregation**: Creates unified cluster geometries through spatial union
5. **Boundary Refinement**: Clips results to precise administrative boundaries

The LCOE-based clustering ensures that cells with similar techno-economic 
characteristics are grouped together, creating representative clusters suitable
for energy system optimization while maintaining spatial coherence.

Algorithm Details
-----------------
- **K-means Clustering**: Uses scikit-learn implementation with multiple initializations
- **Elbow Method**: Automatically determines optimal cluster count based on WCSS tolerance
- **Missing Data Handling**: Imputes missing values using mean strategy
- **Spatial Preservation**: Maintains geographic relationships through geometry operations
- **Regional Processing**: Handles each administrative region independently

Usage Examples
--------------
Basic clustering workflow:

>>> import pandas as pd
>>> import geopandas as gpd
>>> from RES.cluster import cells_to_cluster_mapping, create_cells_Union_in_clusters
>>> 
>>> # Perform clustering analysis
>>> cluster_map_gdf, optimal_k_df = cells_to_cluster_mapping(
>>>     cells_scored=scored_cells,
>>>     vis_directory="vis/BC",
>>>     wcss_tolerance=0.15,
>>>     resource_type="solar",
>>>     sort_columns=["lcoe_solar"]
>>> )
>>> 
>>> # Create unified cluster geometries
>>> clusters_gdf, cluster_indices = create_cells_Union_in_clusters(
>>>     cluster_map_gdf=cluster_map_gdf,
>>>     region_optimal_k_df=optimal_k_df,
>>>     resource_type="solar"
>>> )

Cell identification:

>>> # Generate unique cell identifiers
>>> cells_with_ids = assign_cluster_id(
>>>     cells=grid_cells,
>>>     source_column="Province",
>>>     index_name="cell_id"
>>> )

Input Data Requirements
-----------------------
The clustering functions expect GeoDataFrames with specific columns:

Required Columns:
- 'x', 'y': Grid cell centroid coordinates
- sub_national_unit_tag: Administrative region classification
- 'lcoe_{resource_type}': Levelized cost of electricity
- 'potential_capacity_{resource_type}': Maximum potential capacity
- 'geometry': Spatial geometry (Polygon or Point)

Optional Columns:
- 'capex_{resource_type}': Capital expenditure costs
- 'fom_{resource_type}': Fixed operation and maintenance costs
- 'vom_{resource_type}': Variable operation and maintenance costs
- '{resource_type}_CF_mean': Average capacity factor
- 'nearest_station': Nearest grid connection point
- 'nearest_station_distance_km': Distance to grid connection

Output Data Structure
---------------------
Clustering results include:

Cluster Map GeoDataFrame:
- Individual cells with assigned cluster numbers
- Original cell attributes preserved
- Cluster_No: Integer cluster identifier
- Optimal_k: Optimal number of clusters for region

Unified Clusters GeoDataFrame:
- Dissolved cluster geometries
- Aggregated techno-economic parameters
- Representative cluster characteristics
- Spatial extent covering all member cells

Cluster Indices Dictionary:
- Mapping of original cell indices to clusters
- Structure: {region: {cluster_no: [cell_indices]}}
- Enables traceability from clusters back to individual cells

Visualization Outputs
--------------------
The module generates several visualization products:

Elbow Plots:
- WCSS vs. number of clusters for each region
- Optimal cluster number identification
- Saved to vis_directory/Regional_cluster_Elbow_Plots/

Performance Considerations
--------------------------
- Memory usage scales with number of grid cells and clusters
- Processing time increases with higher max_k values
- Imputation handles missing data but may affect clustering quality
- Large regions may benefit from hierarchical clustering approaches

Dependencies
------------
- pandas: Data manipulation and analysis
- geopandas: Spatial data operations
- numpy: Numerical computations
- matplotlib.pyplot: Visualization
- sklearn.cluster.KMeans: K-means clustering algorithm
- sklearn.impute.SimpleImputer: Missing value imputation
- pathlib: File path operations
- logging: Progress and error reporting
- RES.utility: Custom utility functions for spatial operations

Notes
-----
- Clustering is performed separately for each administrative region
- WCSS tolerance controls the trade-off between cluster number and representation
- Missing or infinite values are automatically handled through imputation
- Cluster ranking is based on ascending LCOE values (lowest cost first)
- Spatial relationships are preserved through geometry operations
- Results are suitable for energy system optimization models

See Also
--------
- RES.CellCapacityProcessor: For generating input data with LCOE calculations
- RES.utility: For additional spatial operations and cell ID management
- sklearn.cluster: For alternative clustering algorithms
"""

import logging as log
from pathlib import Path

import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer

import RES.utility as utils

imputer = SimpleImputer(strategy="mean")  # Other strategies: "median",         "most_frequent"


[docs]
def assign_cluster_id(cells: gpd.GeoDataFrame, 
                  source_column: str = None, 
                  index_name: str = 'cell') -> gpd.GeoDataFrame:
    """
    Generate unique identifiers for grid cells based on region and coordinates.
    
    Creates standardized cell identifiers that combine regional information
    with spatial coordinates to ensure uniqueness across the entire assessment
    domain. These identifiers serve as primary keys for data linking and
    result tracking throughout the assessment workflow.
    
    Parameters
    ----------
    cells : gpd.GeoDataFrame
        Input GeoDataFrame containing spatial data with 'x', 'y' coordinates
        and regional classification information
    source_column : str, default None
        Column name containing regional classification (e.g., province, state)
    index_name : str, default 'cell'
        Name for the new unique identifier column
        
    Returns
    -------
    gpd.GeoDataFrame
        GeoDataFrame with new unique cell identifier column set as index
        
    Examples
    --------
    Basic cell ID assignment:
    
    >>> cells_with_ids = assign_cluster_id(
    ...     cells=grid_cells,
    ...     source_column='Province',
    ...     index_name='cell_id'
    ... )
    >>> print(cells_with_ids.index.name)  # 'cell_id'
    
    Custom identifier format:
    
    >>> # Creates IDs like: "BC_-123.5_49.2"
    >>> cells = assign_cluster_id(cells, 'Province', 'unique_cell')
    
    Raises
    ------
    ValueError
        If source_column doesn't exist in the GeoDataFrame
    ValueError  
        If required coordinate columns 'x', 'y' are missing
        
    Notes
    -----
    - Removes spaces from region names for consistent formatting
    - ID format: "{region}_{x_coord}_{y_coord}"
    - Coordinates maintain original decimal precision
    - Sets generated IDs as DataFrame index for efficient lookups
    - Essential for linking spatial analysis results across workflow steps
    """
    
    if source_column is None:
        raise ValueError(f"'{source_column}' not defined for indexing. Please provide a valid source column name.")
    # Ensure the source column exists
    if source_column not in cells.columns:
        raise ValueError(f"'{source_column}' does not exist in the GeoDataFrame.")

    # Remove spaces in the region names for consistency
    cells[source_column] = cells[source_column].str.replace(" ", "", regex=False)

    # Check if 'x' and 'y' coordinates exist
    if 'x' not in cells.columns or 'y' not in cells.columns:
        raise ValueError("Columns 'x' and 'y' must exist in the GeoDataFrame.")

    # Generate unique cell IDs using a combination of the region name and coordinates
    cells[index_name] = (
        cells.apply(
            lambda row: f"{row[source_column]}_{row['x']}_{row['y']}",
            axis=1
        )
    )

    # Set the index to the newly created column
    cells.set_index(index_name, inplace=True)

    return cells



[docs]
def find_optimal_K(
        resource_type:str,
        data_for_clustering:pd.DataFrame, 
        region:str, 
        wcss_tolerance:float, 
        max_k  :int
        )->pd.DataFrame:
    
    """
    Determine optimal number of clusters using elbow method and WCSS tolerance.
    
    Analyzes grid cells with renewable energy characteristics to find the optimal
    number of K-means clusters using the elbow method. The Within-Cluster Sum of
    Squares (WCSS) tolerance parameter controls the trade-off between cluster
    representation accuracy and computational complexity.
    
    The function iteratively tests different cluster numbers (k) and calculates
    WCSS for each configuration. The optimal k is determined when WCSS falls
    below the specified tolerance threshold, indicating diminishing returns for
    additional clusters.
    
    Parameters
    ----------
    resource_type : str
        Type of renewable energy resource ('solar', 'wind', 'bess')
        Used for labeling and file naming
    data_for_clustering : pd.DataFrame
        Preprocessed data containing clustering features (LCOE, capacity)
        Must have no missing values or infinite values
    region : str
        Name of the administrative region being processed
        Used for plot titles and output messages
    wcss_tolerance : float
        Tolerance threshold as fraction of total WCSS (0.0 to 1.0)
        Lower values = more clusters, higher values = fewer clusters
    max_k : int
        Maximum number of clusters to test
        Limited by data size and computational constraints
        
    Returns
    -------
    int or None
        Optimal number of clusters for the region
        Returns None if no optimal k found within tolerance
        
    Examples
    --------
    Find optimal clusters for solar data:
    
    >>> optimal_k = find_optimal_K(
    ...     resource_type="solar",
    ...     data_for_clustering=clean_data,
    ...     region="British Columbia",
    ...     wcss_tolerance=0.15,
    ...     max_k=20
    ... )
    >>> print(f"Optimal clusters: {optimal_k}")
    
    Notes
    -----
    - WCSS measures squared distances from cluster centroids
    - Higher WCSS tolerance leads to fewer, more aggregated clusters
    - Lower WCSS tolerance leads to more, finer-grained clusters
    - Elbow plots are automatically generated and displayed
    - Function uses K-means with 10 random initializations for stability
    - Processing time increases quadratically with max_k
    
    Algorithm Details
    -----------------
    1. Test k from 1 to min(max_k, data_size)
    2. Calculate WCSS (inertia) for each k using K-means
    3. Compute tolerance threshold as fraction of total WCSS
    4. Find first k where WCSS ≤ tolerance threshold
    5. Generate elbow plot with optimal k marked
    
    The WCSS measures the sum of squared distances between each data point
    and its assigned cluster centroid. Lower WCSS indicates tighter, more
    homogeneous clusters but may lead to over-segmentation.
    
    Raises
    ------
    ValueError
        If data_for_clustering is empty or contains only NaN values
    RuntimeError
        If K-means clustering fails for any k value
        
    See Also
    --------
    sklearn.cluster.KMeans : K-means clustering implementation
    pre_process_cluster_mapping : Preprocessing function that calls this method
    """
    
    utils.print_update(level=2,message="Estimating optimal number of Clusters for each region based on the Score for each Cell ...")

    # Initialize empty list to store the within-cluster sum of squares (WCSS)
    wcss_data = []

    # Try different values of k (number of clusters)
    for k in range(1, min(max_k, len(data_for_clustering))):
        # Handle NaN values by filling them with the mean of the column


        kmeans_data = KMeans(n_clusters=k, random_state=0, n_init=10).fit(data_for_clustering)
        # Inertia is the within-cluster sum of squares
        wcss_data.append(kmeans_data.inertia_)

    # Calculate the total WCSS
    total_wcss_data = sum(wcss_data)

    # Calculate the tolerance as a percentage of the total WCSS
    tolerance_data = wcss_tolerance * total_wcss_data

    # Initialize the optimal k
    optimal_k_data = next((k for k, wcss_value in enumerate(wcss_data, start=1) if wcss_value <= tolerance_data), None)

# Plot and save the elbow charts
    plt.plot(range(1, min(max_k, len(data_for_clustering))), wcss_data, marker='o', linestyle='-', label=f'lcoe_{resource_type}')
    if optimal_k_data is not None:
        plt.axvline(x=optimal_k_data, color='r', linestyle='--',
                    label=f"Optimal k = {optimal_k_data}; K-means with {round(wcss_tolerance*100,3)}% of WCSS")

    plt.title(f"Elbow plot of K-means Clustering with 'LCOE_{resource_type}' for Region-{region}")
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
    plt.grid(True)
    plt.legend()

    # Ensure x-axis ticks are integers
    plt.xticks(range(1, min(max_k, len(data_for_clustering))))

    # plt.tight_layout()

    # Print the optimal k
    print(f"Zone {region} - Optimal k for LCOE_{resource_type} based clustering: {optimal_k_data}\n")

    return optimal_k_data



[docs]
def pre_process_cluster_mapping(
        cells_scored:pd.DataFrame,
        vis_directory:str,
        wcss_tolerance:float,
        sub_national_unit_tag:str,
        resource_type:str)->tuple[pd.DataFrame, pd.DataFrame]:
    
    """
    Preprocess data and determine optimal cluster numbers for each region.
    
    Performs comprehensive preprocessing of scored grid cells to prepare them
    for K-means clustering analysis. The function handles missing data, determines
    optimal cluster numbers for each administrative region, and generates
    visualization outputs for clustering analysis.
    
    This function serves as the preprocessing pipeline that prepares raw scored
    cell data for the main clustering workflow, ensuring data quality and
    generating region-specific clustering parameters.
    
    Parameters
    ----------
    cells_scored : pd.DataFrame
        GeoDataFrame containing scored grid cells with LCOE and capacity data
        Must include columns: 'Region', 'lcoe_{resource_type}', 'potential_capacity_{resource_type}'
    vis_directory : str
        Base directory path for saving visualization outputs
        Elbow plots will be saved in subdirectory 'Regional_cluster_Elbow_Plots'
    wcss_tolerance : float
        WCSS tolerance threshold for optimal cluster determination (0.0 to 1.0)
        Controls trade-off between cluster number and representation accuracy
    resource_type : str
        Resource type identifier ('solar', 'wind', 'bess')
        Used for column name construction and labeling
        
    Returns
    -------
    tuple[pd.DataFrame, pd.DataFrame]
        - cells_scored_cluster_mapped: Enhanced cell data with optimal k values and cell IDs
        - region_optimal_k_df: Summary of optimal cluster numbers by region
        
    Examples
    --------
    Preprocess solar cell data:
    
    >>> cells_mapped, optimal_k_summary = pre_process_cluster_mapping(
    ...     cells_scored=scored_solar_cells,
    ...     vis_directory="vis/BC",
    ...     wcss_tolerance=0.15,
    ...     resource_type="solar"
    ... )
    >>> print(f"Processed {len(cells_mapped)} cells across {len(optimal_k_summary)} regions")
    
    Processing Workflow
    -------------------
    1. **Region Iteration**: Process each unique administrative region separately
    2. **Data Validation**: Check for required columns and sufficient data
    3. **Data Cleaning**: Handle infinite values and missing data through imputation
    4. **Optimal K Finding**: Apply elbow method to determine cluster numbers
    5. **Visualization**: Generate and save elbow plots for each region
    6. **Data Integration**: Merge optimal k values back to cell data
    7. **ID Assignment**: Generate unique cell identifiers for data linking
    
    Data Quality Handling
    ---------------------
    - **Missing Columns**: Regions without required columns are skipped
    - **Infinite Values**: Replaced with NaN for proper imputation
    - **Empty Data**: Regions with insufficient data are excluded
    - **Imputation**: Uses mean strategy for missing value replacement
    - **Zero Clusters**: Regions with optimal_k=0 are filtered out
    
    Output Structure
    ----------------
    cells_scored_cluster_mapped contains:
    - All original cell attributes
    - 'Optimal_k': Optimal cluster number for the cell's region
    - 'cell': Unique cell identifier (set as index)
    
    region_optimal_k_df contains:
    - 'Region': Administrative region name
    - 'Optimal_k': Optimal number of clusters for the region
    
    Visualization Outputs
    ---------------------
    Generates elbow plots saved to:
    `{vis_directory}/Regional_cluster_Elbow_Plots/elbow_plot_region_{region}.png`
    
    Each plot shows:
    - WCSS vs. number of clusters
    - Optimal k marked with vertical line
    - Region-specific title and labels
    
    Notes
    -----
    - Processing is performed region-by-region for spatial coherence
    - Imputation strategy can affect clustering quality
    - Visualization directory is created if it doesn't exist
    - Regions with insufficient data (< 2 cells) may be skipped
    - Memory usage scales with number of regions and cells per region
    
    Raises
    ------
    ValueError
        If vis_directory path is invalid or cannot be created
    KeyError
        If required columns are missing from cells_scored
    RuntimeError
        If imputation or clustering fails for critical regions
        
    See Also
    --------
    find_optimal_K : Core optimal cluster determination function
    assign_cluster_id : Cell identifier generation function
    cells_to_cluster_mapping : Main clustering workflow function
    """

    unique_regions = cells_scored[sub_national_unit_tag].unique()
    try:
        elbow_plot_directory = Path(vis_directory, 'Regional_cluster_Elbow_Plots')
        elbow_plot_directory.mkdir(parents=True, exist_ok=True)
    except Exception as e:
        raise ValueError(f"Failed to create directory at {elbow_plot_directory}. Ensure 'vis_directory' is valid. Error: {e}")
    region_optimal_k_list = []

    # Loop over unique regions
    for region in unique_regions:
        print(f"\n=== Processing region: {region} ===")
        expected_cols = [f'lcoe_{resource_type}', f'potential_capacity_{resource_type}']
        
        available_cols = cells_scored.columns.tolist()
        print("Available columns in cells_scored:", available_cols)
        
        # Check if all required columns exist
        if not all(col in available_cols for col in expected_cols):
            print(f"Missing columns for clustering in region {region}. Skipping.")
            continue

        data_for_clustering = cells_scored[cells_scored[sub_national_unit_tag] == region][expected_cols]
        
        # Replace inf/-inf with NaN so they can be imputed
        data_for_clustering.replace([np.inf, -np.inf], np.nan, inplace=True)
        
        print("Data before imputation:")
        print(data_for_clustering.describe())

        # Drop columns that are entirely NaN
        data_for_clustering.dropna(axis=1, how='all', inplace=True)

        if data_for_clustering.empty or data_for_clustering.shape[1] == 0:
            print(f"Data for clustering is empty or invalid for region {region}. Skipping.")
            continue

        try:
            imputed_array = imputer.fit_transform(data_for_clustering)
        except Exception as e:
            print(f"Imputer failed for region {region} with error: {e}")
            continue

        data_for_clustering_cleaned = pd.DataFrame(imputed_array, columns=data_for_clustering.columns)
        
        
        # Call the function for K-means clustering and elbow plot
        optimal_k = find_optimal_K(resource_type,data_for_clustering_cleaned, region, wcss_tolerance, max_k=15)
        
        # Append values to the list
        region_optimal_k_list.append({sub_national_unit_tag: region, 'Optimal_k': optimal_k})

        # Save the elbow plot
        plot_name = f'elbow_plot_region_{region}.png'
        plot_save_to=elbow_plot_directory/plot_name
        plt.savefig(plot_save_to)
        plt.close()  # Close the plot to avoid overlapping
    ##################################################################
    print(">>> K-means clustering Elbow plots generated for each region based on the Score for each Cell ...")

    # Create a DataFrame from the list
    region_optimal_k_df = pd.DataFrame(region_optimal_k_list)
    region_optimal_k_df['Optimal_k'].fillna(0, inplace=True)
    region_optimal_k_df['Optimal_k'] = region_optimal_k_df['Optimal_k'].astype(int)
    
    NonZeroClustersmask=region_optimal_k_df['Optimal_k']!=0
    region_optimal_k_df=region_optimal_k_df[NonZeroClustersmask]

    _x = cells_scored.merge(region_optimal_k_df, on=sub_national_unit_tag, how='left')
    cells_scored = assign_cluster_id(_x,sub_national_unit_tag, 'cell')#.set_index('cell')
    

    print(f"Optimal-k based on 'LCOE' clustering calculated for {len(unique_regions)} zones and saved to cell dataframe.\n")
    cells_scored_cluster_mapped=cells_scored.copy()

    return cells_scored_cluster_mapped,region_optimal_k_df



[docs]
def cells_to_cluster_mapping(
        cells_scored:pd.DataFrame,
        vis_directory:str,
        wcss_tolerance:float,
        sub_national_unit_tag:str,
        resource_type:str,
        sort_columns:list)-> tuple[pd.DataFrame,pd.DataFrame]:
    """
    Map grid cells to clusters based on similarity metrics and optimal cluster numbers.
    
    Performs spatial clustering of renewable energy grid cells by grouping cells with
    similar techno-economic characteristics (primarily LCOE) into representative clusters.
    The function implements a systematic approach to divide each region's cells into
    the optimal number of clusters determined through elbow method analysis.
    
    This is the main clustering workflow function that transforms individual grid cells
    into clustered representations suitable for energy system optimization models,
    reducing computational complexity while preserving spatial and economic relationships.
    
    Parameters
    ----------
    cells_scored : pd.DataFrame
        Scored grid cells with techno-economic attributes
        Must contain LCOE, capacity, and regional classification data
    vis_directory : str
        Directory path for saving clustering visualization outputs
        Used for elbow plots and clustering analysis results
    wcss_tolerance : float
        Within-Cluster Sum of Squares tolerance (0.0 to 1.0)
        Controls cluster granularity vs. computational efficiency trade-off
    resource_type : str
        Renewable energy resource type ('solar', 'wind', 'bess')
        Determines which columns to use for clustering analysis
    sort_columns : list
        Column names for sorting cells before cluster assignment
        Typically includes LCOE or other ranking metrics
        
    Returns
    -------
    tuple[pd.DataFrame, pd.DataFrame]
        - cells_cluster_map_df: Individual cells with assigned cluster numbers
        - optimal_k_df: Summary of optimal cluster counts by region
        
    Examples
    --------
    Perform clustering for wind resources:
    
    >>> cluster_map, optimal_k = cells_to_cluster_mapping(
    ...     cells_scored=wind_cells_scored,
    ...     vis_directory="vis/Alberta",
    ...     wcss_tolerance=0.20,
    ...     resource_type="wind",
    ...     sort_columns=["lcoe_wind", "potential_capacity_wind"]
    ... )
    >>> print(f"Created {cluster_map['Cluster_No'].max()} clusters across regions")
    
    Clustering Methodology
    ----------------------
    The clustering approach follows several key principles:
    
    1. **Regional Separation**: Clustering is performed independently for each
       administrative region to maintain spatial coherence and respect political
       boundaries that affect renewable energy development.
    
    2. **LCOE-Based Similarity**: Cells are grouped based on Levelized Cost of
       Electricity (LCOE) as the primary similarity metric, ensuring clusters
       represent similar economic viability.
    
    3. **Sorted Assignment**: Within each region, cells are sorted by specified
       metrics (typically LCOE) before being assigned to clusters, ensuring
       that the best cells are distributed across clusters.
    
    4. **Equal Distribution**: Cells are divided as evenly as possible across
       the optimal number of clusters for each region, preventing cluster
       size imbalances.
    
    Algorithm Workflow
    ------------------
    1. **Preprocessing**: Call pre_process_cluster_mapping to determine optimal k
    2. **Region Filtering**: Focus on regions with valid optimal cluster numbers
    3. **Cell Sorting**: Sort cells within each region by specified criteria
    4. **Cluster Assignment**: Divide sorted cells into optimal number of groups
    5. **Remainder Handling**: Merge small remainder groups into larger clusters
    6. **Numbering**: Assign sequential cluster numbers within each region
    
    Cluster Assignment Strategy
    ---------------------------
    For each region with n cells and k optimal clusters:
    - Calculate step_size = n ÷ k
    - Assign cells [0:step_size] to cluster 1
    - Assign cells [step_size:2*step_size] to cluster 2
    - Continue until all cells are assigned
    - Merge any remainder cells into the last cluster
    
    This ensures balanced cluster sizes while maintaining economic similarity
    through the pre-sorting step.
    
    Output Data Structure
    ---------------------
    cells_cluster_map_df contains:
    - All original cell attributes (LCOE, capacity, coordinates, etc.)
    - 'Cluster_No': Integer cluster identifier within region
    - 'Optimal_k': Total number of clusters for the cell's region
    - 'cell': Unique cell identifier (as index)
    
    optimal_k_df contains:
    - sub_national_unit_tag : Administrative region unit (e.g. Region or Municipality etc.)
    - 'Optimal_k': Optimal number of clusters determined for region
    
    Performance Considerations
    --------------------------
    - Memory usage scales linearly with number of cells
    - Processing time increases with number of regions and complexity
    - Sorting operations may be memory-intensive for large datasets
    - Cluster assignment is efficient O(n) operation per region
    
    Quality Assurance
    -----------------
    - Validates that all cells receive cluster assignments
    - Ensures cluster numbers are sequential within regions
    - Maintains data integrity through concatenation operations
    - Preserves spatial relationships through regional processing
    
    Notes
    -----
    - Clustering preserves regional boundaries for political/administrative coherence
    - LCOE-based sorting ensures economic similarity within clusters
    - Balanced cluster sizes improve downstream optimization performance
    - Results are suitable for capacity expansion and dispatch optimization models
    - Cluster numbering resets for each region (regional scope)
    
    Raises
    ------
    ValueError
        If required columns are missing or data validation fails
    KeyError
        If region names don't match between datasets
    RuntimeError
        If clustering assignment produces invalid results
        
    See Also
    --------
    pre_process_cluster_mapping : Preprocessing and optimal k determination
    create_cells_Union_in_clusters : Spatial union of clustered cells
    find_optimal_K : Core optimal cluster number determination
    """
    dataframe,optimal_k_df=pre_process_cluster_mapping(cells_scored,vis_directory,wcss_tolerance,sub_national_unit_tag,resource_type)

    utils.print_update(level=2,message="Mapping the Optimal Number of Clusters for Each region ...")

    clusters = []
    dataframe_filtered=dataframe[dataframe[sub_national_unit_tag].isin(list(optimal_k_df[sub_national_unit_tag]))]
    
    for region, group in dataframe_filtered.groupby(sub_national_unit_tag):
        group = group.sort_values(by=sort_columns, ascending=True)
        region_rows = len(group)
        
        optimal_k = optimal_k_df[optimal_k_df[sub_national_unit_tag] == region]['Optimal_k'].iloc[0]
        region_step_size = region_rows // optimal_k
        
        clusters.extend([group.iloc[i:i+region_step_size].copy() for i in range(0, region_rows, region_step_size)])
        
        if len(clusters[-1]) < region_step_size:
            clusters[-2] = pd.concat([clusters[-2], clusters.pop()], ignore_index=False)
        
        cluster_no_counter = 1  # Reset cluster_no_counter for each region
        for cluster_df in clusters[-optimal_k:]:
            cluster_df['Cluster_No'] = cluster_no_counter
            cluster_no_counter += 1
    cells_cluster_map_df=pd.concat(clusters, ignore_index=False)

    return cells_cluster_map_df,optimal_k_df



[docs]
def create_cells_Union_in_clusters(
        cluster_map_gdf:gpd.GeoDataFrame, 
        region_optimal_k_df:pd.DataFrame,
        sub_national_unit_tag:str,
        resource_type:str
            )->tuple[pd.DataFrame,dict]:

    """
    Create unified cluster geometries by dissolving individual cell boundaries.
    
    Transforms individual grid cells assigned to clusters into unified cluster
    geometries through spatial union operations. This process aggregates both
    geometric boundaries and techno-economic attributes to create representative
    cluster entities suitable for energy system optimization models.
    
    The function performs spatial dissolve operations grouped by cluster number
    within each region, creating cohesive cluster polygons while maintaining
    traceability back to original cells through detailed index mapping.
    
    Parameters
    ----------
    cluster_map_gdf : gpd.GeoDataFrame
        Grid cells with cluster assignments from cells_to_cluster_mapping
        Must contain defined sub_national_unit_tag, 'Cluster_No', and geometric attributes
    region_optimal_k_df : pd.DataFrame
        Summary of optimal cluster numbers by region
        Contains defined sub_national_unit_tag and 'Optimal_k' columns
    resource_type : str
        Resource type identifier ('solar', 'wind', 'bess')
        Used for column naming and aggregation rules
        
    Returns
    -------
    tuple[pd.DataFrame, dict]
        - dissolved_gdf: Unified cluster geometries with aggregated attributes
        - dissolved_indices: Mapping of cluster to original cell indices
        
    Examples
    --------
    Create unified solar clusters:
    
    >>> clusters_gdf, cell_mapping = create_cells_Union_in_clusters(
    ...     cluster_map_gdf=mapped_cells,
    ...     region_optimal_k_df=optimal_k_summary,
    ...     resource_type="solar"
    ... )
    >>> print(f"Created {len(clusters_gdf)} unified clusters")
    >>> print(f"Cluster 1 contains {len(cell_mapping['BC'][1])} original cells")
    
    Aggregation Strategy
    --------------------
    Different attributes are aggregated using specific strategies:
    
    **Economic Metrics**:
    - LCOE: Median value (representative of cluster economics)
    - CAPEX, FOM, VOM: First value (uniform within region/technology)
    
    **Performance Metrics**:
    - Capacity Factor: Mean value (average performance)
    - Potential Capacity: Sum (total cluster capacity)
    
    **Infrastructure Metrics**:
    - Nearest Station: First value (primary connection point)
    - Distance to Grid: First value (representative distance)
    
    **Classification**:
    - Region, Cluster_No: First value (preserved identity)
    
    Geometric Operations
    --------------------
    1. **Spatial Dissolve**: Union of cell geometries within each cluster
    2. **Topology Preservation**: Maintains valid polygon geometry
    3. **Attribute Aggregation**: Combines cell attributes per aggregation rules
    4. **Index Tracking**: Records original cell indices for each cluster
    
    Output Structure
    ----------------
    dissolved_gdf contains unified clusters with:
    - 'cluster_id': Unique cluster identifier (as index)
    -  sub_national_unit_tag: Administrative region unit (e.g., Region or Municipality)
    - 'Cluster_No': Sequential cluster number within region
    - 'Rank': Cluster ranking based on LCOE (ascending)
    - Economic attributes: Aggregated costs and performance metrics
    - 'geometry': Unified cluster polygon geometry
    
    dissolved_indices structure:
    ```
    {
        'region_name': {
            cluster_no: [list_of_original_cell_indices],
            ...
        },
        ...
    }
    ```
    
    Processing Workflow
    -------------------
    1. **Region Iteration**: Process each region independently
    2. **Cluster Grouping**: Group cells by cluster number within region
    3. **Index Recording**: Store original cell indices before dissolving
    4. **Spatial Dissolve**: Union geometries and aggregate attributes
    5. **Result Compilation**: Concatenate all dissolved clusters
    6. **ID Assignment**: Generate unique cluster identifiers
    7. **Ranking**: Sort and rank clusters by economic metrics
    8. **Column Cleanup**: Standardize column names for downstream use
    
    Traceability Features
    ---------------------
    The dissolved_indices dictionary enables:
    - Mapping clusters back to constituent cells
    - Detailed analysis of cluster composition
    - Validation of aggregation results
    - Disaggregation for detailed reporting
    
    Quality Assurance
    -----------------
    - Validates that all cells are included in clusters
    - Ensures geometric validity after spatial operations
    - Maintains attribute consistency through aggregation
    - Preserves regional and cluster identity information
    
    Performance Considerations
    --------------------------
    - Memory usage scales with cluster complexity and number
    - Spatial operations may be computationally intensive
    - Large clusters with many cells require more processing time
    - Geometric simplification may be beneficial for very detailed cells
    
    Notes
    -----
    - Cluster ranking facilitates economic dispatch optimization
    - Column name standardization removes resource type suffixes
    - Median LCOE provides robust cluster economic representation
    - Spatial union preserves geographic relationships
    - Results are optimized for energy system modeling workflows
    
    Raises
    ------
    ValueError
        If cluster assignments are invalid or missing
    GeometryError
        If spatial dissolve operations fail
    KeyError
        If required columns are missing from input data
        
    See Also
    --------
    cells_to_cluster_mapping : Preceding cluster assignment function
    clip_cluster_boundaries_upto_regions : Boundary refinement function
    gpd.GeoDataFrame.dissolve : Core spatial dissolve operation
    """
    utils.print_update(level=1,message=" Preparing Clusters...")
    node_distance_col = utils.get_available_column(cluster_map_gdf, ['nearest_station_distance_km', 'nearest_distance'])
    grid_node_col = utils.get_available_column(cluster_map_gdf, ['nearest_station', 'nearest_connection_point'])
        
    # Initialize an aggregation dictionary
    agg_dict = {#f'LCOE_{resource_type}': lambda x: x.iloc[len(x) // 2], 
                f'lcoe_{resource_type}': lambda x: x.iloc[len(x) // 2], 
                f'capex_{resource_type}':'first',
                f'fom_{resource_type}':'first',
                f'vom_{resource_type}':'first',
                f'{resource_type}_CF_mean':'mean',
                'Cluster_No':'first',
                f'potential_capacity_{resource_type}': 'sum',
                sub_national_unit_tag: 'first',
                grid_node_col:'first',
                node_distance_col:'first'}

    # Initialize an empty list to store the dissolved results
    dissolved_gdf_list = []
    
    # Initialize an empty dictionary to store dissolved indices for each region and each Cluster_No
    dissolved_indices = {}
    i=0
    # Loop through each region
    for region in region_optimal_k_df[sub_national_unit_tag]:
        i+=1
        log.info(f" Creating cluster for {region} {i}/{len(region_optimal_k_df[sub_national_unit_tag])}")
        region_mask = cluster_map_gdf[sub_national_unit_tag] == region
        region_cells = cluster_map_gdf[region_mask]

        # Initialize dictionary for the current region
        dissolved_indices[region] = {}

        # Loop through each Cluster_No in the current region
        for cluster_no, group in region_cells.groupby('Cluster_No'):
            # Store the indices of the rows before dissolving
            dissolved_indices[region][cluster_no] = group.index.tolist()

            # Dissolve by 'Bucket_No' and aggregate using the agg_dict
            region_dissolved = group.dissolve(by='Cluster_No', aggfunc=agg_dict)

            # Append the dissolved GeoDataFrame to the list
            dissolved_gdf_list.append(region_dissolved)

        # Concatenate all GeoDataFrames in the list
        dissolved_gdf = pd.concat(dissolved_gdf_list, ignore_index=True)

        dissolved_gdf=utils.assign_regional_cell_ids(dissolved_gdf,sub_national_unit_tag,'cluster_id')

        dissolved_gdf['Cluster_No'] = dissolved_gdf['Cluster_No'].astype(int)
        dissolved_gdf.sort_values(by=f'lcoe_{resource_type}', ascending=True, inplace=True)
        # dissolved_gdf.sort_values(by=f'LCOE_{resource_type}', ascending=True, inplace=True)
        dissolved_gdf['Rank'] = range(1, len(dissolved_gdf)+1)
        
        dissolved_gdf.columns=dissolved_gdf.columns.str.replace(fr"(?i)(_{resource_type}|{resource_type}_)", "", regex=True)
        
    utils.print_update(level=2,message="Clusters Created and a list generated to map the Cells inside each Cluster...")
    return dissolved_gdf, dissolved_indices



[docs]
def clip_cluster_boundaries_upto_regions(
        cell_cluster_gdf:gpd.GeoDataFrame,
        gadm_regions_gdf:gpd.GeoDataFrame,
        resource_type)->gpd.GeoDataFrame:
    """
    Clip cluster boundaries to precise regional administrative boundaries.
    
    Refines cluster geometries by clipping them to exact administrative
    boundaries, ensuring that cluster extents respect political and
    administrative divisions. This final processing step removes any
    geometric artifacts from the clustering process and aligns results
    with official regional boundaries.
    
    The function performs spatial clipping operations to trim cluster
    polygons to the precise extent of administrative regions, maintaining
    data integrity while ensuring geographic accuracy for policy and
    planning applications.
    
    Parameters
    ----------
    cell_cluster_gdf : gpd.GeoDataFrame
        Unified cluster geometries from create_cells_Union_in_clusters
        Contains cluster polygons that may extend beyond regional boundaries
    gadm_regions_gdf : gpd.GeoDataFrame
        Official administrative boundary geometries from GADM dataset
        Defines precise regional extents for clipping operations
    resource_type : str
        Resource type identifier ('solar', 'wind', 'bess')
        Used for column identification and sorting operations
        
    Returns
    -------
    gpd.GeoDataFrame
        Clipped cluster geometries with boundaries precisely aligned
        to administrative regions, sorted by LCOE in ascending order
        
    Examples
    --------
    Clip wind clusters to provincial boundaries:
    
    >>> clipped_clusters = clip_cluster_boundaries_upto_regions(
    ...     cell_cluster_gdf=unified_clusters,
    ...     gadm_regions_gdf=provincial_boundaries,
    ...     resource_type="wind"
    ... )
    >>> print(f"Clipped {len(clipped_clusters)} clusters to regional boundaries")
    
    Clipping Operations
    -------------------
    1. **Spatial Intersection**: Clips cluster geometries using administrative boundaries
    2. **Topology Preservation**: Maintains valid polygon geometry after clipping
    3. **Attribute Retention**: Preserves all cluster attributes through clipping
    4. **Multi-geometry Handling**: Manages potential multi-polygon results
    
    Boundary Alignment Benefits
    ---------------------------
    - **Policy Compliance**: Ensures clusters respect administrative jurisdictions
    - **Planning Accuracy**: Aligns with regional energy planning boundaries
    - **Data Integrity**: Removes geometric inconsistencies from processing
    - **Visualization Quality**: Improves map accuracy for stakeholder communication
    
    Geometric Considerations
    ------------------------
    - Handles edge cases where clusters span multiple regions
    - Preserves cluster identity even after boundary clipping
    - Maintains geometric validity through robust clipping algorithms
    - May create multi-polygon geometries for clusters crossing boundaries
    
    Sorting and Organization
    ------------------------
    Results are sorted by LCOE in ascending order to facilitate:
    - Economic dispatch optimization
    - Merit order analysis
    - Least-cost development planning
    - Investment prioritization
    
    Quality Assurance
    -----------------
    - Validates geometric integrity after clipping operations
    - Ensures all clusters remain within administrative boundaries
    - Maintains attribute consistency through spatial operations
    - Preserves cluster ranking and identification
    
    Performance Notes
    -------------------
    - Clipping operations scale with geometric complexity
    - Large regions or detailed boundaries increase processing time
    - Memory usage depends on cluster and boundary detail level
    - Results are optimized for downstream energy modeling applications
    
    Use Cases
    ---------
    - **Regulatory Compliance**: Ensuring development respects jurisdictions
    - **Policy Analysis**: Aligning renewable development with administrative units
    - **Planning Integration**: Connecting energy models with regional planning
    - **Stakeholder Communication**: Accurate maps for decision-maker engagement
    
    Notes
    -----
    - Final step in the clustering workflow before energy system modeling
    - Essential for maintaining political and administrative coherence
    - Improves visual quality of cluster maps and analysis results
    - Ensures compatibility with regional energy planning frameworks
    - Results are ready for capacity expansion and dispatch optimization
    
    Raises
    ------
    GeometryError
        If clipping operations produce invalid geometries
    ValueError
        If input datasets have incompatible coordinate systems
    AttributeError
        If required columns are missing from input data
        
    See Also
    --------
    create_cells_Union_in_clusters : Preceding cluster creation function
    gpd.GeoDataFrame.clip : Core spatial clipping operation
    RES.boundaries.GADMBoundaries : Administrative boundary data source
    """
    cell_cluster_gdf_clipped=cell_cluster_gdf.clip(gadm_regions_gdf,keep_geom_type=False)
    # cell_cluster_gdf_clipped.sort_values(by=[f'LCOE_{resource_type}'], ascending=True, inplace=True) 
    cell_cluster_gdf_clipped.sort_values(by=[f'lcoe_{resource_type}'], ascending=True, inplace=True) 

    return cell_cluster_gdf_clipped