"""
Chapter-05-Matplotlib-Basics - Matplotlib: Line, Scatter, Histogram, Box Plot
===============================================================================
File    : Session5-Matplotlib-V01.py
Version : V01
Date    : 2026-05-13

Objective
---------
Load a people CSV from the shared A-Data folder, create four classic
Matplotlib chart types (line, scatter, histogram, box plot), and save each
chart as a 300-DPI PNG file to the C-Results/Session5 output folder.

What Is Matplotlib?
--------------------
Matplotlib is Python's most widely used plotting library.  It follows a
"figure -> axes -> elements" model:

    plt.figure()         Create a new blank canvas (figure).
    plt.plot() etc.      Draw chart elements onto the current figure.
    plt.title/xlabel/... Add labels and formatting.
    plt.tight_layout()   Auto-adjust spacing so labels do not overlap.
    plt.savefig(path)    Write the figure to a PNG (or PDF, SVG, ...) file.
    plt.close()          Release the figure from memory.

Always call plt.close() after saving so memory is freed; otherwise each
figure stays open and can accumulate into a memory leak in long scripts.

Four Chart Types Introduced in This Session
--------------------------------------------
    Chart type      Function            Best used for
    --------------- ------------------- ------------------------------------
    Line plot       plt.plot()          Trends over a continuous axis (time,
                                        sorted numeric variable like Age).
    Scatter plot    plt.scatter()       Relationship / correlation between
                                        two numeric variables.
    Histogram       plt.hist()          Distribution shape of one numeric
                                        variable; how often each range occurs.
    Box plot        plt.boxplot()       Spread, median, and outliers of one
                                        or more numeric variables side-by-side.

Key Parameters Explained
--------------------------
    Parameter               Used in         Meaning
    ----------------------  --------------- ---------------------------------
    marker='o'              plt.plot()       Draw a circle at each data point.
    grid(True, lw=0.3)      all             Show faint background grid lines.
    bins=5                  plt.hist()       Number of equal-width buckets.
    edgecolor='black'       plt.hist()       Draw a border around each bar.
    axis='y'                plt.grid()       Grid lines on y-axis only.
    labels=[...]            plt.boxplot()    Labels under each box.
    dpi=300                 savefig()        High-resolution output (300 dots
                                            per inch; suitable for print).

Box Plot Anatomy
-----------------
    ----  <- upper whisker (up to 1.5 * IQR above Q3)
    +--+
    |  |  <- box spans Q1 (25%) to Q3 (75%)  =  interquartile range (IQR)
    +--+  <- median line inside the box (50%)
    |  |
    +--+
    ----  <- lower whisker (down to 1.5 * IQR below Q1)
    o     <- individual outlier point (beyond whiskers)

Expected Folder Layout
----------------------
    (root)/
    |
    +-- A-Data/
    |   +-- People.csv                      <- Input data (read-only)
    |
    +-- Chapter-05-Matplotlib-Basics/
    |   +-- Session5-Matplotlib-V01.py      <- THIS script
    |
    +-- C-Results/
        +-- Session5/
            +-- line_income_vs_age.png      <- Line chart output
            +-- scatter_height_weight.png   <- Scatter chart output
            +-- hist_income.png             <- Histogram output
            +-- box_height_weight.png       <- Box plot output

Required Input File  (People.csv)
-----------------------------------
The CSV must contain at least these four numeric columns:
    Age         Age in years.
    Income      Annual income in USD.
    Height_cm   Height in centimetres.
    Weight_kg   Weight in kilograms.

Sample file content  (People.csv):
    Name,Age,Income,Height_cm,Weight_kg
    Ali,25,45000,175,70
    Sara,30,62000,160,55
    John,22,38000,180,75
    Mei,28,55000,165,60
    Luis,35,71000,170,68

Quick Start
-----------
1. Open VS Code and open the Chapter-05-Matplotlib-Basics folder.
2. Open Session5-Matplotlib-V01.py.
3. Run (Ctrl+F5, or right-click -> Run Python File in Terminal).
4. Four PNG files appear in C-Results/Session5/.
5. Open each PNG to inspect the charts.

Step-by-Step Flow
-----------------
Step 1  : Reconfigure stdout/stderr to UTF-8 for cross-platform printing.
Step 2  : Import required libraries: sys, os, pathlib.Path, pandas, matplotlib.
Step 3  : Resolve the input CSV path (two levels up, then into A-Data/).
Step 4  : Resolve the project root and build the Session5 output folder path.
Step 5  : Create the output folder if it does not already exist.
Step 6  : Print resolved paths for verification.
Step 7  : Load the CSV into a pandas DataFrame.
Step 8  : Sort the DataFrame by Age for the line plot.
Step 9  : Create and save Chart 1 -- Line plot: Income vs Age.
Step 10 : Create and save Chart 2 -- Scatter plot: Height vs Weight.
Step 11 : Create and save Chart 3 -- Histogram: Income distribution.
Step 12 : Create and save Chart 4 -- Box plot: Height and Weight.
Step 13 : Print a success message listing all four saved PNG file paths.

Output Files
-------------
    line_income_vs_age.png
        Line chart showing Income on Y-axis vs Age on X-axis, sorted by Age.
        Markers at each data point.  300 DPI.

    scatter_height_weight.png
        Scatter plot of Height (X) vs Weight (Y).  Each dot is one person.
        Good for spotting a positive correlation between the two variables.

    hist_income.png
        Histogram of Income divided into 5 equal-width bins.
        Shows whether incomes cluster in a range or are spread out.

    box_height_weight.png
        Side-by-side box plots of Height_cm and Weight_kg.
        Shows median, interquartile range, and any outliers.

Notes
-----
- .values on a pandas Series converts it to a plain NumPy array before
  passing to plt.plot().  This avoids a pandas index mismatch warning when
  the DataFrame has been sorted (the index is reordered but not reset).
- plt.close() must be called after each savefig() to free figure memory.
  Skipping it causes figures to accumulate -- harmless for 4 charts but
  a real problem in loops that create hundreds of figures.
- dpi=300 produces print-quality images.  Use dpi=72 or dpi=96 for
  smaller files suited to web or screen display.
- grid(True, linewidth=0.3) uses a very thin line so the grid is visible
  but does not compete visually with the data.
- os.path.join() is used for the savefig() paths here because savefig()
  also accepts a pathlib.Path directly -- either style works.
"""

# ===========================================================================
# Step 1 - Reconfigure stdout and stderr to UTF-8
#          Prevents garbled or missing characters on Windows terminals.
#          The hasattr() guard keeps this safe on older Python builds.
# ===========================================================================
import sys

if hasattr(sys.stdout, 'reconfigure'):
    sys.stdout.reconfigure(encoding='utf-8')
if hasattr(sys.stderr, 'reconfigure'):
    sys.stderr.reconfigure(encoding='utf-8')

# ===========================================================================
# Step 2 - Import required libraries
# ===========================================================================
import os                               # Used for os.path.join() in savefig paths
from pathlib import Path                # Modern object-oriented path handling
import pandas as pd                     # DataFrame for data loading
import matplotlib.pyplot as plt         # Core Matplotlib plotting interface

# ===========================================================================
# Step 3 - Resolve the input CSV path
#
#  Path(__file__).resolve()       -> absolute path of THIS script file
#  .parents[1]                    -> two levels up (project root)
#  / 'A-Data' / 'People.csv'      -> navigate into the shared data folder
#
#  This means the CSV is located at:
#      (root)/A-Data/People.csv
#  regardless of which directory the terminal is pointed at when you run.
# ===========================================================================
data_path = Path(__file__).resolve().parents[1] / 'A-Data' / 'People.csv'
print(f'[DEBUG] data_path    : {data_path}')

# ===========================================================================
# Step 4 - Resolve the project root and build the output folder path
#
#  A dedicated Session5 subfolder keeps this session's PNG files separate
#  from outputs produced by other sessions.
# ===========================================================================
project_root = Path(__file__).resolve().parents[1]      # (root)/
out_folder   = project_root / 'C-Results' / 'Session5'  # (root)/C-Results/Session5/

print(f'[DEBUG] project_root : {project_root}')
print(f'[DEBUG] out_folder   : {out_folder}')

# ===========================================================================
# Step 5 - Create the output folder if it does not already exist
#          parents=True  -> also create C-Results/ if it is missing.
#          exist_ok=True -> do nothing silently if the folder is already there.
# ===========================================================================
out_folder.mkdir(parents=True, exist_ok=True)

# ===========================================================================
# Step 6 - Verify the input file exists before trying to load it
#          Giving a clear message here is more helpful than letting pandas
#          raise a raw FileNotFoundError with a long traceback.
# ===========================================================================
if not data_path.exists():
    print(f'[ERROR] Input file not found: {data_path}')
    print('[ERROR] Check that People.csv is in the A-Data folder. Exiting.')
    raise SystemExit(1)

print(f'[INFO] Input file found: {data_path}')

# ===========================================================================
# Step 7 - Load the CSV into a pandas DataFrame
#          encoding='utf-8' handles accented or special characters in names.
# ===========================================================================
print('\n[INFO] Loading data from CSV...')
df = pd.read_csv(data_path, encoding='utf-8')
print(f'[DEBUG] Shape loaded  : {df.shape[0]} rows x {df.shape[1]} cols')
print(f'[DEBUG] Columns found : {list(df.columns)}')
print(df.head(), '\n')

# ===========================================================================
# Step 8 - Sort the DataFrame by Age for the line plot
#
#  df.sort_values('Age')
#      Returns a new DataFrame with rows ordered from youngest to oldest.
#      Without sorting, plt.plot() connects the dots in CSV row order, which
#      produces a jagged zigzag line rather than a smooth left-to-right trend.
#
#  .values on a sorted Series converts it to a plain NumPy array.
#  This is important because sort_values reorders the pandas index
#  (e.g. [2, 0, 4, 1, 3]) without resetting it.  If we passed the Series
#  directly, Matplotlib would receive mismatched indices and print a warning.
#  Using .values strips the index entirely, giving Matplotlib a clean array.
# ===========================================================================
df_sorted = df.sort_values('Age')
print(f'[DEBUG] DataFrame sorted by Age.')

# ===========================================================================
# Step 9 - Chart 1: Line plot -- Income vs Age
#
#  plt.figure()
#      Creates a new blank figure canvas.  Every chart starts with this call
#      to ensure each chart is independent and does not draw on a previous one.
#
#  plt.plot(x, y, marker='o')
#      Draws a line connecting the data points in order.
#      marker='o' adds a filled circle at each data point so individual
#      values are visible as well as the connecting line.
#
#  plt.grid(True, linewidth=0.3)
#      Draws faint horizontal and vertical grid lines to help read values.
#      linewidth=0.3 keeps them subtle so they do not overpower the data.
#
#  plt.tight_layout()
#      Automatically adjusts margins so axis labels, tick labels, and the
#      title are not clipped or overlapping.  Always call this before savefig.
#
#  plt.savefig(path, dpi=300)
#      Writes the figure to a PNG file at 300 dots per inch (print quality).
#
#  plt.close()
#      Releases the figure from memory.  Must be called after every savefig()
#      to prevent memory accumulation when generating many charts.
# ===========================================================================
print('[INFO] Creating Chart 1: Line plot - Income vs Age...')

plt.figure()
plt.plot(
    df_sorted['Age'].values,        # X-axis: Age values as plain NumPy array
    df_sorted['Income'].values,     # Y-axis: Income values as plain NumPy array
    marker='o'                      # Circle marker at each data point
)
plt.title('Income vs Age (sorted by Age)')
plt.xlabel('Age')
plt.ylabel('Income (USD)')
plt.grid(True, linewidth=0.3)
plt.tight_layout()

out_line = out_folder / 'line_income_vs_age.png'
plt.savefig(out_line, dpi=300)
plt.close()                         # Free figure memory after saving
print(f'[DEBUG] Saved: {out_line}')

# ===========================================================================
# Step 10 - Chart 2: Scatter plot -- Height vs Weight
#
#  plt.scatter(x, y)
#      Draws one dot per row at position (Height_cm, Weight_kg).
#      Unlike plt.plot(), scatter does NOT connect the dots with lines,
#      making it ideal for showing the relationship between two variables
#      without implying any ordering.
#
#  A cluster of dots trending from lower-left to upper-right indicates a
#  positive correlation: taller people tend to weigh more.
# ===========================================================================
print('[INFO] Creating Chart 2: Scatter plot - Height vs Weight...')

plt.figure()
plt.scatter(
    df['Height_cm'],                # X-axis: Height in centimetres
    df['Weight_kg']                 # Y-axis: Weight in kilograms
)
plt.title('Height vs Weight')
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.grid(True, linewidth=0.3)
plt.tight_layout()

out_scatter = out_folder / 'scatter_height_weight.png'
plt.savefig(out_scatter, dpi=300)
plt.close()
print(f'[DEBUG] Saved: {out_scatter}')

# ===========================================================================
# Step 11 - Chart 3: Histogram -- Income distribution
#
#  plt.hist(series, bins=5, edgecolor='black')
#      Divides the range of Income into 5 equal-width buckets (bins) and
#      counts how many values fall into each bucket.
#      edgecolor='black' draws a visible border around each bar, making
#      adjacent bars easier to distinguish.
#
#  plt.grid(axis='y', linewidth=0.3)
#      Grid lines on the Y-axis only (horizontal lines) so you can read
#      the count values.  Vertical grid lines are suppressed because the
#      bars already span the X-axis range.
#
#  A histogram answers: "Are incomes clustered around a typical value,
#  or spread across a wide range?"
# ===========================================================================
print('[INFO] Creating Chart 3: Histogram - Income distribution...')

plt.figure()
plt.hist(
    df['Income'],                   # One bar per bin covering the Income range
    bins=5,                         # Divide the range into 5 equal buckets
    edgecolor='black'               # Black border around each bar
)
plt.title('Income Distribution')
plt.xlabel('Income (USD)')
plt.ylabel('Count')
plt.grid(axis='y', linewidth=0.3)   # Horizontal grid lines only
plt.tight_layout()

out_hist = out_folder / 'hist_income.png'
plt.savefig(out_hist, dpi=300)
plt.close()
print(f'[DEBUG] Saved: {out_hist}')

# ===========================================================================
# Step 12 - Chart 4: Box plot -- Height and Weight side-by-side
#
#  plt.boxplot([list_of_series], labels=[list_of_names])
#      Draws one box for each Series in the list.
#      Passing a Python list of Series (not a DataFrame) lets us place two
#      columns with very different numeric ranges (cm vs kg) on the same
#      figure for a visual comparison of their spread.
#
#  Each box shows five summary statistics:
#      - Bottom whisker : minimum non-outlier value
#      - Bottom of box  : Q1 (25th percentile)
#      - Line in box    : median (50th percentile)
#      - Top of box     : Q3 (75th percentile)
#      - Top whisker    : maximum non-outlier value
#      - Circles (o)    : individual outlier points beyond 1.5 * IQR
#
#  labels=['Height_cm', 'Weight_kg'] places a readable label under each box.
# ===========================================================================
print('[INFO] Creating Chart 4: Box plot - Height and Weight...')

plt.figure()
plt.boxplot(
    [df['Height_cm'], df['Weight_kg']],     # Two boxes: one per column
    labels=['Height_cm', 'Weight_kg']       # Labels displayed below each box
)
plt.title('Box Plots: Height and Weight')
plt.tight_layout()

out_box = out_folder / 'box_height_weight.png'
plt.savefig(out_box, dpi=300)
plt.close()
print(f'[DEBUG] Saved: {out_box}')

# ===========================================================================
# Step 13 - Print a success message listing all four saved PNG file paths
# ===========================================================================
print('\n[DONE] All chart files saved:')
print(f'  1. {out_line}')
print(f'  2. {out_scatter}')
print(f'  3. {out_hist}')
print(f'  4. {out_box}')
