"""
Chapter-01-First-Analysis - First Analysis in VS Code
======================================================
File    : Session1-First-Analysis-V02.py
Version : V02
Date    : 2026-05-13

Objective
---------
Load a CSV file from a shared data folder that sits TWO levels above this
script, compute basic descriptive statistics and the Pearson correlation
between Height and Weight, then save the results to an output folder that
also sits TWO levels above this script.

This session introduces relative path navigation:

    Symbol      Meaning
    --------    -------------------------------------------------------
    .           Current working directory (where Python was launched).
    ./          One folder below the current directory.
    ../         One folder above the current directory.
    ../../      Two folders above the current directory.

See R05-Relative-Paths-Cheat-Sheet.pdf in the references folder for a
full visual guide.

Expected Folder Layout
----------------------
    (root)/
    |
    +-- A-Data/
    |   +-- Height-Weight.csv               <- Input data (read-only)
    |
    +-- Chapter-01-First-Analysis/
    |   +-- S01-First-Analysis/
    |       +-- Session1-First-Analysis-V02.py   <- THIS script
    |
    +-- C-Results/
        +-- Summary-Session-V02.csv         <- Output produced here

Relative Paths Used in This Script
------------------------------------
    Input  : ../../A-Data/Height-Weight.csv
             (two levels up from script, then into A-Data)
    Output : ../../C-Results/Summary-Session-V02.csv
             (two levels up from script, then into C-Results)

Required Input File  (Height-Weight.csv)
-----------------------------------------
    Name,Height_cm,Weight_kg
    Ali,175,70
    Sara,160,55
    John,180,75
    Mei,165,60
    Luis,170,68

Column names must match exactly:
    Height_cm   (numeric, centimetres)
    Weight_kg   (numeric, kilograms)

Quick Start
-----------
1. Open VS Code and open the S01-First-Analysis folder.
2. Open Session1-First-Analysis-V02.py.
3. Run (Ctrl+F5, or right-click -> Run Python File in Terminal).
4. Read the console output for a data preview and statistics.
5. Open C-Results/Summary-Session-V02.csv to inspect saved results.

Step-by-Step Flow
-----------------
Step 1  : Reconfigure stdout/stderr to UTF-8 for cross-platform printing.
Step 2  : Import required libraries: sys, os, pandas.
Step 3  : Resolve the absolute path of this script to anchor all relative paths.
Step 4  : Build the input path  (../../A-Data/Height-Weight.csv).
Step 5  : Build the output path (../../C-Results/) and create folder if missing.
Step 6  : Load the CSV into a pandas DataFrame.
Step 7  : Print the first five rows as a quick data preview.
Step 8  : Compute mean and standard deviation for Height_cm and Weight_kg.
Step 9  : Print the summary statistics to the console.
Step 10 : Compute the Pearson correlation between the two numeric columns.
Step 11 : Print the correlation value and a plain-language interpretation.
Step 12 : Assemble all results into a small summary DataFrame.
Step 13 : Save the summary DataFrame as Summary-Session-V02.csv.
Step 14 : Print a success message showing the full output path.

Output File  (Summary-Session-V02.csv)
----------------------------------------
    Metric,Value
    Mean Height (cm),<value>
    Std Dev Height (cm),<value>
    Mean Weight (kg),<value>
    Std Dev Weight (kg),<value>
    Pearson Correlation,<value>

Notes
-----
- Column names Height_cm and Weight_kg are case-sensitive.
  Update COLUMN_HEIGHT and COLUMN_WEIGHT below if your CSV uses different names.
- Pearson correlation ranges from -1.0 (perfect negative) to +1.0 (perfect positive).
  A value near 1.0 means taller people tend to weigh more in this dataset.
- os.makedirs(..., exist_ok=True) creates C-Results automatically on first run.
- All paths are built with os.path.join so the script works identically on
  Windows (backslashes) and macOS/Linux (forward slashes).
- The script auto-installs pandas if it is not already present.
"""

# ===========================================================================
# Step 1 - Reconfigure stdout and stderr to UTF-8
#          Prevents garbled or missing characters on Windows terminals.
#          The hasattr() guard keeps this safe on older Python builds.
# ===========================================================================
import sys

if hasattr(sys.stdout, 'reconfigure'):
    sys.stdout.reconfigure(encoding='utf-8')
if hasattr(sys.stderr, 'reconfigure'):
    sys.stderr.reconfigure(encoding='utf-8')

# ===========================================================================
# Step 2 - Import standard libraries
# ===========================================================================
import os                               # File and folder path operations

# ---------------------------------------------------------------------------
# Step 2b - Import pandas (auto-install if not already present)
#           pandas provides the DataFrame used for data loading and analysis.
# ---------------------------------------------------------------------------
try:
    import pandas as pd
except ImportError:
    print('[INFO] pandas not found. Attempting automatic install...')
    import subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'pandas'])
    import pandas as pd
    print('[INFO] pandas installed successfully.')

# ===========================================================================
# Step 3 - Resolve the absolute path of this script (the path anchor)
#
#          WHY: Relative paths like "../data/file.csv" are always relative to
#          the current working directory -- which is wherever the terminal
#          was opened, NOT necessarily where this .py file lives.
#          Using __file__ as the anchor instead means the paths work correctly
#          no matter where the terminal is pointed when you run the script.
#
#          __file__            -> path Python assigned to this .py file
#          os.path.abspath()   -> converts it to a full absolute path
#          os.path.dirname()   -> strips the filename, leaving the folder only
# ===========================================================================
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
# e.g. C:\Users\Yahya\...\Chapter-01-First-Analysis\S01-First-Analysis
print(f'[DEBUG] Script folder  : {SCRIPT_DIR}')

# ===========================================================================
# Step 4 - Build the input file path
#
#          Folder layout from SCRIPT_DIR upward:
#              SCRIPT_DIR               = .../S01-First-Analysis/
#              one level up   (..)      = .../Chapter-01-First-Analysis/
#              two levels up  (../..)   = (root)/
#              then into A-Data/        = (root)/A-Data/
#              then the file            = (root)/A-Data/Height-Weight.csv
#
#          os.path.normpath() collapses the ../ tokens into a clean path.
# ===========================================================================
COLUMN_HEIGHT  = 'Height_cm'            # Exact column name for height in the CSV
COLUMN_WEIGHT  = 'Weight_kg'            # Exact column name for weight in the CSV
INPUT_FILENAME = 'Height-Weight.csv'    # Input CSV filename

input_path = os.path.normpath(
    os.path.join(SCRIPT_DIR,  '..', 'A-Data', INPUT_FILENAME)
)
print(f'[DEBUG] Input file     : {input_path}')

# ===========================================================================
# Step 5 - Build the output folder path and create it if it does not exist
#
#          Folder layout from SCRIPT_DIR upward:
#              two levels up  (../..)   = (root)/
#              then into C-Results/     = (root)/C-Results/
#
#          os.makedirs(..., exist_ok=True):
#              - Creates the folder (and any missing parents) if needed.
#              - Does nothing silently if the folder already exists.
# ===========================================================================
OUTPUT_FILENAME = 'Summary-Session-V02.csv'   # Output CSV filename

output_folder = os.path.normpath(
    os.path.join(SCRIPT_DIR,  '..', 'C-Results')
)
output_path = os.path.join(output_folder, OUTPUT_FILENAME)

os.makedirs(output_folder, exist_ok=True)     # Create folder if missing
print(f'[DEBUG] Output folder  : {output_folder}')
print(f'[DEBUG] Output file    : {output_path}')

# ===========================================================================
# Step 6 - Load the CSV into a pandas DataFrame
#          encoding='utf-8' handles accented or special characters in names.
#          A missing file raises FileNotFoundError with a clear path in the message.
# ===========================================================================
print('\n[INFO] Loading data from CSV...')
df = pd.read_csv(input_path, encoding='utf-8')    # df = main data table
print(f'[DEBUG] Rows loaded    : {len(df)}')
print(f'[DEBUG] Columns found  : {list(df.columns)}')

# ===========================================================================
# Step 7 - Print a quick data preview (first 5 rows)
#          Confirms the file loaded and the column names match expectations.
# ===========================================================================
print('\n=== Data Preview (first 5 rows) ===')
print(df.head())

# ===========================================================================
# Step 8 - Compute descriptive statistics for Height and Weight
#
#          .mean()  -> arithmetic average of all values in the column
#          .std()   -> sample standard deviation (pandas default: ddof=1)
#                      ddof=1 divides by (n-1) rather than n, giving an
#                      unbiased estimate when working with a sample.
# ===========================================================================
mean_height = df[COLUMN_HEIGHT].mean()     # Average height across all rows
mean_weight = df[COLUMN_WEIGHT].mean()     # Average weight across all rows
std_height  = df[COLUMN_HEIGHT].std()      # Height spread / variability
std_weight  = df[COLUMN_WEIGHT].std()      # Weight spread / variability

# ===========================================================================
# Step 9 - Print the summary statistics
# ===========================================================================
print('\n=== Summary Statistics ===')
print(f'  Average Height : {mean_height:.1f} cm')
print(f'  Std Dev Height : {std_height:.1f} cm')
print(f'  Average Weight : {mean_weight:.1f} kg')
print(f'  Std Dev Weight : {std_weight:.1f} kg')

# ===========================================================================
# Step 10 - Compute the Pearson correlation coefficient
#
#           .corr() measures the linear relationship between two numeric Series.
#           Formula:  r = cov(X, Y) / (std(X) * std(Y))
#
#           Result range and meaning:
#               r >= 0.8   -> strong positive correlation
#               r >= 0.5   -> moderate positive correlation
#               r >= 0.0   -> weak or no positive correlation
#               r <  0.0   -> negative correlation (taller -> lighter)
# ===========================================================================
corr = df[COLUMN_HEIGHT].corr(df[COLUMN_WEIGHT])   # Pearson r (default method)

# ===========================================================================
# Step 11 - Print the correlation and a plain-language interpretation
# ===========================================================================
print(f'\n  Pearson Correlation (Height vs Weight) : {corr:.2f}')

if corr >= 0.8:
    print('  [INFO] Strong positive correlation.')
elif corr >= 0.5:
    print('  [INFO] Moderate positive correlation.')
elif corr >= 0.0:
    print('  [INFO] Weak or no positive correlation.')
else:
    print('  [INFO] Negative correlation detected.')

# ===========================================================================
# Step 12 - Assemble all results into a compact summary DataFrame
#           Two columns: Metric (label) and Value (rounded number).
#           round(..., 4) keeps the CSV tidy without losing meaningful precision.
# ===========================================================================
summary = pd.DataFrame({
    'Metric': [
        'Mean Height (cm)',
        'Std Dev Height (cm)',
        'Mean Weight (kg)',
        'Std Dev Weight (kg)',
        'Pearson Correlation',
    ],
    'Value': [
        round(mean_height, 4),
        round(std_height,  4),
        round(mean_weight, 4),
        round(std_weight,  4),
        round(corr,        4),
    ]
})

print('\n=== Results Table ===')
print(summary.to_string(index=False))          # Print cleanly without row numbers

# ===========================================================================
# Step 13 - Save the summary DataFrame to CSV
#           index=False omits the automatic integer row-number column.
#           encoding='utf-8' ensures the file opens correctly everywhere.
# ===========================================================================
summary.to_csv(output_path, index=False, encoding='utf-8')

# ===========================================================================
# Step 14 - Print a success message with the full output file path
# ===========================================================================


print(f'\nPay attention that the input file is hard coded')
print(f'A-Data    - Where are input data files are stored')
print(f'B-Engines - Where are python code is stored')
print(f'C-Results - Where the results of the program are stored')
print(f'\n[DONE] Results saved to: {output_path}')