""" Duplicate File Detector - Cryptographic Hash-Based Duplicate Scanner ====================================================================== File : 02-Find-Duplicates-V03.py Version : 2.3.0 Author : Yahya Nazer Copyright : (c) 2025 Chatbizdb.com - Yahya Nazer License : Proprietary Email : contact@chatbizdb.com Status : Production Date : 2026-05-13 Objective --------- Recursively scan a user-selected folder, compute a cryptographic hash (MD5 by default) for every file, group files that share the same hash as confirmed duplicates, and export two reports: 1. CSV -- Excel-compatible with status column, duplicate group numbers, =HYPERLINK() formulas, full hash strings, and a terminal open-command fallback column. 2. HTML -- Browser-friendly styled report with a statistics dashboard, colour-coded duplicate / unique rows, clickable file links, and truncated hashes for readability. Both output files are written to a TIMESTAMPED subfolder created PARALLEL to the scanned folder (not next to the script). How Duplicate Detection Works ------------------------------ Two files are confirmed duplicates if and only if their cryptographic hashes are identical. The process: Step A Read every file in binary mode ('rb') in fixed-size chunks. Step B Feed each chunk into a hash object (hashlib.md5 or sha256). Step C After reading the whole file, call .hexdigest() to get a hex string. Step D Group all file paths by their hash using a defaultdict(list). Step E Any hash with >= 2 file paths is a duplicate group. Why MD5? MD5 produces a 128-bit (32 hex character) hash. It is fast and sufficient for file deduplication (distinguishing accidental copies). It is NOT cryptographically secure for security purposes (collision attacks exist), but collisions between real files with different content are practically impossible in normal use. Why chunk reading (chunk_size=8192)? Reading the whole file at once with f.read() would load large files (e.g. a 4 GB video) entirely into RAM. Reading in 8 KB chunks keeps memory usage constant regardless of file size. Walrus Operator (:=) Used in get_file_hash() ---------------------------------------------- while chunk := f.read(chunk_size): hash_obj.update(chunk) The walrus operator (:=) assigns the result of f.read() to 'chunk' AND evaluates it as the while condition in one expression. When f.read() returns b'' (empty bytes -- end of file), the assignment produces a falsy value and the loop exits. This is more concise than the two-line alternative: chunk = f.read(chunk_size) while chunk: hash_obj.update(chunk) chunk = f.read(chunk_size) defaultdict(list) in scan_for_duplicates() -------------------------------------------- hash_dict = defaultdict(list) hash_dict[file_hash].append(file_data) A regular dict would raise KeyError on the first access to a new hash key. defaultdict(list) automatically creates an empty list [] for any new key, so .append() always works without a prior check or setdefault() call. Output Folder Layout (relative to the scanned folder's PARENT) ---------------------------------------------------------------- / | +-- C-Reports/ +-- duplicate-report-YYYY-MM-DD--HH-MM/ +-- 02-duplicate-files-YYYY-MM-DD--HH-MM.csv +-- 02-duplicate-files-YYYY-MM-DD--HH-MM.html Example: Script location: /home/user/scripts/02-Find-Duplicates-V03.py Selected folder: /home/user/Documents/Photos/ Report created: /home/user/Documents/C-Reports/duplicate-report-2026-05-13--14-30/ Features --------- - Cross-platform: Windows, macOS, Linux. - Recursive scan: finds duplicates across all subfolders. - Chunk-based file hashing: constant memory usage for any file size. - Duplicate grouping: files sharing a hash are assigned the same Group number. - Sort order in both CSV and HTML: duplicates listed first, then by group number. - utf-8-sig encoding for CSV: adds a BOM so Excel auto-detects UTF-8 on open. - HTML statistics dashboard: four stat boxes (total, duplicates, unique, groups). - Colour-coded HTML rows: duplicate rows highlighted in red. - Sticky table header in HTML: stays visible when scrolling long reports. Dependencies ------------- All standard library -- no pip install required: os, csv, hashlib, tkinter, datetime, collections.defaultdict, platform, subprocess, sys Function Summary ----------------- Function Purpose ---------------------- -------------------------------------------------- print_script_info() Print version banner at startup. select_folder() GUI folder-picker dialog; returns selected path. make_report_dir() Build the timestamped report folder and return (report_dir, timestamp). get_csv_file_path() Derive and return the full CSV output path. get_html_file_path() Derive and return the full HTML output path. get_file_hash() Read a file in chunks and return its hex hash. create_file_url() Convert an absolute path to an OS-correct file:// URL. get_file_type() Extract the file extension as a lowercase string. get_file_info() Read os.stat() metadata: size, modified, created. scan_for_duplicates() Walk the folder tree, hash every file, group duplicates; return an annotated file list. save_to_csv() Write the annotated file list to CSV. save_to_html() Write the annotated file list to a styled HTML report. open_file() Open a file with the OS default application. main() Orchestrate all steps end-to-end. CSV Columns ------------ Selector : 'DUPLICATE' or 'UNIQUE' -- easy to filter in Excel. Index : 1-based row number (after duplicate-first sort). File Name : =HYPERLINK() formula -- clickable in Excel/Numbers. File Path : Full absolute path string. Size (bytes) : Integer byte count from os.stat(). Date Modified : Last-modified timestamp (YYYY-MM-DD HH:MM:SS). Date Created : Creation time (Windows) or metadata-change time (Unix). File Type : Lowercase extension without the dot. Duplicate Group : Integer group number for duplicates; blank for unique. Hash : Full MD5 hex string (32 characters). Duplicate Count : Number of copies sharing this hash (1 for unique files). Open Command : Terminal command to open the file on the current OS. Notes ----- - print_flag = True controls all console output. Set to False for silent operation when the script is imported as a module. - encoding='utf-8-sig' writes a UTF-8 BOM (byte order mark) at the start of the CSV file. Excel on Windows uses this marker to auto-detect UTF-8 encoding and display accented characters correctly. Regular 'utf-8' often renders as garbled text in older Excel versions. - The hash is truncated to 16 characters in the HTML display column to keep the table readable, but the full 32-character hash is stored in the CSV. - Files that cannot be read (permission denied, locked) are silently skipped; get_file_hash() returns None and they are excluded from hash_dict. """ __version__ = "2.3.0" __author__ = "Yahya Nazer" __copyright__ = "Copyright (c) 2025 Chatbizdb.com - Yahya Nazer" __license__ = "Proprietary" __maintainer__ = "Yahya Nazer" __email__ = "contact@chatbizdb.com" __status__ = "Production" # =========================================================================== # Imports -- all standard library, no pip install required # =========================================================================== import os # File system operations and path handling import csv # CSV writer for tabular output import sys # stdout reconfiguration import hashlib # Cryptographic hashing (MD5, SHA-256, ...) import platform # OS detection (Windows / Darwin / Linux) import subprocess # Launch OS-native file openers from datetime import datetime # Timestamp generation and formatting from collections import defaultdict # Auto-initialising dictionary (list variant) import tkinter as tk # GUI toolkit for the folder-picker dialog from tkinter import filedialog # Folder dialog widget # Reconfigure stdout to UTF-8 so file names with special characters print # correctly on Windows terminals. if hasattr(sys.stdout, 'reconfigure'): sys.stdout.reconfigure(encoding='utf-8') # =========================================================================== # Script-level constants # =========================================================================== SCRIPT_NAME = "Duplicate File Detector" SCRIPT_VERSION = __version__ SCRIPT_FILE = os.path.basename(__file__) # Filename only SCRIPT_PATH = os.path.abspath(__file__) # Full absolute path of this script SCRIPT_DIR = os.path.dirname(SCRIPT_PATH) # Folder containing this script # Global print flag -- set False for silent / headless operation. print_flag = True # =========================================================================== # Function: print_script_info() # =========================================================================== def print_script_info() -> None: """ Print a version and environment banner to the console at startup. Displays: script name, version, copyright, filename, absolute path, Python version, OS platform, and current timestamp. Controlled by the global print_flag. """ if not print_flag: return print('=' * 70) print(f'{SCRIPT_NAME}') print('=' * 70) print(f'Version : {SCRIPT_VERSION}') print(f'Copyright : {__copyright__}') print(f'File : {SCRIPT_FILE}') print(f'Path : {SCRIPT_PATH}') print(f'Python : {sys.version.split()[0]}') print(f'Platform : {platform.system()} {platform.release()}') print(f'Date : {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}') print('=' * 70) # =========================================================================== # Function: select_folder() # =========================================================================== def select_folder() -> str: """ Open a GUI folder-picker dialog and return the selected folder path. Uses tkinter's askdirectory() which opens the OS-native folder browser. Returns an empty string '' if the user cancels the dialog. Steps: 1. Create an invisible tkinter root window (required before dialogs). 2. Hide it immediately with withdraw(). 3. Open the askdirectory() dialog and block until user acts. 4. Return the chosen path, or '' if cancelled. Returns: str: Absolute path of the selected folder, or '' if cancelled. """ if print_flag: print('[INFO] Opening folder selection dialog...') root = tk.Tk() root.withdraw() # Hide the blank root window immediately folder_path = filedialog.askdirectory( title='Select folder to scan for duplicates' ) if print_flag: print(f'[INFO] Selected folder: {folder_path}') return folder_path # =========================================================================== # Function: make_report_dir() # =========================================================================== def make_report_dir(folder_path: str) -> tuple[str, str]: """ Build the timestamped report folder path, create it, and return it. The report folder is created PARALLEL to the selected folder (as a sibling of the selected folder's parent), so the scan never writes into the folder being examined. Folder structure produced: / +-- C-Reports/ +-- duplicate-report-YYYY-MM-DD--HH-MM/ <- returned as report_dir Steps: 1. Resolve the parent directory of the selected folder. 2. Build the C-Reports path and create it if missing. 3. Generate a timestamp string (YYYY-MM-DD--HH-MM). 4. Build and create the timestamped report subfolder. 5. Return (report_dir, timestamp). Args: folder_path (str): Absolute path of the folder being scanned. Returns: tuple[str, str]: (report_dir, timestamp) report_dir -- full path of the timestamped output folder. timestamp -- 'YYYY-MM-DD--HH-MM' string used in filenames. """ # Parent of the SELECTED folder (not the script folder) parent_dir = os.path.dirname(os.path.abspath(folder_path)) # C-Reports folder sits as a sibling of the selected folder main_report_dir = os.path.join(parent_dir, 'C-Reports') if not os.path.exists(main_report_dir): os.makedirs(main_report_dir) if print_flag: print(f'[INFO] Created C-Reports directory: {main_report_dir}') # Timestamp used in both the subfolder name and the output filenames timestamp = datetime.now().strftime('%Y-%m-%d--%H-%M') # Timestamped subfolder -- each run gets its own folder report_dir = os.path.join(main_report_dir, f'duplicate-report-{timestamp}') if not os.path.exists(report_dir): os.makedirs(report_dir) if print_flag: print(f'[INFO] Created report directory: {report_dir}') else: if print_flag: print(f'[INFO] Report directory already exists: {report_dir}') return report_dir, timestamp # =========================================================================== # Function: get_csv_file_path() # =========================================================================== def get_csv_file_path(folder_path: str) -> str: """ Derive the full output path for the CSV report file. Delegates folder creation to make_report_dir() and appends the CSV filename using the same timestamp. Filename pattern: 02-duplicate-files-YYYY-MM-DD--HH-MM.csv Args: folder_path (str): Absolute path of the folder being scanned. Returns: str: Full absolute path where the CSV file will be written. """ if print_flag: print('[INFO] Deriving CSV output path...') report_dir, timestamp = make_report_dir(folder_path) csv_filename = f'02-duplicate-files-{timestamp}.csv' csv_file_path = os.path.join(report_dir, csv_filename) if print_flag: print(f'[INFO] CSV will be saved as: {csv_file_path}') return csv_file_path # =========================================================================== # Function: get_html_file_path() # =========================================================================== def get_html_file_path(folder_path: str) -> str: """ Derive the full output path for the HTML report file. Delegates folder creation to make_report_dir() and appends the HTML filename using the same timestamp. Calling make_report_dir() twice within the same minute produces the same folder path because the exist_ok logic silently skips re-creation. Filename pattern: 02-duplicate-files-YYYY-MM-DD--HH-MM.html Args: folder_path (str): Absolute path of the folder being scanned. Returns: str: Full absolute path where the HTML file will be written. """ if print_flag: print('[INFO] Deriving HTML output path...') report_dir, timestamp = make_report_dir(folder_path) html_filename = f'02-duplicate-files-{timestamp}.html' html_file_path = os.path.join(report_dir, html_filename) if print_flag: print(f'[INFO] HTML will be saved as: {html_file_path}') return html_file_path # =========================================================================== # Function: get_file_hash() # =========================================================================== def get_file_hash(file_path: str, hash_algo: str = 'md5', chunk_size: int = 8192) -> str | None: """ Compute and return the cryptographic hash of a file as a hex string. The file is read in fixed-size chunks rather than all at once, keeping RAM usage constant regardless of file size. Algorithm choice: 'md5' -> 32 hex characters; fast; suitable for deduplication. 'sha256' -> 64 hex characters; slower; higher collision resistance. Any algorithm supported by hashlib.new() is accepted. Walrus operator (:=) used in the read loop: while chunk := f.read(chunk_size): hash_obj.update(chunk) := assigns f.read(chunk_size) to 'chunk' AND evaluates the result as the while condition in one expression. When f.read() returns b'' (empty bytes = end of file), the loop exits automatically. This is more concise than the explicit two-line read-then-check form. Steps: 1. Create a new hash object for the chosen algorithm. 2. Open the file in binary mode ('rb'). 3. Read and feed chunks into the hash object until EOF. 4. Return the hex digest string. 5. On any read error (permission denied, locked), return None. Args: file_path (str): Full absolute path to the file. hash_algo (str): Hash algorithm name. Default: 'md5'. chunk_size (int): Bytes per read call. Default: 8192 (8 KB). Returns: str | None: 32-character MD5 hex string, or None on read error. """ hash_obj = hashlib.new(hash_algo) # Create a fresh hash accumulator try: with open(file_path, 'rb') as f: # 'rb' = binary read (required for hashing) # Walrus operator: read a chunk, assign to 'chunk', check if non-empty while chunk := f.read(chunk_size): hash_obj.update(chunk) # Feed chunk bytes into the hash file_hash = hash_obj.hexdigest() # Finalise and get the hex string if print_flag: # Print only first 16 chars to keep console output concise print(f'[DEBUG] Hash ({hash_algo}): {file_hash[:16]}... {file_path}') return file_hash except (OSError, IOError) as e: # Unreadable files (permission denied, broken symlink, locked) are # silently skipped -- they will not appear in hash_dict at all. if print_flag: print(f'[WARN] Cannot hash {file_path}: {e}') return None # =========================================================================== # Function: create_file_url() # =========================================================================== def create_file_url(file_path: str) -> str: """ Convert an absolute file path to an OS-correct file:// URL. File URL format by OS: Windows : file:///C:/Users/... (three slashes; backslashes -> forward) macOS : file:///Users/... (two slashes + leading slash of abs_path) Linux : file:///home/... (same as macOS) chr(92) is the backslash character; used instead of a literal backslash to avoid escape-sequence ambiguity in a regular (non-raw) string. Args: file_path (str): Path to the file (relative or absolute). Returns: str: file:// URL suitable as an HTML href or CSV =HYPERLINK() argument. """ abs_path = os.path.abspath(file_path) if platform.system() == 'Windows': file_url = f'file:///{abs_path.replace(chr(92), "/")}' else: # macOS and Linux: abs_path already starts with '/', giving file:///path file_url = f'file://{abs_path}' return file_url # =========================================================================== # Function: get_file_type() # =========================================================================== def get_file_type(filename: str) -> str: """ Extract the file extension and return it as a lowercase string. Steps: 1. os.path.splitext('report.PDF') -> ('report', '.PDF') 2. .lower() -> '.pdf' 3. .lstrip('.') -> 'pdf' 4. Return 'No Extension' if the result is empty. Args: filename (str): File name (with or without directory prefix). Returns: str: Lowercase extension without the dot, or 'No Extension'. """ _, ext = os.path.splitext(filename) file_type = ext.lower().lstrip('.') return file_type if file_type else 'No Extension' # =========================================================================== # Function: get_file_info() # =========================================================================== def get_file_info(file_path: str) -> dict: """ Read file system metadata for a single file using os.stat(). os.stat() attributes used: st_size : File size in bytes. st_mtime : Last modification time (Unix timestamp float). st_ctime : Windows -- true creation time. Unix -- last METADATA CHANGE time (not creation time). On OSError the function returns safe fallback values ('Unknown') so one unreadable file does not abort the entire scan. Args: file_path (str): Full absolute path to the file. Returns: dict with keys: 'size' (int): File size in bytes (0 on error). 'modified' (str): 'YYYY-MM-DD HH:MM:SS' or 'Unknown'. 'created' (str): 'YYYY-MM-DD HH:MM:SS' or 'Unknown'. """ try: stat_info = os.stat(file_path) file_size = stat_info.st_size mod_time = datetime.fromtimestamp(stat_info.st_mtime) create_time = datetime.fromtimestamp(stat_info.st_ctime) return { 'size' : file_size, 'modified': mod_time.strftime('%Y-%m-%d %H:%M:%S'), 'created' : create_time.strftime('%Y-%m-%d %H:%M:%S'), } except OSError as e: if print_flag: print(f'[WARN] Could not read metadata for {file_path}: {e}') return {'size': 0, 'modified': 'Unknown', 'created': 'Unknown'} # =========================================================================== # Function: scan_for_duplicates() # =========================================================================== def scan_for_duplicates(folder_path: str, hash_algo: str = 'md5') -> list[dict]: """ Recursively scan a folder, hash every file, and group duplicates. Algorithm overview: 1. Walk the entire folder tree with os.walk(). 2. For every file, compute its hash with get_file_hash(). 3. Store file dicts in hash_dict keyed by hash using defaultdict(list). Files with the same hash are automatically grouped in the same list. 4. After all files are processed, iterate over hash_dict: - Hashes with >= 2 files -> mark all as DUPLICATE, assign a shared group number. - Hashes with exactly 1 file -> mark as UNIQUE, group = ''. 5. Return the flat file_list with status, group, and duplicate_count added to every dict. defaultdict(list) explained: A regular dict raises KeyError on the first access to a new key. defaultdict(list) automatically creates an empty list [] for any new key on first access, so hash_dict[file_hash].append(...) always works without a prior existence check or setdefault() call. Progress reporting: A progress counter is printed every 10 files so the user can see the scan is running during large folder scans. Sort order in the returned list: The list is NOT sorted here; sorting is applied in save_to_csv() and save_to_html() so each output can sort independently if needed. Args: folder_path (str): Absolute path of the folder to scan. hash_algo (str): Hash algorithm passed to get_file_hash(). Default 'md5'. Returns: list[dict]: One dict per file. Each dict contains: filename, path, url, size, modified, created, type, hash, status ('DUPLICATE' or 'UNIQUE'), group (int group number or ''), duplicate_count (int). """ if print_flag: print(f'[INFO] Scanning for duplicates: {folder_path}') print(f'[INFO] Hash algorithm: {hash_algo}') # defaultdict(list) groups all file dicts under their hash key hash_dict = defaultdict(list) file_count = 0 for root, dirs, files in os.walk(folder_path): if print_flag: print(f'[DEBUG] Scanning directory: {root} ({len(files)} files)') for filename in files: file_path = os.path.join(root, filename) file_count += 1 # Progress update every 10 files during large scans if print_flag and file_count % 10 == 0: print(f'[INFO] Processed {file_count} files...') # Hash the file; None means the file was unreadable -- skip it file_hash = get_file_hash(file_path, hash_algo) if file_hash is None: continue file_info = get_file_info(file_path) # Append this file's data to the list for its hash # defaultdict creates a new empty list automatically on first access hash_dict[file_hash].append({ 'filename': filename, 'path' : file_path, 'url' : create_file_url(file_path), 'size' : file_info['size'], 'modified': file_info['modified'], 'created' : file_info['created'], 'type' : get_file_type(filename), 'hash' : file_hash, }) if print_flag: print(f'[INFO] Total files scanned : {file_count}') print(f'[INFO] Unique hashes found : {len(hash_dict)}') # ------------------------------------------------------------------ # Build the flat output list, assigning status and group numbers # ------------------------------------------------------------------ file_list = [] duplicate_count = 0 duplicate_groups = 0 group_number = 1 # Group numbers start at 1; incremented per duplicate group for file_hash, files in hash_dict.items(): if len(files) > 1: # All files in this hash bucket are confirmed duplicates duplicate_groups += 1 duplicate_count += len(files) for file_data in files: file_data['status'] = 'DUPLICATE' file_data['group'] = group_number file_data['duplicate_count'] = len(files) file_list.append(file_data) group_number += 1 # Increment so the next group gets a new number else: # Only one file has this hash -- it is unique file_data = files[0] file_data['status'] = 'UNIQUE' file_data['group'] = '' # Blank group for unique files file_data['duplicate_count'] = 1 file_list.append(file_data) if print_flag: unique_count = len(file_list) - duplicate_count print(f'[INFO] Duplicate files : {duplicate_count}') print(f'[INFO] Duplicate groups : {duplicate_groups}') print(f'[INFO] Unique files : {unique_count}') return file_list # =========================================================================== # Function: save_to_csv() # =========================================================================== def save_to_csv(file_list: list[dict], csv_file_path: str) -> None: """ Write the annotated file list to a CSV file with Excel hyperlinks. Sort order applied before writing: Primary key: status -- DUPLICATE rows first (sort key 0), UNIQUE last (1). Secondary key: group number -- duplicate groups appear together in order. Unique files (group='') use 999999 as a sentinel to sort last. Sort key lambda explained: key=lambda x: ( 0 if x['status'] == 'DUPLICATE' else 1, # DUPLICATE=0, UNIQUE=1 x['group'] if x['group'] else 999999 # numeric group or large sentinel ) Python sorts tuples element by element, so primary sort is by status, ties broken by group number. encoding='utf-8-sig': Writes a UTF-8 BOM (Byte Order Mark: EF BB BF) at the start of the file. Excel on Windows uses this marker to auto-detect UTF-8 encoding, ensuring accented characters display correctly without manual import settings. Regular 'utf-8' (no BOM) often renders garbled in older Excel versions. CSV Columns: Selector : 'DUPLICATE' or 'UNIQUE' -- filter this in Excel. Index : 1-based row number after sorting. File Name : =HYPERLINK("file://...", "name") -- clickable in Excel. File Path : Plain absolute path. Size (bytes) : Integer file size. Date Modified : Last-modified timestamp. Date Created : Creation / metadata-change timestamp. File Type : Lowercase extension. Duplicate Group: Integer group number or blank. Hash : Full 32-character MD5 hex string. Duplicate Count: Number of copies sharing this hash. Open Command : OS-specific terminal command to open the file. Args: file_list (list[dict]): Output of scan_for_duplicates(). csv_file_path (str): Full path where the CSV will be written. Raises: Exception: Re-raised after logging if the file cannot be written. """ if print_flag: print(f'[INFO] Saving CSV: {csv_file_path}') headers = [ 'Selector', 'Index', 'File Name', 'File Path', 'Size (bytes)', 'Date Modified', 'Date Created', 'File Type', 'Duplicate Group', 'Hash', 'Duplicate Count', 'Open Command', ] try: # utf-8-sig writes a BOM so Excel auto-detects UTF-8 on open with open(csv_file_path, 'w', newline='', encoding='utf-8-sig') as csvfile: writer = csv.writer(csvfile) writer.writerow(headers) # Sort: DUPLICATE rows first, then by group number sorted_list = sorted( file_list, key=lambda x: ( 0 if x['status'] == 'DUPLICATE' else 1, x['group'] if x['group'] else 999_999, ) ) for i, file_data in enumerate(sorted_list, 1): # OS-specific terminal command (fallback when hyperlinks fail) if platform.system() == 'Windows': open_command = f'start "" "{file_data["path"]}"' elif platform.system() == 'Darwin': open_command = f'open "{file_data["path"]}"' else: open_command = f'xdg-open "{file_data["path"]}"' row = [ file_data['status'], i, # =HYPERLINK(url, display_text) -- Excel formula f'=HYPERLINK("{file_data["url"]}", "{file_data["filename"]}")', file_data['path'], file_data['size'], file_data['modified'], file_data['created'], file_data['type'], file_data['group'] if file_data['group'] else '', file_data['hash'], # Full 32-char MD5 string file_data['duplicate_count'], open_command, ] writer.writerow(row) if print_flag: print(f'[INFO] CSV saved: {len(file_list)} rows written.') print('[INFO] Tip: if Excel hyperlinks do not work, ' 'use the Open Command in column M.') except Exception as e: if print_flag: print(f'[ERROR] Could not save CSV: {e}') raise # =========================================================================== # Function: save_to_html() # =========================================================================== def save_to_html(file_list: list[dict], html_file_path: str) -> None: """ Write the annotated file list to a styled HTML report. HTML features: - Statistics dashboard: four stat boxes (total, duplicates, unique, duplicate groups) using CSS Grid layout. - Report information box: generation timestamp and script name. - Sortable-ready table with sticky header (position: sticky; top: 0) so the header row stays visible when scrolling long reports. - Colour-coded rows: duplicate rows highlighted red (#ffe6e6); even unique rows use alternating light grey (#f2f2f2). - Status badges: coloured inline elements showing DUPLICATE (red) or UNIQUE (green). - Group badges: blue elements showing 'Group N' for duplicates. - Hash column: truncated to first 16 characters for readability. Statistics computed before building the HTML: total_files : len(file_list) duplicate_files : count of rows where status == 'DUPLICATE' unique_files : total_files - duplicate_files duplicate_groups : count of distinct non-empty group values Sort order: same duplicate-first, group-number sort as in save_to_csv(). Args: file_list (list[dict]): Output of scan_for_duplicates(). html_file_path (str): Full path where the HTML will be written. Raises: Exception: Re-raised after logging if the file cannot be written. """ if print_flag: print(f'[INFO] Saving HTML: {html_file_path}') # Pre-compute statistics for the dashboard boxes total_files = len(file_list) duplicate_files = sum(1 for f in file_list if f['status'] == 'DUPLICATE') unique_files = total_files - duplicate_files # len(set(...)) counts distinct group numbers (ignoring blank strings) duplicate_groups = len(set(f['group'] for f in file_list if f['group'])) # ----------------------------------------------------------------------- # HTML header, CSS, and statistics dashboard. # Double braces {{ }} produce literal { } characters in CSS rules # inside an f-string. # ----------------------------------------------------------------------- html_content = f""" Duplicate Files Report

Duplicate Files Report

{total_files}

Total Files

{duplicate_files}

Duplicate Files

{unique_files}

Unique Files

{duplicate_groups}

Duplicate Groups

Report Information

Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
Script: {SCRIPT_NAME} v{SCRIPT_VERSION}
Note: Files with the same hash are identical duplicates. Click on filenames to open them.

""" # Sort duplicates first, then by group number (same key as CSV) sorted_list = sorted( file_list, key=lambda x: ( 0 if x['status'] == 'DUPLICATE' else 1, x['group'] if x['group'] else 999_999, ) ) for i, file_data in enumerate(sorted_list, 1): # CSS class for the -- 'duplicate' triggers the red background row_class = 'duplicate' if file_data['status'] == 'DUPLICATE' else '' # Inline status badge: red for DUPLICATE, green for UNIQUE status_css = ('status-duplicate' if file_data['status'] == 'DUPLICATE' else 'status-unique') status_badge = (f'' f'{file_data["status"]}') # Group badge only for duplicate rows group_badge = (f'Group {file_data["group"]}' if file_data['group'] else '') html_content += f""" """ html_content += """

Status	Index	Group	File Name	File Path	Size (bytes)	Date Modified	File Type	Hash (first 16 chars)
{status_badge}	{i}	{group_badge}	{file_data['filename']}	{file_data['path']}	{file_data['size']:,}	{file_data['modified']}	{file_data['type']}	{file_data['hash'][:16]}...

""" try: with open(html_file_path, 'w', encoding='utf-8') as htmlfile: htmlfile.write(html_content) if print_flag: print(f'[INFO] HTML saved: {len(file_list)} rows written.') except Exception as e: if print_flag: print(f'[ERROR] Could not save HTML: {e}') raise # =========================================================================== # Function: open_file() # =========================================================================== def open_file(file_path: str) -> None: """ Open a file using the OS default application. OS-specific launchers: Windows : os.startfile() -- Windows ShellExecute API. macOS : subprocess 'open' -- macOS built-in launcher. Linux : subprocess 'xdg-open' -- freedesktop.org standard. Args: file_path (str): Full absolute path of the file to open. """ if print_flag: print(f'[INFO] Opening file: {file_path}') try: if platform.system() == 'Windows': os.startfile(file_path) elif platform.system() == 'Darwin': subprocess.run(['open', file_path]) else: subprocess.run(['xdg-open', file_path]) if print_flag: print('[INFO] File opened successfully.') except Exception as e: if print_flag: print(f'[WARN] Could not open file automatically: {e}') print('[INFO] Please navigate to the report folder and open manually.') # =========================================================================== # Function: main() # =========================================================================== def main() -> None: """ Orchestrate the full duplicate-detection workflow end-to-end. Steps: 1. Print the startup version banner. 2. Open the GUI folder-picker; exit gracefully if cancelled. 3. Show the user where the report will be created. 4. Derive CSV and HTML output file paths. 5. Recursively scan the folder and hash every file. 6. Exit gracefully if the folder contains no readable files. 7. Save results to both CSV and HTML. 8. Auto-open the HTML report in the default browser. 9. Print a completion summary and next-steps guidance. """ print_script_info() if print_flag: print('\n[INFO] Duplicate File Detector started.') print('=' * 50) try: # ------------------------------------------------------------------ # Step 2 - Folder selection # ------------------------------------------------------------------ if print_flag: print('\n[STEP 2] Select the folder to scan...') folder_path = select_folder() if not folder_path: print('[INFO] No folder selected. Exiting.') return # ------------------------------------------------------------------ # Step 3 - Show the anticipated report location # ------------------------------------------------------------------ if print_flag: parent_dir = os.path.dirname(os.path.abspath(folder_path)) ts_preview = datetime.now().strftime('%Y-%m-%d--%H-%M') report_preview = os.path.join(parent_dir, 'C-Reports', f'duplicate-report-{ts_preview}') print(f'\n[INFO] Scanned folder : {folder_path}') print(f'[INFO] Report will be at: {report_preview}') print('[INFO] (Report is parallel to the selected folder, ' 'NOT in the script location)') # ------------------------------------------------------------------ # Step 4 - Derive output file paths # ------------------------------------------------------------------ if print_flag: print('\n[STEP 4] Setting up output file paths...') csv_file_path = get_csv_file_path(folder_path) html_file_path = get_html_file_path(folder_path) # ------------------------------------------------------------------ # Step 5 - Scan for duplicates # ------------------------------------------------------------------ if print_flag: print('\n[STEP 5] Scanning for duplicate files...') file_list = scan_for_duplicates(folder_path) # ------------------------------------------------------------------ # Step 6 - Exit gracefully if nothing was found # ------------------------------------------------------------------ if not file_list: print('[INFO] No files found in the selected folder. Exiting.') return # ------------------------------------------------------------------ # Step 7 - Save to CSV and HTML # ------------------------------------------------------------------ if print_flag: print('\n[STEP 7] Saving reports...') save_to_csv(file_list, csv_file_path) save_to_html(file_list, html_file_path) # ------------------------------------------------------------------ # Step 8 - Auto-open the HTML report # ------------------------------------------------------------------ if print_flag: print('\n[STEP 8] Opening HTML report...') open_file(html_file_path) # ------------------------------------------------------------------ # Step 9 - Completion summary and guidance # ------------------------------------------------------------------ if print_flag: dup_count = sum(1 for f in file_list if f['status'] == 'DUPLICATE') grp_count = len(set(f['group'] for f in file_list if f['group'])) print('\n' + '=' * 50) print('[DONE] Script completed successfully.') print(f'[INFO] CSV : {csv_file_path}') print(f'[INFO] HTML : {html_file_path}') print(f'[INFO] Total files processed : {len(file_list)}') print(f'[INFO] Duplicate files found : {dup_count} ' f'in {grp_count} groups') print('=' * 50) print('\nNext steps:') print(' 1. Open the HTML report to review duplicates visually.') print(' 2. Files sharing the same Group number are identical.') print(' 3. Click filenames in HTML to preview each file.') print(' 4. Decide which copy to keep and delete the others.') except Exception as e: if print_flag: print(f'[ERROR] Unexpected error: {e}') raise # =========================================================================== # Entry-point guard -- main() runs only when executed directly # =========================================================================== if __name__ == '__main__': main()