Automatisches HTML Update mit Python

🇬 English Version

The Challenge:

I recently inherited a large archive of old HTML files where every single page had different styles and inconsistent structures.

Manually fixing hundreds of files was impossible. So, instead of struggling with complex CMS tools, I decided to write a simple Python script to handle the updates automatically.

This tool does exactly one thing well: It finds a global stylesheet and injects it into thousands of HTML pages instantly.

Why Python?

It allows me to process files programmatically without installing heavy frameworks like WordPress or React. It's lightweight, fast, and runs directly on my desktop.

Key Features of this Script:

Iterates through all files in a target folder recursively
Reads external CSS definitions dynamically
Inserts missing stylesheets safely into <head>
Auto-backup mechanism prevents data loss
Includes a dry-run mode (Test before applying changes)

A few things were important to me:

Backup Safety: Every modified file gets a backup first.
Dry-Run Mode: See what would change before changing anything.
Clean Code: No external dependencies, just standard library modules.

This approach is highly efficient for migrating old websites or standardizing internal documentation sites.

import os
import re
from pathlib import Path
from shutil import copy2

# Configuration - Change everything centrally here
ARCHIV_ORDNER = "archiv"
BACKUP_ORDNER = "archiv_backup"
STYLE_DATEI = "global-style.css" 

try:
    with open(STYLE_DATEI, encoding="utf-8") as f:
        NEW_STYLE_CONTENT = f.read().strip()
except FileNotFoundError:
    print(f"ERROR: "{STYLE_DATEI} not found!")
    exit(1)

def fix_archive(dry_run=True):
    count = 0
    folder = Path(ARCHIV_ORDNER)
    
    if not folder.exists():
        print(f"Folder '"{folder}"' not found!")
        return

    backup_dir = Path(BACKUP_ORDNER)
    backup_dir.mkdir(exist_ok=True)

    for filepath in folder.glob("*.html"):
        if filepath.name == "30nov1.html":
            continue

        backup_path = backup_dir / filepath.name
        copy2(filepath, backup_path)

        with filepath.open(encoding="utf-8") as f:
            content = f.read()

        if "<style>" in content:
            fixed = re.sub(
                r"<style\b[^>]*>.*?</style>",
                f"<style>\n"{NEW_STYLE_CONTENT}"\n</style>",
                content, flags=re.DOTALL | re.IGNORECASE
            )
        else:
            fixed = re.sub(
                r"(?i)</head>",
                f"<style>\n"{NEW_STYLE_CONTENT}"\n</head>",
                content
            )

        if dry_run:
            print(f"[DRY RUN] Would change:" {filepath})
        else:
            with filepath.open("w", encoding="utf-8") as f:
                f.write(fixed)
            count += 1

    print(f"\n--- DONE ---\n"{count} files changed (dry_run={dry_run}))

if __name__ == "__main__":
    fix_archive(dry_run=True)
    # fix_archive(dry_run=False)

This makes maintaining a functional website much more relaxed, without having to fight your way through every single file.

🇩🇪 Deutsche Version

Herausforderung:

Ich habe kürzlich Zugriff auf ein großes Archiv alter HTML-Dateien bekommen, bei dem jede einzelne Seite unterschiedliche Styles und inkonsistente Strukturen hatte.

Das manuelle Reparieren von hunderten Dateien war unmöglich. Also beschloss ich, statt komplexe CMS-Tools zu verwenden, ein einfaches Python-Skript zu schreiben, um Updates automatisch durchzuführen.

Dieses Werkzeug erledigt exakt eine Aufgabe sehr gut: Es findet ein globales Stylesheet und injiziert es blitzschnell in tausende HTML-Seiten.

Warum Python?

Es ermöglicht mir, Dateien programmatisch zu verarbeiten, ohne schwere Frameworks wie WordPress oder React zu installieren. Es ist leichtgewichtig, schnell und läuft direkt auf meinem Desktop.

Schlüsselfunktionen dieses Scripts:

Durchläuft alle Dateien in einem Zielordner rekursiv
Liest externe CSS-Definitionen dynamisch aus
Fügt fehlende Stylesheets sicher in <head> ein
Automatische Backup-Mechanismus verhindert Datenverlust
Entspricht einem Dry-Run-Modus (Erst testen, dann ändern)

Etwas Wichtiges vorab:

Sicherheits-Backups: Jede modifizierte Datei wird zuerst gesichert.
Dry-Run Modus: Erst sehen, was geändert würde, bevor man Änderungen vornimmt.
Bereinigter Code: Keine externen Abhängigkeiten, nur Standardbibliotheken.

Dieser Ansatz ist extrem effizient für das Migrieren alter Webseiten oder das Standardisieren interner Dokumentationsseiten.

import os
import re
from pathlib import Path
from shutil import copy2

# Konfiguration – Alles zentral hier ändern
ARCHIV_ORDNER = "archiv"
BACKUP_ORDNER = "archiv_backup"
STYLE_DATEI = "global-style.css" 

try:
    mit open(STYLE_DATEI, encoding="utf-8") als f:
        NEW_STYLE_CONTENT = f.read().strip()
außer FileNotFoundError:
    print(f"FEHLER: "{STYLE_DATEI} nicht gefunden!")
    exit(1)

def fix_archive(dry_run=True):
    count = 0
    folder = Path(ARCHIV_ORDNER)
    
    wenn nicht folder.exists():
        print(f"Ordner '"{folder}"' nicht gefunden!")
        zurück

    backup_dir = Path(BACKUP_ORDNER)
    backup_dir.mkdir(exist_ok=True)

    für filepath in folder.glob("*.html"):
        wenn filepath.name == "30nov1.html":
            weiter

        backup_path = backup_dir / filepath.name
        copy2(filepath, backup_path)

        mit filepath.open(encoding="utf-8") als f:
            content = f.read()

        wenn "<style>" in content:
            fixed = re.sub(
                r"<style\b[^>]*>.*?</style>",
                f"<style>\n"{NEW_STYLE_CONTENT}"\n</style>",
                content, flags=re.DOTALL | re.IGNORECASE
            )
        sonst:
            fixed = re.sub(
                r"(?i)</head>",
                f"<style>\n"{NEW_STYLE_CONTENT}"\n</head>",
                content
            )

        wenn dry_run:
            print(f"[DRY RUN] Würde ändern:" {filepath})
        sonst:
            mit filepath.open("w", encoding="utf-8") als f:
                f.write(fixed)
            count += 1

    print(f"\n--- FERTIG ---\n"{count} Dateien verändert (dry_run={dry_run}))

wenn __name__ == "__main__":
    fix_archive(dry_run=True)
    # fix_archive(dry_run=False)

So klappt es auch deutlich entspannter mit einer funktionierenden Webseite, ohne sich durch jede einzelne Datei kämpfen zu müssen.

Dieser Python-Code ist mein geistiges Eigentum. Ich stelle ihn all jenen zur Verfügung, die den HTML-Code ihrer Webseite bereinigen möchten.

🤓 Python Day – ich werde ab und zu solche kleinen Tools teilen. 🤓
Ist besser als das Chaos auf X – und angenehmer einfach drauf loszubauen.

This Python code is my intellectual property. I am making it available to anyone who wants to clean up the HTML on their website.
🤓 Python Day – I will share such small tools from time to time. 🤓
It's better than the chaos on X – and more pleasant to just start building.

← Vorheriger Fall / Previous case Zurück zum Archiv / Back to archive →

Automation Project: Cleaning HTML Archives

🇬 English Version

🇩🇪 Deutsche Version