Fernab von X – 04. April: Automatisches HTML Cleanup via Python

🇬 Far away from X – HTML and Python are cooler

I had a problem:

Lots of HTML files, different styles everywhere and no motivation to touch every single file manually.

So I built a small Python tool that does exactly this automatically.
And Python is supposed to help, right? Whether using VS Code or PyCharm – I like both programs.

No framework, no overkill – just a script that works and does its job.

The script:

goes through all HTML files in a folder
replaces existing style blocks
automatically inserts missing styles
creates a backup beforehand
has a dry-run (test first, then apply)

Especially with larger archives, this saves an extreme amount of time.

A few things were important to me:

Backup: Every file is backed up beforehand. If something goes wrong, I can restore everything.
Dry-Run: First see what would happen – only then make changes. This prevents nasty surprises.
External CSS file: The style comes from a central file. This keeps everything consistent.

Especially with larger HTML archives or old websites, a Python script like this saves a lot of time.

Technically, the script uses:

Python
Regular expressions (re)
File operations (pathlib, shutil)

This allows automated editing of HTML without touching every file manually.

import os
import re
from pathlib import Path
from shutil import copy2

# =============================================
# Configuration - change everything centrally here
# =============================================
ARCHIV_ORDNER = "archiv"
BACKUP_ORDNER = "archiv_backup"
STYLE_DATEI = "global-style.css"  # <- external file!

try:
    with open(STYLE_DATEI, encoding="utf-8") as f:
        NEW_STYLE_CONTENT = f.read().strip()
except FileNotFoundError:
    print(f"ERROR: {STYLE_DATEI} not found!")
    exit(1)

def fix_archive(dry_run=True):
    count = 0
    folder = Path(ARCHIV_ORDNER)
    if not folder.exists():
        print(f"Folder '{folder}' not found!")
        return

    backup_dir = Path(BACKUP_ORDNER)
    backup_dir.mkdir(exist_ok=True)

    for filepath in folder.glob("*.html"):
        if filepath.name == "30nov1.html":
            continue

        backup_path = backup_dir / filepath.name
        copy2(filepath, backup_path)

        with filepath.open(encoding="utf-8") as f:
            content = f.read()

        if "<style>" in content:
            fixed = re.sub(
                r"<style\b[^>]*>.*?</style>",
                f"<style>\n{NEW_STYLE_CONTENT}\n</style>",
                content, flags=re.DOTALL | re.IGNORECASE
            )
        else:
            fixed = re.sub(
                r"(?i)</head>",
                f"<style>\n{NEW_STYLE_CONTENT}\n</style>\n</head>",
                content
            )

        if dry_run:
            print(f"[DRY RUN] Would change: {filepath}")
        else:
            with filepath.open("w", encoding="utf-8") as f:
                f.write(fixed)
            count += 1

    print(f"\n--- DONE ---\n{count} files changed (dry_run={dry_run})")

if __name__ == "__main__":
    fix_archive(dry_run=True)
    # fix_archive(dry_run=False)

This makes maintaining a functional website much more relaxed, without having to fight your way through every single file.

🇩🇪 Fernab von X - html und Python sind cooler!

Ich hatte ein Problem:

Viele HTML-Dateien, überall unterschiedliche Styles und keine Lust, jede Datei einzeln anzufassen.

Also habe ich mir ein kleines Python-Tool gebaut, das genau das automatisch erledigt.
Und Python soll ja helfen. Egal ob mit VS Code oder PyCharm – ich mag beide Programme.

Kein Framework, kein Overkill – einfach ein Script, das funktioniert und seinen Job macht.

Das Script:

geht alle HTML-Dateien in einem Ordner durch
ersetzt vorhandene Style-Blöcke
fügt fehlende Styles automatisch ein
erstellt vorher ein Backup
hat einen Dry-Run (erst testen, dann anwenden)

Gerade bei größeren Archiven spart das extrem Zeit.

Ein paar Dinge sind mir dabei wichtig gewesen:

Backup: Jede Datei wird vorher gesichert. Wenn etwas schiefgeht, kann ich alles zurückholen.
Dry-Run: Erst schauen, was passieren würde – dann erst ändern. Das verhindert böse Überraschungen.
Externe CSS-Datei: Der Style kommt aus einer zentralen Datei. So bleibt alles konsistent.

Gerade bei größeren HTML-Archiven oder alten Webseiten spart so ein Python Script extrem viel Zeit.

Technisch nutzt das Script:

Python
reguläre Ausdrücke (re)
Dateioperationen (pathlib, shutil)

Damit lässt sich HTML automatisiert bearbeiten, ohne jede Datei manuell anzufassen.

import os
import re
from pathlib import Path
from shutil import copy2

# =============================================
# Konfiguration – alles zentral hier ändern
# =============================================
ARCHIV_ORDNER = "archiv"
BACKUP_ORDNER = "archiv_backup"
STYLE_DATEI = "global-style.css"  # <- externe Datei!

try:
    with open(STYLE_DATEI, encoding="utf-8") as f:
        NEW_STYLE_CONTENT = f.read().strip()
except FileNotFoundError:
    print(f"FEHLER: {STYLE_DATEI} nicht gefunden!")
    exit(1)

def fix_archive(dry_run=True):
    count = 0
    folder = Path(ARCHIV_ORDNER)
    if not folder.exists():
        print(f"Ordner '{folder}' nicht gefunden!")
        return

    backup_dir = Path(BACKUP_ORDNER)
    backup_dir.mkdir(exist_ok=True)

    for filepath in folder.glob("*.html"):
        if filepath.name == "30nov1.html":
            continue

        backup_path = backup_dir / filepath.name
        copy2(filepath, backup_path)

        with filepath.open(encoding="utf-8") as f:
            content = f.read()

        if "<style>" in content:
            fixed = re.sub(
                r"<style\b[^>]*>.*?</style>",
                f"<style>\n{NEW_STYLE_CONTENT}\n</style>",
                content, flags=re.DOTALL | re.IGNORECASE
            )
        else:
            fixed = re.sub(
                r"(?i)</head>",
                f"<style>\n{NEW_STYLE_CONTENT}\n</style>\n</head>",
                content
            )

        if dry_run:
            print(f"[DRY RUN] Würde ändern: {filepath}")
        else:
            with filepath.open("w", encoding="utf-8") as f:
                f.write(fixed)
            count += 1

    print(f"\n--- FERTIG ---\n{count} Dateien verändert (dry_run={dry_run})")

if __name__ == "__main__":
    fix_archive(dry_run=True)
    # fix_archive(dry_run=False)

So klappt es auch deutlich entspannter mit einer funktionierenden Webseite, ohne sich durch jede einzelne Datei kämpfen zu müssen.

Dieser Python-Code ist mein geistiges Eigentum. Ich stelle ihn all jenen zur Verfügung, die den HTML-Code ihrer Webseite bereinigen möchten.

🤓 Python Day – ich werde ab und zu solche kleinen Tools teilen. 🤓
Ist besser als das Chaos auf X – und angenehmer einfach drauf loszubauen.

This Python code is my intellectual property. I am making it available to anyone who wants to clean up the HTML on their website.
🤓 Python Day – I will share such small tools from time to time. 🤓
It's better than the chaos on X – and more pleasant to just start building.

← Vorheriger Fall / Previous case Zurück zum Archiv / Back to archive →

Projekttag: Automatisches Update via Python Skript

🇬 Far away from X – HTML and Python are cooler

🇩🇪 Fernab von X - html und Python sind cooler!