PY-L8-11 · Hashing 101: hashlib

Learning Goals

3 min

By the end of this lesson you can:

Compute hashes with hashlib (SHA-256 and friends).
State the three security properties of a good hash function.
Explain why MD5 and SHA-1 are broken and must not be used for security.
Pick the right hash for the job — and know hashing is not encryption.

Warm-Up · One-Way Streets

5 min

You met file hashing for backups in Level 7. Here's the security view: a hash turns any input into a fixed-size fingerprint, and you cannot go backwards.

"hello"        → 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824
"hello."       → a different, unpredictable 64-hex-char value
a 4GB movie    → also exactly 64 hex chars

Same input → same hash (always).   Different input → different hash.
Hash → input?  Impossible to reverse.

Today's big idea

Encryption is two-way (encrypt then decrypt with a key). Hashing is one-way: no key, no "un-hash." That one-wayness is the whole point — it lets you verify data (does the fingerprint match?) without storing the data itself. But it only works if the hash function is secure, and several famous ones no longer are.

New Concept · Secure Hashing

14 min

Computing hashes with hashlib

import hashlib

# hash a string (encode to bytes first — hashes work on bytes)
digest = hashlib.sha256("hello".encode("utf-8")).hexdigest()
print(digest)        # 2cf24dba...938b9824  (64 hex chars = 256 bits)

# update incrementally — for large data without loading it all
h = hashlib.sha256()
h.update(b"hello ")
h.update(b"world")
print(h.hexdigest())   # same as sha256(b"hello world")

# hash a file in chunks (Level 7 pattern)
def file_sha256(path):
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(65536), b""):
            h.update(chunk)
    return h.hexdigest()

The three properties of a good hash

Deterministic — same input always gives the same output.
Pre-image resistant — given a hash, you can't find an input that produces it (one-way).
Collision resistant — you can't find two different inputs with the same hash.

Plus the avalanche effect: changing one bit of input flips about half the output bits — so similar inputs look totally unrelated. These properties make a hash a trustworthy fingerprint.

⚠️ MD5 and SHA-1 are broken

Do not use MD5 or SHA-1 for security

Researchers can deliberately create collisions for MD5 (trivially) and SHA-1 (the 2017 "SHAttered" attack produced two different PDFs with the same SHA-1). A broken collision-resistance means an attacker can forge a file that matches a "trusted" hash. MD5/SHA-1 are fine only for non-security uses (a quick cache key, detecting accidental corruption). For anything security-relevant — integrity you must trust, signatures, certificates — use SHA-256 or stronger.

Which hash to use

SHA-256 / SHA-512   ✓ secure default for integrity & fingerprints
SHA-3 family        ✓ modern alternative
BLAKE2 (hashlib)    ✓ fast and secure
MD5, SHA-1          ✗ broken — never for security

Passwords?          ✗ NONE of the above! Plain hashes are too FAST.
                    Passwords need bcrypt/argon2 — next lesson.

Hashing is NOT encryption (the common confusion)

HASHING            one-way, no key       → verify, fingerprint, integrity
ENCRYPTION         two-way, with a key   → confidentiality (read it back later)
ENCODING (base64)  reversible, no secret → format change, NOT security

If someone says "we encrypted the passwords" they probably mean hashed (and you can't decrypt a hash). And base64 is not encryption — it's just an encoding anyone can reverse. Keep these three straight.

Worked Example · An Integrity Verifier

12 min

Goal: a tool that records a file's SHA-256 and later detects if even one byte changed — the basis of file-integrity monitoring (Lesson 21) and verifying downloads.

import hashlib, json
from pathlib import Path

def sha256_file(path: Path) -> str:
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(65536), b""):
            h.update(chunk)
    return h.hexdigest()

def record(folder: str, manifest: str = "hashes.json") -> None:
    """Snapshot the SHA-256 of every file in a folder."""
    hashes = {str(p): sha256_file(p)
              for p in Path(folder).rglob("*") if p.is_file()}
    Path(manifest).write_text(json.dumps(hashes, indent=2))
    print(f"recorded {len(hashes)} file hashes")

def verify(manifest: str = "hashes.json") -> None:
    """Compare current hashes to the recorded snapshot."""
    recorded = json.loads(Path(manifest).read_text())
    changed, missing = [], []
    for path, old_hash in recorded.items():
        p = Path(path)
        if not p.exists():
            missing.append(path)
        elif sha256_file(p) != old_hash:
            changed.append(path)
    if changed:  print("⚠️ MODIFIED:", *changed, sep="\n  ")
    if missing:  print("⚠️ MISSING:", *missing, sep="\n  ")
    if not (changed or missing):
        print("✓ all files intact")

# record("config"); ... later ... verify()

recorded 12 file hashes
# (someone edits config/app.conf, even by one character) ...
⚠️ MODIFIED:
  config/app.conf
# the avalanche effect means a one-byte change → totally different hash

Read the code

Because SHA-256 is deterministic and collision-resistant, a stored hash is a trustworthy fingerprint: if the recomputed hash differs, the file changed — period, even a single byte (avalanche effect). This is exactly how you verify a downloaded ISO matches its published checksum, and how file-integrity monitors detect tampering. We use SHA-256, not MD5: if an attacker could forge a collision, they could swap a malicious file in undetected. The right hash makes the guarantee real.

Try It Yourself

13 min

01 🟢 See the avalanche

Hash "password" and "Password" with SHA-256 and compare. Count how many of the 64 hex chars differ — you'll see roughly half, despite a one-letter change.

02 🟡 Verify a download

Download any file that publishes a SHA-256 checksum (many open-source projects do). Compute its hash and confirm it matches the published one. This is how you verify software hasn't been tampered with.

Hint

actual = sha256_file(Path("downloaded.iso"))
expected = "abc123..."   # from the project's site
print("MATCH ✓" if actual == expected else "MISMATCH ✗ — do not trust!")

03 🔴 Compare hash speeds

Time how long it takes to hash a large file with MD5, SHA-256, and BLAKE2. Note that MD5 is faster — and explain why "faster" is exactly the wrong property for password hashing (a teaser for next lesson).

Hint

import hashlib, time
data = b"x" * (50 * 1024 * 1024)   # 50MB
for algo in ("md5", "sha256", "blake2b"):
    t = time.perf_counter()
    hashlib.new(algo, data).hexdigest()
    print(f"{algo}: {time.perf_counter()-t:.3f}s")
# Fast = great for files, TERRIBLE for passwords: an attacker can try
# billions of guesses per second. Password hashes are SLOW on purpose.

Mini-Challenge · A Collision-Aware Deduplicator

8 min

Build a duplicate-file finder that groups files by SHA-256 (exact, content-based — better than Level 6's size-based version). Report sets of identical files and bytes wasted. Add a comment explaining why SHA-256 (not MD5) makes the "these are truly identical" claim trustworthy.

Show a sample solution

import hashlib
from pathlib import Path
from collections import defaultdict

def sha256_file(p):
    h = hashlib.sha256()
    with open(p, "rb") as f:
        for chunk in iter(lambda: f.read(65536), b""):
            h.update(chunk)
    return h.hexdigest()

def find_dupes(folder: str) -> None:
    groups = defaultdict(list)
    for p in Path(folder).rglob("*"):
        if p.is_file():
            groups[sha256_file(p)].append(p)
    for digest, files in groups.items():
        if len(files) > 1:
            wasted = files[0].stat().st_size * (len(files) - 1)
            print(f"{len(files)} identical ({wasted} bytes wasted):")
            for f in files:
                print("   ", f)
    # SHA-256's collision resistance means equal hashes ⇒ equal content.
    # With MD5 an attacker could craft two DIFFERENT files sharing a hash,
    # so "identical hash" would no longer guarantee "identical file".

find_dupes("downloads")

Non-negotiables: content-based grouping via SHA-256, wasted-bytes report, and the collision-resistance justification.

Recap

3 min

A hash is a one-way, fixed-size fingerprint computed with hashlib (always on bytes). A secure hash is deterministic, pre-image resistant, and collision resistant, with the avalanche effect. MD5 and SHA-1 are broken — never use them for security; use SHA-256/SHA-3/BLAKE2. Hashing verifies integrity and fingerprints data without storing it — but it is not encryption (one-way, no key) and not encoding (base64 is reversible). Crucially, fast general-purpose hashes are the wrong tool for passwords — those need deliberately slow functions, which is exactly next lesson.

Vocabulary Card

hash function: A one-way function mapping any input to a fixed-size fingerprint.
collision resistance: Infeasibility of finding two inputs with the same hash.
avalanche effect: A tiny input change flips ~half the output bits.
SHA-256 vs MD5: Secure default vs. broken — never use MD5/SHA-1 for security.

Homework

4 min

Build a small integrity.py CLI with record <folder> and verify subcommands using SHA-256 (you'll extend it into a real monitor in Lesson 21). Test it by recording, then modifying one file, then verifying. Write a short note: three legitimate uses of hashing, and one thing hashing is not good for (with the correct alternative).

Sample · uses of hashing

Legitimate uses:
  1. Integrity: verify a download/file matches a known SHA-256.
  2. Fingerprinting: dedupe files, or detect changes (FIM, L8-21).
  3. Data structures: hash tables / git object IDs (content addressing).

NOT good for: storing passwords. A plain hash (even SHA-256) is far
too FAST — an attacker with the hash file can try billions of guesses
per second, and identical passwords share a hash. Correct alternative:
a slow, salted password hash — bcrypt or argon2 (next lesson).

Non-negotiables: working record/verify with SHA-256, three real uses, and the password caveat with the bcrypt/argon2 fix.

"hello" → 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824 "hello." → a different, unpredictable 64-hex-char value a 4GB movie → also exactly 64 hex chars Same input → same hash (always). Different input → different hash. Hash → input? Impossible to reverse.

import hashlib # hash a string (encode to bytes first — hashes work on bytes) digest = hashlib.sha256("hello".encode("utf-8")).hexdigest() print(digest) # 2cf24dba...938b9824 (64 hex chars = 256 bits) # update incrementally — for large data without loading it all h = hashlib.sha256() h.update(b"hello ") h.update(b"world") print(h.hexdigest()) # same as sha256(b"hello world") # hash a file in chunks (Level 7 pattern) def file_sha256(path): h = hashlib.sha256() with open(path, "rb") as f: for chunk in iter(lambda: f.read(65536), b""): h.update(chunk) return h.hexdigest()

SHA-256 / SHA-512 ✓ secure default for integrity & fingerprints SHA-3 family ✓ modern alternative BLAKE2 (hashlib) ✓ fast and secure MD5, SHA-1 ✗ broken — never for security Passwords? ✗ NONE of the above! Plain hashes are too FAST. Passwords need bcrypt/argon2 — next lesson.

HASHING one-way, no key → verify, fingerprint, integrity ENCRYPTION two-way, with a key → confidentiality (read it back later) ENCODING (base64) reversible, no secret → format change, NOT security

import hashlib, json from pathlib import Path def sha256_file(path: Path) -> str: h = hashlib.sha256() with open(path, "rb") as f: for chunk in iter(lambda: f.read(65536), b""): h.update(chunk) return h.hexdigest() def record(folder: str, manifest: str = "hashes.json") -> None: """Snapshot the SHA-256 of every file in a folder.""" hashes = {str(p): sha256_file(p) for p in Path(folder).rglob("*") if p.is_file()} Path(manifest).write_text(json.dumps(hashes, indent=2)) print(f"recorded {len(hashes)} file hashes") def verify(manifest: str = "hashes.json") -> None: """Compare current hashes to the recorded snapshot.""" recorded = json.loads(Path(manifest).read_text()) changed, missing = [], [] for path, old_hash in recorded.items(): p = Path(path) if not p.exists(): missing.append(path) elif sha256_file(p) != old_hash: changed.append(path) if changed: print("⚠️ MODIFIED:", *changed, sep="\n ") if missing: print("⚠️ MISSING:", *missing, sep="\n ") if not (changed or missing): print("✓ all files intact") # record("config"); ... later ... verify()

import hashlib, time data = b"x" * (50 * 1024 * 1024) # 50MB for algo in ("md5", "sha256", "blake2b"): t = time.perf_counter() hashlib.new(algo, data).hexdigest() print(f"{algo}: {time.perf_counter()-t:.3f}s") # Fast = great for files, TERRIBLE for passwords: an attacker can try # billions of guesses per second. Password hashes are SLOW on purpose.

import hashlib from pathlib import Path from collections import defaultdict def sha256_file(p): h = hashlib.sha256() with open(p, "rb") as f: for chunk in iter(lambda: f.read(65536), b""): h.update(chunk) return h.hexdigest() def find_dupes(folder: str) -> None: groups = defaultdict(list) for p in Path(folder).rglob("*"): if p.is_file(): groups[sha256_file(p)].append(p) for digest, files in groups.items(): if len(files) > 1: wasted = files[0].stat().st_size * (len(files) - 1) print(f"{len(files)} identical ({wasted} bytes wasted):") for f in files: print(" ", f) # SHA-256's collision resistance means equal hashes ⇒ equal content. # With MD5 an attacker could craft two DIFFERENT files sharing a hash, # so "identical hash" would no longer guarantee "identical file". find_dupes("downloads")

Hashing 101: `hashlib`

Learning Goals

Warm-Up · One-Way Streets

New Concept · Secure Hashing

Computing hashes with hashlib

The three properties of a good hash

⚠️ MD5 and SHA-1 are broken

Which hash to use

Hashing is NOT encryption (the common confusion)

Worked Example · An Integrity Verifier

Read the code

Try It Yourself

Mini-Challenge · A Collision-Aware Deduplicator

Recap

Vocabulary Card

Homework

Sample · uses of hashing

Hashing 101: `hashlib`

Learning Goals

Warm-Up · One-Way Streets

New Concept · Secure Hashing

Computing hashes with hashlib

The three properties of a good hash

⚠️ MD5 and SHA-1 are broken

Which hash to use

Hashing is NOT encryption (the common confusion)

Worked Example · An Integrity Verifier

Read the code

Try It Yourself

Mini-Challenge · A Collision-Aware Deduplicator

Recap

Vocabulary Card

Homework

Sample · uses of hashing