Learning Goals
3 minBy the end of this lesson you can:
- Compute hashes with
hashlib(SHA-256 and friends). - State the three security properties of a good hash function.
- Explain why MD5 and SHA-1 are broken and must not be used for security.
- Pick the right hash for the job — and know hashing is not encryption.
Warm-Up · One-Way Streets
5 minYou met file hashing for backups in Level 7. Here's the security view: a hash turns any input into a fixed-size fingerprint, and you cannot go backwards.
"hello" → 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824 "hello." → a different, unpredictable 64-hex-char value a 4GB movie → also exactly 64 hex chars Same input → same hash (always). Different input → different hash. Hash → input? Impossible to reverse.
Encryption is two-way (encrypt then decrypt with a key). Hashing is one-way: no key, no "un-hash." That one-wayness is the whole point — it lets you verify data (does the fingerprint match?) without storing the data itself. But it only works if the hash function is secure, and several famous ones no longer are.
New Concept · Secure Hashing
14 minComputing hashes with hashlib
import hashlib # hash a string (encode to bytes first — hashes work on bytes) digest = hashlib.sha256("hello".encode("utf-8")).hexdigest() print(digest) # 2cf24dba...938b9824 (64 hex chars = 256 bits) # update incrementally — for large data without loading it all h = hashlib.sha256() h.update(b"hello ") h.update(b"world") print(h.hexdigest()) # same as sha256(b"hello world") # hash a file in chunks (Level 7 pattern) def file_sha256(path): h = hashlib.sha256() with open(path, "rb") as f: for chunk in iter(lambda: f.read(65536), b""): h.update(chunk) return h.hexdigest()
The three properties of a good hash
- Deterministic — same input always gives the same output.
- Pre-image resistant — given a hash, you can't find an input that produces it (one-way).
- Collision resistant — you can't find two different inputs with the same hash.
Plus the avalanche effect: changing one bit of input flips about half the output bits — so similar inputs look totally unrelated. These properties make a hash a trustworthy fingerprint.
⚠️ MD5 and SHA-1 are broken
Researchers can deliberately create collisions for MD5 (trivially) and SHA-1 (the 2017 "SHAttered" attack produced two different PDFs with the same SHA-1). A broken collision-resistance means an attacker can forge a file that matches a "trusted" hash. MD5/SHA-1 are fine only for non-security uses (a quick cache key, detecting accidental corruption). For anything security-relevant — integrity you must trust, signatures, certificates — use SHA-256 or stronger.
Which hash to use
SHA-256 / SHA-512 ✓ secure default for integrity & fingerprints
SHA-3 family ✓ modern alternative
BLAKE2 (hashlib) ✓ fast and secure
MD5, SHA-1 ✗ broken — never for security
Passwords? ✗ NONE of the above! Plain hashes are too FAST.
Passwords need bcrypt/argon2 — next lesson.Hashing is NOT encryption (the common confusion)
HASHING one-way, no key → verify, fingerprint, integrity ENCRYPTION two-way, with a key → confidentiality (read it back later) ENCODING (base64) reversible, no secret → format change, NOT security
If someone says "we encrypted the passwords" they probably mean hashed (and you can't decrypt a hash). And base64 is not encryption — it's just an encoding anyone can reverse. Keep these three straight.
Worked Example · An Integrity Verifier
12 minGoal: a tool that records a file's SHA-256 and later detects if even one byte changed — the basis of file-integrity monitoring (Lesson 21) and verifying downloads.
import hashlib, json from pathlib import Path def sha256_file(path: Path) -> str: h = hashlib.sha256() with open(path, "rb") as f: for chunk in iter(lambda: f.read(65536), b""): h.update(chunk) return h.hexdigest() def record(folder: str, manifest: str = "hashes.json") -> None: """Snapshot the SHA-256 of every file in a folder.""" hashes = {str(p): sha256_file(p) for p in Path(folder).rglob("*") if p.is_file()} Path(manifest).write_text(json.dumps(hashes, indent=2)) print(f"recorded {len(hashes)} file hashes") def verify(manifest: str = "hashes.json") -> None: """Compare current hashes to the recorded snapshot.""" recorded = json.loads(Path(manifest).read_text()) changed, missing = [], [] for path, old_hash in recorded.items(): p = Path(path) if not p.exists(): missing.append(path) elif sha256_file(p) != old_hash: changed.append(path) if changed: print("⚠️ MODIFIED:", *changed, sep="\n ") if missing: print("⚠️ MISSING:", *missing, sep="\n ") if not (changed or missing): print("✓ all files intact") # record("config"); ... later ... verify()
recorded 12 file hashes # (someone edits config/app.conf, even by one character) ... ⚠️ MODIFIED: config/app.conf # the avalanche effect means a one-byte change → totally different hash
Read the code
Because SHA-256 is deterministic and collision-resistant, a stored hash is a trustworthy fingerprint: if the recomputed hash differs, the file changed — period, even a single byte (avalanche effect). This is exactly how you verify a downloaded ISO matches its published checksum, and how file-integrity monitors detect tampering. We use SHA-256, not MD5: if an attacker could forge a collision, they could swap a malicious file in undetected. The right hash makes the guarantee real.
Try It Yourself
13 minHash "password" and "Password" with SHA-256 and compare. Count how many of the 64 hex chars differ — you'll see roughly half, despite a one-letter change.
Download any file that publishes a SHA-256 checksum (many open-source projects do). Compute its hash and confirm it matches the published one. This is how you verify software hasn't been tampered with.
Hint
actual = sha256_file(Path("downloaded.iso")) expected = "abc123..." # from the project's site print("MATCH ✓" if actual == expected else "MISMATCH ✗ — do not trust!")
Time how long it takes to hash a large file with MD5, SHA-256, and BLAKE2. Note that MD5 is faster — and explain why "faster" is exactly the wrong property for password hashing (a teaser for next lesson).
Hint
import hashlib, time data = b"x" * (50 * 1024 * 1024) # 50MB for algo in ("md5", "sha256", "blake2b"): t = time.perf_counter() hashlib.new(algo, data).hexdigest() print(f"{algo}: {time.perf_counter()-t:.3f}s") # Fast = great for files, TERRIBLE for passwords: an attacker can try # billions of guesses per second. Password hashes are SLOW on purpose.
Mini-Challenge · A Collision-Aware Deduplicator
8 minBuild a duplicate-file finder that groups files by SHA-256 (exact, content-based — better than Level 6's size-based version). Report sets of identical files and bytes wasted. Add a comment explaining why SHA-256 (not MD5) makes the "these are truly identical" claim trustworthy.
Show a sample solution
import hashlib from pathlib import Path from collections import defaultdict def sha256_file(p): h = hashlib.sha256() with open(p, "rb") as f: for chunk in iter(lambda: f.read(65536), b""): h.update(chunk) return h.hexdigest() def find_dupes(folder: str) -> None: groups = defaultdict(list) for p in Path(folder).rglob("*"): if p.is_file(): groups[sha256_file(p)].append(p) for digest, files in groups.items(): if len(files) > 1: wasted = files[0].stat().st_size * (len(files) - 1) print(f"{len(files)} identical ({wasted} bytes wasted):") for f in files: print(" ", f) # SHA-256's collision resistance means equal hashes ⇒ equal content. # With MD5 an attacker could craft two DIFFERENT files sharing a hash, # so "identical hash" would no longer guarantee "identical file". find_dupes("downloads")
Non-negotiables: content-based grouping via SHA-256, wasted-bytes report, and the collision-resistance justification.
Recap
3 minA hash is a one-way, fixed-size fingerprint computed with hashlib (always on bytes). A secure hash is deterministic, pre-image resistant, and collision resistant, with the avalanche effect. MD5 and SHA-1 are broken — never use them for security; use SHA-256/SHA-3/BLAKE2. Hashing verifies integrity and fingerprints data without storing it — but it is not encryption (one-way, no key) and not encoding (base64 is reversible). Crucially, fast general-purpose hashes are the wrong tool for passwords — those need deliberately slow functions, which is exactly next lesson.
Vocabulary Card
- hash function
- A one-way function mapping any input to a fixed-size fingerprint.
- collision resistance
- Infeasibility of finding two inputs with the same hash.
- avalanche effect
- A tiny input change flips ~half the output bits.
- SHA-256 vs MD5
- Secure default vs. broken — never use MD5/SHA-1 for security.
Homework
4 minBuild a small integrity.py CLI with record <folder> and verify subcommands using SHA-256 (you'll extend it into a real monitor in Lesson 21). Test it by recording, then modifying one file, then verifying. Write a short note: three legitimate uses of hashing, and one thing hashing is not good for (with the correct alternative).
Sample · uses of hashing
Legitimate uses: 1. Integrity: verify a download/file matches a known SHA-256. 2. Fingerprinting: dedupe files, or detect changes (FIM, L8-21). 3. Data structures: hash tables / git object IDs (content addressing). NOT good for: storing passwords. A plain hash (even SHA-256) is far too FAST — an attacker with the hash file can try billions of guesses per second, and identical passwords share a hash. Correct alternative: a slow, salted password hash — bcrypt or argon2 (next lesson).
Non-negotiables: working record/verify with SHA-256, three real uses, and the password caveat with the bcrypt/argon2 fix.