Coding, Programming & Software Development

Industry standard for parsing out IP addresses from a text document regex

Adam

29 Jan 2026 • 6 min read

can you give me some kind of formula that is an industry standard to parse out ip addresses of a txt document. it can be complete or partial obfuscation/redaction

ChatGPT said:

Thought for 21s

Yep — the “industry standard” way to do this is:

Use a regex to find candidates (fast, works in grep/sed/ripgrep/etc.)
Optionally validate candidates (best practice) with a real IP parser (e.g., Python’s ipaddress) to avoid weird false-positives.

Below are solid, widely-used patterns and redaction variants.

IPv4 extraction (strict 0–255 per octet)

Regex (PCRE / most engines):

\b(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\b

If you also want optional :port:

\b(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\b(?::\d{1,5})?

IPv6 extraction (practical + common forms)

IPv6 regex can get very long if you try to perfectly cover every compressed form. The common industry approach is “good-enough extraction + validation”.

Good extraction regex (covers most real-world IPv6 you’ll see, including :: compression):

\b(?:[0-9A-Fa-f]{1,4}:){1,7}[0-9A-Fa-f]{1,4}\b|\b(?:[0-9A-Fa-f]{1,4}:){1,7}:\b|\b:(?::[0-9A-Fa-f]{1,4}){1,7}\b

If you also want bracketed IPv6 with optional port (common in URLs/logs: [2001:db8::1]:443):

\[(?:[0-9A-Fa-f:]+)\](?::\d{1,5})?

(Then validate the inside of the brackets.)

Redaction / obfuscation patterns

A) Full redaction (replace any IPv4 with a token)

Match (IPv4 strict):

\b(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\b

Replace with:

[REDACTED_IP]

B) Partial IPv4 redaction (keep /24, hide last octet)

Match + capture first 3 octets:

\b((?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d))\.(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\b

Replace with:

$1.xxx

(Some tools use \1 instead of $1.)

C) Partial IPv6 redaction (keep prefix, hide the rest)

For IPv6, regex-only redaction gets messy because of compression (::). The clean way is:

extract candidate IPv6 tokens (regex),
normalize/validate (e.g., expand/compress),
then redact by prefix length (like keeping first 4 hextets).

If you must do regex-only, keep it simple and accept imperfections:

Replace anything that looks IPv6-ish with [REDACTED_IPV6] using a broad pattern like:

\b[0-9A-Fa-f:]{2,}\b

…but this will overmatch (timestamps, hashes, etc.), so validation is recommended.

Best-practice note (why validation matters)

Even “strict” regex can still pick up things you don’t want in edge cases (IDs, version strings, etc.). The standard robust workflow is:

Regex → candidate tokens
Validate → confirm it’s a real IP
Redact/transform → consistent output

If you tell me what tool you’re using (grep/sed, Python, Node, Go, etc.), I’ll give you the exact one-liner / function you can drop in, including IPv4+IPv6 and your preferred redaction style.

You said:

Python

ChatGPT said:

Thought for 1m 43s

Here’s a solid “industry standard” Python approach: regex for candidate matches + ipaddress for real validation, with optional redaction/obfuscation (full token, partial, or deterministic hashing).

#!/usr/bin/env python3 import re import sys import argparse import ipaddress import hashlib from collections import Counter # --- Candidate patterns (broad enough to catch real-world logs, then validated) --- # [IPv6]:port (common in URLs/logs) RE_V6_BRACKET = r"\[(?P<v6b>[0-9A-Fa-f:]{2,})\](?P<pb>:\d{1,5})?" # IPv6 with embedded IPv4 (e.g., ::ffff:192.0.2.1) — no port here RE_V6V4 = r"(?<![0-9A-Fa-f:])(?P<v6v4>(?:[0-9A-Fa-f]{0,4}:){1,6}\d{1,3}(?:\.\d{1,3}){3})(?![0-9A-Fa-f:])" # IPv4 with optional :port RE_V4 = r"(?P<v4>\b\d{1,3}(?:\.\d{1,3}){3}\b)(?P<p4>:\d{1,5})?" # Practical IPv6 (most forms incl :: compression) — no port here RE_V6 = ( r"(?<![0-9A-Fa-f:])" r"(?P<v6>" r"(?:[0-9A-Fa-f]{1,4}:){1,7}[0-9A-Fa-f]{1,4}|" r"(?:[0-9A-Fa-f]{1,4}:){1,7}:|" r":(?::[0-9A-Fa-f]{1,4}){1,7}" r")" r"(?![0-9A-Fa-f:])" ) CANDIDATE_RE = re.compile( f"{RE_V6_BRACKET}|{RE_V6V4}|{RE_V4}|{RE_V6}" ) def _validate_ip(s: str): try: return ipaddress.ip_address(s) except ValueError: return None def iter_ips_from_line(line: str, ipv4_only=False, ipv6_only=False): for m in CANDIDATE_RE.finditer(line): addr = None if m.group("v6b"): addr = m.group("v6b") elif m.group("v6v4"): addr = m.group("v6v4") elif m.group("v4"): addr = m.group("v4") elif m.group("v6"): addr = m.group("v6") ip = _validate_ip(addr) if addr else None if not ip: continue if ipv4_only and ip.version != 4: continue if ipv6_only and ip.version != 6: continue yield ip def obfuscate_ip(ip, style: str, salt: str | None, ipv6_hextets: int): s = ip.compressed if style == "token": return "[IP]" if style in ("keep24", "keep16"): if ip.version == 4: a, b, c, d = s.split(".") if style == "keep24": return f"{a}.{b}.{c}.xxx" # invalid on purpose (redacted) else: return f"{a}.{b}.xxx.xxx" # invalid on purpose (redacted) # IPv6: keep N hextets, redact the rest as "xxxx" (invalid on purpose) parts = ip.exploded.split(":") # 8 hextets n = max(1, min(8, ipv6_hextets)) return ":".join(parts[:n] + ["xxxx"] * (8 - n)) if style == "hash": if not salt: raise ValueError("--salt is required for style=hash") h = hashlib.sha256((salt + "|" + s).encode("utf-8")).hexdigest() return f"[IP:{h[:12]}]" # short deterministic pseudonym raise ValueError(f"Unknown style: {style}") def redact_line(line: str, style: str, salt: str | None, ipv6_hextets: int, ipv4_only=False, ipv6_only=False): def repl(m: re.Match): full = m.group(0) # figure out address + preserve port/brackets if present if m.group("v6b"): addr = m.group("v6b") port = m.group("pb") or "" ip = _validate_ip(addr) if not ip: return full if ipv4_only and ip.version != 4: return full if ipv6_only and ip.version != 6: return full obf = obfuscate_ip(ip, style, salt, ipv6_hextets) return f"[{obf}]{port}" if m.group("v4"): addr = m.group("v4") port = m.group("p4") or "" ip = _validate_ip(addr) if not ip: return full if ipv6_only and ip.version != 4: return full obf = obfuscate_ip(ip, style, salt, ipv6_hextets) return f"{obf}{port}" addr = m.group("v6v4") or m.group("v6") ip = _validate_ip(addr) if not ip: return full if ipv4_only and ip.version != 4: return full if ipv6_only and ip.version != 6: return full return obfuscate_ip(ip, style, salt, ipv6_hextets) return CANDIDATE_RE.sub(repl, line) def main(): ap = argparse.ArgumentParser(description="Extract / count / redact IPs in a text file (IPv4 + IPv6).") ap.add_argument("input", help="Input file path, or '-' for stdin") ap.add_argument("--mode", choices=["extract", "count", "redact"], default="extract") ap.add_argument("--out", help="Output file for redact mode (default: stdout)") ap.add_argument("--ipv4-only", action="store_true") ap.add_argument("--ipv6-only", action="store_true") ap.add_argument("--style", choices=["token", "keep24", "keep16", "hash"], default="token", help="Redaction style (for --mode redact).") ap.add_argument("--ipv6-hextets", type=int, default=4, help="How many IPv6 hextets to keep for keep24/keep16 (default: 4).") ap.add_argument("--salt", help="Salt for deterministic hashing (required for --style hash).") ap.add_argument("--top", type=int, default=50, help="Top N for count mode (default: 50).") args = ap.parse_args() src = sys.stdin if args.input == "-" else open(args.input, "r", encoding="utf-8", errors="replace") if args.mode == "extract": seen = set() for line in src: for ip in iter_ips_from_line(line, args.ipv4_only, args.ipv6_only): seen.add(ip.compressed) for ip in sorted(seen): print(ip) elif args.mode == "count": c = Counter() for line in src: for ip in iter_ips_from_line(line, args.ipv4_only, args.ipv6_only): c[ip.compressed] += 1 for ip, n in c.most_common(args.top): print(f"{n}\t{ip}") else: # redact out = open(args.out, "w", encoding="utf-8") if args.out else sys.stdout for line in src: out.write( redact_line( line, style=args.style, salt=args.salt, ipv6_hextets=args.ipv6_hextets, ipv4_only=args.ipv4_only, ipv6_only=args.ipv6_only, ) ) if args.out: out.close() if args.input != "-": src.close() if __name__ == "__main__": main()

Examples

Extract unique IPs:

python3 ip_parse.py /path/to/log.txt

Count top IPs:

python3 ip_parse.py /path/to/log.txt --mode count --top 25

Redact everything to a token:

python3 ip_parse.py /path/to/log.txt --mode redact --style token --out redacted.txt

Redact IPv4 to /24-ish (keep first 3 octets), IPv6 keep first 4 hextets:

python3 ip_parse.py /path/to/log.txt --mode redact --style keep24 --out redacted.txt

Deterministic “same IP → same pseudonym” hashing:

python3 ip_parse.py /path/to/log.txt --mode redact --style hash --salt "my_secret_salt" --out redacted.txt

If you tell me what your input looks like (plain logs? UFW blocks? Nginx access logs?), I can tune the patterns to catch edge cases (like weird delimiters, CSV fields, quoted JSON, etc.) while staying strict on false positives.

You said:

that's awesome! make an infographic for my blog my question of this subject matter and your solutions