Shipping LanguageFilter v1.5: 12 New Filters, 5 New Languages, and a Wordlist Importer

The last couple of days have been a sprint on LanguageFilter, my Unity package for cleaning up user-generated text. What started as a quick “let’s add a few patches” turned into a full v1.5 release: a security hardening pass, twelve new pattern filters, five new languages and wordlist refresh, and a brand-new editor importer. Here’s a tour of what landed.

Security & correctness first

Before adding anything new, I went through the existing code with a paranoid hat on:

  • ReDoS hardening in WebsitesFilter, RegexFilter, PunctuationFilter, LanguageFilter, and LanguageData – every regex now has a match timeout, and the URL-tail pattern was tightened to eliminate catastrophic backtracking.
  • Thread safetyLanguageData‘s regex rebuild is now lock-protected and properly invalidated whenever the underlying word list mutates.
  • Honest naming – the base64 word-list storage is now documented as obfuscation, not security. Don’t lie to your users.
  • Allocation dietLanguageFilter skips Regex.Replace when there’s no match, and PunctuationFilter no longer slices substrings it doesn’t need.
  • Caret preservationInPlaceFilter (both uGUI and TMP variants) now keeps the caret in place across SetTextWithoutNotify. Tiny thing, big UX win.

API improvements

  • A new FilterChain ScriptableObject lets you compose any number of BaseTextFilter instances into a single pipeline asset – no more bespoke MonoBehaviour glue.
  • OnDemandFilter and its TMP twin now expose a UnityEvent<string> onFiltered and a public FilterText() method, so you can drive filtering from any button or game event without touching the input field directly.
  • The original “LanguageFilter” filter was renamed to WordListFilter to free up the namespace and stop the type-name collision with the package itself.
  • DataFileLoader is now data-driven: the menu items and language lists are generated from a single source-of-truth dictionary instead of dozens of copy-pasted [MenuItem] stubs.

12 new pattern filters

This is the headline feature. The package now ships dedicated filters for the patterns that show up in real moderation queues:

FilterWhat it catches
HomoglyphNormalizationFilterCyrillic/Greek lookalikes that bypass word lists (раypal, αpple)
ZeroWidthFilterInvisible Unicode used to break tokenizers
PhoneNumberFilterE.164, NANP, UK 3-4-4, and other groupings
CreditCardFilterLuhn-validated, so it doesn’t false-positive on order numbers
OffPlatformLinkFilterDiscord, Telegram, WhatsApp invites
SocialHandleFilter@handles and platform-specific patterns
IPAddressFilterIPv4 and IPv6
CryptoWalletFilterBTC, ETH, and other common address formats
RepeatedCharacterFilteraaaaaaaaa spam
AllCapsFilterSHOUTING
ZalgoFilterC̷̢̛̮ó̴͜ḿ̷̪b̵̜͝i̶̭͝n̵̦͒i̶̮̇n̷̩̄g̶̜̈ marks
ExcessiveWhitespaceFilterLayout-breaking padding

Each one ships with a [CreateAssetMenu] entry, a custom inspector, XML docs, and a focused test suite.

5 new languages, plus a wordlist refresh

The bundled language data was a mix of crowdsourced lists from various vintages.

The new data set – 29 languages, ~4.7k normalized entries, all freshly sourced and dedup’d. Most languages roughly doubled in coverage – Danish jumped from 20 entries to 151, Finnish from 130 to 253, English from 403 to 490.

Vietnamese in particular was overdue. It’s a major mobile-gaming market, and “we don’t support it” was a recurring piece of feedback.

The wordlist importer

Hand-encoding base64 blobs into a 3000-line C# file every time someone wanted to update a list was, predictably, a nightmare. v1.5 ships a real importer:

Tools → Language Filter → Import Wordlists…

  • File picker accepts .txt (one word per line, # comments) or .json (top-level array of strings).
  • Pick a target language from a sorted dropdown.
  • Side-by-side diff shows added vs removed words before you commit.
  • Confirm, and the importer rewrites the relevant block in LanguageDataFactory.cs in place — preserving indentation, brace matching, and all the other blocks.

Under the hood it’s split into a pure-static WordlistImporter helper class (fully unit-tested, 30+ tests covering parse/normalize/diff/encode/rewrite) and a thin EditorWindow shell. The static layer means the same code that powers the GUI can also drive a CI script.

Editor polish

A lot of small editor-side cleanup also landed:

  • All asset-creation menu items now sit under a single Language Filter submenu with proper priorities and separator gaps (no more duplicate menus from cross-assembly registration).
  • OnValidate clamps invalid inspector values instead of letting them propagate.
  • Programmatic asset creation uses unique paths so multiple “Create” clicks no longer overwrite each other.