The last couple of days have been a sprint on LanguageFilter, my Unity package for cleaning up user-generated text. What started as a quick “let’s add a few patches” turned into a full v1.5 release: a security hardening pass, twelve new pattern filters, five new languages and wordlist refresh, and a brand-new editor importer. Here’s a tour of what landed.
Security & correctness first
Before adding anything new, I went through the existing code with a paranoid hat on:
- ReDoS hardening in
WebsitesFilter,RegexFilter,PunctuationFilter,LanguageFilter, andLanguageData– every regex now has a match timeout, and the URL-tail pattern was tightened to eliminate catastrophic backtracking. - Thread safety –
LanguageData‘s regex rebuild is now lock-protected and properly invalidated whenever the underlying word list mutates. - Honest naming – the base64 word-list storage is now documented as obfuscation, not security. Don’t lie to your users.
- Allocation diet –
LanguageFilterskipsRegex.Replacewhen there’s no match, andPunctuationFilterno longer slices substrings it doesn’t need. - Caret preservation –
InPlaceFilter(both uGUI and TMP variants) now keeps the caret in place acrossSetTextWithoutNotify. Tiny thing, big UX win.
API improvements
- A new
FilterChainScriptableObject lets you compose any number ofBaseTextFilterinstances into a single pipeline asset – no more bespokeMonoBehaviourglue. OnDemandFilterand its TMP twin now expose aUnityEvent<string> onFilteredand a publicFilterText()method, so you can drive filtering from any button or game event without touching the input field directly.- The original “LanguageFilter” filter was renamed to
WordListFilterto free up the namespace and stop the type-name collision with the package itself. DataFileLoaderis now data-driven: the menu items and language lists are generated from a single source-of-truth dictionary instead of dozens of copy-pasted[MenuItem]stubs.
12 new pattern filters
This is the headline feature. The package now ships dedicated filters for the patterns that show up in real moderation queues:
| Filter | What it catches |
|---|---|
HomoglyphNormalizationFilter | Cyrillic/Greek lookalikes that bypass word lists (раypal, αpple) |
ZeroWidthFilter | Invisible Unicode used to break tokenizers |
PhoneNumberFilter | E.164, NANP, UK 3-4-4, and other groupings |
CreditCardFilter | Luhn-validated, so it doesn’t false-positive on order numbers |
OffPlatformLinkFilter | Discord, Telegram, WhatsApp invites |
SocialHandleFilter | @handles and platform-specific patterns |
IPAddressFilter | IPv4 and IPv6 |
CryptoWalletFilter | BTC, ETH, and other common address formats |
RepeatedCharacterFilter | aaaaaaaaa spam |
AllCapsFilter | SHOUTING |
ZalgoFilter | C̷̢̛̮ó̴͜ḿ̷̪b̵̜͝i̶̭͝n̵̦͒i̶̮̇n̷̩̄g̶̜̈ marks |
ExcessiveWhitespaceFilter | Layout-breaking padding |
Each one ships with a [CreateAssetMenu] entry, a custom inspector, XML docs, and a focused test suite.
5 new languages, plus a wordlist refresh
The bundled language data was a mix of crowdsourced lists from various vintages.
The new data set – 29 languages, ~4.7k normalized entries, all freshly sourced and dedup’d. Most languages roughly doubled in coverage – Danish jumped from 20 entries to 151, Finnish from 130 to 253, English from 403 to 490.
Vietnamese in particular was overdue. It’s a major mobile-gaming market, and “we don’t support it” was a recurring piece of feedback.
The wordlist importer
Hand-encoding base64 blobs into a 3000-line C# file every time someone wanted to update a list was, predictably, a nightmare. v1.5 ships a real importer:
Tools → Language Filter → Import Wordlists…
- File picker accepts
.txt(one word per line,#comments) or.json(top-level array of strings). - Pick a target language from a sorted dropdown.
- Side-by-side diff shows added vs removed words before you commit.
- Confirm, and the importer rewrites the relevant block in
LanguageDataFactory.csin place — preserving indentation, brace matching, and all the other blocks.
Under the hood it’s split into a pure-static WordlistImporter helper class (fully unit-tested, 30+ tests covering parse/normalize/diff/encode/rewrite) and a thin EditorWindow shell. The static layer means the same code that powers the GUI can also drive a CI script.
Editor polish
A lot of small editor-side cleanup also landed:
- All asset-creation menu items now sit under a single
Language Filtersubmenu with proper priorities and separator gaps (no more duplicate menus from cross-assembly registration). OnValidateclamps invalid inspector values instead of letting them propagate.- Programmatic asset creation uses unique paths so multiple “Create” clicks no longer overwrite each other.