Cyrillic Homoglyph Evasion — Invisible Characters That Bypass Page Analysis
A ClickFix page looked like it said "Press Windows Button + R" to a human. To ClickArmor's regex engine, it said nothing recognizable. The text was written with Cyrillic lookalike characters mixed with zero-width spaces.
The Technique
The visible page text appeared to say "Press Windows Button + R" and "Press CTRL + V" — standard ClickFix instruction patterns. But the actual characters were a mix of Cyrillic and Latin: Р (U+0420 Cyrillic) instead of P, е (U+0435 Cyrillic) instead of e, ѕ (U+0455 Cyrillic) instead of s, С (U+0421 Cyrillic) instead of C, о (U+043E Cyrillic) instead of o.
Zero-width spaces (U+2005 four-per-em space, U+FEFF BOM) were also inserted between words. The combination was visually identical to normal English text but completely invisible to regex patterns expecting ASCII.
// Human sees: Press Windows Button + R
// Regex sees: Рrеѕѕ Wіndоwѕ Вuttоn + R
// ↑ U+0420 ↑ U+0435 ↑ U+0455
// Pattern /press\s+win(dows)?/i → no match
How We Added Detection
A normalizeText() function was added that runs before every regex scan in the page analyzer. It performs four operations in sequence:
1. Zero-width character stripping — removes U+200B through U+200F, U+2060, U+FEFF, and other invisible formatting characters that break word boundaries.
2. Exotic whitespace normalization — converts U+2005 (four-per-em space), U+2003 (em space), U+00A0 (non-breaking space), and other Unicode space variants to standard ASCII spaces.
3. NFKD decomposition — Unicode Normalization Form KD decomposes combined characters into their base + combining mark form, then strips combining diacritical marks (U+0300–U+036F).
4. Cyrillic and Greek homoglyph mapping — 40+ character substitutions mapping lookalike Cyrillic and Greek characters to their ASCII equivalents. Covers uppercase and lowercase variants, including less common mappings like Ѕ (U+0405→S), І (U+0406→I), Ї (U+0407→I), and Υ (U+03A5→Y).
After normalization, the page text becomes standard ASCII — "Press Windows Button + R" — and the existing lure phrase regexes match immediately.
The normalization layer runs on all page text before every scan — lure phrases, fake CAPTCHA detection, fake error detection, fake browser update detection, and instruction image analysis. A single addition defeated the entire class of cross-script homoglyph evasion for all detection layers simultaneously.