ReMastering Regular Expressions

Let’s have a look at the source of the perl module Email::Valid and search for „Mastering Regular Expressions“. You will find Jeffrey Friedl’s regex, which matches all valid email adresses – and nothing more. It seems so complicated, that no one could validate it by just looking at the regex. I will show you, how to disenchant it.

The disenchanting is based on the following rule:
Let us have $a=qr/foobar/, $b=qr/foo/, $c=qr/bar/ and $d=qr/$b$c/. $string=~$d behaves exactly as $string=~$a.

Now open Jeffrey Friedl’s regex in an editor of your choice and start to replace some strings. Be careful to replace them literally „15“ means \ and 0 and 1 and 5 instead of „\r“ or chr(0) or whatever.

Search for

Replace with

Count

\\\x80-\xff\n15 $c 128
[^(40)@,;:“.\\\[\]00-37\x80-\xff] $d 26
\\[^\x80-\xff] $e 68
\([^$c()]*(?:(?:$e|\([^$c()]*(?:$e[^$c()]*)*\))[^$c()]*)*\) $f 27
[40\t]*(?:$f[40\t]*)* $g 26
„[^$c“]*(?:$e[^$c“]*)*“ $h 6
\[(?:[^$c\[\]]|$e)*\] $k 8
[^()@,;:“.\\\[\]\x80-\xff00-1012-37] $m 2
(?:$d+(?!$d)|$h) $n 5
$g(?:$d+(?!$d)|$k)$g $o 8
@$o(?:\.$o)* $p 4
$n$g(?:\.$g$n$g)*$p $q 2
$g(?:$q|$n$m*(?:(?:$f|$h)$m*)*) $s 1

After this replace-marathon your editor should show only „$s“.
The original regex contains 6599 characters. If you save the content of the variables this way:

$p = qr/\@$o(?:\.$o)*/;

you will end with $s containing 8129 characters of which are around 900 qr//-overhead (as „-xism“) and a script with total size of 514 bytes for exactly the same thing.
Those, who have already greater experience with regular expression, should have already seen the possibilities to optimize most of the regexes in size. I will show my optimizations. But the main goal is already reached: You have a small set of small regular expressions. And each small expression is understandable for itself.

Optimized Expression

my $c = ‚[:^ascii:]\\n\\r\\\\‘;
my $d = qr([\w!#\$%&’*+-/=?^`{|}~\x7F]);
my $e = qr/\\[[:ascii:]]/;
my $g = qr/(?:$f|[\x20\t])*/;
my $h = qr/\“(?:$e|[^$c\“])*\“/;
my $k = =qr/\[(?:$e|[^$c\[\]])*\]/
my $f = qr/\((?:$e|[^$c()]|\((?:$e|[^$c()])*\))*\)/;
my $m = qr/$d|[\x20\t]/;
my $n = qr/$d+(?!$d)|$h/;
my $o = qr/$g(?:$d+(?!$d)|$k)$g/;
my $p = qr/\@$o(?:\.$o)*/;
my $q = qr/$n$g(?:\.$g$n$g)*$p/;
my $s = qr/$g$q|$g$n(?:$f|$h|$m)*/;

There are of course more optimizations possible – try it yourself. The length of my $s is 5144 („-xism“s deleted). The code uses 457 bytes without comments. (Remember: We originally dealt with 6599 characters pure regex.)

Of course I was supported by a little script. Feel free to ask me, if you are interested.

Comments are closed.