Problem: Regular Expressions should have more core types
Regular expressions recipes live on websites. When you want that U.S. telephone number regular expression, like “((?P<areacode>:\d{3})?\s{0,2}….“, you make this huge hash into your regular expressions.
What happens is that regular expressions become de facto lexers. We want them to recognize the various obvious forms of telephone numbers, like “(415) 555 – 5555” or “415.555.5555” and so on. We want the right answer, without rebuilding a recipe library from scratch each time.
Solution: Build an extended regular expression engine that knows the basic recipes.
It would be good to have a regular expression subclass that recognizes additional special characters and functionality. Previous extensions in regular expressions, notably in Perl, have added core building blocks such as less greedy patterns, named sub-expressions, repetition matching, and so on. These extensions would provide easy access to common parsing problems. I expect a good set of candidates would include:
- Dates, like “02-03-2007′ or “03-Feb-07″ or “February 3″.
- Times, like “2:45 a.m. GST” or “10:04:23GMT+3″
- Credit card numbers, with or without spaces.
- Floats, like “+2.234E20″, “3.1415″, and “42″.
- U.S. Phone numbers, like “(301) 342-3222 ext. 2432″
- U.S. Address Lines, like “City of Industry, MN 23423-1322″
- Names, like “Dr. Phillip P. R. Radnov IV, MD”
- Overseas Phone Numbers, like “+23 234 12333″
- Quoted Strings in CSV formats
So, you can see that I’m looking at a lot of common, odd, and exception prone text processing that straddles the line between lexing and parsing. Recognizing the half dozen forms of writing a number and then returning a number is typically done in a lexer by providing multiple rules. Alternately, it is done by the parser in an annoyingly repetitive manner. It should be done in a library further down, such as regular expressions. Too many applications use different recipes and cause both incompatibilities and bugs.
One method would be to have these as macros in a regular expression class, and provide a cannonical example for post parsing. Convenience functions would provide access by field, e.g.,
>>> x = re.match(“(?Date)\s+(?USPhone)”, “23-Feb-07 415.234.9902″)
>>> print x.group(1) # What string matched the Date?
02-23-2003 # See, we substitute in the easy to parse date.
>>> print re.areacode(2) # Areacode from match of group 2
415
It feels like work, but doable to make this run quickly. That is, for the convenience functions to run quickly. The hard part would be correctly reporting when a regular expression using extensions might give unexpected results. For example, zip code followed by a number is ambigious for “20423-1234.234″. Is that (“20423-1234″,”0.234″) or (“20423″, “-1234.234″)? That problem is hard in regular expressions now.