January 2008

Extended Regular Expressions

 re.jpg

Problem:   Regular Expressions should have more core types

Regular expressions recipes live on websites.   When you want that U.S. telephone number regular expression, like “((?P<areacode>:\d{3})?\s{0,2}….“, you make this huge hash into your regular expressions.

What happens is that regular expressions become de facto lexers.  We want them to recognize the various obvious forms of telephone numbers, like “(415) 555 – 5555” or “415.555.5555” and so on.   We want the right answer, without rebuilding a recipe library from scratch each time.

Solution:  Build an extended regular expression engine that knows the basic recipes.

It would be good to have a regular expression subclass that recognizes additional special characters and functionality.   Previous extensions in regular expressions, notably in Perl, have added core building blocks such as less greedy patterns, named sub-expressions, repetition matching, and so on.  These extensions would provide easy access to common parsing problems.  I expect a good set of candidates would include:

  • Dates, like “02-03-2007′ or “03-Feb-07″ or “February 3″.
  • Times, like “2:45 a.m. GST” or “10:04:23GMT+3″
  • Credit card numbers, with or without spaces.
  • Floats, like “+2.234E20″, “3.1415″, and “42″.
  • U.S. Phone numbers, like “(301) 342-3222 ext. 2432″
  • U.S. Address Lines, like “City of Industry, MN 23423-1322″
  • Names, like “Dr. Phillip P. R. Radnov IV, MD”
  • Overseas Phone Numbers, like “+23 234 12333″
  • Quoted Strings in CSV formats

So, you can see that I’m looking at a lot of common, odd, and exception prone text processing that straddles the line between lexing and parsing.  Recognizing the half dozen forms of writing a number and then returning a number is typically done in a lexer by providing multiple rules.   Alternately, it is done by the parser in an annoyingly repetitive manner.   It should be done in a library further down, such as regular expressions.  Too many applications use different recipes and cause both incompatibilities and bugs.

One method would be to have these as macros in a regular expression class, and provide a cannonical example for post parsing.   Convenience functions would provide access by field, e.g.,

>>> x = re.match(”(?Date)\s+(?USPhone)”, “23-Feb-07  415.234.9902″)

>>> print x.group(1)   # What string matched the Date?

02-23-2003         # See, we substitute in the easy to parse date.

>>> print re.areacode(2)   # Areacode from match of group 2

415

It feels like work, but doable to make this run quickly.  That is, for the convenience functions to run quickly.   The hard part would be correctly reporting when a regular expression using extensions might give unexpected results.   For example, zip code followed by a number is ambigious for “20423-1234.234″.  Is that (”20423-1234″,”0.234″) or (”20423″, “-1234.234″)?   That problem is hard in regular expressions now.

del.icio.us Reddit Slashdot Digg Facebook Technorati Google StumbleUpon Tailrank Yahoo Bloglines Newsvine Spurl Fark

Coding
Idea
Ideas
Invention
Libraries

Comments (0)

Permalink

OLPC Scrabble with Definitions

 scrab.jpg220px-laptopolpc_a.jpg

This is a cool feature I have yet to see implemented.

Problem: Scrabble is a good way to learn spelling, but people make words they don’t understand.

Scrabble style games encourage spelling by having people put together letters in odd combinations. For less strenuous play, most computer versions of the game dispense with the ‘challenge a misspelled word’ option. This leads to players to try to explore new words by trial and error. When a young player tries to put an ‘H’ after ‘PIT’, they discover a new word.

Solution: Provide definitions of the word inside the game.

Tool tips provide one simple solution. Hovering the cursor over a word would pop-up containing the definition of the word as helpfully provided by wiktionary. This provides another level of passive learning in the same experience.

For advanced players, one could play against someone speaking a different language. Words of both player’s languages would be accepted, and the tool-tips would provide spelling and pronunciation hints. While one could imagine triggering off videos, such as matching Colingo videos, most additional help would come from chatting with the other player.

Lots of possibilities.

del.icio.us Reddit Slashdot Digg Facebook Technorati Google StumbleUpon Tailrank Yahoo Bloglines Newsvine Spurl Fark

Coding
Idea
Ideas

Comments (0)

Permalink