Entries Tagged 'Coding' ↓
January 3rd, 2008 — Coding, Idea, Ideas, Invention, Libraries

Problem: Regular Expressions should have more core types
Regular expressions recipes live on websites. When you want that U.S. telephone number regular expression, like “((?P<areacode>:\d{3})?\s{0,2}….“, you make this huge hash into your regular expressions.
What happens is that regular expressions become de facto lexers. We want them to recognize the various obvious forms of telephone numbers, like “(415) 555 – 5555” or “415.555.5555” and so on. We want the right answer, without rebuilding a recipe library from scratch each time.
Solution: Build an extended regular expression engine that knows the basic recipes.
It would be good to have a regular expression subclass that recognizes additional special characters and functionality. Previous extensions in regular expressions, notably in Perl, have added core building blocks such as less greedy patterns, named sub-expressions, repetition matching, and so on. These extensions would provide easy access to common parsing problems. I expect a good set of candidates would include:
- Dates, like “02-03-2007′ or “03-Feb-07″ or “February 3″.
- Times, like “2:45 a.m. GST” or “10:04:23GMT+3″
- Credit card numbers, with or without spaces.
- Floats, like “+2.234E20″, “3.1415″, and “42″.
- U.S. Phone numbers, like “(301) 342-3222 ext. 2432″
- U.S. Address Lines, like “City of Industry, MN 23423-1322″
- Names, like “Dr. Phillip P. R. Radnov IV, MD”
- Overseas Phone Numbers, like “+23 234 12333″
- Quoted Strings in CSV formats
So, you can see that I’m looking at a lot of common, odd, and exception prone text processing that straddles the line between lexing and parsing. Recognizing the half dozen forms of writing a number and then returning a number is typically done in a lexer by providing multiple rules. Alternately, it is done by the parser in an annoyingly repetitive manner. It should be done in a library further down, such as regular expressions. Too many applications use different recipes and cause both incompatibilities and bugs.
One method would be to have these as macros in a regular expression class, and provide a cannonical example for post parsing. Convenience functions would provide access by field, e.g.,
>>> x = re.match(“(?Date)\s+(?USPhone)”, “23-Feb-07 415.234.9902″)
>>> print x.group(1) # What string matched the Date?
02-23-2003 # See, we substitute in the easy to parse date.
>>> print re.areacode(2) # Areacode from match of group 2
415
It feels like work, but doable to make this run quickly. That is, for the convenience functions to run quickly. The hard part would be correctly reporting when a regular expression using extensions might give unexpected results. For example, zip code followed by a number is ambigious for “20423-1234.234″. Is that (“20423-1234″,”0.234″) or (“20423″, “-1234.234″)? That problem is hard in regular expressions now.
January 3rd, 2008 — Coding, Idea, Ideas


This is a cool feature I have yet to see implemented.
Problem: Scrabble is a good way to learn spelling, but people make words they don’t understand.
Scrabble style games encourage spelling by having people put together letters in odd combinations. For less strenuous play, most computer versions of the game dispense with the ‘challenge a misspelled word’ option. This leads to players to try to explore new words by trial and error. When a young player tries to put an ‘H’ after ‘PIT’, they discover a new word.
Solution: Provide definitions of the word inside the game.
Tool tips provide one simple solution. Hovering the cursor over a word would pop-up containing the definition of the word as helpfully provided by wiktionary. This provides another level of passive learning in the same experience.
For advanced players, one could play against someone speaking a different language. Words of both player’s languages would be accepted, and the tool-tips would provide spelling and pronunciation hints. While one could imagine triggering off videos, such as matching Colingo videos, most additional help would come from chatting with the other player.
Lots of possibilities.
December 28th, 2007 — Coding, Idea, Ideas

Problem: Installing and getting going on OLPC’s XO development takes time.
Getting the environment to work, getting the libraries installed and linked, and all the other cruft of systems administration takes time on the XO. Each application is written in Python and then ‘sugarized’. Then you move it over the XO, then you run it.
This can create a real hurdle for people who want to play with the OLPC.
Solution: Provide a premade virtual XO development environment.
What if we run a server that has an environment with an open Subversion version control system. When you check in a new file to your project, a build bot compiles it in the correct environment, runs pychecker and other tools, sugarizes, runs unit tests, and then makes available a binary, a report, and a screen-movie of the run.
So, good idea, bad idea?
December 27th, 2007 — Coding, Idea, Ideas, Invention, Products, Web

Problem: The One Laptop Per Child (OLPC) XO Laptop coud get viruses
The XO laptop will be a perfect breeding ground for viruses. The system is monolithic, with an expectation of identical hardware and operating systems being deployed across entire countries. IT support is expected to be non-existent. The machine networks heavily and promiscuously.
The primary defense against the spread of viruses is the built-in BitFrost security system. This assemblage uses virtual machines to restrict certain combinations of rights and attempts to prevent some of the higher harm viruses from occurring. One can expect many installations to disable any security measures found burdensome, as occurred with the rights based security in J2EE.
Solution: Make a bootable, read only, USB key to validate that all installed software is open source.
On the other hand, the OLPC has a few truly special features that allow effective anti-virus software to exist. All, or almost all, software on an OLPC is expected to be open source. It becomes reasonable to validate that all possible changes to configurations, applications installed, and binaries from a known list of ‘good’ applications. While “enumerating badness” is a doomed strategy, enumerating allowed systems is possible.
Consider a repository that keeps track of all known contributed applications and language packs in source form. An automated script builds these into sugarized files and keeps track of the file system effects from installing these onto virtual machines. The results are placed onto a self-booting USB key that boots up, examines the file system of the XO, and declares it to be clean or infected.
This does require some tricky programming, mostly in reducing the combinatorially exploding possible configurations into a linearly growing set of rules for allowable configurations. The problem is engineering, not requiring perfect implementation.
This feels like a good PhD Thesis for someone.
December 21st, 2007 — Coding, Idea, Ideas, Products, Web

Problem: Ubuntu troubleshooting is scattered about useless message boards
Solution: Rate messages for useful information
Ubuntu, the popular Linux distribution, discusses problems in message boards. Someone poses a problem and various people pose possible solutions. The total volume means that great posts detailing the actual way to set environment variables or handle XYZ Mark 32.42 video cards are lost. Users spend way too much time digging through chaff.
One simple solution is to rate the value of posts, ala Slashdot and others. As a user is browsing through the discussion forum, she can tag an answer as especially clear and well written. Alternately, she could tag a message as a question or otherwise content free. Later users might set a minimum threshold to search for well written explanations.
November 28th, 2007 — Coding
Here’s the summary of PEP8, a badly formatted guide to Python style which also imparts random programming advice.
Capitalization
_private, keyword_, __mangled_private, module, _extension, ClassName, _PrivateClassName, ExceptionError, function_name, global_variables, instance_methods, _private_instance_methods, instance_variables, _private_instance_variables,
* Follow the existing module convention over these rules
* _private Leading underscore to make a weak internal or private indication.
* keyword_ Trailing underscore to avoid conflict with keyword.
* __mangled_private Double leading underscore invokes name mangling privacy
* module Package names are all lower case, short, and may use underscores.
* _extension C and C++ extension modules have a leading underscore.
* ClassName Class names use CapWords.
* _PrivateClassName Class names can have an underscore to indicate internal use.
* ExceptionError Exceptions are CapWords classes and end with Error if they are errors.
* function_name Function names are lowercase with underscore seperators.
* global_variables Globals look like functions with lowercase and underscores.
* instance_methods Instance methods also look like function names.
* _private_instance_methods Private instance methods may use an underscore.
* instance_variables Everything looks like a function name
* _private_instance_variables Everything private looks like a private function name
Style Advice
* Indent 4 spaces per line.
* Limit lines to 79 characters.
* Separate top level methods and classes with two blank lines.
* Separate methods within a class with one line.
* Imports are one per line.
* Avoid spaces immediately inside parenthesis, brackets or braces.
* Avoid spaces immediately before a comma, semicolon, or colon.
* Comments should be complete sentences in English.
* Write good docstrings.
* Version bookkeeping goes after the module’s docstring.
* self is the first argument to all instance methods.
* cls is the first argument to all class methods.
Sorry, not an Idea For Free, but I got sick of trying to read through Pep-8 again to check a style question.
November 8th, 2007 — Coding, Idea, Ideas, Products, Web

Problem: Good or Cheap Internet Hosting Companies
Solution: Good and Cheap Internet Hosting Companies
The cost of casual web hosting has dropped through the floor and one can expect it to remain there. If you need to host a dozen domains, downloading about 500 gigabytes of bandwidth each month, storing some hundreds of gigabytes of graphics, handling email accounts, and programming with a full Unix shell including tools like Subversion, Python, and MySQL, then that will set you back about $8 per month. DreamHost’s feature list (sign up here) would be unheard of five or ten years ago. With minor variations, so does everyone else. The catch? These servers are not 100% reliable; more like 98% reliable. I noticed DreamHost had about two or three days of outages or severely degraded performance last year.
Hold it, you think, we handle nearly 100% reliability with unreliable hard drives, unreliable communications channels, and even unreliable employees. Why can’t we handle nearly 100% reliability with unreliable web hosting? What can we do?
Redundant Arrays of Independent Service Providers
Tactic 0 — Sharing the Static Load
Web hosting companies slow down or run out of bandwidth quota far more often than actually crashing. Also, graphic files and other static data account for the bulk of the data served. One common tactic is to serve the the static files from several different domains. One can use an Apache plug-in to rewrite outgoing html, a periodic (cron) program to occasionally rewrite your images directory location, a client-side JavaScript routine to display from whichever server provided the picture fastest, mirroring downloads onto a new server, or several other methods. The effect is that loading a web page may load data from more than one domain.
Only the smaller main page needs to be served from the primary ISP when it is slow. All other data comes from faster, inexpensive ISPs. Similarly, a graphics intensive site, such as Girl Genius, could use this method to serve many terabytes of data for about a hundred dollars per month.
Tactic 1 — Dumb DNS Failover or Load Splitting
Another tactic that commonly works to failover to another site. Using various tricks with DNS or BGP, traffic to a downed site can route to a backup or traffic from different geographies can route to different servers. The site is mirrored, but the the URL remains the same.
Tactic 2 — Failover with Transactions Intact. Distributed ACID Transactions.
The hard part of failing over is making sure the important state is preserved when a site goes down. There are ways to do this, with varying degrees of bandwidth efficiency and latency. Each service would need to notify other servers in some scheme in order keep databases intact. Lots of trade-offs present themselves. It is doable, but requires a fair amount of custom work.
Summary
Getting the high availability out of these redundant ISPs is just work. More complex tactics require more engineering. The business model, as always, is to mask all of the complexity of these tactics so that the customer’s problem just “goes away”.
One could provide a service that set up all the failover options, signed up for the ISPs, wrote the scripts to maintain transaction state for a flat fee. Alternately, one could simply sell themselves as a “high availabilty, high transaction volume” service provider while actually running the service off a number of inexpensive ISPs.
There is money sitting there people. Go pick it up!
October 4th, 2007 — Coding, EngEdu, Reviews, SelfEd, Write-up
Google Test Automation Conference Lightning Talks
Good and excellent short talks from the 2006 conference covering many testing topics. Slides are available.
01:20 Dan North from Thoughtworks on Getting Lean
- Automated Testing has the secondary effect of enabling two of the Lean Principles: Lean Product Design (maximize discovery by faster releases) and Lean Manufacturing (low variance repeatable testing).
- Sort of supply chain management because design is just before coding is just before release.
- Automated Testing allows for frequent shipping of high quality product.
06:00 Steve Freeman JMock I Done By Five Minutes
- JMock is a library for using mock objects in the test-driven development of Java code.
- JMock’s types allows IDE to expose just right spots. Good error reporting.
10:50 Nat Pryce, The JMock II update
- Update is Java 5, JMock library is framework independent
- Cleaner testing code, better IDE autocomplete
- New refactoring support
15:01 Christine Newman from Progressive Insurance, Getting More Funding for Automated Testing
- Quantify the value of automated testing with numbers like “Production Problems Prevented; Risks Mitigated (by severity); or Hours Saved”
18:20 Andrin von Rechenberg, Google Intern, Improving WURFL data
- Database mobile phone capabilities are hard to test for 8500 user agents.
- Check your logs to see how they navigate, e.g., to find HTTPS capability
23:30 Ade Oshineye from ThoughtWorks, Five Heresies in Five Minutes
- Convert your logs of real people into test cases
- Mutation Testing works (change code, if test passes then test is bad) for high risk, about to be refactored, or bug-prone sections.
- Run actual tests on production hardware with production systems for new release or when making a dummy is hard.
- Test your vendor’s code lest their upgrades break you or you are afraid to upgrade.
- Static analysis works now. Use FindBugs or PyChecker or IntelliJ’s stack analysis server to find bugs people won’t find or broken language idioms.
29:00 Timur Hairullin for Vandex (Russian Search Engine)
- Had unexplained performance problems on an update.
- Measure performance from time between first byte of request to something happening inside browser of older computers; not just to server sending last byte of response.
32:00 James Richardson on Automated Testing: Why Bother?
- How do I quantify Automated Testing? I missed Christine Neuman’s lecture.
34:00 James Lyndsay for Workroom Productions, Automated Tricks for Manual Testors
- Some tests must be manual and observed by good testers; they look for surprisers and emergent behaviors.
- Use snippets of code (your manual testers should code and use Unix)
- Use virtual machine images to hand over bugs.
38:26 Jordon Dea-Mattson, some radiation treatment machine firm
- Working on client/server in life-critical, highly regulated systems.
- Most testing strategies have single points of failure; try thinking of testing defense in depth.
- Collect more data in the (wild) field instead than just crashes.
- Think of software ERP, integrating the whole requirement to delivery chain.
- Strong recommendaton for TestQuest; UI testing tool.
42:10 Curtis “Ovid” Poe with the Perl Foundation, TAP
- TAP (Test Anything Protocol) is a human and machine readable protocol for test output.
- TAP is simple, line oriented, and implemented in C/Perl/Python/Ruby/JavaScript/PostgresSQL
- Driven by PerlQA team and a new language agnostic list is forming.
Overall, a great video.
September 14th, 2007 — Coding, Event, Write-up
Mike Pittaro, a founder at SnapLogic, spoke on “An Open Source Data Integration Project using Python” at BayPiggies. Slides are available as a PDF. Mike spoke primarily about the issues building the team, infrastructure, and installer for a large Python project.
The SnapLogic product is a data munging and translation application aimed at developers. While the source code is under a GPL license, professional support and service are available for a fee. It has taken about one year to develop. Hiring Python developers was difficult, and eventually good programmers were hired that then learned Python. They are still hiring.
The development infrastructure came together quickly using primarily open source tools written in Python: mailman, subversion, trac, and moinmoin. The team uses both Linux and Windows (using Samba partitions) and both Eclipse/Pydev and vi/emacs. The build process uses buildbot to kick off builds, epydoc and epytext for documentation,figleaf for code coverage, unittest for testing, and (closed source) bitrock for installation. This led to a 11.5 Joel Score.
They learned some rules about coding standards, keeping control of import statements, and wrapping really complex libraries with getattr() tricks. Mostly, they fought with the installation problems of installation of an open source project that requires Python, C, databases, lots of interactions, and also needs to deploy on Windows. They are still fighting with parallelization.
Mike suggested several sources of information. First, keep reading the library reference over and over. He also recommended Python Is Not Java and Code Like a Pythonista.
Thanks again to Mike for a great talk!
July 27th, 2007 — Coding, Idea, Ideas
Problem: The world of programming is increasingly multi-platform, multi-paradigm, multi-this, and multi-that. The environments do not reflect this reality.
Solution: I am still appreciating the scope of the problem.
So last night, I went the the Home Brew Robotics Club to listen to three presentations:
* Some CMU students presenting their Master’s project of getting a CMUCam working by putting a Lua interpreter on it, building a custom Eclipse plug-in, and getting rudimentary interactivity over a serial line. The CMUCam has many nifty features and programming characteristics, and some horrid serial line command protocol. This was a rough attempt at providing reasonable access to the nifty features.
* A hobbyist got up to show the tricks he had done with the Propeller chip. He played with it for some time until he learned to hack just the right instructions into the middle of included servo control software. He managed to track all the minutia of the quadrature encoder signals, measured and desired motor speed, voltages, corrections, etc. The magic four line code hack included a deft understanding of the architecure, built-in SPIN high level language, and the on chip assembly code. This then integrated with free ossiliscope simulation softare.
* A third hobbyist showed off the Propeller chip, with its various form factors, interface options, cool architecture points, and the like. Coding could be done on the free-of-cost, proprietary Windows only IDE and compiler. A little hacking of various types let various interfaces work.
All of these showed how hard it is to make a good, general environment. The chips are all messy constructs with levels of architecture details and layers of languages. Each language has a different IDE, loading system, and required support. The IDEs don’t usually understand the limits of the chip. There are minimal standards. There is no closed loop between the three pillars of mathematical analysis, simulation against a model, and working with a physical reality. Moving code onto and off the device is always a custom, tricky thing. And, finally, there is no thought as to how to capture progress and testing.
Outside of robotics, one is still making systems to sweet-talk graphics cards, special authentication hardware, bridge languages, operating systems, and cooperating program stacks. It’s a mess. It’s a puzzle.