Sunday, March 01, 2009

Building a National Pet Identification System in Java

This week I dealt with a rather irritating little problem that wound up having a surprising source. Rather than try to explain it to my wife in technical terms, I went for the more prosaic form, but for those of you that are familiar with Java, it's all about this.



Congratulations! President Obama created a new Cabinet post, and you became the first Secretary of Pets! Your first task: managing and categorizing every pet in the United States: dogs, cats, budgies, snakes, it's your domain. You decide to identify every pet by giving them a unique nationally-assigned number. References to that number are the same as references to the pet and references to the pet are the same as references to the pet's identifying number. You create a huge central station just outside Lebanon, Kansas and process requests as they come in. Someone requests an identifying number for a Maltese cat in Michigan, they get 104558628. The next request is for a wire-haired terrier in Pasadena, CA and they get the next one: 104558629.

You've also been tasked with creating an abstract broad classification system. It's an odd classification requirement: you don't need to classify by species, number of paws, color, gender, geography or fangs, just something that lets you batch them up. So you create a classification based on the identifying number already given to the animals by declaring an animal's classification as the last two digits of the identifying number. So if your wire-haired terrier's identifying number is 104558629, its classification is 29.

So far so good, your central station is humming away, doling out identification numbers. One day your sister's cat goes missing, and she goes to the Lost Pet Center. She doesn't describe the cat, she just gives her cat's identification number 104558628, and a Lost Animal Ticket is stored in the National Lost Pet Database.

Remember when I said earlier that referencing a pet is virtually the same as referencing a pet by its identification number? Thanks to this you were able to build a Pet Identification Station in every town: put the animal up to a station scanner and, presto! It displays the animal's identification number.

Also in every town is a Lost Pet station. It's awesome: if you find an animal you don't even have to bring it to the station, you just supply its identification number. When you do, cogs turn, phones dial and the Lost Pet station connects to the National Lost Pet Database, and asks it: "Is this identification number in your Database?" If the response is no then it's not a lost pet, and if the answer is yes the cogs turn and phones dial. The Correct Authorities are notified and before you know it, your sister's cat is safely returned. In fact, she just told me so about an hour ago, and her cat's making all sorts of happy purrs. Do you like happy endings? I love 'em.

I bet you're a little sad that there's a National Lost Pet Database. Don't be sad - it's not a terribly large list, and it's certainly smaller than before you ever built your pet identification system.

But it's large enough that everyone wants it to respond as quick as possible. Nobody wants the National Lost Pet Database to be the source of delaying a pet returning to its home, particularly not the politicians, at least, not the savvy ones. But right now, how it works is like this: every entry in the database is a Lost Animal Ticket. The Lost Animal Ticket contains the identifying number of the pet as well as a piece of information about the owner to facilitate fast contact. When trying to decide if any pet is missing given its identification number, the system creates a prototype Lost Animal Ticket, and attempt to match the prototype ticket against every entry in the lost pet database. If the identifying number of the first entry matches the identifying number on the prototype ticket, it's considered a match. If not, it moves to the second Lost Animal Ticket entry in the database, and performs the same test. This repeats until a match is found or until the entire database is exhausted. It's a bit like trying to find a playing card in a deck by taking a prototype card from yet another deck and attempting to match it against the first deck through visual inspection.

So to make it faster, you come up with a Great Idea: split the database into several miniature databases. In fact, thanks to the classification system you designed, you can just split the database into a hundred mini databases, and in each mini database you only put Lost Animal Tickets for pets with the same classification number. So now you have one mini database for all Lost Animal Tickets for pets classified as '00', one for all pets classified as '01', one for all pets classified as '02', and so on up to '99'. This really works well because every mini database is about as large as every other, and even better, instead of search through every Lost Animal Ticket, you just search one of the mini databases in 1% of the time. Instead of searching through all the possible missing pets, you only have to search through the mini database for pets classified as '28', which as we know are the last two digits of your sister's cat's identifying number. In some ways this is like trying to quickly search a deck of cards by separating them based on their suit first.




All of this works and is predicated on three important things which may seem obvious: first, everything that tries to classify a pet must consistently perform the same step to get the classification number, which is to create a Lost Animal Ticket, read the identification number from it and use its last two digits; second, that whenever attempting to match one Lost Animal Ticket against another, this must be performed by matching the identification number on each of the two tickets; and third, those first two things agree that the identification number is what matters.

And here's the thing: what happens if any of those assumptions are wrong? For instance: one day you decide to categorize pets by the first two digits of the animal's identification number instead of the last. While your sister's cat's Lost Animal Ticket is sitting in mini database '29', everyone may look in mini database '10'. All of a sudden nobody can find your cat, or it's registered in two separate mini databases.

Here's something else that can go wrong: instead of comparing Lost Animal Tickets using the identification number of the animal, the system uses the identification number of the pet's owner. If your sister lost three cats and one is found, it's quite possible for the wrong ticketto be matched against the prototype ticket. Also consider instead a possibility where the criteria for deciding if tickets matched changed to being the same physical piece of paper. Doesn't make sense? If you and I both have a four of clubs, we could possibly agree they were the same card. (That's basically how the National Lost Pet Database works.) But if instead of playing cards, we had airline boarding passes, it would be safe to assume that if you were standing on an airplane with our own boarding passes, if they referred to the same ticket, someone may call The Correct Authorities, and if we applied that algorithm to the National Lost Pet Database, then pets would practically never be found.

Finally, your system will fail from a poor combination of the first two requirements: if your system's categorization calculation is inconsistent with its test for equality, the system may lose tickets. For example, while the first example of possible failure still used the pet's identifying number, what if classification was something more arbitrary, like the time the ticket was issued? The prototype ticket will have been issued at a different time from the lost pet ticket, and as such will likely be classified in different mini databases, and never be found, or only found due to something repeating the same error (Like the time nobody at the bank could locate my account until someone mistyped my name the same way as the teller who registered it on my behalf. That's another story.)

Luckily for your sister, you didn't make these mistakes. Your National Pet Identification System is a huge success, and President Obama has now tapped you to address the problems with our banking system, just like Moist Von Lipwig.

Keep in mind: if anyone plans to write a National Pet Identification System in Java, read Items 8 and 9 of this book. You may save someone both several hours of work and their wire-haired terrier.

2 comments:

Anonymous said...

bOy, i couldn't get past the fifth paragraph, is that bad?

[sigh]

konberg said...

Well it's not good that it's hard to read.