Follow by Email

Monday, November 09, 2009

Final Thoughts: Java Puzzler: Splitting Hairs

This is the final in a series of posts about a puzzler [ post containing the question ] [ post containing the answer ]. In those two posts I highlighted some surprising behavior in String.split().

Why the surprising behavior?

Well, and here's why my job is damn cool: after discovering this issue, I dropped a note to Josh Bloch, who quickly replied: (edited summary)
Yes, this is a pain. FWIW, it was done for a very good reason: compatibility with Perl. The guy who did it is Mike "madbot" McCloskey, who now works with us at Google. Mike made sure that Java's regular expressions passed virtually every one of the 30K Perl regular expression tests (and ran faster).
I have no real issues with the way Madbot implemented regular expressions in Java, nor with the goal of Perl compatibility. Perl's regular expression language was very popular, and derivatives of it were implemented not only in Java, but , JavaScript, PCRE, Python, Ruby, Microsoft's .NET Framework, and the W3C's XML Schema.[ref]

But I do have issue with String getting saddled with a method that quietly explodes the API's complexity.

So why does Perl work this way? I don't know, dude. This isn't a Perl Puzzler.

However, I tried to recreate the original puzzler in Perl as a way to validate the original puzzler's behavior. Unfortunately, I failed using perl v5.8.8 on my OSX machine. This script:

@first = split(/:/, "");
@second = split(/:/, ":");

print scalar @first . " [@first]\n";
print scalar @second . " [@second]\n";


0 []
0 []

I'm not claiming there's either a bug or implementation change in either the Java or Perl implementations, but I sure am curious.

Possible Solutions

1. Hacking String.split

I don't attest to this, but it seems that you can ensure consistent behavior by appending a copy of the delimiter. So, if you plan to split a string by its colons you can do:

String[] result = (string + ":").split(":")

I'm sure you can find all sorts of issues with this example. Go for it. Point them out in the comments.

Besides, that's not much of a solution.

2. Get a String tokenizer

A second solution is to use StringTokenizer, which I completely forgot about Wim Jongman made a comment in the solution post.

Of course, even it has its own specific behavior.

public class Main {
  public static void main(String[] args) {

  static void tokenize(String s) {
    StringTokenizer t = new StringTokenizer(s, ":");
    List l = new ArrayList();
    while (t.hasMoreTokens()) {
    System.out.printf("Tokenization of %s is %s\n", s, l);


Tokenization of  is []
Tokenization of : is []
Tokenization of a: is [a]
Tokenization of :a is [a]
Tokenization of a:a is [a, a]
Tokenization of :: is []

3. Get your serving of Guava.

Here's a nice one: Project Guava. Project Guava is a soon-to-be open sourced library some of Google's core Java code. I've worked with these libraries for five years and I attest that they're wonderful to use. The only problem: it's not out yet. Kevin Bourrillion tells me. though, that an initial release will be available before Thanksgiving.

Note: You may already be aware of the open source project for Google's collections library. When Guava is released, the Google Collections library will go away.

The Guava libraries have a class called Splitter. Splitter's purpose is to alleviate some the confusion that comes with String.split.
By default Splitter's behavior is very simplistic:

Splitter.on(',').split("foo,,bar,  quux")

This returns an iterable containing ["foo", "", "bar", " quux"]. Notice that the splitter does not assume that you want empty strings removed, or that you wish to trim whitespace. If you want features like these, simply ask for them:

private static final Splitter MY_SPLITTER = Splitter.on(',')
You can read more about Guava in Kevin Bourrillion's presentation slide deck from September of this year. Splitter is covered in slides 13 to 17.

Avoiding the real issue

There are two ways of looking back on the variety of votes: either people were assuming that String.split had confusing behavior, or they just expected it to work as as they would hope. Some might want a parse of ":" to return two elements. Some might want it to return one. Or zero. Something as seemingly simple as string tokenization has behavior that just might not meet your expectations. I'd like to say that Guava's Splitter will do the trick for everyone (as it does for my case of parsing a classpath) but you need to evaluate it for yourself.

This has been a rather long way of saying: test your edge cases. Thanks for reading.


Dan said...

In Python:

>>> ''.split(':')
>>> ':'.split(':')
['', '']

Looks like it does the right thing.

Foodberg said...

@Dan, thanks for the comment. See, it depends on the operation.

If it's for parsing a Unix Path, I would want ''.split[':'] to return []. For parsing delimited data with a fixed (or at least, non-zero) number of fields, I'd want Python's behavior. I think the user has to come around to realize that xxx.split(":") might not do what you want, though may be totally valid for another use case.

Piotr Maj said...

":".split(":", -1) does the trick (same as in perl).

Unknown said...

I would almost agree that Google's Splitter class is good enough for everyone. But, there is no mechanism that I could see for returning the delimiters, as does StringTokenizer. If it does, I missed it.

Foodberg said...

Hey, Mike. It might not. Good catch.

Jackie Co Kad said...
This comment has been removed by a blog administrator.