Monday, November 09, 2009

Final Thoughts: Java Puzzler: Splitting Hairs

This is the final in a series of posts about a puzzler [ post containing the question ] [ post containing the answer ]. In those two posts I highlighted some surprising behavior in String.split().

Why the surprising behavior?

Well, and here's why my job is damn cool: after discovering this issue, I dropped a note to Josh Bloch, who quickly replied: (edited summary)

Yes, this is a pain. FWIW, it was done for a very good reason: compatibility with Perl. The guy who did it is Mike "madbot" McCloskey, who now works with us at Google. Mike made sure that Java's regular expressions passed virtually every one of the 30K Perl regular expression tests (and ran faster).
I have no real issues with the way Madbot implemented regular expressions in Java, nor with the goal of Perl compatibility. Perl's regular expression language was very popular, and derivatives of it were implemented not only in Java, but , JavaScript, PCRE, Python, Ruby, Microsoft's .NET Framework, and the W3C's XML Schema.[ref]

But I do have issue with String getting saddled with a method that quietly explodes the API's complexity.

So why does Perl work this way? I don't know, dude. This isn't a Perl Puzzler.

However, I tried to recreate the original puzzler in Perl as a way to validate the original puzzler's behavior. Unfortunately, I failed using perl v5.8.8 on my OSX machine. This script:

@first = split(/:/, "");
@second = split(/:/, ":");

print scalar @first . " [@first]\n";
print scalar @second . " [@second]\n";

Yielded

0 []
0 []

I'm not claiming there's either a bug or implementation change in either the Java or Perl implementations, but I sure am curious.

Possible Solutions

1. Hacking String.split

I don't attest to this, but it seems that you can ensure consistent behavior by appending a copy of the delimiter. So, if you plan to split a string by its colons you can do:

String[] result = (string + ":").split(":")

I'm sure you can find all sorts of issues with this example. Go for it. Point them out in the comments.

Besides, that's not much of a solution.

2. Get a String tokenizer

A second solution is to use StringTokenizer, which I completely forgot about Wim Jongman made a comment in the solution post.

Of course, even it has its own specific behavior.

public class Main {
  public static void main(String[] args) {
    tokenize("");
    tokenize(":");
    tokenize("a:");
    tokenize(":a");
    tokenize("a:a");
    tokenize("::");
  }

  static void tokenize(String s) {
    StringTokenizer t = new StringTokenizer(s, ":");
    List l = new ArrayList();
    while (t.hasMoreTokens()) {
      l.add(t.nextToken());
    }
    System.out.printf("Tokenization of %s is %s\n", s, l);
  }
}

yields

Tokenization of  is []
Tokenization of : is []
Tokenization of a: is [a]
Tokenization of :a is [a]
Tokenization of a:a is [a, a]
Tokenization of :: is []

3. Get your serving of Guava.


Here's a nice one: Project Guava. Project Guava is a soon-to-be open sourced library some of Google's core Java code. I've worked with these libraries for five years and I attest that they're wonderful to use. The only problem: it's not out yet. Kevin Bourrillion tells me. though, that an initial release will be available before Thanksgiving.

Note: You may already be aware of the open source project for Google's collections library. When Guava is released, the Google Collections library will go away.

The Guava libraries have a class called Splitter. Splitter's purpose is to alleviate some the confusion that comes with String.split.
By default Splitter's behavior is very simplistic:

Splitter.on(',').split("foo,,bar,  quux")

This returns an iterable containing ["foo", "", "bar", " quux"]. Notice that the splitter does not assume that you want empty strings removed, or that you wish to trim whitespace. If you want features like these, simply ask for them:

private static final Splitter MY_SPLITTER = Splitter.on(',')
       .trimResults()
       .omitEmptyStrings();
You can read more about Guava in Kevin Bourrillion's presentation slide deck from September of this year. Splitter is covered in slides 13 to 17.

Avoiding the real issue

There are two ways of looking back on the variety of votes: either people were assuming that String.split had confusing behavior, or they just expected it to work as as they would hope. Some might want a parse of ":" to return two elements. Some might want it to return one. Or zero. Something as seemingly simple as string tokenization has behavior that just might not meet your expectations. I'd like to say that Guava's Splitter will do the trick for everyone (as it does for my case of parsing a classpath) but you need to evaluate it for yourself.

This has been a rather long way of saying: test your edge cases. Thanks for reading.

Sunday, November 08, 2009

Answer to: Java Puzzler: Splitting Hairs

Update: I fixed some small errors, and also updated the charts one last time.

This blog post contains the results and answer to the previous post, Splitting Hairs.

Given the nature of the possible answers, this is actually two puzzlers in one:

  1. How many elements come back from "".split(":"): Zero or one?
  2. How many elements come back from ":".split(":"): Zero, one or two?
Darn, I could have gotten two separate puzzlers out of this.

My Guess


Here's the important point: this is a real world case that occurred to me just the other day. Specifically, I was writing some UI code to edit and parse a colon-separated classpath, so in fact the line in question looked like this:

String[] classpathEntries = classpathField.getText().split(":");

In this case, if the UI field starts out empty, and classpathField.getText()returns an empty String. Effectively, my expectation was that with an empty classpath, I'd also get an empty array.

So my guess for "".split(":").length() is Zero.

My guess for splitting a separator would result in two empty strings on either side. So that guess is two. Put them together and you get c) 0/2.

Your Guesses

How did you folks do?

Let's look at the numbers:






An overwhelming preference for f) 1/2, and my guess, c) 0/2 in a distant second.

The Answer

If by now you haven't run the sample code, I'll tell you: the correct answer is d) 1/0. If you're shocked that you get more elements with less data, you're not alone. Hey, just look at those graphs.

Okay, let's start the analysis by reading the method signature for String.split():
public String[] split(String regex)
Ah, yes, split takes a regular expression as a delimiter, not a literal string.

Sidebar: I occasionally read about (or experience) puzzlers where someone tries splitting on a pipe character (|), which has a special meaning in regular expressions. To split on a pipe, you must use "\|" so the pattern compiler sees it as a literal character. This is not one of those puzzlers: the colon does not have a special meaning for the pattern compiler.

Part 1: "".split(":")


If you navigate through the source for String.split, you'll discover that "".split(":") is effectively the same as Pattern.compile(":").split("", 0).

Things get interesting when you read the javadoc for Pattern.split(Charsequence, int):
If this pattern does not match any subsequence of the input then the resulting array has just one element, namely the input sequence in string form.
Wha? So if the delimiter isn't found in the in the input, the original input is returned. In other words:

System.out.println(Arrays.deepToString("food".split(":")));
System.out.println(Arrays.deepToString("foo".split(":")));
System.out.println(Arrays.deepToString("fo".split(":")));
System.out.println(Arrays.deepToString("f".split(":")));
System.out.println(Arrays.deepToString("".split(":")));

yields

[food]
[foo]
[fo]
[f]
[]

OK, that explains the first one, what about the second one?

Part 2: ":".split(":")


To understand what's going on, let's look again at the javadoc for Pattern.split(). In this case, n by the nature of being called by String.split(String), is 0.
If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded. [ emphasis added. ]
This can be corroborated with this piece of code near the end of the method:

// Construct result
int resultSize = matchList.size();
if (limit == 0)
    while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
        resultSize--;

In other words, internally, it splits the input string ":" into [ "", "" ], and then removes the elements before returning to the caller, but "".split(":") doesn't get this treatment because the delimiter was never found in the input.

Does your head hurt? Mine sure hurts, and I've done the research. But here, if you're still hungry, or hate yourself, take a look at this little gem, almost worthy of its own puzzler.

System.out.println(Arrays.deepToString((String[]) ":".split(":", 2)));
System.out.println(Arrays.deepToString((String[]) ":".split(":", 1)));
System.out.println(Arrays.deepToString((String[]) ":".split(":", 0)));

Coming Soon

Like the last puzzler, this will be followed with an analysis of why this occurs, and some nice alternatives.

Friday, November 06, 2009

Java Puzzler: Splitting Hairs

Try to answer this question without running the code, or reading the class documentation.

Assume we're using Java 1.6, though it likely doesn't matter if you're using Java 1.5.

What does the following snippet print?

String[] nothing = "".split(":");
String[] bunchOfNothing = ":".split(":");

System.out.printf("%d/%d\n", nothing.length, bunchOfNothing.length);

I will post an answer on Sunday.

This puzzler comes with six possible answers. I would have liked to have given four choices, but then I realized this was just for fun.

Thursday, October 22, 2009

Final Thoughts On: A Symbolic Puzzler

This is the final in a series of posts about a puzzler [ post containing the question ] [ post containing the answer ]. In those two posts I highlighted some bizarre Java weirdness as it pertained to the java.io.File.getCanonicalPath method, that canonicalized paths are cached, which means if a symbolic link changes somewhere the cached value becomes invalid.

Blatherberg

This problem really only ever crops up when your code relies on calls to getCanonicalPath and the links change while the application is running. If you expect your application to run against a filesystem with shifting symbolic links, you have to do one of these three things:
  1. Disable the cache by setting the system property sun.io.useCanonCaches to false. (I would also like to briefly nod to the pedantic point that the string useCanonCaches is icky.)

    This seems like an easy out (particularly if your application is moderately complex, and deployed, and assuming your application doesn't rely on the default behavior,) but there's a reason Java comes with a canonicalization cache: performance. Reading symbolic links from disk can take time, especially if you do it a lot.

    Also, you might be running your application in a Java EE container along with other applications, in which case, you can't isolate the cache behavior to a single application.

  2. Stop using getCanonicalPath (and its sugary sibling, getCanonicalFile) in your application, and rely solely on the non-canonicalized path.

    Changing your infrastructure to rely on the symlink paths themselves and not their canonicalized values sounds good, but you might not have control over that code: your application may rely on an application infrastructure that already relies on getCanonicalPath, and then you're kind of screwed. It's easy to say that the cache should be disabled in the name of correctness, but if you're repeatedly resolving symbolic links files by the thousand, the time cost may be significant.

    This leads to the other way to look at this problem, which is the lack of accessibility and flexible control over the cache. You might want to cache calls in some circumstances yet not others. The use cases for cache control can be complex, and by hiding the complexity you get, well, surprises like this.

  3. Disallow an application's filesystem to redefine symbolic links.

    If you've got that power, go for it.

Help From JSR 203

There's actually some hope for the future, and that's JSR203: More New I/O APIs for the JavaTM Platform ("NIO.2") which is scheduled to be part of Java 7. Look back to the puzzler, which points out the use of Filesystem and UnixFilesystem classes. In JSR203, those ideas are explicit. The equivalent of java.io.File is java.nio.file.Path which exposes a method getFileSystem. That's right, the file system is no longer hidden from the user, and you can read all about java.nio.file.FileSystem here. You can have a file system that represents a thin layer on top of your disk, or one that caches all sorts of metadata from your disk, or, heck, create an in-memory implementation for high-speed storage! But the real benefit is that these filesystem implementations can be injected into your classes: no more need for a single static filesystem. Whereas java.io.File objects are created through a constructor, java.nio.file.Path objects are constructed through the FileSystem's getPath method.

This isn't disk i/o nirvana, unfortunately, because like the continuing transition from java.util.Date to java.util.Calendar to something more reasonable like org.joda.time.DateTime, there's still plenty of legacy code using the old and busted APIs. But it's a good start.

If you want some more information about JSR 203, here's a write-up by Alex Miller and a link to a JavaOne talk from 2008. The video is a bit out of date (for instance it highlights the notion of Path.get, which seems to be gone, thank goodness.) But it's got lots of great information about the JSR.

The Last Word

In the end, I want to highlight something underlying this entire journey: the choice to cache the values by default in the first place is just wrong. It reminds me of the saying (that seems to be attributed to Bill Harlan): "It's easier to optimize correct code than it is to correct optimized code.

Wednesday, October 21, 2009

Answer to: A Symbolic Puzzler

This blog post contains the results and answer to the previous post, A Symbolic Puzzler.

The answer will be covered here, and I'll follow this up with a fourth post that covers my thoughts on this issue.

What were your guesses?



Clearly, the most popular answer was that the test would fail.

What would have been my guess?

Look, nobody codes in puzzler fashion, so without details I'll explain what I expected to occur from my own production code, but in terms of this test. I wouldn't be confident that TESTDIR_SYMLINK.getCanonicalPath() returned the non-canonicalized location, but excepting that, I certainly would assume that once the symbolic link was created at the end, the second call to TESTDIR_SYMLINK.getCanonicalPath()would return the symlinked directory. So my guess would have been a. It passes.

The Answer

The correct answer is: d. It depends. More specifically, it depends on the VM's arguments.

The Explanation

If you ran this test straight up without any special VM arguments, the test would fail (which might lead you to think the answer is b. It fails.)
junit.framework.ComparisonFailure: expected:</testdir/[file]> but was:</testdir/[symlink]>
    at junit.framework.Assert.assertEquals(Assert.java:81)
    at junit.framework.Assert.assertEquals(Assert.java:87)
    at ATest.testSymlink(ATest.java:26)
    ...
Why wouldn't the canonicalization return the updated value? It's because the return value from getCanonicalPath was cached from the previous call. Yes, calls to getCanonicalPath are cached.

Let's look at the code underneath getCanonicalPath. The magic lies in some package-private classes in the java.io package, specfically Filesystem and UnixFilesystem. The key operation occurs in UnixFilesystem.canonicalize:

class UnixFileSystem extends FileSystem {
  public String canonicalize(String path) throws IOException {
     if (!useCanonCaches) {
       return canonicalize0(path);
     } else {
       String res = cache.get(path);
       ... 
     }
  }
  private native String canonicalize0(String path) throws IOException;
}

In other words, the path canonicalization computations are cached when useCanonCaches is true. So just when is useCanonCaches true? For that let's look at the static initialization block for Filesystem, the subclass of UnixFilesystem:

// Flags for enabling/disabling performance optimizations for file
// name canonicalization
static boolean useCanonCaches      = true;
static boolean useCanonPrefixCache = true;
... 

static {
    useCanonCaches      = getBooleanProperty("sun.io.useCanonCaches",
                                             useCanonCaches);
    useCanonPrefixCache = getBooleanProperty("sun.io.useCanonPrefixCache",
                                             useCanonPrefixCache);
} 

So by default, the cache canonicalization is on, but when you specify the VM arg -Dsun.io.useCanonCaches=false, the cache is never used.

Getting back to the puzzler, the first call to TESTDIR_SYMLINK.getCanonicalPath() always returns the path to the symlink, while the second call returns either the cached value (the path to the symlink) or the up-to-date resolved symlink, but only when -Dsun.io.useCanonCaches=false is specified.

If you don't beleive me now, go try running the test twice, once without specifying VM arguments, and once while disabling the canonicalization cache, and you'll see that it fails once, and passes another time. Hence, d. It depends.

    Replies to some of the comments

    Here are some of the comments that accompanied the survey:
    Guess: It fails.
    Comment:The target of the symlink doesn't exist so I suspect exists() will return false
    In fact, exists() will return true since the symlink exists. And in fact, the next call to getCanonicalPath returns itself, just as the comment suggested.
    Guess: It depends.
    Comment: Depending on where the root filesystem is mounted, the canonical path might be something else. For example, /etc on Mac is a symlink to /private/etc. In addition, the mountpoint for / might be a networked drive (e.g. netboot) which might have different semantics.

    The bottom line is that you can't necessarily assume that the file you use to access a file is the canonical path for that file, without knowing the filesystem.
    This is an interesting comment. I had hoped this was cleared up by this statement in the original post: "The path /testdir is not a symbolic link to another directory, and the user running the test also owns /testdir." If my explanation was insufficient, then so be it.

    There were other comments to that effect: "Does /bin/rm do what /bin/rm should do?" "Does the user have write permissions?" These are good points, and would be better suited for a UNIX puzzler. I hope people who worried about those cases looked past them to focus on Java's behavior.
    Guess: It throws an exception
    Comment: I want be the first one to check "it throws an exception"!
    Sorry, not the first.

    Meta: Thoughts on writing a puzzler

    There are at least two places where the puzzler's code could have been simplified without sacrificing its quality:
    1. Replace explicit calls that delete the files /testdir/file and /testdir/symlink, with a communicated precondition that /testdir had no files.
    2. This test has an interim call that computes the canonical path of a symlink that points to nothing, leading people to spend too much time worrying about that case. A better snippet would:
      1. point the symlink to a file that exists.
      2. compute the symlink's canonical location thereby (optionally) populating the internal cache.
      3. point the symlink to a second file that also exists.

    Tuesday, October 20, 2009

    Interim: A Symbolic Puzzler

    Last night I published a Java puzzler for your enjoyment. I will publish the answer tomorrow, but for now I thought you would enjoy an interim count of the guesses:



    There's still plenty of time to participate in the puzzler. Don't forget to use the text box if you want to back up your reasoning.