I've done this now for Windows, adding a 'java.nt' key to _failures.

I don't know how we would "rigorously sort tests into stable vs. unstable". I saw test_glob fail yesterday, out of the blue, and I couldn't repeat it. Does that make it unstable? (It was just an unlink() failure, so I made that non-fatal.)

If we want repeatable tests we should err on the side of expecting as failures (or skipping in the module) those we find unreliable, but not without converting that choice to an issue.
