Issue2663

classification
Title: Remove dependency on javax.xml.bind.DatatypeConverter
Type: crash Severity: normal
Components: Core Versions:
Milestone: Jython 2.7.2
process
Status: pending Resolution: fixed
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: jeff.allen, stefan.richthofer
Priority: high Keywords: Java Roadmap

Created on 2018-04-01.16:26:04 by jeff.allen, last changed 2019-01-08.22:46:28 by jeff.allen.

Messages
msg11867 (view) Author: Jeff Allen (jeff.allen) Date: 2018-04-01.16:26:04
javax.xml.bind.DatatypeConverter is not present in Java 9. This prevents use with Java 9.

We are using it in core.ByteCodeLoader and compiler.Module in order to convert to/from Base64. However, we have our own code for that (in the codecs module), which I think could be applied here.

There is a java.util.Base64 class in Java 8, but it is not available in 7, which we intend to support a little while longer.
msg12250 (view) Author: Jeff Allen (jeff.allen) Date: 2019-01-03.08:34:21
It is more accurate to say that javax.xml.bind.DatatypeConverter is not available in Java 9 SE, used normally. http://openjdk.java.net/jeps/320. There are workarounds (https://stackoverflow.com/questions/43574426/how-to-resolve-java-lang-noclassdeffounderror-javax-xml-bind-jaxbexception-in-j) but I think we would like this to work without.

So why, in practice, does Jython run happily on Java 9 SE?

Both occurrences of the class in question exist to support large modules (#527524). So execution is rare in normal use. The related test_large_method_bytecode_jy that might reveal the problem does not execute any test cases where it occurs in regression tests, since it lacks the test_main function. Run as a program it fails, like this:

PS jython-jvm9> dist\bin\jython -m test.test_large_method_bytecode_jy
...
  File "C:\Users\Jeff\Documents\Eclipse-O\jython-jvm9\dist\Lib\test\test_large_method_bytecode_jy.py", line 25, in setUpClass
    import large_methods as _large_methods
org.objectweb.asm.MethodTooLargeException: Method too large: test/large_methods$py.large_function$1 (Lorg/python/core/PyFrame;Lorg/python/core/ThreadState;)Lorg/python/core/PyObject;
        at org.objectweb.asm.MethodWriter.computeMethodInfoSize(MethodWriter.java:2089)
        at org.objectweb.asm.ClassWriter.toByteArray(ClassWriter.java:458)
        at org.python.compiler.ClassFile.write(ClassFile.java:263)
        at org.python.compiler.Module.write(Module.java:664)

So where is the NoClassDefFoundError we should expect?

This happens because the message in the exception is not exactly what was expected by the code. That's probably something I broke by upgrading ASM to 7.0 recently. Testing the message text is a yukky way to sort exceptions, of course. It must've been necessary once (see example in #1891), but in ASM 7.0's favour, it now defines the specific exception class seen above.

With improvements to address these two, we can now fail as we ought:

PS jython-jvm9> dist\bin\jython -m test.regrtest test_large_method_bytecode_jy
test_large_method_bytecode_jy
test test_large_method_bytecode_jy crashed -- <type 'java.lang.NoClassDefFoundError'>: java.lang.NoClassDefFoundError: javax/xml/bind/DatatypeConverter
...

I'll commit that as a distinct change, probably with a skip on Java >= 9 for now, then move on to the main event, to find a substitute for DatatypeConverter. It is helpful to know it is only there to solve one problem: it is used to embed Python byte code as a large constant. There may be another way to do that.
msg12256 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2019-01-03.17:39:37
Jeff, thanks for investigating. Indeed DatatypeConverter is only used to encode/decode between binary data and base64. Maybe we can use java.util.Base64 instead? (I think the version via DatatypeConverter was the recommended way in some stack overflow search at that time.)
However any Base64 encoder/decoder implementation should do. There is probably some plain Java open source implementation available we can bundle. Storing it in string constants is kind of provisional anyway. There is probably a better solution, e.g. bytecode attributes or something.


> Testing the message text is a yukky way to sort exceptions

Agreed. This was mainly necessary because there is apparently no way to determine beforehand whether a given function/method will exceed size limits: Size limits scope bytecode, not source code. And (at least those days) asm didn't offer a way to detect it in advance. Sorry if I overlooked something. Then, the exception type was too generic to triage it properly into oversized method handling. And the solution based on CPython bytecode should not be applied to method code that does not really require it. If you see a better way to triage the exception or if there is a better way now with current asm I am fully in support to improve this.
msg12258 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2019-01-03.17:42:33
Just noticed that java.util.Base64 was just added in Java 8. So this yields again the question of dropping Java 7 compatibility.
(Or can we catch the NoClassDefFoundError and fallback?)
Using an independent Base64 implementation is probably the best solution. Isn't there one in the Python stl?
msg12260 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2019-01-03.17:49:35
Indeed we can maybe use it from Python stl: https://docs.python.org/2/library/base64.html

There is also Apache Commons Codec: http://commons.apache.org/proper/commons-codec/

Also see https://stackoverflow.com/questions/13109588/base64-encoding-in-java

For sake of avoiding yet another dependency the Python stl version should be favorable. OTOH we are already shipping an Apache Commons module: Apache Commons Compress (https://github.com/jythontools/jython/blob/master/extlibs/commons-compress-1.14.jar). So Apache Commons seems to be fairly modular, it might not be a big deal.
msg12262 (view) Author: Jeff Allen (jeff.allen) Date: 2019-01-04.08:19:06
> > Testing the message text is a yukky way to sort exceptions

> Agreed. ... If you see a better way to triage the exception or if there is a better way now with current asm I am fully in support to improve this.

The version of ASM now in use throws a specific exception we can catch by type. I realised looking at #1891 that this had not been available when the code was written.

I'm a little surprised at Oracle removing this without warning. Java is usually good for continuity. (Ok, I found work-arounds for the user, but that feels like a removal.)

I think binascii.b2a_base64 is the thing (except for the annoying newline).
msg12264 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2019-01-04.23:23:04
I took a short look at the binascii.b2a_base64 implementation
https://github.com/jythontools/jython/blob/2dd11357668506cafd71697a7632ad4527c76b8a/src/org/python/modules/binascii.java#L433

It looks dubious to me that it attempts to store the binary data in a string. Does Java accept this? I'd suspect that Java might attempt to decode it via the platform default coding. However, it's certainly tested and as long as it retrieves what it encodes it should work fine.
The binary data is finally retrieved using String.getBytes...?
msg12270 (view) Author: Jeff Allen (jeff.allen) Date: 2019-01-05.06:33:00
We always store our binary data in a String. Dubiously. :/

Perhaps a local duplicate of the logic would be tolerable in the interest of using the proper types.

I'd like to make sure I understand, maybe factor into private methods, the processing in the catch of Module.compile. Why does "thresh" go down so slowly BTW? Why not halve it, say?
msg12280 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2019-01-05.15:53:38
> Why does "thresh" go down so slowly BTW?

From my experiments the starting value is already rather sharp, so in most cases it should work in the first attempt. The decrease value was just picked "by heart". Maybe another value works better or another approach at all.

One could perform the embedding in more fine-grained manner, e.g. store constants and locals in their own proper fields. Also re-unfolding the PyBytecode object could be better done in the loader rather than by post-processing the module after load. But that would have been a lot more work and just serializing the whole PyBytecode object in a self-contained manner was a directly workable solution to bridge the time until we would improve it.

Re binary data in string: I mainly wondered why I had to use base64 at all if strings can carry arbitrary binary data. I think the answer is that while string maybe able to do this (still concerned with javadoc of String's byte[]-constructor: "Constructs a new String by decoding the specified array of bytes using the platform's default charset."), String literals cannot. The bytecode is embedded into the classfile as a sequence of string literals (a single literal is too restrictive in its maximal size). This decomposition logic is contained in insert_code_str_to_classfile. This has a quite expressive javadoc.


> understand the processing in the catch of Module.compile

Roughly speaking, it figures out which individual methods are oversized, so it does not have to use PyBytecode for the whole module. The results are stored by name in oversized_methods.

Nested methods/functions/whateverBecomesPyBytecode are then handled via cpython bytecode in any case. It would be too tedious to sort this out code object wise, maybe an undecidable problem at all.

It contains some out-commented code to handle method nesting cycles. I don't see how cycles could occur in method nesting, but Python is so packed with dubious features that I initially felt it would be safer to handle possible cycles (e.g. think of auto-created methods like attrs does). I out-commented that code after careful testing, but I recommend to keep the comments so we can easily revive it if we discover a case where it's suddenly needed.

Finally it attempts a re-write with the oversized methods removed. These are re-inserted based on PyBytecode loaded from string literals in BytecodeLoader.fixPyBytecode. It would likely be possible to create the proper PyBytecode-based method objects during compilation, store them as constants and perform the loading in module initialization code. Then the module class files would be truely self-contained without the need for post-processing in fixPyBytecode. That would also eliminate the need for the marker interface ContainsPyBytecode. However, it's complicated work without direct benefit... You go first :)
(Seriously: Let's tackle this for Jython 3 if at all.)


loadPyBytecode takes care of loading the PyBytecode from a pyc-file. It also looks for the pyc-file, maybe attempts to create one via cpython, takes care of user messages. This might maybe be improved by some split-up as well.


serializePyBytecode takes care of code serialization. It's the only method in this class that strictly requires a fix due to DatatypeConverter. The approach via marshal might work as well if base64 is used afterwards. However when I came to that insight the Java serialization based version was already finished. If you rework this, please fix the comment in https://github.com/jythontools/jython/blob/bc8f61ac8256f2e9a1046f84d2b66a9deed037b5/src/org/python/compiler/Module.java#L860:
It was meant to be "// so we use Java-serialization..." rather than "reflection".
msg12282 (view) Author: Jeff Allen (jeff.allen) Date: 2019-01-05.20:22:53
Thanks for the explanations, Stefan.

> Re binary data in string: I mainly wondered why I had to use base64 at all if strings can carry arbitrary binary data. I think the answer is that while string maybe able to do this (still concerned with javadoc of String's byte[]-constructor: "Constructs a new String by decoding the specified array of bytes using the platform's default charset."), String literals cannot.

That's right. Literal strings in the class file are characters encoded using the JVM's variant of UTF-8. I believe arbitrary 16-bit values could be stored in a String and represented as a constant. You could pack a byte array into a String at almost 2 bytes/character, but with random data you would end up with nearly 3/2 bytes/byte in the UTF-8, compared to 4/3 now, so base64 is good.

> The bytecode is embedded into the classfile as a sequence of string literals (a single literal is too restrictive in its maximal size). This decomposition logic is contained in insert_code_str_to_classfile. This has a quite expressive javadoc.

The length limit applies to the encoded form, not the text, but for base64 these are the same. I've revised the javadoc to clarify this.

Module.compile: I have to re-indent it as we saved an if-statement, but I don't intend to revise the approach, just improve readability to make checking mny work easier. I thought of a nicer structure (same approach) but I'm not sure I can do it without breakage.

At present I'm going for my own, simplified, base64 encoder/decoder that the Java 8 one could replace directly.

> It was meant to be "// so we use Java-serialization..." rather than "reflection".
I wondered about that. Ok, that's corrected.
msg12284 (view) Author: Jeff Allen (jeff.allen) Date: 2019-01-06.09:38:03
Now in at https://hg.python.org/jython/rev/79cd4168b63d .

Outside the base64 bit, changes are white space. Except I have tweaked the decrementing of thresh to go by 1000s, because 120 attempts seems a lot to me. My instinct would be for thresh = thresh * 2 / 3 or something like that, but the tests we have don't seem to exercise this code so I've no ready tool for choosing (and it's not a priority).
msg12286 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2019-01-06.11:23:52
Note that installing sympy and importing it successfully is a substantial and non-trivial test for handling oversized methods. It contains at least 4 modules affected by this limitation. After editing the concerned code, that is definitely a good assertion to make.
msg12300 (view) Author: Jeff Allen (jeff.allen) Date: 2019-01-08.20:46:54
Installing sympy was a worthwhile exercise. It can only be tested on an installed Jython, so one has to make an installer, which I don't do very often. This revealed a regression in the Linux version (fixed).

However, the large file compiler in Module.loadPyBytecode has also broken (now #2732).
msg12304 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2019-01-08.21:17:35
Making an installer should be straight forward via
ant installer. Did that work?
build.xml documents the full-build target for a release. I never successfully completed that target due to missing closed-source dependencies. I once tried to get them but somehow gave up when I had to register etc. However ant installer usually went fine.
msg12308 (view) Author: Jeff Allen (jeff.allen) Date: 2019-01-08.22:46:28
> Making an installer should be straight forward via ant installer. Did that work?

Yes. Tht's how I was able to test pip on Windows and Linux, revealing #2732. ISTR full-build has not worked for me either, but will seek Frank's help as a separate issue/thread.
History
Date User Action Args
2019-01-08 22:46:28jeff.allensetmessages: + msg12308
2019-01-08 21:17:35stefan.richthofersetmessages: + msg12304
2019-01-08 20:46:54jeff.allensetmessages: + msg12300
2019-01-06 11:23:52stefan.richthofersetmessages: + msg12286
2019-01-06 09:38:03jeff.allensetstatus: open -> pending
resolution: accepted -> fixed
messages: + msg12284
2019-01-05 20:22:53jeff.allensetmessages: + msg12282
2019-01-05 15:53:38stefan.richthofersetmessages: + msg12280
2019-01-05 06:33:00jeff.allensetmessages: + msg12270
2019-01-04 23:23:04stefan.richthofersetmessages: + msg12264
2019-01-04 08:19:06jeff.allensetmessages: + msg12262
2019-01-03 17:49:35stefan.richthofersetmessages: + msg12260
2019-01-03 17:42:34stefan.richthofersetmessages: + msg12258
2019-01-03 17:39:38stefan.richthofersetmessages: + msg12256
2019-01-03 08:34:22jeff.allensetassignee: jeff.allen
resolution: accepted
messages: + msg12250
nosy: + stefan.richthofer
2018-04-01 16:26:04jeff.allencreate