Message12280

Author	stefan.richthofer
Recipients	jeff.allen, stefan.richthofer
Date	2019-01-05.15:53:38
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1546703618.69.0.542676986718.issue2663@roundup.psfhosted.org>
In-reply-to

Content
> Why does "thresh" go down so slowly BTW? From my experiments the starting value is already rather sharp, so in most cases it should work in the first attempt. The decrease value was just picked "by heart". Maybe another value works better or another approach at all. One could perform the embedding in more fine-grained manner, e.g. store constants and locals in their own proper fields. Also re-unfolding the PyBytecode object could be better done in the loader rather than by post-processing the module after load. But that would have been a lot more work and just serializing the whole PyBytecode object in a self-contained manner was a directly workable solution to bridge the time until we would improve it. Re binary data in string: I mainly wondered why I had to use base64 at all if strings can carry arbitrary binary data. I think the answer is that while string maybe able to do this (still concerned with javadoc of String's byte[]-constructor: "Constructs a new String by decoding the specified array of bytes using the platform's default charset."), String literals cannot. The bytecode is embedded into the classfile as a sequence of string literals (a single literal is too restrictive in its maximal size). This decomposition logic is contained in insert_code_str_to_classfile. This has a quite expressive javadoc. > understand the processing in the catch of Module.compile Roughly speaking, it figures out which individual methods are oversized, so it does not have to use PyBytecode for the whole module. The results are stored by name in oversized_methods. Nested methods/functions/whateverBecomesPyBytecode are then handled via cpython bytecode in any case. It would be too tedious to sort this out code object wise, maybe an undecidable problem at all. It contains some out-commented code to handle method nesting cycles. I don't see how cycles could occur in method nesting, but Python is so packed with dubious features that I initially felt it would be safer to handle possible cycles (e.g. think of auto-created methods like attrs does). I out-commented that code after careful testing, but I recommend to keep the comments so we can easily revive it if we discover a case where it's suddenly needed. Finally it attempts a re-write with the oversized methods removed. These are re-inserted based on PyBytecode loaded from string literals in BytecodeLoader.fixPyBytecode. It would likely be possible to create the proper PyBytecode-based method objects during compilation, store them as constants and perform the loading in module initialization code. Then the module class files would be truely self-contained without the need for post-processing in fixPyBytecode. That would also eliminate the need for the marker interface ContainsPyBytecode. However, it's complicated work without direct benefit... You go first :) (Seriously: Let's tackle this for Jython 3 if at all.) loadPyBytecode takes care of loading the PyBytecode from a pyc-file. It also looks for the pyc-file, maybe attempts to create one via cpython, takes care of user messages. This might maybe be improved by some split-up as well. serializePyBytecode takes care of code serialization. It's the only method in this class that strictly requires a fix due to DatatypeConverter. The approach via marshal might work as well if base64 is used afterwards. However when I came to that insight the Java serialization based version was already finished. If you rework this, please fix the comment in https://github.com/jythontools/jython/blob/bc8f61ac8256f2e9a1046f84d2b66a9deed037b5/src/org/python/compiler/Module.java#L860: It was meant to be "// so we use Java-serialization..." rather than "reflection".

> Why does "thresh" go down so slowly BTW?

From my experiments the starting value is already rather sharp, so in most cases it should work in the first attempt. The decrease value was just picked "by heart". Maybe another value works better or another approach at all.

One could perform the embedding in more fine-grained manner, e.g. store constants and locals in their own proper fields. Also re-unfolding the PyBytecode object could be better done in the loader rather than by post-processing the module after load. But that would have been a lot more work and just serializing the whole PyBytecode object in a self-contained manner was a directly workable solution to bridge the time until we would improve it.

Re binary data in string: I mainly wondered why I had to use base64 at all if strings can carry arbitrary binary data. I think the answer is that while string maybe able to do this (still concerned with javadoc of String's byte[]-constructor: "Constructs a new String by decoding the specified array of bytes using the platform's default charset."), String literals cannot. The bytecode is embedded into the classfile as a sequence of string literals (a single literal is too restrictive in its maximal size). This decomposition logic is contained in insert_code_str_to_classfile. This has a quite expressive javadoc.

> understand the processing in the catch of Module.compile

Roughly speaking, it figures out which individual methods are oversized, so it does not have to use PyBytecode for the whole module. The results are stored by name in oversized_methods.

Nested methods/functions/whateverBecomesPyBytecode are then handled via cpython bytecode in any case. It would be too tedious to sort this out code object wise, maybe an undecidable problem at all.

It contains some out-commented code to handle method nesting cycles. I don't see how cycles could occur in method nesting, but Python is so packed with dubious features that I initially felt it would be safer to handle possible cycles (e.g. think of auto-created methods like attrs does). I out-commented that code after careful testing, but I recommend to keep the comments so we can easily revive it if we discover a case where it's suddenly needed.

Finally it attempts a re-write with the oversized methods removed. These are re-inserted based on PyBytecode loaded from string literals in BytecodeLoader.fixPyBytecode. It would likely be possible to create the proper PyBytecode-based method objects during compilation, store them as constants and perform the loading in module initialization code. Then the module class files would be truely self-contained without the need for post-processing in fixPyBytecode. That would also eliminate the need for the marker interface ContainsPyBytecode. However, it's complicated work without direct benefit... You go first :)
(Seriously: Let's tackle this for Jython 3 if at all.)

loadPyBytecode takes care of loading the PyBytecode from a pyc-file. It also looks for the pyc-file, maybe attempts to create one via cpython, takes care of user messages. This might maybe be improved by some split-up as well.

serializePyBytecode takes care of code serialization. It's the only method in this class that strictly requires a fix due to DatatypeConverter. The approach via marshal might work as well if base64 is used afterwards. However when I came to that insight the Java serialization based version was already finished. If you rework this, please fix the comment in https://github.com/jythontools/jython/blob/bc8f61ac8256f2e9a1046f84d2b66a9deed037b5/src/org/python/compiler/Module.java#L860:
It was meant to be "// so we use Java-serialization..." rather than "reflection".

History
Date	User	Action	Args
2019-01-05 15:53:38	stefan.richthofer	set	messageid: <1546703618.69.0.542676986718.issue2663@roundup.psfhosted.org>
2019-01-05 15:53:38	stefan.richthofer	set	recipients: + stefan.richthofer, jeff.allen
2019-01-05 15:53:38	stefan.richthofer	link	issue2663 messages
2019-01-05 15:53:38	stefan.richthofer	create