Issue2672

classification
Title: Integer formatting emits two minus signs with -2^31
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7
Milestone:
process
Status: pending Resolution: fixed
Dependencies: Superseder:
Assigned To: stefan.richthofer Nosy List: jamesmudd, jeff.allen, stefan.richthofer
Priority: normal Keywords: easy

Created on 2018-04-29.15:05:58 by jeff.allen, last changed 2018-07-09.13:14:43 by stefan.richthofer.

Files
File name Uploaded Description Edit Remove
IntegerBug.java jamesmudd, 2018-05-15.22:23:03
Messages
msg11925 (view) Author: Jeff Allen (jeff.allen) Date: 2018-04-29.15:05:57
First observed by Helmut Niemann (thanks):

x = int(-(1<<31))
print ('%d' % x) # emits --2147483648

One also observe it as:

>>> int(-(1<<31)).__format__('d')
'--2147483648'

It's important that x be an int, not a long, which is in fact a clue to the work-around. Clearly this is a thinking error somewhere in int.__format__, so we should add a regression test in case it recurs.
msg11982 (view) Author: James Mudd (jamesmudd) Date: 2018-05-15.22:23:03
I have had a quick look at this and it's kind of an edge case were i'm not sure if we should fix it.

The issue occurs in: org.python.core.stringlib.IntegerFormatter.format_d(int)

Where the minus sign is handled separate to the value itself. The value is then negated and passed to Java Integer.toString(-value). This is suposed to result in a positive sting representation. However in the edge case of passing in Integer.MIN_VALUE it results in "-2147483648". This can be demonstrated just in java see attached example.

The core of the problem is Integer.MIN_VALUE=-2^31 whereas Integer.MAX_VALUE=2^31-1 (https://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html) so -Integer.MIN_VALUE doesn't fit in an int.
msg11986 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2018-05-16.19:17:06
We should definitely fix it. It is invalid behavior, can cause arbitrary issues, unforeseeable damage, etc. It must be fixed.

That said, an easy fix should be to replace
number = Integer.toString(-value);
by
number = Long.toString(-((long) value));

This impacts performance of non-edge-case calls. I don't think that format_d is a performance critical function, but - for what it's worth -
we can do it like

number = value == Integer.MIN_VALUE ? "2147483648" : Integer.toString(-value);
msg11987 (view) Author: James Mudd (jamesmudd) Date: 2018-05-16.22:08:23
I'm happy for it to be fixed also. Think I would prefer

number = Long.toString(-((long) value));

It's a bit cleaner and like Stefan I don't think performance is a issue here.
msg11988 (view) Author: James Mudd (jamesmudd) Date: 2018-05-17.16:46:28
Pull request to fix this one https://github.com/jythontools/jython/pull/102
msg11989 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2018-05-17.19:11:27
I thought about it and would slightly prefer the second version.
- it is still more efficient
- it somewhat clearly documents the nature of an edge case

Your PR looks good, especially thanks for writing a test!
If you don't mind I would just switch the one-liner to the second version on merging. So you won't have to do anything.
Is that okay?
msg11990 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2018-05-17.19:21:49
Oh, and it occurred to me that we should assert that not a corresponding format_x function of another primitive type is affected by the same kind of issue.
msg11993 (view) Author: James Mudd (jamesmudd) Date: 2018-05-17.20:17:16
Ok I don't mind the one liner either your right its faster. Style wise I would prefer an if e.g

            if(value == Integer.MIN_VALUE) {
                number = Long.toString(-((long)value));
            } else {
                number = Integer.toString(value);
            }

Just had a quick look and I think this bug is present in other places too. Long should be ok as that will go though IntegerFormatter.format_d(BigInteger) but the bug is present in the other int formatting i.e to hex, octal, binary, char. These need tests and fixing to.

Actually no I think the nicest solution might be to call the BigInteger equivalent methods if value == Integer.MIN_VALUE. What do you think of this idea?

I will work on a pull request to fix all the cases I know of.
msg11994 (view) Author: James Mudd (jamesmudd) Date: 2018-05-17.21:26:34
I have updated the pull request https://github.com/jythontools/jython/pull/102

I have added more tests covering the other cases where this could fail. The fix I have implemented is to delegate all the int methods to the BigInteger version counterparts. This makes for quite a simplification as well as fixing this bug.

It is debatable if this will reduce performance, I haven't benchmarked it. I suspect it will be very close, the new way can make use of BigInteger optimisations.
msg11995 (view) Author: Jeff Allen (jeff.allen) Date: 2018-05-17.22:02:27
Thanks both for picking this up.

If we use a clause guarded by value == Integer.MIN_VALUE, in that special case, the rational thing is to return a string constant. Why do math? However, promoting everything to BigInteger is ok I think.
msg12017 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2018-06-24.13:23:04
This issue is somehow stuck it seems. It is unclear to me whether it is worth to risk BigInteger-induced performance regression. I'd be in favor of returning the string constant, like Jeff also pointed out.

How shall we proceed here?
msg12018 (view) Author: James Mudd (jamesmudd) Date: 2018-06-24.21:16:38
I don't really mind either way. Promotion to BigInteger seemed like the tidiest approach to me, removes quite a bit of code. It might be slower than returning a constant so if that would be preferred i'm willing to change it?
msg12019 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2018-06-24.21:24:58
> Promotion to BigInteger seemed like the tidiest approach to me

yes, I see this advantage too and decision is not easy.
I'd still value performance slightly higher.
At least as far as it's not overdone, otherwise we would end up with an inlining madness. So maybe it's a matter of taste what should be done here...
msg12022 (view) Author: James Mudd (jamesmudd) Date: 2018-06-27.18:34:46
Ok decided to try and benchmark this and here are the results.

Current master:
	Jython 2.7.2a1+ (, Jun 27 2018, 19:16:18) 
	[OpenJDK 64-Bit Server VM (Oracle Corporation)] on java1.8.0_171
	Type "help", "copyright", "credits" or "license" for more information.
	>>> import timeit
	>>> def print_int():
	...   print('%d' % 6235676672)
	... 
	>>> timeit.timeit('print_int', setup="from __main__ import print_int", number=1000000000)
	1.309999942779541
	>>> timeit.timeit('print_int', setup="from __main__ import print_int", number=1000000000)
	1.3050000667572021
	>>> timeit.timeit('print_int', setup="from __main__ import print_int", number=1000000000)
	1.306999921798706
	>>> timeit.timeit('print_int', setup="from __main__ import print_int", number=1000000000)
	1.2990000247955322

With promotion to BigInteger the suggested fix on PR 102:
	Jython 2.7.2a1+ (, Jun 27 2018, 19:03:22) 
	[OpenJDK 64-Bit Server VM (Oracle Corporation)] on java1.8.0_171
	Type "help", "copyright", "credits" or "license" for more information.
	>>> import timeit
	>>> execfile('bench.py')
	>>> def print_int():
	...   print('%d' % 6235676672)
	... 
	>>> timeit.timeit('print_int', setup="from __main__ import print_int", number=1000000000)
	1.312000036239624
	>>> timeit.timeit('print_int', setup="from __main__ import print_int", number=1000000000)
	1.3289999961853027
	>>> timeit.timeit('print_int', setup="from __main__ import print_int", number=1000000000)
	1.306999921778706
	>>> timeit.timeit('print_int', setup="from __main__ import print_int", number=1000000000)
	1.3040001392364502

Just for interest here is CPython:
	Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
	[GCC 5.4.0 20160609] on linux2
	Type "help", "copyright", "credits" or "license" for more information.
	>>> import timeit
	>>> def print_int():
	...   print('%d' % 6235676672)
	... 
	>>> timeit.timeit('print_int', setup="from __main__ import print_int", number=1000000000)
	9.2016921043396
	>>> timeit.timeit('print_int', setup="from __main__ import print_int", number=1000000000)
	9.216765880584717
	>>> timeit.timeit('print_int', setup="from __main__ import print_int", number=1000000000)
	9.206844091415405

That's running 1,000,000,000 (1 billion) print functions formatting an integer. Jython is fast here! Around 7 times faster than CPython.

The difference between promoting to a BigInteger and the existing system is very small <1%, and that is without adding the additional "if" required to return the string constant, so in my opinion the code simplification offered by promotion is the better solution.

Off topic: I'm kind of surprised how much Jython beats CPython by in this. JVM string handling is very fast but still there must be stuff CPython are doing badly. I would be interested if anyone know why, is this benchmark flawed somehow? The scaling with the number of repeats looks ok to me, so I don't think i'm looking at the JVM being clever somehow.
msg12023 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2018-06-27.18:50:43
Are you sure that the print command itself, i.e. the console displaying the output does not dominate the formatting cost?

Then, CPython had some improvements since 2.7. How does it look on CPython 3.7?

Still, it's good to hear that Jython can perform so well compared to CPython. Unfortunately I found some situations where it's kind of the other way round, e.g. try some SymPy examples. This will require some performance-scoped revision. Distinct issue however.
msg12026 (view) Author: James Mudd (jamesmudd) Date: 2018-06-27.19:47:43
Thanks Stefan I have redone it without the print with very different results which I don't understand. Here they are:

CPython 2
	Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
	[GCC 5.4.0 20160609] on linux2
	Type "help", "copyright", "credits" or "license" for more information.
	>>> import timeit
	>>> timeit.timeit('"%d" % 64726787', number=10000000)
	0.13201189041137695
	>>> timeit.timeit('"%d" % 64726787', number=10000000)
	0.0979149341583252
	>>> timeit.timeit('"%d" % 64726787', number=10000000)
	0.09714412689208984

CPython 3
	Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
	[GCC 5.4.0 20160609] on linux
	Type "help", "copyright", "credits" or "license" for more information.
	>>> import timeit
	>>> timeit.timeit('"%d" % 64726787', number=10000000)
	0.08939030999954412
	>>> timeit.timeit('"%d" % 64726787', number=10000000)
	0.09239740000066377
	>>> timeit.timeit('"%d" % 64726787', number=10000000)
	0.12687400600043475


Jython current master
	Jython 2.7.2a1+ (, Jun 27 2018, 19:16:18) 
	[OpenJDK 64-Bit Server VM (Oracle Corporation)] on java1.8.0_171
	Type "help", "copyright", "credits" or "license" for more information.
	>>> import timeit
	>>> timeit.timeit('"%d" % 64726787', number=10000000)
	2.0910000801086426
	>>> timeit.timeit('"%d" % 64726787', number=10000000)
	2.120000123977661
	>>> timeit.timeit('"%d" % 64726787', number=10000000)
	2.115000009536743

Jython with BigInteger promotion fix
	Jython 2.7.2a1+ (, Jun 27 2018, 20:36:23) 
	[OpenJDK 64-Bit Server VM (Oracle Corporation)] on java1.8.0_171
	Type "help", "copyright", "credits" or "license" for more information.
	>>> import timeit
	>>> timeit.timeit('"%d" % 64726787', number=10000000)
	2.7939999103546143
	>>> timeit.timeit('"%d" % 64726787', number=10000000)
	2.6410000324249268
	>>> timeit.timeit('"%d" % 64726787', number=10000000)
	2.635999917984009

Some things to note. This is now without the print so that should be fixed, however its also with much fewer repeats "only" 10,000,000 (10 million) not 1 billion as before. The reason is Jython was hanging with 1 billion, maybe I just didn't wait long enough. So now CPython is much faster than Jython, and Jython is slower to do this 10m times than it was to do the print + this 1b times, so I really don't understand what this is doing! With this result the BigInteger performance is significantly worse (~25% slower) than without so I think that's a reason to go with the constant. Not much difference between python 2 and 3.

Just if someone can explain this I would like to know!
msg12027 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2018-06-27.20:04:18
I think this requires proper Java profiling to be answered. Such investigation is a long standing todo for me. Anyway, we should move performance discussion to a separate issue.
The new results are comparable to what I observed with SymPy and are the reason why I carefully question every bit of additional performance regression. At least until we understand this better. Jython should be faster than CPython as it as a JIT, no? I.e. something is wrong and we will find out what it is :)
msg12031 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2018-06-28.08:56:33
A possible explanation is that Jython, Java or the JIT (not sure right now) might trigger some memoization for repeated print calls with the same arg.
msg12034 (view) Author: James Mudd (jamesmudd) Date: 2018-06-30.12:04:52
I have changed https://github.com/jythontools/jython/pull/102 to add special casing for the min int case.

I have realised why the benchmark was wrong and it was stupid. I never actually called the function I had:
    timeit.timeit('print_int', setup="from __main__ import print_int", number=1000000000)
when it should have been:
   timeit.timeit('print_int()', setup="from __main__ import print_int", number=1000000000)

So I guess looking up a function in the globals is faster in Jython for what that's worth.
msg12035 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2018-06-30.12:24:16
I looked at the new version. There might be a configuration where the call to positiveSign actually does something, which might result now in something like +-2147483648. The version I suggested earlier, i.e.
number = value == Integer.MIN_VALUE ? "2147483648" : Integer.toString(-value);
appears safer to me (and more explicit).

> I guess looking up a function in the globals is faster in Jython for what that's worth.
I'm convinced that Jython can be potentially faster than CPython, obviously because of the JIT. It's a question of identifying the performance flaws in the current impementation.
msg12036 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2018-06-30.13:25:59
Oh, and the fix should be recovered for the other format_x functions.
msg12042 (view) Author: James Mudd (jamesmudd) Date: 2018-07-05.20:29:36
I have updated the pull request to use the string constant.

Also I investigated the other cases further and they actually worked ok. The reason is Integer.{toHexString, toOctalString, toBinaryString} work in unsigned. From the Javadoc:
	Returns a string representation of the integer argument as an unsigned integer in base 8.
	The unsigned integer value is the argument plus 232 if the argument is negative; otherwise, it is equal to the argument.

So those don't require special casing and there are now tests covering those as well.
msg12045 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2018-07-08.20:35:01
James, thanks for this work and for the tests.
I will accept and merge as soon as I find time.
msg12046 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2018-07-09.13:14:43
Fixed as of https://hg.python.org/jython/rev/123a2956db3f.
History
Date User Action Args
2018-07-09 13:14:43stefan.richthofersetstatus: open -> pending
resolution: accepted -> fixed
messages: + msg12046
2018-07-08 20:35:01stefan.richthofersetassignee: stefan.richthofer
messages: + msg12045
2018-07-05 20:29:37jamesmuddsetmessages: + msg12042
2018-06-30 13:25:59stefan.richthofersetmessages: + msg12036
2018-06-30 12:24:17stefan.richthofersetmessages: + msg12035
2018-06-30 12:04:53jamesmuddsetmessages: + msg12034
2018-06-28 08:56:33stefan.richthofersetmessages: + msg12031
2018-06-27 20:04:19stefan.richthofersetresolution: accepted
messages: + msg12027
2018-06-27 19:47:43jamesmuddsetmessages: + msg12026
2018-06-27 18:50:44stefan.richthofersetmessages: + msg12023
2018-06-27 18:34:47jamesmuddsetmessages: + msg12022
2018-06-24 21:24:58stefan.richthofersetmessages: + msg12019
2018-06-24 21:16:39jamesmuddsetmessages: + msg12018
2018-06-24 13:23:05stefan.richthofersetmessages: + msg12017
2018-05-17 22:02:27jeff.allensetmessages: + msg11995
2018-05-17 21:26:34jamesmuddsetmessages: + msg11994
2018-05-17 20:17:17jamesmuddsetmessages: + msg11993
2018-05-17 19:21:50stefan.richthofersetmessages: + msg11990
2018-05-17 19:11:27stefan.richthofersetmessages: + msg11989
2018-05-17 16:46:29jamesmuddsetmessages: + msg11988
2018-05-16 22:08:23jamesmuddsetmessages: + msg11987
2018-05-16 19:17:06stefan.richthofersetnosy: + stefan.richthofer
messages: + msg11986
2018-05-15 22:23:04jamesmuddsetfiles: + IntegerBug.java
nosy: + jamesmudd
messages: + msg11982
2018-04-29 15:05:58jeff.allencreate