Issue1702

classification

Title:	wrong byteorder / endianness detected
Type:	behaviour	Severity:	critical
Components:	Core	Versions:	2.5.2rc
		Milestone:

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:	amak	Nosy List:	amak, fwierzbicki, lukas, pjenvey
Priority:		Keywords:	patch

Created on 2011-01-26.13:48:48 by lukas, last changed 2013-02-18.00:21:15 by amak.

Files
File name	Uploaded	Description	Edit	Remove
byteorder.patch	lukas, 2011-01-26.16:22:22	patch

Messages
msg6354 (view)	Author: Lukas (lukas)	Date: 2011-01-26.13:48:47
On my little endian x86 linux machine, jython reports the wrong byteorder. This also has effects on the behavior of struct and array and introduces subtle errors when doing i/o. Especially, as the array.tofile() uses the (wrongly) detected native format. Testcode: >>>import os >>>os.byteorder 'big' --> should be 'little' my jython version: Jython 2.5.2rc3 (Release_2_5_2rc3:7184, Jan 10 2011, 22:54:57) [Java HotSpot(TM) 64-Bit Server VM (Sun Microsystems Inc.)] on java1.6.0_20
msg6355 (view)	Author: Lukas (lukas)	Date: 2011-01-26.16:22:22
Added patch for sys.byteorder and struct. The changes in PyArray.java are a bit more involved. Probably LittleEndianDataInputStream/OutputStream from guava v8 could be used. Otherwise, the documentation regarding jython and byteorder should be updated (i.e. that it is fixed to 'big')
msg6361 (view)	Author: Philip Jenvey (pjenvey)	Date: 2011-01-29.00:35:51
I'm not sure about this. Java's byte order is big so we have always hardcoded sys.byteorder accordingly. nio now exposes the native byte order of the underlying platform but I think it only does this for its own performance reasons Can you describe how this is an issue for you?
msg6374 (view)	Author: Lukas (lukas)	Date: 2011-01-31.08:13:29
Hi Philip, Thank you for your reply. It's not a particular issue for me at the moment, but more an very unexpected behavior (and I think this is for every developer who comes from CPython). First of all, I would expect that two Python interpreters behave the same way, if they're interpreting the same code on the same machine. Second, if this is not the case, the doc should clearly say so (and even then this is really suboptimal). But neither the doc to sys or struct does so. In the docs of struct, it's written "If the first character is not one of these, '@' is assumed. Native byte order is big-endian or little-endian, depending on the host system." and in the docs to sys.byteorder it says "An indicator of the native byte order. This will have the value 'big' on big-endian (most-significant byte first) platforms, and 'little' on little-endian (least-significant byte first) platforms.". I think it is obvious, that this introduces a source for many subtle bugs, which are somewhat hard to find and may lead to serious data corruption. Third, I don't see an obvious way to detect the endianness in pure Python (for which I thought sys.byteorder would be there).
msg6573 (view)	Author: Lukas (lukas)	Date: 2011-07-15.08:35:20
Any news?
msg7377 (view)	Author: Frank Wierzbicki (fwierzbicki)	Date: 2012-08-10.20:58:50
I don't think we can apply this to 2.5, really this is a tough one to decide on. We should fix the documentation at least.
msg7690 (view)	Author: Alan Kennedy (amak)	Date: 2013-02-17.20:19:34
I am closing this issue as "wontfix". My reasoning is as follows. Java endian-ness ================ Java endian-ness is big-endian. Rather than link the java specs, I will link this stackoverflow post, which links the relevant sections. Java's Virtual Machine's Endianness http://stackoverflow.com/questions/981549/javas-virtual-machines-endianness A part of the argument stated is that "I would expect that two Python interpreters behave the same way, if they're interpreting the same code on the same machine.". But they're not on the same machine. Jython is running that code inside the java virtual machine, which abstracts away the underlying hardware. Java behaves identically across architectures, big- and little-endian. If I were running two different operating systems on the same "machine", but in virtual machines under something like VMWare or Virtualbox, then I would not expect the operating systems to behave identically, even though they are they are running on the same machine. java.nio ======== The reason why java.nio exposes the java.nio.ByteOrder class is because java.nio supports the concept of "direct buffers", where java layers a buffer abstraction on top of memory that has been allocated outside of the JVM, and can be operated upon by native operating system services, where the endian-ness is a concern. http://www.javamex.com/tutorials/io/nio_buffer_direct.shtml Use cases for struct ==================== Struct is used in cases it is required to serialise data in python data structures in a format that is comprehensible to some form of peer. The primary situations where endian-ness are a concern are 1. In-process foreign function invocation. When jython (or java) code wants to invoke functions written in another language, e.g. C or Fortran, stack frames need to be constructed and destructed so that data are transferred in the correct endian-ness. In java, this is done by Java Native Interface (JNI). In jython, this is done with the (as yet incomplete) ctypes module. Both have endian-ness builtin as a fundamental concept that does not rely on sys.byteorder or struct.(un)pack. 2. Network communication When communicating between two network endpoints, which potentially have different endian-ness, byte order is important. However, all network communication must be in network byte order (i.e. big endian), therefore, sys.byteorder and struct.(un)pack are not relevant in this context. 3. Other non-file uses. An examination of the cpython 2.7 library shows the following uses of struct.(un)pack, along with notes about their explicit declaration of endian-ness. Module Notes ========================================================= base64.py All uses explicitly specify network endian (big-endian) binhex.py All uses explicitly specify network endian (big-endian) pickle.py All uses explicitly specify an endian (little or big) xdrlib.py All uses explicitly specify big-endian So sys.byteorder and the behaviour of struct.)un)pack based upon it are not a concern here. (There is also a usage in the jython socket module, for the SO_LINGER socket option, which takes a struct.pack'ed parameter, by virtue of its cpython heritage. But this is not a concern, because the packing and unpacking is always carried out inside the same execution instance of the interpreter). 4. File usage Lastly, we reach the key area where endian-ness must be carefully (and I will argue, explicitly) handled. Examination of the cpython 2.7 library shows the following uses of struct.(un)pack, all realting to file format processing. The folllowing modules all explicitly declare the endian-ness of the data they are processing, and thus sys.byteorder is not involved in their de/serialization. Module Notes ========================================================= aifc.py All uses explicitly specify big-endian Reads and writes AIFF or AIFC format chunk.py All uses explicitly specify endian-ness Processes IFF chunks, as in AIFF, TIFF, RMFF Endian specified as part of format compileall.py All uses explicitly specify little-endian gzip.py All uses explicitly specify little-endian imputil.py All uses explicitly specify little-endian modulefinder.py All uses explicitly specify little-endian wave.py All uses explicitly specify little-endian zipfile.py All uses explicitly specify little-endian There are three files which contain struct.(un)pack calls which do not specify an endian-ness. They are Module Notes ========================================================= tarfile.py 4 uses explicitly specify a number of bytes - no endian-ness concerns 3 uses explicitly specify little endian-ness 1 use explicitly specify a fixed-length string - no endian-ness concerns 1 use relating to a GNU specific extension for large files which does not specify endian-ness. - Will give differing results across platforms - Probably a bug that was never noticed because of prevalence of little-endian archs posixfile.py DOES NOT EXPLICITLY DECLARE ENDIAN-NESS Processes File-like objects with locking support Unix-only - has special cases for many (but not all) flavours of unix Does not support java Deprecated since cpython 1.5(!) Removed in cpython 3 whichdb.py All uses explicitly specify native byte order Deprecated in python 3, subsumed into dbm module The replacement in dbm module uses struct.unpack for a single purpose, to read the magic number of a dbm file, which is a dbm-specific concern. Posixfile can, I think, be ignored, because it has been deprecated forever, and removed in python 3. Whichdb can, I think, be ignored, because it has been deprecated, and its sole use of struct is specific to a single database format. Which leaves a single module, tarfile, which has a single call to struct.(un)pack which does not specify an endian-ness, and which I believe is possibly an undiscovered bug, or a bug that no-one has tripped over and reported yet. Which leads onto my final point. Explicit vs. Implicit ===================== We all know in python land that "explicit is better than implicit". But in this case, explicitness is more than something to be desired, it is a necessity. As an anology, take character encodings of text serialized to a file. If you serialize text to a file using one encoding, and deserialize using a different encoding, you will get garbage ("mojibake"). It is for this reason that all common and modern textual file formats (including python source code) permit you to explicitly declare a character encoding, using various human readable mechanisms. Moreover, the UTF character encodings go even further by explicitly placing a "Byte Order Mark" or "BOM" at the beginning of files, so that their endian-ness and encoding can be detected. Similarly, any encoding of binary data, be it integers or floats, should be so explicitly treated. This means that 1. You should explicitly include endian-ness information in any data serialization that is persisted, so that the data can be correctly serialized without making assumptions (also see pyhon issue http://bugs.python.org/issue12848). 2. If you do NOT include such an explicit declaration, then A: If you deserialize the data using the same library which you used to serialize the data, then you can reasonably expect that the data will deserialize correctly. (note that for jython, this holds for deserialization of the same data across big- and little-endian architectures, because of java constant big-endian-ness) B: If you deserialize the data using a different library than that which you used to serialize the data, or the same library running on a different platform (think reading ASCII python text on an EBCDIC system), then you can have no guarantee that the data will be deserialized correctly. To summarise: If you rely on implicit declarations of endian-ness, your data is going to get mangled when it crosses machine or architectural boundaries. Legacy concerns =============== As maintainers of a 15 year old platform, we also have to take into consideration that there may be jython users out there who fall into category 2A above: they may have persisted data using struct.pack, without specifying an endian-ness. If we apply the patch above, then we will corrupt this data on deserialization, if they are running on a little-endian platform. This is not an acceptable outcome. Documentation ============= The documentation states that sys.byteorder is "an indicator of the native byte order. This will have the value 'big' on big-endian (most-significant byte first) platforms, and 'little' on little-endian (least-significant byte first) platforms.". This statement is correct. The only thing that might perhaps be added is that, on jython, because the JVM is big-endian, sys.byteorder will never contain any other value than "big": it will never contain the value "little". Objections ========== If the OP objects to this resolution, then please re-open the issue with either A: A rebuttal of the points above B: Concrete evidence of data corruption
msg7691 (view)	Author: Alan Kennedy (amak)	Date: 2013-02-17.23:54:09
Sorry, bad paste of that python bug link. This is the issue I meant to link. Array pickling exposes internal memory representation of elements http://bugs.python.org/issue2389
msg7692 (view)	Author: Alan Kennedy (amak)	Date: 2013-02-18.00:21:15
Lastly, in relation to array processing, the cpython array module documentation http://docs.python.org/2/library/array.html says this "The actual representation of values is determined by the machine architecture (strictly speaking, by the C implementation)" The machine architecture in this case is the JVM, not the architecture of the x86, Sparc or other CPU it is running on. And the statement "strictly speaking, by the C implementation" can only be taken as a cpython specific statement, since python interpreters do not necessarily have to be written in C. Python interpreters can and have been written in Java, C#, Javascript, Haskell, Lisp, Ruby, and of course, Python. http://wiki.python.org/moin/PythonImplementations

History
Date	User	Action	Args
2013-02-18 00:21:15	amak	set	messages: + msg7692
2013-02-17 23:54:09	amak	set	messages: + msg7691
2013-02-17 20:19:35	amak	set	status: open -> closed assignee: amak resolution: wont fix messages: + msg7690 nosy: + amak
2012-08-10 20:58:50	fwierzbicki	set	nosy: + fwierzbicki messages: + msg7377
2011-07-15 08:35:20	lukas	set	messages: + msg6573
2011-01-31 08:13:29	lukas	set	messages: + msg6374
2011-01-29 00:35:51	pjenvey	set	nosy: + pjenvey messages: + msg6361
2011-01-26 16:22:23	lukas	set	files: + byteorder.patch keywords: + patch messages: + msg6355
2011-01-26 13:48:48	lukas	create