Issue1702

classification
Title: wrong byteorder / endianness detected
Type: behaviour Severity: critical
Components: Core Versions: 2.5.2rc
Milestone:
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: amak Nosy List: amak, fwierzbicki, lukas, pjenvey
Priority: Keywords: patch

Created on 2011-01-26.13:48:48 by lukas, last changed 2013-02-18.00:21:15 by amak.

Files
File name Uploaded Description Edit Remove
byteorder.patch lukas, 2011-01-26.16:22:22 patch
Messages
msg6354 (view) Author: Lukas (lukas) Date: 2011-01-26.13:48:47
On my little endian x86 linux machine, jython reports the wrong byteorder. This also has effects on the behavior of struct and array and introduces subtle errors when doing i/o. Especially, as the array.tofile() uses the (wrongly) detected native format.

Testcode:
>>>import os
>>>os.byteorder
'big'

--> should be 'little'

my jython version:
Jython 2.5.2rc3 (Release_2_5_2rc3:7184, Jan 10 2011, 22:54:57) 
[Java HotSpot(TM) 64-Bit Server VM (Sun Microsystems Inc.)] on java1.6.0_20
msg6355 (view) Author: Lukas (lukas) Date: 2011-01-26.16:22:22
Added patch for sys.byteorder and struct.
The changes in PyArray.java are a bit more involved. Probably LittleEndianDataInputStream/OutputStream from guava v8 could be used.

Otherwise, the documentation regarding jython and byteorder should be updated (i.e. that it is fixed to 'big')
msg6361 (view) Author: Philip Jenvey (pjenvey) Date: 2011-01-29.00:35:51
I'm not sure about this. Java's byte order is big so we have always hardcoded sys.byteorder accordingly. nio now exposes the native byte order of the underlying platform but I think it only does this for its own performance reasons

Can you describe how this is an issue for you?
msg6374 (view) Author: Lukas (lukas) Date: 2011-01-31.08:13:29
Hi Philip,
Thank you for your reply. It's not a particular issue for me at the moment, but more an very unexpected behavior (and I think this is for every developer who comes from CPython).
First of all, I would expect that two Python interpreters behave the same way, if they're interpreting the same code on the same machine. 
Second, if this is not the case, the doc should clearly say so (and even then this is really suboptimal).
But neither the doc to sys or struct does so. In the docs of struct, it's written "If the first character is not one of these, '@' is assumed. Native byte order is big-endian or little-endian, depending on the host system." and in the docs to sys.byteorder it says "An indicator of the native byte order. This will have the value 'big' on big-endian (most-significant byte first) platforms, and 'little' on little-endian (least-significant byte first) platforms.".
I think it is obvious, that this introduces a source for many subtle bugs, which are somewhat hard to find and may lead to serious data corruption.
Third, I don't see an obvious way to detect the endianness in pure Python (for which I thought sys.byteorder would be there).
msg6573 (view) Author: Lukas (lukas) Date: 2011-07-15.08:35:20
Any news?
msg7377 (view) Author: Frank Wierzbicki (fwierzbicki) Date: 2012-08-10.20:58:50
I don't think we can apply this to 2.5, really this is a tough one to decide on. We should fix the documentation at least.
msg7690 (view) Author: Alan Kennedy (amak) Date: 2013-02-17.20:19:34
I am closing this issue as "wontfix".

My reasoning is as follows.

Java endian-ness
================

Java endian-ness is big-endian.

Rather than link the java specs, I will link this stackoverflow post, which links the relevant sections.

Java's Virtual Machine's Endianness
http://stackoverflow.com/questions/981549/javas-virtual-machines-endianness

A part of the argument stated is that "I would expect that two Python interpreters behave the same way, if they're interpreting the same code on the same machine.".

But they're not on the same machine. Jython is running that code inside the java virtual machine, which abstracts away the underlying hardware. Java behaves identically across architectures, big- and little-endian.

If I were running two different operating systems on the same "machine", but in virtual machines under something like VMWare or Virtualbox, then I would not expect the operating systems to behave identically, even though they are they are running on the same machine.

java.nio
========

The reason why java.nio exposes the java.nio.ByteOrder class is because java.nio supports the concept of "direct buffers", where java layers a buffer abstraction on top of memory that has been allocated outside of the JVM, and can be operated upon by native operating system services, where the endian-ness is a concern.

http://www.javamex.com/tutorials/io/nio_buffer_direct.shtml

Use cases for struct
====================

Struct is used in cases it is required to serialise data in python data structures in a format that is comprehensible to some form of peer. 

The primary situations where endian-ness are a concern are

1. In-process foreign function invocation.

When jython (or java) code wants to invoke functions written in another language, e.g. C or Fortran, stack frames need to be constructed and destructed so that data are transferred in the correct endian-ness.

In java, this is done by Java Native Interface (JNI). In jython, this is done with the (as yet incomplete) ctypes module. 

Both have endian-ness builtin as a fundamental concept that does not rely on sys.byteorder or struct.(un)pack.

2. Network communication

When communicating between two network endpoints, which potentially have different endian-ness, byte order is important.

However, all network communication must be in network byte order (i.e. big endian), therefore, sys.byteorder and struct.(un)pack are not relevant in this context.

3. Other non-file uses.

An examination of the cpython 2.7 library shows the following uses of struct.(un)pack, along with notes about their explicit declaration of endian-ness.

Module     Notes
=========================================================
base64.py  All uses explicitly specify network endian (big-endian)
binhex.py  All uses explicitly specify network endian (big-endian)
pickle.py  All uses explicitly specify an endian (little or big)
xdrlib.py  All uses explicitly specify big-endian

So sys.byteorder and the behaviour of struct.)un)pack based upon it are not a concern here.

(There is also a usage in the jython socket module, for the SO_LINGER socket option, which takes a struct.pack'ed parameter, by virtue of its cpython heritage. But this is not a concern, because the packing and unpacking is always carried out inside the same execution instance of the interpreter).

4. File usage

Lastly, we reach the key area where endian-ness must be carefully (and I will argue, explicitly) handled.

Examination of the cpython 2.7 library shows the following uses of struct.(un)pack, all realting to file format processing.

The folllowing modules all explicitly declare the endian-ness of the data they are processing, and thus sys.byteorder is not involved in their de/serialization.

Module        Notes
=========================================================
aifc.py       All uses explicitly specify big-endian
              Reads and writes AIFF or AIFC format

chunk.py      All uses explicitly specify endian-ness
              Processes IFF chunks, as in AIFF, TIFF, RMFF
              Endian specified as part of format

compileall.py All uses explicitly specify little-endian

gzip.py       All uses explicitly specify little-endian

imputil.py    All uses explicitly specify little-endian

modulefinder.py All uses explicitly specify little-endian

wave.py       All uses explicitly specify little-endian

zipfile.py    All uses explicitly specify little-endian

There are three files which contain struct.(un)pack calls which do not specify an endian-ness. They are

Module        Notes
=========================================================
tarfile.py    4 uses explicitly specify a number of bytes - no endian-ness concerns
              3 uses explicitly specify little endian-ness
              1 use explicitly specify a fixed-length string - no endian-ness concerns
              1 use relating to a GNU specific extension for large files which does not specify endian-ness. 
               - Will give differing results across platforms
               - Probably a bug that was never noticed because of prevalence of little-endian archs

posixfile.py  DOES NOT EXPLICITLY DECLARE ENDIAN-NESS
              Processes File-like objects with locking support
              Unix-only - has special cases for many (but not all) flavours of unix
              Does not support java
              Deprecated since cpython 1.5(!)
              Removed in cpython 3

whichdb.py    All uses explicitly specify native byte order
              Deprecated in python 3, subsumed into dbm module
              The replacement in dbm module uses struct.unpack for a single purpose,
              to read the magic number of a dbm file, which is a dbm-specific concern.

Posixfile can, I think, be ignored, because it has been deprecated forever, and removed in python 3.
Whichdb can, I think, be ignored, because it has been deprecated, and its sole use of struct is specific to a single database format.

Which leaves a single module, tarfile, which has a single call to struct.(un)pack which does not specify an endian-ness, and which I believe is possibly an undiscovered bug, or a bug that no-one has tripped over and reported yet.

Which leads onto my final point.

Explicit vs. Implicit
=====================

We all know in python land that "explicit is better than implicit". But in this case, explicitness is more than something to be desired, it is a necessity.

As an anology, take character encodings of text serialized to a file. If you serialize text to a file using one encoding, and deserialize using a different encoding, you will get garbage ("mojibake"). It is for this reason that all common and modern textual file formats (including python source code) permit you to explicitly declare a character encoding, using various human readable mechanisms.

Moreover, the UTF character encodings go even further by explicitly placing a "Byte Order Mark" or "BOM" at the beginning of files, so that their endian-ness and encoding can be detected.

Similarly, any encoding of binary data, be it integers or floats, should be so explicitly treated. This means that

1. You should explicitly include endian-ness information in any data serialization that is persisted, so that the data can be correctly serialized without making assumptions (also see pyhon issue http://bugs.python.org/issue12848).

2. If you do NOT include such an explicit declaration, then
A: If you deserialize the data using the same library which you used to serialize the data, then you can reasonably expect that the data will deserialize correctly. (note that for jython, this holds for deserialization of the same data across big- and little-endian architectures, because of java constant big-endian-ness)
B: If you deserialize the data using a different library than that which you used to serialize the data, or the same library running on a different platform (think reading ASCII python text on an EBCDIC system), then you can have no guarantee that the data will be deserialized correctly.

To summarise: If you rely on implicit declarations of endian-ness, your data is going to get mangled when it crosses machine or architectural boundaries.

Legacy concerns
===============

As maintainers of a 15 year old platform, we also have to take into consideration that there may be jython users out there who fall into category 2A above: they may have persisted data using struct.pack, without specifying an endian-ness. If we apply the patch above, then we will corrupt this data on deserialization, if they are running on a little-endian platform. This is not an acceptable outcome.

Documentation
=============

The documentation states that sys.byteorder is "an indicator of the native byte order. This will have the value 'big' on big-endian (most-significant byte first) platforms, and 'little' on little-endian (least-significant byte first) platforms.".

This statement is correct.

The only thing that might perhaps be added is that, on jython, because the JVM is big-endian, sys.byteorder will never contain any other value than "big": it will never contain the value "little".

Objections
==========

If the OP objects to this resolution, then please re-open the issue with either

A: A rebuttal of the points above
B: Concrete evidence of data corruption
msg7691 (view) Author: Alan Kennedy (amak) Date: 2013-02-17.23:54:09
Sorry, bad paste of that python bug link.

This is the issue I meant to link.

Array pickling exposes internal memory representation of elements
http://bugs.python.org/issue2389
msg7692 (view) Author: Alan Kennedy (amak) Date: 2013-02-18.00:21:15
Lastly, in relation to array processing, the cpython array module documentation

http://docs.python.org/2/library/array.html

says this

"The actual representation of values is determined by the machine architecture (strictly speaking, by the C implementation)"

The machine architecture in this case is the JVM, not the architecture of the x86, Sparc or other CPU it is running on.

And the statement "strictly speaking, by the C implementation" can only be taken as a cpython specific statement, since python interpreters do not necessarily have to be written in C. Python interpreters can and have been written in Java, C#, Javascript, Haskell, Lisp, Ruby, and of course, Python.

http://wiki.python.org/moin/PythonImplementations
History
Date User Action Args
2013-02-18 00:21:15amaksetmessages: + msg7692
2013-02-17 23:54:09amaksetmessages: + msg7691
2013-02-17 20:19:35amaksetstatus: open -> closed
assignee: amak
resolution: wont fix
messages: + msg7690
nosy: + amak
2012-08-10 20:58:50fwierzbickisetnosy: + fwierzbicki
messages: + msg7377
2011-07-15 08:35:20lukassetmessages: + msg6573
2011-01-31 08:13:29lukassetmessages: + msg6374
2011-01-29 00:35:51pjenveysetnosy: + pjenvey
messages: + msg6361
2011-01-26 16:22:23lukassetfiles: + byteorder.patch
keywords: + patch
messages: + msg6355
2011-01-26 13:48:48lukascreate