Issue1802339

classification
Title: Problem printing unicode when stdout intercepted
Type: Severity: normal
Components: Core Versions: 2.5.1
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: pjenvey Nosy List: cgroves, nriley, pekka.klarck, pjenvey, richardwoolliscroft
Priority: normal Keywords:

Created on 2007-09-25.23:13:24 by pekka.klarck, last changed 2009-10-28.18:41:58 by pjenvey.

Files
File name Uploaded Description Edit Remove
unic.patch pekka.klarck, 2007-09-27.22:52:42
Messages
msg1933 (view) Author: Pekka Klärck (pekka.klarck) Date: 2007-09-25.23:13:24
Running following code fails when using Jython 2.2.1 rc 1 but succeeds with Jython 2.2 (and earlier alphas/betas/rcs) and Python 2.3/2.4/2.5. 

- - - - - - - - - -
import sys
from StringIO import StringIO

msg = u'Circle is 360\u00B0'
sys.stdout = StringIO()

print msg

assert sys.stdout.getvalue() == msg + '\n'
- - - - - - - - - -


The traceback is below the code and shows that printing a unicode string fails even though in this case stdout has been intercepted.

- - - - - - - - - -
Traceback (innermost last):
  File "unictest.py", line 7, in ?
UnicodeError: ascii encoding error: ordinal not in range(128)
- - - - - - - - - -


Being able to print unicode strings like this is crucial in our case. We've been implementing a test automation framework that runs on Python and Jython and  it can be extended using so called test libraries which they can write messages to a common test log simply by writing to stdout. This way the API between the framework and libraries is pretty simple and it works the same way both when a lib is written in Python and when it's written in Java (we intercept java.lang.System.out too).
msg1934 (view) Author: Philip Jenvey (pjenvey) Date: 2007-09-26.04:17:08
This one actually fails on CPython 2.2 though

CPython > 2.2 calls PyObject_Str on anything printed. Jython doesn't have an equivalent function; in this case it just calls __str__ (in StdoutWrapper) on any object printed.

PyObject_Str looks like it's a safer version of __str__ for situations like these, it specially handles unicode objects, returning PyUnicode_AsEncodedString (which is like our encode_UnicodeEscape)

We could special case unicode objects in StdoutWrapper, but I see PythonObject_Str used in a few places in CPython. So patching StdoutWrapper might miss other cases where this is a problem

$ grep -r PyObject_Str\( * | grep \.c:
Modules/_csv.c:         str = PyObject_Str(field);
Modules/_tkinter.c:     PyObject *v = PyObject_Str(value);
Modules/_tkinter.c:     PyObject *v = PyObject_Str(value);
Objects/descrobject.c:  return PyObject_Str(pp->dict);
Objects/fileobject.c:                        value = PyObject_Str(v);
Objects/object.c:                 s = PyObject_Str(op);
Objects/object.c:PyObject_Str(PyObject *v)
Objects/stringobject.c:         op = (PyStringObject *) PyObject_Str((PyObject *)op);
Objects/stringobject.c: return PyObject_Str(x);
Objects/stringobject.c:                  temp = PyObject_Str(v);
Objects/stringobject.c:                     PyObject_Str() assure this */
Objects/unicodeobject.c:                temp = PyObject_Str(v);
Objects/unicodeobject.c:                       PyObject_Repr() and PyObject_Str() assure
Python/bltinmodule.c:             po = PyObject_Str(v);
Python/codecs.c:        PyObject *string = PyObject_Str(name);
Python/errors.c:                tmp = PyObject_Str(v);
Python/exceptions.c:        out = PyObject_Str(tmp);
Python/exceptions.c:        out = PyObject_Str(args);
Python/exceptions.c:    str = PyObject_Str(msg);
Python/pythonrun.c:     v = PyObject_Str(v);
Python/pythonrun.c:     w = PyObject_Str(w);
Python/pythonrun.c:               PyObject *s = PyObject_Str(value);
msg1935 (view) Author: Pekka Klärck (pekka.klarck) Date: 2007-09-26.07:34:24
This might be a bit involved for me to investigate and fix but if nobody else is doing it I can try. Getting the original example working would be a big step forward and even if other places were missed that would be better than nothing.

I hope that this failing on CPython 2.2 doesn't mean that it won't be fixed in Jython 2.2. At least for us that would be really inconvenient because it'll take some time before Jython 2.3 (or whatever the version will be) is released. We can of course instruct people needing to use unicode to stick with 2.2 but then they won't get any other fixes/features in 2.2.x releases.
msg1936 (view) Author: Charlie Groves (cgroves) Date: 2007-09-27.04:42:28
Without a patch in hand and a good understanding of the problem, I think this is too big of a change to attempt between release candidates.  Even Philip's explanation below isn't complete because if CPython were just using unicode_escape on the printed objects, your final assert would fail.  sys.stdout.getvalue() would have a str object in it which isn't equal to the unicode object from above.  It definitely passes though.  While 2.2.1 is too far along to fix this, I wouldn't mind making a 2.2.2 for this and whatever else comes up.  

That said, as long as you're not relying on unicode objects coming out of getvalue(which I don't think could be the case since that wouldn't have happened under 2.2 either), you might be able to get around this by setting the default encoding.  The reason it's complaining about ascii is because ascii is the default default encoding.  You can change that to any encoding supported by Jython in your site.py, and then whenever Jython attempts to turn a unicode object into a str without an explict encoding, it'll use that encoding to do the work.  It works the same in the opposite direction when decoding a str into a unicode object without an explicit encoding.
msg1937 (view) Author: Pekka Klärck (pekka.klarck) Date: 2007-09-27.22:52:42
Philip pointed me to StdoutWrapper and after playing with it a little bit I was able to come up with a simple patch (attached) that makes the original example pass. I run dist/Lib/test/regrtest.py on 2.2 maint branch both w/ and w/o the patch and got same failures so it doesn't break everything.

I have to confess that I don't really know the code in StdoutWrapper nor the code using it so I may very well be missing something totally obvious. The patch is rather ugly (catching Throwable is probably not the best idea) and should be taken as a prototype at this phase.

File Added: unic.patch
msg1938 (view) Author: Charlie Groves (cgroves) Date: 2007-09-30.01:49:39
I don't think this patch is going in the right direction.  Rather than slipping in a quick fix for this particular case, we need to figure out exactly what CPython was doing in 2.2 and what CPython is doing currently.  If the current behavior won't break 2.2's expectations in a horrible way, we can add it to our 2.2.  Just shoehorning a fix in for this one case could lead to weirdly inconsistent behavior in different parts of the code, which I really want to avoid.

Did you try setting the default encoding?  You can do it from java with org.python.core.codecs.setDefaultEncoding.
msg1939 (view) Author: Pekka Klärck (pekka.klarck) Date: 2007-09-30.22:21:49
I totally agree that fixing this issue with a hack that just seems to solve the problems is not the right thing to do. My patch was just an example showing that somehow modifying StdoutWrapper might be a part of the solution. Unfortunately I don't understand Jython (nor CPython) internals well enough to be able to figure out a real fix. =/

Thanks for mentioning org.python.core.codecs.setDefaultEncoding. I played with it a little and it seems that we could even have a workaround for the problem in our system. I changed my original example slightly and was able to get "print <unicode>" working. There are still some differences between different Jython versions and CPython but we should be able to handle them.

Here's the new code:

- - - - - - - - - -
import sys
import os
from StringIO import StringIO
if os.name == 'java':
    from org.python.core import codecs
    codecs.setDefaultEncoding('utf-8')

    print 'Jython', sys.version
else:
    print 'Python', sys.version

sys.stdout = StringIO()
msg = u'Circle is 360\u00B0'
print msg

out = sys.stdout.getvalue()
sys.stdout = sys.__stdout__
print out, type(out)
print msg, type(msg)
assert out == msg + '\n'
- - - - - - - - - -

And here are outputs using few different interpreters:

- - - - - - - - - -
Jython 2.2rc3
Circle is 360°
<type 'str'>
Circle is 360° <type 'unicode'>
- - - - - - - - - -
Jython 2.2.1rc1
Circle is 360°
<type 'str'>
Circle is 360° <type 'unicode'>
Traceback (innermost last):
  File "unictest.py", line 21, in ?
AssertionError: 
- - - - - - - - - -
Python 2.5.1 (r251:54863, May  2 2007, 16:56:35) 
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)]
Circle is 360°
<type 'unicode'>
Circle is 360° <type 'unicode'>
msg3176 (view) Author: Pekka Klärck (pekka.klarck) Date: 2008-05-02.20:06:17
This issue might be related to http://bugs.jython.org/issue1032
msg3178 (view) Author: Pekka Klärck (pekka.klarck) Date: 2008-05-02.20:13:55
While investigating #1032 I also noticed that the example in msg1939
passes on Jython 2.2.1 if setDefaultEncoding('utf-8') is changed to
setDefaultEncoding('iso-8859-1'). Unfortunately it only works if the
printed string is ISO-8859-1 like in the example -- if it's something
else the familiar UnicodeError reappears.
msg4832 (view) Author: Philip Jenvey (pjenvey) Date: 2009-06-21.21:49:59
This is still present on 2.5
msg4850 (view) Author: Pekka Klärck (pekka.klarck) Date: 2009-06-22.02:58:09
pjenvey, if you need help testing this let me know. I'd like to get
Unicode working fully with Robot Framework (http://robotframework.org)
also on Jython.
msg4894 (view) Author: Philip Jenvey (pjenvey) Date: 2009-07-11.23:23:02
fixed in r6529. Hope this helps, Pekka
msg4895 (view) Author: Pekka Klärck (pekka.klarck) Date: 2009-07-11.23:46:32
Awesome! I'm currently on holiday trying to avoid work related tasks,
but I added verifying this behavior to our Jython 2.5(.1) compatibility
issue (http://code.google.com/p/robotframework/issues/detail?id=198).
msg5281 (view) Author: Richard Woolliscroft (richardwoolliscroft) Date: 2009-10-28.09:32:49
The fix does not solve the case where the stdout is a PyFileWriter which
uses encoding.
msg5282 (view) Author: Richard Woolliscroft (richardwoolliscroft) Date: 2009-10-28.09:54:39
Actually, I'm not sure a change to StoutWrapper would fix this. 

The problem is that displayhook in PySystemState calls __repr__ on
everything to be printed to the stdout. 

If the object is a PyUnicode then its __repr__ method returns a PyString
which is passed into Py.stdout.println, so StoutWrapper cannot
distinguish between something which originally was a PyUnicode or a
PyString. So any unicode string would always come out with a u at the
start and not be properly encoded.
msg5288 (view) Author: Philip Jenvey (pjenvey) Date: 2009-10-28.18:33:02
That issue should really be a new ticket. It also needs a test
History
Date User Action Args
2009-10-28 18:41:58pjenveysetnosy: + nriley
2009-10-28 18:33:03pjenveysetmessages: + msg5288
2009-10-28 09:54:40richardwoolliscroftsetmessages: + msg5282
2009-10-28 09:32:49richardwoolliscroftsetnosy: + richardwoolliscroft
messages: + msg5281
2009-07-11 23:46:32pekka.klarcksetmessages: + msg4895
2009-07-11 23:23:03pjenveysetstatus: open -> closed
resolution: fixed
messages: + msg4894
2009-06-22 02:58:09pekka.klarcksetmessages: + msg4850
title: [221rc1] Problem printing unicode when stdout intercepted -> Problem printing unicode when stdout intercepted
2009-06-21 23:38:45pjenveysetassignee: pjenvey
2009-06-21 21:49:59pjenveysetmessages: + msg4832
versions: + 2.5.1, - 2.2.2
2009-03-14 03:02:28fwierzbickisetversions: + 2.2.2
2008-12-15 17:09:52fwierzbickisetcomponents: + Core, - None
2008-05-02 20:13:55pekka.klarcksetmessages: + msg3178
2008-05-02 20:06:17pekka.klarcksetmessages: + msg3176
2007-09-25 23:13:24pekka.klarckcreate