Message5863

Author rdesgroppes
Recipients rdesgroppes
Date 2010-06-30.11:26:55
SpamBayes Score 0.00010921025
Marked as misclassified No
Message-id <1277897217.13.0.83259667503.issue1625@psf.upfronthosting.co.za>
In-reply-to
Content
Hi,
Coercion from/to Unicode should always consider default encoding, typically specified after a call to sys.setdefaultencoding(name) in sitecustomize.py.

Here with utf-8:
1. explicit str-unicode coercion does it well: unicode("utf-8 string") -> utf-8 decoder is used. yes!
2. explicit unicode-str coercion does it wrong: str(u"unicode string") -> utf-8 encoder is not used. latin-1 one?
3. implicit str-unicode coercion does it wrong: java,lang.String("utf-8 string") -> utf-8 decoder is not used.
4. implicit unicode-str coercion does it wrong: str(java.lang.Object) or str(java.lang.Object.toString()) -> utf-8 encoder is not used.

Quick workaround for ^3. and ^4., in the form of a custom hook in sitecustomize.py:
----
    sys.setdefaultencoding("utf-8")

    if os.name == "java":
        from java.lang import Object
        def utf8_str(obj, orig_str=str):
            if isinstance(obj, unicode):
                return obj.encode("utf-8")
            if isinstance(obj, Object):
                return obj.toString().encode("utf-8")
            return orig_str(obj)
        sys.builtins["str"] = utf8_str
----

Thanks for your attention,
Regis
History
Date User Action Args
2010-06-30 11:26:57rdesgroppessetrecipients: + rdesgroppes
2010-06-30 11:26:57rdesgroppessetmessageid: <1277897217.13.0.83259667503.issue1625@psf.upfronthosting.co.za>
2010-06-30 11:26:56rdesgroppeslinkissue1625 messages
2010-06-30 11:26:55rdesgroppescreate