Issue1663711

classification
Title: 32767 characters is max string constant size
Type: Severity: normal
Components: Core Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: cgroves, codecraig, kzuberi, rluse, zyasoft
Priority: low Keywords:

Created on 2007-02-19.18:44:20 by codecraig, last changed 2008-09-14.06:09:15 by zyasoft.

Messages
msg1495 (view) Author: craig (codecraig) Date: 2007-02-19.18:44:20
Jython can't handle strings over 32767 characters....


StringBuffer sb = new StringBuffer();
for (int i = 0; i < 32768; i++) {
    sb.append("a");
}

PythonInterpreter pi = new PthonInterpreter();
pi.exec("data = {}");

String x = "data[\"stuff\"] = {\"val\" : \"" + sb + \"}";
pi.exec(x);


When the line, "pi.exec(x)" is execute the following exception occurs:

Exception in thread "main" Traceback (innermost last):
  (no code object) at line 0
SyntaxError: ('string constant too large (more than 32767 characters)', ('<string>', 1, 23, ''))

Can this be fixed?
msg1496 (view) Author: Khalid Zuberi (kzuberi) Date: 2007-02-20.00:23:17
To clarify your description, its a limit of the size of string constants in the source and not a limit to the size of strings handled by the program (i think that's what you mean anyway). Looking in the source history, it seems to have been introduced with this ancient checkin:

  http://jython.svn.sourceforge.net/viewvc/jython?view=rev&revision=131

But the bug number mentioned there refers to a system that predated our use of the sourceforge trackers (a jitterbug instance?), and i've not been able to dig up the actual bug report.

Experimenting with that limit removed in CodeCompiler.java using a little one-liner like:

 exec('x="%s"' % ('1' * 65536 ))

shows an underlying problem. The relavant bit of stacktrace is:
 
 java.io.UTFDataFormatException: encoded string too long: 65536 bytes
         at java.io.DataOutputStream.writeUTF(DataOutputStream.java:347)
         at java.io.DataOutputStream.writeUTF(DataOutputStream.java:306)
         at org.python.compiler.ConstantPool.UTF8(ConstantPool.java:88)
         at org.python.compiler.ConstantPool.String(ConstantPool.java:188)
 
So i think what's happening here is that the string constants that appear in the source are stored in the java class's constant pool, but that the max size allowed there and allowed by writeUTF() is 64k bytes. Here's an old reference to this limit:

 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4071592

Notice that the check in CodeCompiler.java is actually comparing the number of (presumably 16-bit encoded) characters in the string to this 32767 limit and not the length of its encoding in UTF-8. So its possible that we are actually disallowing string constants that would actually fit, say in the case of the plain old ascii subset that is represented by 1-byte chars in UTF-8.

Anyhow, if you can control your input, you may be able to work around this by transforming your large string constants into smaller constants concatenated at runtime. It would be interesting to see if a similar transformation were possible to do automagically within jython, but i wouldn't expect it for the upcoming release.

Lowering priority and removing assignment to next beta.

- kz 
msg1497 (view) Author: craig (codecraig) Date: 2007-02-20.01:48:25
Currently I have to do my own management for this problem, where I check the length of any string before putting it into Python and splitting into pieces smaller 32767 characters.

guess that'll do for now :)
msg1498 (view) Author: Charlie Groves (cgroves) Date: 2007-05-01.09:40:36
Just to be clear, Jython can handle any string that fits in memory.  It can't handle a string *literal* longer than 32767 characters.  Because you're putting the string directly into the exec'd code it's turned into a literal and compiled into bytecode.  If you set a PyString in the interpreter, this will work fine. 

        StringBuffer sb = new StringBuffer();
        for (int i = 0; i < 32768; i++) {
            sb.append('a');
        }
        PythonInterpreter pi = new PythonInterpreter();
        pi.exec("data = {}");

        pi.set("s", new PyString(sb.toString()));
        pi.exec("data[\"stuff\"] = {\"val\" : s}");
        pi.exec("print len(s)");

That prints '32768'.  
msg1499 (view) Author: Bob Luse (rluse) Date: 2007-12-24.21:18:17
I just ran the following script and it was still going strong at over 75000 character string length :


seed = 'a'
testString = 'a'
while True:
    testString = testString + seed
    if len(testString) % 1000 == 0:
        print len(testString)

So, I think you can close this one also.
    
msg1500 (view) Author: Charlie Groves (cgroves) Date: 2007-12-25.00:32:06
The problem isn't the length of any string, it's the length of string literals in code.  That's what Khalid's code does by execing a string of length 65536, so this issue still exists.
msg3587 (view) Author: Jim Baker (zyasoft) Date: 2008-09-14.06:08:38
Fixed for 2.5 in r5196
Currently testing is being done by other packages such as pygments.
History
Date User Action Args
2008-09-14 06:09:15zyasoftsettitle: 32767 characters is max string size -> 32767 characters is max string constant size
2008-09-14 06:08:38zyasoftsetstatus: open -> closed
nosy: + zyasoft
resolution: fixed
messages: + msg3587
2007-02-19 18:44:20codecraigcreate