Issue2609

classification
Title: PyType.fromClass publication race (discovered in strptime and re)
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7
Milestone: Jython 2.7.2
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: Ivan, jeff.allen, zyasoft
Priority: Keywords:

Created on 2017-07-18.13:29:53 by Ivan, last changed 2017-09-05.20:52:59 by zyasoft.

Files
File name Uploaded Description Edit Remove
parse3a.py jeff.allen, 2017-07-30.09:35:22 Python script sometimes demonstrating the error
parse4a.py jeff.allen, 2017-08-04.07:43:21 Variant used to test the back-out of 2 sets
Messages
msg11479 (view) Author: Ivan (Ivan) Date: 2017-07-18.13:29:53
Got this in logs:
'org.python.modules.sre.MatchObject' object has no attribute 'end'
Call stack:
    gs.time_stamp = datetime.strptime(data_dict['time_stamp'], '%Y-%m-%dT%H:%M:%S')
  File "/opt/flute/lib/jython-standalone-2.7.1.jar/Lib/datetime.py", line 1791, in strptime
  File "/opt/flute/lib/jython-standalone-2.7.1.jar/Lib/_strptime.py", line 326, in _strptime

line 326 of _strptime is     if len(data_string) != found.end()..

Unfortunately, I have no idea what was that in data_dict['time_stamp'].
msg11490 (view) Author: Ivan (Ivan) Date: 2017-07-21.17:01:38
Further investigation shown that this issue is due to using compiled re objects in multi-threading environment. Using the script below one can reproduce errors saying that certain attributes not present in org.python.modules.sre.MatchObject. This errors are not reproducible for jython2.7.0!

---- Then the question is -- should re be thread-safe at all?


import thread
import re
import time
REGEX = re.compile('((?:foo|bar)+)(\d*)E?')

def parse(line):
  m = REGEX.search(line)
  if m.groups():
    print m.group(1)

thread.start_new_thread(parse, ('foobarfoofoofoofoofoofoofoofoofoofoo234',))
thread.start_new_thread(parse, ('foobarbarbarbarbarbarbarbarbarbarfoo4E',))
thread.start_new_thread(parse, ('foofoofoofoofoofoofoofoofoofoofoofoo434',))
thread.start_new_thread(parse, ('barbarbarbarbarbarbarbarbarfoofoo4',))
thread.start_new_thread(parse, ('foobarbarbarbarbarbarbarfoofoofoofoofoofoofoobarfoo4',))
thread.start_new_thread(parse, ('foobarbarbarbarbarbarbarbarbarfoo3424',))
thread.start_new_thread(parse, ('foofoobarfoobarfoobarfoobarfoobar333334E',))
thread.start_new_thread(parse, ('barfoofoo4barfoofoobarfoofoo4barfoofoo4barfoofoo4barfoofoo4',))
thread.start_new_thread(parse, ('foofoofoobarbarfoofoobarbarfoofoobarbarfoofoobarbar4',))
thread.start_new_thread(parse, ('foobarbarbarbarbarbarbarfoofoofoofoofoofoofoobarfoo4',))
thread.start_new_thread(parse, ('foobarbarbarbarbarbarbarbarbarfoo3424',))
thread.start_new_thread(parse, ('foofoobarfoobarfoobarfoobarfoobar333334E',))
thread.start_new_thread(parse, ('barfoofoo4barfoofoobarfoofoo4barfoofoo4barfoofoo4barfoofoo4',))
thread.start_new_thread(parse, ('foofoofoobarbarfoofoobarbarfoofoobarbarfoofoobarbar4',))

time.sleep(1)
msg11491 (view) Author: Ivan (Ivan) Date: 2017-07-21.17:26:31
And here is the script to reproduce the original problem (run it several times to catch an exception). The bug is obviously due to regexp object shared somewhere between threads:

import thread
import re
import datetime
import time

def printtime(time):
  print datetime.datetime.strptime(time, '%Y-%m-%dT%H:%M:%S')


thread.start_new_thread(printtime, ('2017-07-01T11:11:11',))
thread.start_new_thread(printtime, ('2017-07-01T11:11:11',))
thread.start_new_thread(printtime, ('2017-07-01T11:11:11',))
thread.start_new_thread(printtime, ('2017-07-01T11:11:11',))
thread.start_new_thread(printtime, ('2017-07-01T11:11:11',))
thread.start_new_thread(printtime, ('2017-07-01T11:11:11',))
thread.start_new_thread(printtime, ('2017-07-01T11:11:11',))
thread.start_new_thread(printtime, ('2017-07-01T11:11:11',))
thread.start_new_thread(printtime, ('2017-07-01T11:11:11',))
thread.start_new_thread(printtime, ('2017-07-01T11:11:11',))
thread.start_new_thread(printtime, ('2017-07-01T11:11:11',))
thread.start_new_thread(printtime, ('2017-07-01T11:11:11',))


time.sleep(1)
msg11492 (view) Author: Ivan (Ivan) Date: 2017-07-22.07:36:59
Another issue with the same root is with json.loads:

    dict = json.loads(flute.params)
  File "/opt/flute/lib/jython-standalone-2.7.1.jar/Lib/json/__init__.py", line 338, in loads
  File "/opt/flute/lib/jython-standalone-2.7.1.jar/Lib/json/decoder.py", line 366, in decode
AttributeError: 'org.python.modules.sre.MatchObject' object has no attribute 'end'
msg11494 (view) Author: Jeff Allen (jeff.allen) Date: 2017-07-23.13:14:34
Thanks for investigating this so carefully.

You write "The bug is obviously due to regexp object shared somewhere between threads", I suppose because in your example it's the only object that obviously is shared between threads. A couple of things puzzle me:
1. Very little here has changed since 2.7.1b3 (Feb 2016).
2. The only obviously shared object (REGEX) appears not to be mutated. 

Which causes me to wonder whether there might be a fundamental problem where module-global objects might be inconsistent between threads. Others may have more insight.
msg11495 (view) Author: Jeff Allen (jeff.allen) Date: 2017-07-23.16:34:42
I was unable to reproduce this on my Windows box. Here's what I evolved Ivan's demonstration into:

import thread
import re
import time
import random

LEN = 2000
COUNT = 10000
REGEX = re.compile('((?:foo|bar)+)(\d*)E?')

counter = 0
counter_lock = thread.allocate_lock()

def count(inc=0):
    global counter
    with counter_lock:
        counter += inc
        return counter
    
def parse(line):
    try:
        m = REGEX.search(line)
    finally:
        count(1)
        if LEN <= 10 and m.groups():
            print line, m.group(1)

def make():
    nonsense = list()
    for i in range(LEN):
        nonsense.append(random.choice(('foo', 'bar')))
    nonsense.append(str(random.randint(1, 99999)))
    nonsense.append(random.choice(('', 'E')))
    return ''.join(nonsense)


# Make up material
tests = list()
for i in range(COUNT):
    tests.append(make())

print "Start all threads"
start = thread.start_new_thread
for test in tests:
    start(parse, (test,))

print "Wait for completion"
while True:
    c = count()
    print c
    if not c < COUNT:
        break
    time.sleep(0.1)

I know there's a lock here, but it doesn't interfere until after the threads complete.

I get:
 ..\inst\bin\jython .\iss2609_parse2.py
Start all threads
Wait for completion
9984
10000
and no exceptions. During the run, jvisualvm peaks at 1038 live threads and the resource monitor shows all 4 CPUs busy, so I think I'm getting as much concurrency out of it as I have available. I am using:
Jython 2.7.1 (default:0df7adb1b397, Jun 30 2017, 19:02:43)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_141
msg11496 (view) Author: Ivan (Ivan) Date: 2017-07-23.21:25:58
Hello, Jeff, thanks for reply!

For some reason your version of script also works on my PC without any errors.

Could you please try this one: 

import thread
import re
import datetime
import time

def printtime(time):
    datetime.datetime.strptime(time, '%Y-%m-%dT%H:%M:%S')

for i in range(10):
    for i in range(10):
         thread.start_new_thread(printtime, ('2017-07-01T11:11:11',))
    time.sleep(0.5)


This reproduces the error on my Windows 8 laptop with 4 core CPU with the following callstack:

Unhandled exception in thread started by <function printtime at 0x2>
Traceback (most recent call last):
  File "d:\temp\deleteme\foo.py", line 7, in printtime
    datetime.datetime.strptime(time, '%Y-%m-%dT%H:%M:%S')
  File "D:\jython2.7.1\Lib\datetime.py", line 1791, in strptime
    struct, micros = _strptime(date_string, format)
  File "D:\jython2.7.1\Lib\_strptime.py", line 326, in _strptime
    if len(data_string) != found.end():
AttributeError: 'org.python.modules.sre.MatchObject' object has no attribute 'end'


Our production environment (where the problem came from) are 2-core DigitalOcean droplets with Ubuntu 16.04
msg11497 (view) Author: Jeff Allen (jeff.allen) Date: 2017-07-24.18:42:24
I had to make it 1000 blocks of 100 calls, but I did got one exception out of it. But not when I try and catch it :(
msg11498 (view) Author: Ivan (Ivan) Date: 2017-07-24.20:17:27
I reproduce this each time I run the script on my development machine. 

But what is worse, we've got lots of these errors in production environment logs. We reverted the dependency to 2.7.1b3 which doesn't seem to be affected by this problem.

Another script to reproduce the problem:

import thread
import re
import json
import time

def process_json(str):
    json.loads(str)

for i in range(10):
    for i in range(10):
         thread.start_new_thread(process_json, ('{"a": 15, "b": 20}',))
    time.sleep(0.1)

fails with error

  File "d:\temp\deleteme\foo2.py", line 7, in process_json
    json.loads(str)
  File "D:\jython2.7.1\Lib\json\__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "D:\jython2.7.1\Lib\json\decoder.py", line 365, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
AttributeError: 'org.python.modules.sre.MatchObject' object has no attribute 'end'


That's what we have in production: regexp, strptime and json.loads calls are affected.
msg11499 (view) Author: Jim Baker (zyasoft) Date: 2017-07-25.12:35:20
When running on OS X 10.12 I can reproduce this problem by increasing the number of threads and/or test runs. I see two types of errors:

AttributeError: 'org.python.modules.sre.MatchObject' object has no attribute 'end'
AttributeError: 'null' object has no attribute 'end'

The strptime implementation in datetime.py and _strptime.py is not written for multithreaded performance, but it does look correct.
msg11508 (view) Author: Ivan (Ivan) Date: 2017-07-28.14:58:14
What we faced in production (Ubuntu 16.04 virtual machine, 2Core-CPU, 2Gb memory) is:

1) multiple issues with strptime, json.loads and regex (variations of 'MatchObject has no attribute')

2) errors with 'object has no attribute' with reference to http://bugs.jython.org/issue2487

So we had to roll back to Jython2.7.1b3: b3 version does not have these issues.

Apparently something is broken with synchronization, and apparently regexps are affected.
msg11509 (view) Author: Jeff Allen (jeff.allen) Date: 2017-07-29.05:07:19
I have a real machine with the same specification and will try it there.
msg11510 (view) Author: Jeff Allen (jeff.allen) Date: 2017-07-30.08:19:38
I've now seen this exception raised in the "parse" and "json" cases on my Linux system. They are rare for me. No program I have is guaranteed to raise one.

I tentatively observe:

1. They are more likely to occur early in execution (if they occur at all).
2. Often I get no actual concurrency (only 1 or 2 more threads than daemons in jvisualvm).
3. Attempting to catch the exception (almost) stops it happening :(

I prefer the parse example as it ought to be possible to preserve the crime scene.

I have a version of the program that (really) runs 100 threads, parsing 2000 foo/bar elements, and drops into pdb if it raises. I'm repeatedly launching it in a Bash while loop. No luck yet.
msg11511 (view) Author: Jeff Allen (jeff.allen) Date: 2017-07-30.09:35:21
Hmmm ... I'm not sure that actual concurrency (large numbers of parsers working at the same time) has much to do with this. I run my test as:

> while ../inst/bin/jython parse3a.py ; do echo "ok"; done

which lets me fiddle with the source as it runs. It can be quite a wait. The times I've caught this, the numbers were small. Here's one:

REPEATS=2, THREADS=10,  LENGTH=2.
barfoo
foofoo
foobar
barfoo
foofoo
barfoo
barfoo
barfoo
barfoo
barfoo
barfoo
barfoo
foofoo
foobar
foobar
barfoofoofoo

barfoo
barfoo
 *** Caught one! ***
Traceback (most recent call last):
  File "parse3a.py", line 25, in parse
    if m.groups(): short_print(m.group(1))
AttributeError: 'org.python.modules.sre.MatchObject' object has no attribute 'groups'
> /home/jeff/Jython/iss2609/work/parse3a.py(25)parse()
-> if m.groups(): short_print(m.group(1))
(Pdb) p m
<org.python.modules.sre.MatchObject object at 0x2>
(Pdb) m.groups()
('foobar', '52536')
(Pdb) p tests
['barfoo84894', 'barfoo38345E', 'foofoo8017', 'barfoo42953', 'foobar52536E', 'foobar18661', 'barfoo35695E', 'foofoo89916', 'barfoo94921', 'barfoo32379']
(Pdb) p i
0

Now, as you can see, the object m behaved in the thread as if it had no groups() method, but later in the debugger it is present, and it yields the right result. This happens in the first block of threads.

My theory is that use of this object by the thread is running ahead of initialising the object or (quite likely) its class dictionary. I'll test that, see if I can get an incriminating snapshot.
msg11512 (view) Author: Jim Baker (zyasoft) Date: 2017-07-30.21:29:17
[also sent to jython-dev, apparently I'm not properly registered to respond to this tracker to use email responses]

I increasingly believe this issue is almost certainly
http://bugs.jython.org/issue2487 as Ivan pointed out earlier in this
thread. Note that the synchronization in PyType.fromClass, which is
responsible for constructing this map, was causing deadlocks was removed in
2.7.1b3 and replaced with a busy wait in 2.7.1rc1.

I want to ask this question: *should we revisit our original decision to
remove this synchronization*? (In
https://hg.python.org/jython/rev/cffe5b824a21 and at least another followup
commit set). It is possible that we got this mixed up with the low memory
problem, and how it had a nasty action at a distance problem, see
http://bugs.jython.org/issue2536

At the very least this is a low cost thing to try out, although it may not
be conclusive.

I should point out that I spent too much time trying to revisit how this
publishes, and I never found a workable solution that didn't do the busy
waiting that Darjus proposed as an alternative. Clearly we don't care about
this code path being synchronized, only that it doesn't deadlock; it is not
going to be hot except in some pathological usage scenario (constant
reloading of Java classes through distinct class loaders, then used in some
sort of callback scenario is the only thing that I can think of; otherwise
this cost is only a startup cost).
msg11513 (view) Author: Jeff Allen (jeff.allen) Date: 2017-07-31.08:27:02
I agree that it is likely to be a general problem with initialisation rather than specific to the thread safety of re. Investigation has led me into code that is generic to the support of types.

This is because, in circumstances (on my Linux box) where the error occurs, I've been able to catch m.__class__.__dict__.items() empty.

The observation that if its going to happen, it happens early, is misleading. I think this may be something to do with it being more likely when the machine is busy (or not busy, who can say). On my Windows machine, which is the only place debugging is a satisfatory experience, the same code will produce the error, but it takes around ten thousand calls. I can't reconcile this with the idea that it is synchronisation of the initialisation of the type that's at fault.

However, not understanding https://hg.python.org/jython/rev/cffe5b824a21 and https://hg.python.org/jython/rev/f3458ef83e08 fully, and given they're in the time frame b3-release, reverting them seems worth a try to me. I could probably try that this evening (UK time). It is clear from comments that we had our doubts about them. With hindsight, given this is a critical bit of the Jython core, and threading is so difficult, either proof or a massively beefy test of correctness was in order.

So far my own investigation has only led me into into MethodCache.lookup_where(). Speaking of hindsight, what I really need for this is a way to catch the miss, then single-step *backwards* into methods that just returned :)
msg11514 (view) Author: Ivan (Ivan) Date: 2017-07-31.09:15:32
Hello guys, if my evidence can be of any help:

1. On our production machines this errors is indeed a very tiny fraction of successful calls. I forgot to mention that we have 20 Ubuntu machines and watch aggregated logs. BUT on my Windows laptop I can reproduce these errors easily, without many retries.

2. Have been running for 2 days after rolling back to 2.7.1b3: no 'pathological' 'object has no attribute' on logs.
msg11515 (view) Author: Jim Baker (zyasoft) Date: 2017-07-31.17:49:13
There's one more wrinkle here, which I thought about given Jeff's
observation that it doesn't occur at initialization, yet this looks
like a publishing problem to me. In running the test program of
msg11496, I tried the below change in PyType.java:

classToType = new MapMaker().weakKeys().weakValues().makeMap();

to

classToType = new MapMaker().makeMap();

and with that change, I'm not seeing the test program fail.

The weak references here are essential for Jython to be able to work
in the context of unloading corresponding Java classes and/or Python
types. But weak references of course do make it harder to think about
object lifetimes. For extra fun, I also threw in a gc.collect() in the
test program. With forced GC, that seemed to make it more reliable to
get the publishing failure using the current weak keys/values
construction, and at different times after startup. (Need to do this
testing more comprehensively to be sure, however.)

Note that we really don't care so much about it being weak keys/weak
values per se, only that it is possible to eventually do this cleanup
in some reasonable period of time. In other words, this should be a
cache. Right now, classToType doesn't appear to be caching terribly
well, given that we must be rebuilding these mappings despite being in
a tight inner loop.

More analysis required!
msg11516 (view) Author: Jim Baker (zyasoft) Date: 2017-07-31.17:51:13
Small clarification from my previous post: doesn't *necessarily* occur at initialization
msg11519 (view) Author: Jeff Allen (jeff.allen) Date: 2017-08-01.21:36:09
I think you've got it Jim.

To elaborate on my understanding ... two threads Alice & Bob start. Alice calls REGEX.match and gets a MatchObject result. Alice finds no PyType for a MatchObject, and adds one (maybe here: https://hg.python.org/jython/file/tip/src/org/python/core/PyType.java#l1509). Meanwhile Bob starts up, receives a MatchObject and finds the PyType Alice created. But Alice is still building it, and Bob uses it before Alice can finish filling it in.

An observation supporting this theory is that in a few cases the error is a Java NPE thrown from here: https://hg.python.org/jython/file/tip/src/org/python/core/PyDictProxy.java#l70 . 

I find, on Windows, that I get the error much sooner if I print out the arguments during parse(), in the first block normally. Not sure why. However, in that condition of frequent early failure, if I add early in the main program:

sample = REGEX.search("foo23E")

it will run forever (ok, 20 minutes) without raising. Comment it out and it fails immediately. Of course, that line will fully construct (and keep alive) an instance of PyType corresponding to MatcherObject, and do so before the threads start.

There's an attempt here:
https://hg.python.org/jython/file/tip/src/org/python/core/PyType.java#l1512
to do without the synchronisation, arguing that multiple construction is harmless, because the first thread to set a PyType in the map will win, and the duplicates can be discarded. This argument only works if the object is fully constructed by the time it is entered in the map.

But even then, might this be overlooking the question of cache coherence? The map entry could be absent in one processor's view of memory and not in the other's.
msg11521 (view) Author: Jeff Allen (jeff.allen) Date: 2017-08-02.08:17:41
Ok, given the map is thread safe, the question about cache coherency is probably not an issue. I see the intended design here as relying on the map's thread safety rather than synchronising PyType as a whole. I guess we're trying to avoid holding a lock on a global (PyType.class) for longer than is necessary in this hot part of Jython.

As implemented, though, we leave completing the job until after the put(), and therefore unsynchronised (if I've understood it correctly). This was not so when the lock was on the PyType class. I assume the motivation for this two-phase construction is to deal with recursive reference. And you can't complete it before the put() for that reason. (Parallel: when loading Python modules we enter them in sys.modules *before* initialising them in case they refer to themselves.)

Is there a case for a map of part-constructed PyTypes that we are allowed to use in constructing other PyTypes but not in resolving regular accesses? Or is that just another thinking error?

Weak references make the problem occur more frequently because we repeat the faulty construction (I think). I appreciate the question of class unloading, but as written it seems like every time we stop using the exposed version of a class momentarily, we're going to pay for a new PyType. (Every time but the last, that is.) I think this is behind the comment in the docs for MapMaker here: https://google.github.io/guava/releases/19.0/api/docs/com/google/common/collect/MapMaker.html#weakValues(), still recommending softValues() despite having removed that method.

Is ClassValue what we ought to be using here? I'm pleased to find that it is available from Java 7, not 8 as I thought. You'd still have to complete construction before making it visible to attribute access.

Also, JLS 17.4.4: "Programmers do not need to reason about reorderings to determine that their code contains data races."
msg11523 (view) Author: Jeff Allen (jeff.allen) Date: 2017-08-04.07:43:21
I can confirm that backing out https://hg.python.org/jython/rev/cffe5b824a21 and https://hg.python.org/jython/rev/f3458ef83e08 makes my test run indefinitely (meaning 40 min). To be specific, the test is parse4a.py as attached, and it runs in a shell loop:

do { dist\bin\jython parse4a.py } until ($LASTEXITCODE -ne 0)

The same thing *without* the 2 changesets backed out drops into the debugger within a few minutes.

This leaves us with #2487 still to fix, and without reproducing this one. More there later.
msg11526 (view) Author: Jeff Allen (jeff.allen) Date: 2017-08-07.18:58:34
Fix to push after a last regression test completes.
History
Date User Action Args
2017-09-05 20:52:59zyasoftsetstatus: pending -> closed
2017-08-07 18:58:34jeff.allensetstatus: open -> pending
title: strptime in Jython2.7.1 not working correctly in multithreaded environment -> PyType.fromClass publication race (discovered in strptime and re)
messages: + msg11526
assignee: jeff.allen
components: + Core, - Library
milestone: Jython 2.7.2
type: behaviour
2017-08-04 07:43:22jeff.allensetfiles: + parse4a.py
messages: + msg11523
2017-08-02 08:17:42jeff.allensetassignee: jeff.allen -> (no value)
messages: + msg11521
components: + Library, - Core
2017-08-01 21:36:10jeff.allensetassignee: jeff.allen
messages: + msg11519
components: + Core, - Library
2017-07-31 17:51:13zyasoftsetmessages: + msg11516
2017-07-31 17:49:14zyasoftsetmessages: + msg11515
2017-07-31 09:15:32Ivansetmessages: + msg11514
2017-07-31 08:27:04jeff.allensetassignee: jeff.allen -> (no value)
messages: + msg11513
components: + Library, - Core
2017-07-30 21:29:18zyasoftsetmessages: + msg11512
2017-07-30 09:35:23jeff.allensetfiles: + parse3a.py
assignee: jeff.allen
messages: + msg11511
components: + Core, - Library
2017-07-30 08:19:40jeff.allensetmessages: + msg11510
2017-07-29 05:07:20jeff.allensetmessages: + msg11509
2017-07-28 14:58:15Ivansetmessages: + msg11508
2017-07-25 12:35:20zyasoftsetresolution: accepted
messages: + msg11499
nosy: + zyasoft
2017-07-24 20:17:28Ivansetmessages: + msg11498
2017-07-24 18:42:24jeff.allensetmessages: + msg11497
2017-07-23 21:25:58Ivansetmessages: + msg11496
2017-07-23 16:34:43jeff.allensetmessages: + msg11495
2017-07-23 13:14:35jeff.allensetnosy: + jeff.allen
messages: + msg11494
2017-07-22 07:37:00Ivansetmessages: + msg11492
2017-07-21 17:26:32Ivansetmessages: + msg11491
title: re module lost thread safety in jython 2.7.1? -> strptime in Jython2.7.1 not working correctly in multithreaded environment
2017-07-21 17:01:39Ivansetmessages: + msg11490
title: 'MatchObject object has no attribute 'end'' when calling strptime -> re module lost thread safety in jython 2.7.1?
2017-07-18 13:29:53Ivancreate