So our assumption on these sorts of testing bugs is that multiple phases of GC are often required to clear out weak references; and GC in general is never quite as deterministic as we would like for such testing. (CPython's model in contrast does give us deterministic GC, so it's a bit harder to use the same tests. In practice, it would not matter - one should not rely on determinism in GC even in CPython, given the possibility of cycles.) You can see the specific code used in the extra_collect function.

In particular, test_weak_values does not use the extra_collect function; it instead uses gc.collect directly. It would be worthwhile revisiting with extra_collect here to see if it makes a difference.

Note that if the weakref is never collected, that would be a bug!
