First: I really appreciate this topic gets some traction!
Your suggestion for OS-wise sections sounds good to me as a midterm solution. For short-term I would propose to more rigorously sort tests into stable vs. unstable, so that regrtests get reliable again (and quickly).
From that position we can step by step fix unstable tests or sort them into OS-specific sections, keeping regrtests reliable during the whole transition period.
