-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Description
Internally we've seen a rare crash arise in the runtime since Go 1.14. The error message is typically sudog with non-nil elem stemming from a call to releaseSudog from chansend or chanrecv.
The issue here is a race between a mark worker and a channel operation. Consider the following sequence of events. GW is a worker G. GS is a G trying to send on a channel. GR is a G trying to receive on that same channel.
- GW wants to suspend GS to scan its stack. It calls
suspendG. - GS is about to
goparkinchansend. It calls intogopark, and changes its status to_GwaitingBEFORE calling itsunlockf, which setsgp.activeStackChans. - GW observes
_Gwaitingand returns fromsuspendG. It continues intoscanstackwhere it checks if it's safe to shrink the stack. In this case, it's fine. So, it readsgp.activeStackChans, and sees it as false. It begins adjustingsudogpointers without synchronization. It reads thesudog'selempointer from thechansend, but has not written it back yet. - GS continues on its merry way and sets
gp.activeStackChansand parks. It doesn't really matter when this happens at this point. - GR comes in and wants to
chanrecvon channel. It grabs the channel lock, reads from thesudog'selemfield, and clears it. GR readies GS. - GW then writes the updated sudog's elem pointer and continues on its merry way.
- Sometime later, GS wakes up because it was readied by GR, and tries to release the
sudog, which has a non-nilelemfield.
The fix here, I believe, is to set gp.activeStackChans before the unlockf is called. Doing this ensures that the value is updated before any worker that could shrink GS's stack observes a useful G status in suspendG. This could alternatively be fixed by changing the G status after unlockf is called, but I worry that will break a lot of things.