Can’t create thread (11) (ThreadError) in Ruby

I have been working on some networked code in Ruby which uses EventMachine. This is part of my work as a Research Associate at the University of Birmingham. I recently had a headache with threading/processes whereby I would get this error during thread creation:

Here is the code I was running:

I knew I wasn’t creating too many threads and I knew I wasn’t running out of memory. After some serious digging, and two hundred browser tabs later,  I found the reason: zombie/defunct processes.

The ’11’ in the error code is actually not from Ruby at all, but from the  pthread class which is used to create new threads in Linux (among other OSes). This is the numerical value for the error code EAGAIN, returned by the function pthread_create, which occurs when:

Insufficient resources to create another thread, or a system-imposed limit on the number of threads was encountered

I had used Ruby’s Thread.list.size to ensure I wasn’t using too many threads, so what could it be?

Well, during runtime I happened to check how many ruby-related threads were running, and I found a huge list of processes listed as <defunct>, with the  Z+ (zombie) flag attached. This tipped me off. The threads I was creating were not dying until I killed the originating process itself. This seemed strange since I knew for a fact the work had been completed in both the originating process and the spawned threads. However, a ‘zombie’ process in Unix is a process which is dead but remains in the process table so that its parent can read its exit status.

Since, in my code, the parent was forking a new process and then ignoring the child’s exit status, the child process would wait forever. It simply required adding a  Process.wait(pid) to ensure the (already-finished) process could be removed from the process table (and by extension its child threads), and allow more threads to be created.

Here’s how the code should have looked:

Since I launched several child processes, I simply stored their pids until the end of the parent process code and called ‘wait’ for them all (which only took some tens of milliseconds). I hope this helps somebody out there.