-
-
Notifications
You must be signed in to change notification settings - Fork 23
Description
We've been troubled by an issue for the past few months where all instances on our machines get stuck at 100% CPU usage. A typical machine has 5 instances and zeo, 2 CPUs (2.4GHz) and 8G RAM. We could avoid the issue by stopping all instances and then starting each one slowly, waiting for the CPU load to go down before starting the next one. Restarting all instances via supervisor reliably reproduced the issue. As soon as one instance got into this state, starting or restarting additional instances would cause them to also use all available CPU. Today we tried adding http-fast-listen = off to the buildout configuration for the instances and now we can restart all instances without any problem.
While investigating the issue we noticed that having a high CPU load when starting the instances made matters worse. For example, by running dd if=/dev/zero of=/dev/null three times in parallel, we could trigger the issue by starting 3 instances at the same time, rather than 5. We have also had this issue on a machine which runs 5 instances, each in a separate docker container.
I attempted to use the plone docker image to make a reproducible test case, by restricting the available CPU for the container and then creating a high load (with dd, as above) before starting the instance, but it was more difficult than I had hoped.