COSMOS2 rebooted

COSMOS2 rebooted

On 2017-07-24 at about 10:00 UTC, the batch job scheduler started misbehaving on the system. Jobs kept running and new jobs kept being scheduled but access to the batch queue status was intermittent and so was the ability to submit new jobs.

After repeated failures to sort the problem without a reboot, cosmos2 was rebooted around midnight after all the jobs were drained or ran over their wall-time limit (the system did not manage to kill these jobs).

The system is now healthy again.