This is Lite Plone Theme
You are here: Home System Status

System Status

System related news items.

Please subscribe to our RSS news feed.

Power cut affecting cosmos machines on 18 August

Dear Users,

On 18 August at around 22:40 the server room where Cosmos machines are hosted has been affected by a power failure. Although cosmos2 and cosmic were still online filesystems on /fast and /slow become not available. The system became fully operational again 00:50 on 19 August. Unfortunately some of the jobs failed yesterday evening due to missing /fast and /slow filesystem. Please check the output of your jobs and resubmit them again if necessary.

Kind regards,

Cosmos Management Team

Maintenance on 16 August

Dear Users,

We are currently observing the problems with our of NUMA link cables in COSMOS systems. As it can influence on the stability of the systems we have scheduled another maintenance on Wednesday, 16 August to replace the cable. Therefore all systems will be not available starting from 8am. We apologize for the caused inconvenience.

Kind regards, Cosmos Management Team

Update

The replacement on NUMAlink cable on 16 August has been successful and the system does not report any more errors on this connection.

Kind regards, Cosmos Management Team

Maintenance on 9 August 2017

Dear Users,

We have scheduled the maintenance day on Wednesday, 9 August. Among other tasks we will attempt to replace the RAID controller that caused problem on cosmos2 on 11 July and 13 June. Therefore all systems will not be available starting from 8:00 am. We apologise for the caused inconvenience.

Kind regards,

Cosmos Management Team

Cosmos machines not available

Unfortunately due to a hardware error Cosmos machines are currently not available. We are working to bring them back to service as soon as possible. We apologise for the caused inconvenience.

Kind regards,

Cosmos Management Team

Update 1 Aug 2017 5:00

Cosmos2 and cosmic are now back online

COSMOS2 rebooted

On 2017-07-24 at about 10:00 UTC, the batch job scheduler started misbehaving on the system. Jobs kept running and new jobs kept being scheduled but access to the batch queue status was intermittent and so was the ability to submit new jobs.

After repeated failures to sort the problem without a reboot, cosmos2 was rebooted around midnight after all the jobs were drained or ran over their wall-time limit (the system did not manage to kill these jobs).

The system is now healthy again.