This is Lite Plone Theme
You are here: Home System Status

System Status

System related news items.

Please subscribe to our RSS news feed.

COSMOS2 Recent crash #Update#

Dear users,

Cosmos2 crashed this morning due to a hardware failure. The cause of which is currently being investigated. Updates will be added here.

Kind Regards,

James

## update 04/02 1:30pm ##

System back up again. One of the NUMALink cables has failed. New part has been ordered.

COSMOS2 scheduled for maintenance on 2014-01-29 from 09:00 [FINISHED]

COSMOS2 maintenance

 

We have scheduled time for maintenance on 2014-01-29, starting 09:00 by which time the queues will be drained. We have reserved a 24-hour window for the maintenance, so unless something unexpected and unsurmountable occurs, COSMOS will be back online before 09:00 the morning after.

This maintenance is a rescheduling of the one on 2014-01-22 and the goal is to finish the work started last year to replace the faulty PCIe sockets on 4 of the nodes.

Sorry for any inconvenience caused.

Best regards,
Juha

 

** Update **

Work finished without any problems. System back on line. Queues started again.

Scheduled downtime on Cosmos2 Wednesday 22/01/14

Cosmos2 maintenance

This Wednesday there is scheduled down time on Cosmos2 starting at 09:30. This is to finish the work started last year to replace the faulty PCIe sockets on 4 of the nodes.

Queues will be drained by 09:00 on Wednesday morning. System should be back up by about 3pm.

Sorry for any inconvenience caused.

Cheers,
James

Cosmos/Universe recent disruption

There was some service disruption last night and this afternoon with both the UV1's cosmos/universe. This is apparently due to some users jobs not dying properly and turning into zombie processes that slow down/crash the system.

*NB* The DiRAC system Cosmos2 has been unaffected by this.

 

This has been opened as a case with our scheduler vendor and we should hopefully find out what is going wrong and how to prevent it in future.

If you have trouble sshing to either cosmos then please try universe (and vice versa).

Sorry to those whose jobs were disrupted.

 

Cheers,

james

 

** 13/1/14 **

This happened again on Saturday. Investigating today on universe. Universe will be unavailable today. Please login via cosmos instead.

** update **

Universe is now open again. Cause of the problem still unclear.

Cosmos short scheduled network downtime 8/01/14 [FINISHED]

Network upgrade

There will be a some short network downtime tomorrow morning at 10am that should last for about 15-30 minutes.

This is to upgrade the network switch in the cosmos room. This should fix some of the network issues that we've been having since the power cut in November.

During this downtime you will not be able to access the system via SSH. Jobs running on the systems at the time will be unaffected.

 

*** 1030 Downtime prolonged ***

The entire network switch turned out to be faulty and has to be replaced.

*** 1330 Finished successfully ***

Network back up and running again. Queues restarted. Sorry for the inconvenience.