UCSF Wynton Status

Queue Metrics

queues usage during the last day
queues usage during the last week
queues usage during the last month
queues usage during the last year

Upcoming and Current Incidents

December 12, 2018

Power failure

Investigating: The compute nodes named mac-* (in the Sandler building) went down due to power failure on Wednesday December 12 starting around 5:50am. Nodes are being rebooted.
Dec 12, 09:05 PDT

November 28-December 19, 2018

Migration to New, Larger, and Faster Storage Space including Users’ Home Space

Update: /wynton/scratch has been taken offline.
Dec 12, 10:20 PDT

Reminder: All of /wynton/scratch will be taken offline and completely wiped starting Wednesday December 12 at 8:00am.
Dec 11, 14:45 PDT

Notice: On Wednesday December 12, 2018, the global scratch space /wynton/scratch will be taken offline and completely erased. Over the week following this, we will be adding to and reconfiguring the storage system in order to provide all users with new, larger, and faster (home) storage space. The new storage will served using BeeGFS, which is a new much faster file system - a system we have prototyped and tested via /wynton/scratch. Once migrated to the new storage, a user’s home directory quota will be increased from 200 GiB to 500 GiB. In order to do this, the following upgrade schedule is planned:

It is our hope to be able to keep the user’s home accounts, login nodes, the transfer nodes, and the development nodes available throughout this upgrade period.

NOTE: If our new setup proves more challenging than anticipated, then we will postpone the SGE downtime to after the holidays, on Wednesday January 9, 2019. Wynton will remain operational over the holidays, though without /wynton/scratch.
Dec 6, 14:30 PDT

Past Incidents

November 8, 2018

Partial shutdown due to planned power outage

Resolved: The cluster is full functional. It turns out that none of the compute nodes, and therefore none of the running jobs, were affected by the power outage.
Nov 8, 11:00 PDT

Update: The queue-metric graphs are being updated again.
Nov 8, 11:00 PDT

Update: The login nodes, the development nodes and the data transfer node are now functional.
Nov 8, 10:10 PDT

Update: Login node wynlog1 is also affected by the power outage. Use wynlog2 instead.
Nov 8, 09:10 PDT

Notice: Parts of the Wynton cluster will be shut down on November 8 at 4:00am. This shutdown takes place due to the UCSF Facilities shutting down power in the Byers Hall. Jobs running on affected compute nodes will be terminated abruptly. Compute nodes with battery backup or in other buildings will not be affected. Nodes will be rebooted as soon as the power comes back. To follow the reboot progress, see the ‘Available CPU cores’ curve (target 1832 cores) in the graph above. Unfortunately, the above queue-metric graphs cannot be updated during the power outage.
Nov 7, 15:45 PDT

September 28 - October 11, 2018

Kernel maintenance

Resolved: The compute nodes has been rebooted and are accepting new jobs. For the record, on day 5 approx. 300 cores were back online, on day 7 approx. 600 cores were back online, on day 8 approx. 1500 cores were back online, and on day 9 the majority of the 1832 cores were back online.
Oct 11, 09:00 PDT

Notice: On September 28, a kernel update was applied to all compute nodes. To begin running the new kernel, each node must be rebooted. To achieve this as quickly as possible and without any loss of running jobs, the queues on the nodes were all disabled (i.e., they stopped accepting new jobs). Each node will reboot itself and re-enable its own queues as soon as all of its running jobs have completed. Since the maximum allowed run time for a job is two weeks, it may take until October 11 before all nodes have been rebooted and accepting new jobs. In the meanwhile, there will be fewer available slots on the queue than usual. To follow the progress, see the ‘Available CPU cores’ curve (target 1832 cores) in the graph above.
Sept 28, 16:30 PDT

October 1, 2018

Kernel maintenance

Resolved: The login, development, and data transfer hosts have been rebooted.
Oct 1, 13:30 PDT

Notice: On Monday October 1 at 1:00 pm, all of the login, development, and data transfer hosts will be rebooted.
Sept 28, 16:30 PDT

September 13, 2018

Scheduler unreachable

Resolved: Around 11pm on Wednesday September 12, the SGE scheduler (“qmaster”) became unreachable such that the scheduler could not be queried and no new jobs could be submitted. Jobs that relied on run-time access to the scheduler may have failed. The problem, which was due to a misconfiguration being introduced, was resolved early morning on Thursday September 13.
Sept 13, 09:50 PDT

August 1, 2018

Partial shutdown

Resolved: Nodes were rebooted on August 1 shortly after the power came back.
Aug 2, 08:15 PDT

Notice: On Wednesday August 1 at 6:45am, parts of the compute nodes (msg-io{1-10} + msg-*gpu) will be powered down. They will be brought back online within 1-2 hours. The reason is a planned power shutdown affecting one of Wynton’s server rooms.
Jul 30, 20:45 PDT

July 30, 2018

Partial shutdown

Resolved: The nodes brought down during the July 30 partial shutdown has been rebooted. Unfortunately, the same partial shutdown has to be repeated within a few days because the work in server room was not completed. Exact date for the next shutdown is not known at this point.
Jul 30, 09:55 PDT

Notice: On Monday July 30 at 7:00am, parts of the compute nodes (msg-io{1-10} + msg-*gpu) will be powered down. They will be brought back online within 1-2 hours. The reason is a planned power shutdown affecting one of Wynton’s server rooms.
Jul 29, 21:20 PDT

June 16-26, 2018

Power outage

Resolved: The NVidia-driver issue occurring on some of the GPU compute nodes has been fixed.
Jun 26, 11:55 PDT

Update: Some of the compute nodes with GPUs are still down due to issues with the NVidia drivers.
Jun 19, 13:50 PDT

Update: The login nodes and and the development nodes are functional. Some compute nodes that went down are back up, but not all.
Jun 18, 10:45 PDT

Investigating: The UCSF Mission Bay Campus experienced a power outage on Saturday June 16 causing parts of Wynton to go down. One of the login nodes (wynlog1), the development node (qb3-dev1), and parts of the compute nodes are currently non-functional.
Jun 17, 15:00 PDT