UCSF Wynton Status

Queue Metrics

queues usage during the last day
queues usage during the last week
queues usage during the last month
queues usage during the last year

Past Incidents

September 28 - October 11, 2018

Kernel maintenance

Resolved: The compute nodes has been rebooted and are accepting new jobs. For the record, on Oct 3 (day 5) approx. 300 cores were back online, on Oct 5 (day 7) approx. 600 cores were back online, on Oct 6 (day 8) approx. 1500 cores were back online, and on Oct 7 (day 9) the majority of the 1832 cores were back online.
Oct 11, 09:00 PDT

Notice: On September 28, a kernel update was applied to all compute nodes. To begin running the new kernel, each node must be rebooted. To achieve this as quickly as possible and without any loss of running jobs, the queues on the nodes were all disabled (i.e., they stopped accepting new jobs). Each node will reboot itself and re-enable its own queues as soon as all of its running jobs have completed. Since the maximum allowed run time for a job is two weeks, it may take until October 11 before all nodes have been rebooted and accepting new jobs. In the meanwhile, there will be fewer available slots on the queue than usual. To follow the progress, see the ‘Available CPU cores’ curve (target 1832 cores) in the graph above.
Sept 28, 16:30 PDT

October 1, 2018

Kernel maintenance

Resolved: The login, development, and data transfer hosts have been rebooted.
Oct 1, 13:30 PDT

Notice: On Monday October 1 at 1:00 pm, all of the login, development, and data transfer hosts will be rebooted.
Sept 28, 16:30 PDT

September 13, 2018

Scheduler unreachable

Resolved: Around 11pm on Wednesday September 12, the SGE scheduler (“qmaster”) became unreachable such that the scheduler could not be queried and no new jobs could be submitted. Jobs that relied on run-time access to the scheduler may have failed. The problem, which was due to a misconfiguration being introduced, was resolved early morning on Thursday September 13.
Sept 13, 09:50 PDT

August 1, 2018

Partial shutdown

Resolved: Nodes were rebooted on August 1 shortly after the power came back.
Aug 2, 08:15 PDT

Notice: On Wednesday August 1 at 6:45am, parts of the compute nodes (msg-io{1-10} + msg-*gpu) will be powered down. They will be brought back online within 1-2 hours. The reason is a planned power shutdown affecting one of Wynton’s server rooms.
Jul 30, 20:45 PDT

July 30, 2018

Partial shutdown

Resolved: The nodes brought down during the July 30 partial shutdown has been rebooted. Unfortunately, the same partial shutdown has to be repeated within a few days because the work in server room was not completed. Exact date for the next shutdown is not known at this point.
Jul 30, 09:55 PDT

Notice: On Monday July 30 at 7:00am, parts of the compute nodes (msg-io{1-10} + msg-*gpu) will be powered down. They will be brought back online within 1-2 hours. The reason is a planned power shutdown affecting one of Wynton’s server rooms.
Jul 29, 21:20 PDT

June 16-26, 2018

Power outage

Resolved: The NVidia-driver issue occurring on some of the GPU compute nodes has been fixed.
Jun 26, 11:55 PDT

Update: Some of the compute nodes with GPUs are still down due to issues with the NVidia drivers.
Jun 19, 13:50 PDT

Update: The login nodes and and the development nodes are functional. Some compute nodes that went down are back up, but not all.
Jun 18, 10:45 PDT

Investigating: The UCSF Mission Bay Campus experienced a power outage on Saturday June 16 causing parts of Wynton to go down. One of the login nodes (wynlog1), the development node (qb3-dev1), and parts of the compute nodes are currently non-functional.
Jun 17, 15:00 PDT