UCSF Wynton Status

Queue Metrics

queues usage during the last day
queues usage during the last week
queues usage during the last month
queues usage during the last year

Compute Nodes

All compute nodes are functional.

Upcoming and Current Incidents

Past Incidents

March 21-April 5, 2019

Kernel maintenance

Resolved: All compute nodes have been rebooted.
April 5, 12:00 PDT

Update: Nearly all compute nodes have been rebooted (~5,200 cores are now available).
Mar 29, 12:00 PDT

Notice: Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer available slots on the queues than usual. To follow the progress, see the green ‘Available CPU cores’ curve (target 5,424 cores) in the graph above.
Mar 21, 15:30 PDT

March 22, 2019

Kernel maintenance

Resolved: The login, development and transfer hosts have been rebooted.
March 22, 10:35 PDT

Notice: On Friday March 22 at 10:30am, all of the login, development, and data transfer hosts will be rebooted. Please be logged out before then. These hosts should be offline for less than 5 minutes.
Mar 21, 15:30 PDT

January 22-February 5, 2019

Kernel maintenance

Resolved: All compute nodes have been rebooted.
Feb 5, 11:30 PDT

Notice: Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer available slots on the queues than usual. To follow the progress, see the green ‘Available CPU cores’ curve (target 1,944 cores) in the graph above.
Jan 22, 16:45 PDT

January 23, 2019

Kernel maintenance

Resolved: The login, development and transfer hosts have been rebooted.
Jan 23, 13:00 PDT

Notice: On Wednesday January 23 at 12:00pm (noon), all of the login, development, and data transfer hosts will be rebooted. Please be logged out before then. The hosts should be offline for less than 5 minutes.
Jan 22, 16:45 PDT

January 14, 2019

Blocking file-system issues

Resolved: The file system under /wynton/ is back up again. We are looking into the cause and taking steps to prevent this from happening again.
Jan 9, 12:45 PDT

Investigating: The file system under /wynton/ went down around 11:30 am resulting is several critical failures including the scheduler failing.
Jan 14, 11:55 PDT

January 9, 2019

Job scheduler maintenance downtime

Resolved: The SGE job scheduler is now back online and accepts new job submission again.
Jan 9, 12:45 PDT

Update: The downtime of the job scheduler will begin on Wednesday January 9 @ noon and is expected to be completed by 1:00pm.
Jan 8, 16:00 PDT

Notice: There will be a short job-scheduler downtime on Wednesday January 9 due to SGE maintenance. During this downtime, already running jobs will keep running and queued jobs will remain in the queue, but no new jobs can be submitted.
Dec 20, 12:00 PDT

January 8, 2019

File-system server crash

Investigating: One of the parallel file-system servers (BeeGFS) appears to have crashed on Monday January 7 at 7:30pm and was recovered on 9:20pm. Right now we are monitoring its stability, and investigating the cause and what impact it might have had. Currently, we believe users might have experienced I/O errors on /wynton/scratch/ whereas /wynton/home/ was not affected.
Jan 8, 10:15 PDT

December 21, 2018

Partial file system failure

Resolved: Parts of the new BeeGFS file system was non-functional for approx. 1.5 hours during Friday December 21 when a brief maintenance task failed.
Dec 21, 20:50 PDT

December 12-20, 2018

Nodes down

Resolved: All of the `msg-* compute nodes but one are operational.
Dec 20, 16:40 PDT

Notice: Starting Wednesday December 12 around 11am, several msg-* compute nodes went down (~200 cores in total). The cause of this is unknown. Because it might be related to the BeeGFS migration project, the troubleshooting of this incident will most likely not start until the BeeGFS project is completed, which is projected to be done on Wednesday December 19.
Dec 17, 17:00 PDT

December 18, 2018

Development node does not respond

Resolved: Development node qb3-dev1 is functional.
Dec 18, 20:50 PDT

Investigating: Development node qb3-dev1 does not respond to SSH. This will be investigated the first thing tomorrow morning (Wednesday December 19). In the meanwhile, development node qb3-gpudev1, which is “under construction”, may be used.
Dec 18, 16:30 PDT

November 28-December 19, 2018

Installation of new, larger, and faster storage space

Resolved: /wynton/scratch is now back online and ready to be used.
Dec 19, 14:20 PDT

Update: The plan is to bring /wynton/scratch back online before the end of the day tomorrow (Wednesday December 19). The planned SGE downtime has been rescheduled to Wednesday January 9. Moreover, we will start providing the new 500-GiB /wynton/home/ storage to users who explicitly request it (before Friday December 21) and who also promise to move the content under their current /netapp/home/ to the new location. Sorry, users on both QB3 and Wynton will not be able to migrate until the QB3 cluster has been incorporated into Wynton HPC (see Roadmap) or they giving up their QB3 account.
Dec 18, 16:45 PDT

Update: The installation and migration to the new BeeGFS parallel file servers is on track and we expect to go live as planned on Wednesday December 19. We are working on fine tuning the configuration, running performance tests, and resilience tests.
Dec 17, 10:15 PDT

Update: /wynton/scratch has been taken offline.
Dec 12, 10:20 PDT

Reminder: All of /wynton/scratch will be taken offline and completely wiped starting Wednesday December 12 at 8:00am.
Dec 11, 14:45 PDT

Notice: On Wednesday December 12, 2018, the global scratch space /wynton/scratch will be taken offline and completely erased. Over the week following this, we will be adding to and reconfiguring the storage system in order to provide all users with new, larger, and faster (home) storage space. The new storage will served using BeeGFS, which is a new much faster file system - a system we have been prototyping and tested via /wynton/scratch. Once migrated to the new storage, a user’s home directory quota will be increased from 200 GiB to 500 GiB. In order to do this, the following upgrade schedule is planned:

It is our hope to be able to keep the user’s home accounts, login nodes, the transfer nodes, and the development nodes available throughout this upgrade period.

NOTE: If our new setup proves more challenging than anticipated, then we will postpone the SGE downtime to after the holidays, on Wednesday January 9, 2019. Wynton will remain operational over the holidays, though without /wynton/scratch.
Dec 6, 14:30 PDT [edited Dec 18, 17:15 PDT]

December 12-14, 2018

Power failure

Resolved: All mac-* compute nodes are up and functional.
Dec 14, 12:00 PDT

Investigating: The compute nodes named mac-* (in the Sandler building) went down due to power failure on Wednesday December 12 starting around 5:50am. Nodes are being rebooted.
Dec 12, 09:05 PDT

November 8, 2018

Partial shutdown due to planned power outage

Resolved: The cluster is full functional. It turns out that none of the compute nodes, and therefore none of the running jobs, were affected by the power outage.
Nov 8, 11:00 PDT

Update: The queue-metric graphs are being updated again.
Nov 8, 11:00 PDT

Update: The login nodes, the development nodes and the data transfer node are now functional.
Nov 8, 10:10 PDT

Update: Login node wynlog1 is also affected by the power outage. Use wynlog2 instead.
Nov 8, 09:10 PDT

Notice: Parts of the Wynton cluster will be shut down on November 8 at 4:00am. This shutdown takes place due to the UCSF Facilities shutting down power in the Byers Hall. Jobs running on affected compute nodes will be terminated abruptly. Compute nodes with battery backup or in other buildings will not be affected. Nodes will be rebooted as soon as the power comes back. To follow the reboot progress, see the ‘Available CPU cores’ curve (target 1,832 cores) in the graph above. Unfortunately, the above queue-metric graphs cannot be updated during the power outage.
Nov 7, 15:45 PDT

September 28 - October 11, 2018

Kernel maintenance

Resolved: The compute nodes has been rebooted and are accepting new jobs. For the record, on day 5 approx. 300 cores were back online, on day 7 approx. 600 cores were back online, on day 8 approx. 1,500 cores were back online, and on day 9 the majority of the 1,832 cores were back online.
Oct 11, 09:00 PDT

Notice: On September 28, a kernel update was applied to all compute nodes. To begin running the new kernel, each node must be rebooted. To achieve this as quickly as possible and without any loss of running jobs, the queues on the nodes were all disabled (i.e., they stopped accepting new jobs). Each node will reboot itself and re-enable its own queues as soon as all of its running jobs have completed. Since the maximum allowed run time for a job is two weeks, it may take until October 11 before all nodes have been rebooted and accepting new jobs. In the meanwhile, there will be fewer available slots on the queue than usual. To follow the progress, see the ‘Available CPU cores’ curve (target 1,832 cores) in the graph above.
Sept 28, 16:30 PDT

October 1, 2018

Kernel maintenance

Resolved: The login, development, and data transfer hosts have been rebooted.
Oct 1, 13:30 PDT

Notice: On Monday October 1 at 1:00 pm, all of the login, development, and data transfer hosts will be rebooted.
Sept 28, 16:30 PDT

September 13, 2018

Scheduler unreachable

Resolved: Around 11pm on Wednesday September 12, the SGE scheduler (“qmaster”) became unreachable such that the scheduler could not be queried and no new jobs could be submitted. Jobs that relied on run-time access to the scheduler may have failed. The problem, which was due to a misconfiguration being introduced, was resolved early morning on Thursday September 13.
Sept 13, 09:50 PDT

August 1, 2018

Partial shutdown

Resolved: Nodes were rebooted on August 1 shortly after the power came back.
Aug 2, 08:15 PDT

Notice: On Wednesday August 1 at 6:45am, parts of the compute nodes (msg-io{1-10} + msg-*gpu) will be powered down. They will be brought back online within 1-2 hours. The reason is a planned power shutdown affecting one of Wynton’s server rooms.
Jul 30, 20:45 PDT

July 30, 2018

Partial shutdown

Resolved: The nodes brought down during the July 30 partial shutdown has been rebooted. Unfortunately, the same partial shutdown has to be repeated within a few days because the work in server room was not completed. Exact date for the next shutdown is not known at this point.
Jul 30, 09:55 PDT

Notice: On Monday July 30 at 7:00am, parts of the compute nodes (msg-io{1-10} + msg-*gpu) will be powered down. They will be brought back online within 1-2 hours. The reason is a planned power shutdown affecting one of Wynton’s server rooms.
Jul 29, 21:20 PDT

June 16-26, 2018

Power outage

Resolved: The NVidia-driver issue occurring on some of the GPU compute nodes has been fixed.
Jun 26, 11:55 PDT

Update: Some of the compute nodes with GPUs are still down due to issues with the NVidia drivers.
Jun 19, 13:50 PDT

Update: The login nodes and and the development nodes are functional. Some compute nodes that went down are back up, but not all.
Jun 18, 10:45 PDT

Investigating: The UCSF Mission Bay Campus experienced a power outage on Saturday June 16 causing parts of Wynton to go down. One of the login nodes (wynlog1), the development node (qb3-dev1), and parts of the compute nodes are currently non-functional.
Jun 17, 15:00 PDT