UCSF Wynton HPC Status

Queue Metrics

queues usage during the last day
GPU queues usage during the last day
queues usage during the last week
GPU queues usage during the last week
queues usage during the last month
GPU queues usage during the last month
queues usage during the last year
GPU queues usage during the last year

Compute Nodes

All compute nodes are functional.

Upcoming and Current Incidents

Current Incidents

Past Incidents

August 19, 2020

Cluster inaccessible (due to BeeGFS issues)

Resolved: Our BeeGFS file system was non-responsive between 17:22 and 18:52 today because one of its meta servers hung while the other attempted to synchronize to it.
August 19, 19:00 PDT

Notice: The cluster is currently inaccessible for unknown reasons. The problem was first reported around 17:30 today.
August 19, 18:15 PDT

August 10-13, 2020

Network and hardware upgrades (full downtime)

Resolved: The cluster is fully back up and running. Several compute nodes still need to be rebooted but we consider this upgrade cycle completed. The network upgrade took longer than expected, which delayed the processes. We hope to bring the new lab storage online during the next week.
August 13, 21:00 PDT

Update: All login, data-transfer, and development nodes are online. Additional compute nodes are being upgraded and are soon part of the pool serving jobs.
August 13, 14:50 PDT

Update: Login node log1, data-transfer node dt2, and the development nodes are available again. Compute nodes are going through an upgrade cycle and will soon start serving jobs again. The upgrade work is taking longer than expected and will continue tomorrow Thursday August 13.
August 12, 16:10 PDT

Notice: All of the Wynton HPC environment is down for maintenance and upgrades.
August 10, 00:00 PDT

Notice: Starting early Monday August 10, the cluster will be powered down entirely for maintenance and upgrades, which includes upgrading the network and adding lab storage purchased by several groups. We anticipate that the cluster will be available again by the end of Wednesday August 12.
July 24, 15:45 PDT

July 6, 2020

Development node failures

Resolved: All three development nodes have been rebooted.
July 6, 15:20 PDT

Notice: The three regular development nodes have all gotten themselves hung up on one particular process. This affects basic system operations and preventing such basic commands as ps and w. To clear this state, we’ll be doing an emergency reboot of the dev nodes at about 15:15.
July 6, 15:05 PDT

July 5, 2020

Job scheduler non-working

Resolved: The SGE scheduler produced errors when queried or when jobs were submitted or launched. The problem started 00:30 and lasted until 02:45 early Sunday 2020-07-05.
July 6, 22:00 PDT

June 11-26, 2020

Kernel maintenance

Resolved: All compute nodes have been rebooted.
June 26, 10:45 PDT

Update: Development node dev3 is back online.
June 15, 11:15 PDT

Update: Development node dev3 is not available. It failed to reboot and requires on-site attention, which might not be possible for several days. All other log-in, data-transfer, and development nodes were rebooted successfully.
June 11, 15:45 PDT

Notice: New operating-system kernels are deployed. Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer than usual slots available on the queues. To follow the progress, see the green ‘Available CPU cores’ curve (target ~10,400 cores) in the graph above. Log-in, data-transfer, and development nodes will be rebooted at 15:30 on Thursday June 11.
June 11, 10:45 PDT

June 5-9, 2020

No internet access on development nodes

Resolved: Internet access from the development nodes is available again. A new web-proxy server had to be built and deploy.
June 9, 09:15 PDT

Notice: Internet access from the development nodes is not available. This is because the proxy server providing them with internet access had a critical hardware failure around 08-09 this morning. At the most, we cannot provide an estimate when we get to restore this server.
June 5, 16:45 PDT

May 18-22, 2020

File-system maintenance

Update: The upgrade of the BeeGFS filesystem introduced new issues. We decided to rollback the upgrade and we are working with the vendor. There is no upgrade planned for the near term.
June 8, 09:00 PDT

Update: The BeeGFS filesystem has been upgraded using a patch from the vendor. The patch was designed to lower the amount of resynchronization needed between the two metadata servers. Unfortunately, after the upgrade we observe an increase of resynchronization. We will keep monitoring the status. If the problem remains, we will consider a rollback to the BeeGFS version used prior to May 18.
May 22, 01:25 PDT

Update: For a short moment around 01:00 early Friday, both of our BeeGFS metadata servers were down. This may have lead to some applications experiencing I/O errors around this time.
May 22, 01:25 PDT

Notice: Work to improve the stability of the BeeGFS filesystem (/wynton) will be conducted during the week of May 18-22. This involves restarting the eight pairs of metadata server processes, which may result in several brief stalls of the file system. Each should last less than 5 minutes and operations will continue normally after each one.
May 6, 15:10 PDT

May 28-29, 2020

GPU compute nodes outage

Resolved: The GPU compute nodes are now fully available to serve jobs.
May 29, 12:00 PDT

Update: The GPU compute nodes that went down yesterday have been rebooted.
May 29, 11:10 PDT

Investigating: A large number of GPU compute nodes in the MSG data center are currently down for unknown reasons. We are investigating the cause.
May 28, 09:35 PDT

February 5-7, 2020

Major outage due to NetApp file-system failure

Resolved: The Wynton HPC system is considered fully functional again. The legacy, deprecated NetApp storage was lost.
February 10, 10:55 PDT

Update: The majority of the compute nodes have been rebooted and are now online and running jobs. We will actively monitor the system and assess the how everything works before we considered this incident resolved.
February 7, 13:40 PDT

Update: The login, development and data transfer nodes will be rebooted at 01:00 today Friday February 7.
February 7, 12:00 PDT

Update: The failed legacy NetApp server is the cause to the problems, e.g. compute nodes not being responsive causing problems for SGE etc. Because of this, all of the cluster - login, development, transfer, and computes nodes - will be rebooted tomorrow Friday 2020-02-07.
February 6, 10:00 PDT

Notice: Wynton HPC is experience major issues due to NetApp file-system failure, despite this is being deprecated and not used much these days. The first user report on this came in around 09:00 and the job-queue logs suggests the problem began around 02:00. It will take a while for everything to come back up and there will be brief BeeGFS outage while we reboot the BeeGFS management node.
February 5, 10:15 PDT

January 29, 2020

BeeGFS failure

Resolved: The BeeGFS file-system issue has been resolved by rebooting two meta servers.
January 29, 17:00 PDT

Notice: There’s currently an issue with the BeeGFS file system. Users reporting that they cannot log in.
January 29, 16:00 PDT

January 22, 2020

File-system maintenance

Resolved: The BeeGFS upgrade issue has been resolved.
Jan 22, 14:30 PST

Update: The planned upgrade caused unexpected problems to the BeeGFS file system resulting in /wynton/group becoming unstable.
Jan 22, 13:35 PST

Notice: One of the BeeGFS servers, which serve our cluster-wide file system, will be swapped out starting at noon (11:59am) on Wednesday January 22, 2020 and the work is expected to last one hour. We don’t anticipate any downtime because the BeeGFS servers are mirrored for availability.
Jan 16, 14:40 PST

December 20, 2019 - January 4, 2020

Kernel maintenance

Resolved: All compute nodes have been updated and rebooted.
Jan 4, 11:00 PST

Notice: Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer available slots on the queues than usual. To follow the progress, see the green ‘Available CPU cores’ curve (target ~7,500 cores) in the graph above. Log-in, data-transfer, and development nodes will be rebooted at 15:30 on Friday December 20. GPU nodes already run the new kernel and are not affected.
December 20, 10:20 PST

December 20, 2019 - January 4, 2020

Kernel maintenance

Resolved: All compute nodes have been updated and rebooted.
Jan 4, 11:00 PST

Notice: Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer available slots on the queues than usual. To follow the progress, see the green ‘Available CPU cores’ curve (target ~7,500 cores) in the graph above. Log-in, data-transfer, and development nodes will be rebooted at 15:30 on Friday December 20. GPU nodes already run the new kernel and are not affected.
December 20, 10:20 PST

December 22, 2019

BeeGFS failure

Resolved: No further hiccups were needed during the BeeGFS resynchronization. Everything is working as expected.
December 23, 10:00 PST

Update: The issues with log in was because the responsiveness of one of the BeeGFS file servers became unreliable around 04:20. Rebooting that server resolved the problem. The cluster is fully functional again although slower than usual until the file system have been resynced. After this, there might be a need for one more, brief, reboot.
December 22, 14:40 PST

Notice: It is not possible to log in to the Wynton HPC environment. The reason is currently not known.
December 22, 09:15 PST

December 18, 2019

Network/login issues

Resolved: The Wynton HPC environment is fully functional again. The BeeGFS filesystem was not working properly during 18:30-22:10 on December 18 resulting in no login access to the cluster and job file I/O being backed up.
May 19, 08:50 PST

Update: The BeeGFS filesystem is non-responsive, which we believe is due to the network switch upgrade.
May 18, 21:00 PST

Notice: One of two network switches will be upgraded on Wednesday December 18 starting at 18:00 and lasting a few hours. We do not expect this to impact the Wynton HPC environment other than slowing down the network performance to 50%.
May 17, 10:00 PST

October 29-November 11, 2019

Kernel maintenance

Resolved: All compute nodes have been updated and rebooted.
Nov 11, 01:00 PST

Notice: Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). GPU nodes will be rebooted as soon as all GPU jobs complete. During this update period, there will be fewer available slots on the queues than usual. To follow the progress, see the green ‘Available CPU cores’ curve (target ~7,000 cores) in the graph above.
Oct 29, 16:30 PDT

October 25, 2019

Byers Hall power outage glitch

Resolved: Development node qb3-dev2 was rebooted. Data-transfer node dt1.wynton.ucsf.edu is kept offline because it is scheduled to be upgraded next week.
October 28, 15:00 PDT

Update: Most compute nodes that went down due to the power glitch has been rebooted. Data-transfer node dt1.wynton.ucsf.edu and development node qb3-dev2 are still down - they will be brought back online on Monday October 28.
October 25, 14:00 PDT

Notice: A very brief power outage in the Byers Hall building caused several compute nodes in its Data Center to go down. Jobs that were running on those compute nodes at the time of the power failure did unfortunately fail. Log-in, data-transfer, and development nodes were also affected. All these hosts are currently being rebooted.
October 25, 13:00 PDT

October 24, 2019

Login non-functional

Resolved: Log in works again.
October 24, 09:45 PDT

Notice: It is not possible to log in to the Wynton HPC environment. This is due to a recent misconfiguration of the LDAP server.
October 24, 09:30 PDT

October 22-23, 2019

BeeGFS failure

Resolved: The Wynton HPC BeeGFS file system is fully functional again. During the outage, /wynton/group and /wynton/scratch was not working properly, whereas /wynton/home was unaffected.
October 23, 10:35 PDT

Notice: The Wynton HPC BeeGFS file system is non-functional. It is expected to be resolved by noon on October 23. The underlying problem is that the power backup at the Diller data center did not work as expected during a planned power maintenance.
October 22, 21:45 PDT

September 24, 2019

BeeGFS failure

Resolved: The Wynton HPC environment is up and running again.
September 24, 20:25 PDT

Notice: The Wynton HPC environment is nonresponsive. Problem is being investigated.
September 24, 17:30 PDT

August 23, 2019

BeeGFS failure

Resolved: The Wynton HPC environment is up and running again. The reason for this downtime was the BeeGFS file server became nonresponsive.
August 23, 20:45 PDT

Notice: The Wynton HPC environment is nonresponsive.
August 23, 16:45 PDT

August 15, 2019

Power outage

Resolved: The Wynton HPC environment is up and running again.
August 15, 21:00 PDT

Notice: The Wynton HPC environment is down due to a non-planned power outage at the Diller data center. Jobs running on compute nodes located in that data center, were terminated. Jobs running elsewhere may also have been affected because /wynton/home went down as well (despite it being mirrored).
August 15, 15:45 PDT

July 30, 2019

Power outage

Resolved: The Wynton HPC environment is up and running again.
July 30, 14:40 PDT

Notice: The Wynton HPC environment is down due to a non-planned power outage at the main data center.
July 30, 08:20 PDT

July 8-12, 2019

Full system downtime

Resolved: The Wynton HPC environment and the BeeGFS file system are fully functional after updates and upgrades.
July 12, 11:15 PDT

Notice: The Wynton HPC environment is down for maintenance.
July 8, 12:00 PDT

Notice: Updates to the BeeGFS file system and the operating system that require to bring down all of Wynton HPC will start on the morning of Monday July 8. Please make sure to log out before then. The downtime might last the full week.
July 1, 14:15 PDT

June 17-18, 2019

Significant file-system outage

Resolved: The BeeGFS file system is fully functional again.
June 18, 01:30 PDT

Investigating: Parts of /wynton/scratch and /wynton/group are currently unavailable. The /wynton/home space should be unaffected.
June 17, 15:05 PDT

May 17, 2019

Major outage due to file-system issues

Resolved: The BeeGFS file system and the cluster is functional again.
May 17, 16:00 PDT

Investigating: There is a major slowdown of the BeeGFS file system (/wynton), which in turn causes significant problems throughout the Wynton HPC environment.
May 17, 10:45 PDT

May 15-16, 2019

Major outage due to file-system issues

Resolved: The BeeGFS file system, and thereby also the cluster itself, is functional again.
May 16, 10:30 PDT

Investigating: The BeeGFS file system (/wynton) is experiencing major issues. This caused all on Wynton HPC to become non-functional.
May 15, 10:00 PDT

May 15, 2019

Network/login issues

Resolved: The UCSF-wide network issue that affected access to Wynton HPC has been resolved.
May 15, 15:30 PDT

Update: The login issue is related to UCSF-wide network issues.
May 15, 13:30 PDT

Investigating: There are issues logging in to Wynton HPC.
May 15, 10:15 PDT

March 21-April 5, 2019

Kernel maintenance

Resolved: All compute nodes have been rebooted.
April 5, 12:00 PDT

Update: Nearly all compute nodes have been rebooted (~5,200 cores are now available).
Mar 29, 12:00 PDT

Notice: Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer available slots on the queues than usual. To follow the progress, see the green ‘Available CPU cores’ curve (target 5,424 cores) in the graph above.
Mar 21, 15:30 PDT

March 22, 2019

Kernel maintenance

Resolved: The login, development and transfer hosts have been rebooted.
March 22, 10:35 PDT

Notice: On Friday March 22 at 10:30am, all of the login, development, and data transfer hosts will be rebooted. Please be logged out before then. These hosts should be offline for less than 5 minutes.
Mar 21, 15:30 PDT

January 22-February 5, 2019

Kernel maintenance

Resolved: All compute nodes have been rebooted.
Feb 5, 11:30 PST

Notice: Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer available slots on the queues than usual. To follow the progress, see the green ‘Available CPU cores’ curve (target 1,944 cores) in the graph above.
Jan 22, 16:45 PST

January 23, 2019

Kernel maintenance

Resolved: The login, development and transfer hosts have been rebooted.
Jan 23, 13:00 PST

Notice: On Wednesday January 23 at 12:00 (noon), all of the login, development, and data transfer hosts will be rebooted. Please be logged out before then. The hosts should be offline for less than 5 minutes.
Jan 22, 16:45 PST

January 14, 2019

Blocking file-system issues

Resolved: The file system under /wynton/ is back up again. We are looking into the cause and taking steps to prevent this from happening again.
Jan 9, 12:45 PST

Investigating: The file system under /wynton/ went down around 11:30 resulting is several critical failures including the scheduler failing.
Jan 14, 11:55 PST

January 9, 2019

Job scheduler maintenance downtime

Resolved: The SGE job scheduler is now back online and accepts new job submission again.
Jan 9, 12:45 PST

Update: The downtime of the job scheduler will begin on Wednesday January 9 @ noon and is expected to be completed by 1:00pm.
Jan 8, 16:00 PST

Notice: There will be a short job-scheduler downtime on Wednesday January 9 due to SGE maintenance. During this downtime, already running jobs will keep running and queued jobs will remain in the queue, but no new jobs can be submitted.
Dec 20, 12:00 PST

January 8, 2019

File-system server crash

Investigating: One of the parallel file-system servers (BeeGFS) appears to have crashed on Monday January 7 at 07:30 and was recovered on 9:20pm. Right now we are monitoring its stability, and investigating the cause and what impact it might have had. Currently, we believe users might have experienced I/O errors on /wynton/scratch/ whereas /wynton/home/ was not affected.
Jan 8, 10:15 PST

December 21, 2018

Partial file system failure

Resolved: Parts of the new BeeGFS file system was non-functional for approx. 1.5 hours during Friday December 21 when a brief maintenance task failed.
Dec 21, 20:50 PST

December 12-20, 2018

Nodes down

Resolved: All of the `msg-* compute nodes but one are operational.
Dec 20, 16:40 PST

Notice: Starting Wednesday December 12 around 11:00, several msg-* compute nodes went down (~200 cores in total). The cause of this is unknown. Because it might be related to the BeeGFS migration project, the troubleshooting of this incident will most likely not start until the BeeGFS project is completed, which is projected to be done on Wednesday December 19.
Dec 17, 17:00 PST

December 18, 2018

Development node does not respond

Resolved: Development node qb3-dev1 is functional.
Dec 18, 20:50 PST

Investigating: Development node qb3-dev1 does not respond to SSH. This will be investigated the first thing tomorrow morning (Wednesday December 19). In the meanwhile, development node qb3-gpudev1, which is “under construction”, may be used.
Dec 18, 16:30 PST

November 28-December 19, 2018

Installation of new, larger, and faster storage space

Resolved: /wynton/scratch is now back online and ready to be used.
Dec 19, 14:20 PST

Update: The plan is to bring /wynton/scratch back online before the end of the day tomorrow (Wednesday December 19). The planned SGE downtime has been rescheduled to Wednesday January 9. Moreover, we will start providing the new 500-GiB /wynton/home/ storage to users who explicitly request it (before Friday December 21) and who also promise to move the content under their current /netapp/home/ to the new location. Sorry, users on both QB3 and Wynton HPC will not be able to migrate until the QB3 cluster has been incorporated into Wynton HPC (see Roadmap) or they giving up their QB3 account.
Dec 18, 16:45 PST

Update: The installation and migration to the new BeeGFS parallel file servers is on track and we expect to go live as planned on Wednesday December 19. We are working on fine tuning the configuration, running performance tests, and resilience tests.
Dec 17, 10:15 PST

Update: /wynton/scratch has been taken offline.
Dec 12, 10:20 PST

Reminder: All of /wynton/scratch will be taken offline and completely wiped starting Wednesday December 12 at 8:00am.
Dec 11, 14:45 PST

Notice: On Wednesday December 12, 2018, the global scratch space /wynton/scratch will be taken offline and completely erased. Over the week following this, we will be adding to and reconfiguring the storage system in order to provide all users with new, larger, and faster (home) storage space. The new storage will served using BeeGFS, which is a new much faster file system - a system we have been prototyping and tested via /wynton/scratch. Once migrated to the new storage, a user’s home directory quota will be increased from 200 GiB to 500 GiB. In order to do this, the following upgrade schedule is planned:

  • Wednesday November 28-December 19 (21 days): To all users, please refrain from using /wynton/scratch - use local, node-specific /scratch if possible (see below). The sooner we can take it down, the higher the chance is that we can get everything in place before December 19.

  • Wednesday December 12-19 (8 days): /wynton/scratch will be unavailable and completely wiped. For computational scratch space, please use local /scratch unique to each compute node. For global scratch needs, the old and much slower /scrapp and /scrapp2 may also be used.

  • Wednesday December 19, 2018 (1/2 day): The Wynton HPC scheduler (SGE) will be taken offline. No jobs will be able to be submitted until it is restarted.

  • Wednesday December 19, 2018: The upgraded Wynton HPC with the new storage will be available including /wynton/scratch.

  • Wednesday January 9, 2019 (1/2 day): The Wynton HPC scheduler (SGE) will be taken offline temporarily. No jobs will be able to be submitted until it is restarted.

It is our hope to be able to keep the user’s home accounts, login nodes, the transfer nodes, and the development nodes available throughout this upgrade period.

NOTE: If our new setup proves more challenging than anticipated, then we will postpone the SGE downtime to after the holidays, on Wednesday January 9, 2019. Wynton HPC will remain operational over the holidays, though without /wynton/scratch.
Dec 6, 14:30 PST [edited Dec 18, 17:15 PST]

December 12-14, 2018

Power failure

Resolved: All mac-* compute nodes are up and functional.
Dec 14, 12:00 PST

Investigating: The compute nodes named mac-* (in the Sandler building) went down due to power failure on Wednesday December 12 starting around 05:50. Nodes are being rebooted.
Dec 12, 09:05 PST

November 8, 2018

Partial shutdown due to planned power outage

Resolved: The cluster is full functional. It turns out that none of the compute nodes, and therefore none of the running jobs, were affected by the power outage.
Nov 8, 11:00 PST

Update: The queue-metric graphs are being updated again.
Nov 8, 11:00 PST

Update: The login nodes, the development nodes and the data transfer node are now functional.
Nov 8, 10:10 PST

Update: Login node wynlog1 is also affected by the power outage. Use wynlog2 instead.
Nov 8, 09:10 PST

Notice: Parts of the Wynton HPC cluster will be shut down on November 8 at 4:00am. This shutdown takes place due to the UCSF Facilities shutting down power in the Byers Hall. Jobs running on affected compute nodes will be terminated abruptly. Compute nodes with battery backup or in other buildings will not be affected. Nodes will be rebooted as soon as the power comes back. To follow the reboot progress, see the ‘Available CPU cores’ curve (target 1,832 cores) in the graph above. Unfortunately, the above queue-metric graphs cannot be updated during the power outage.
Nov 7, 15:45 PST

September 28 - October 11, 2018

Kernel maintenance

Resolved: The compute nodes has been rebooted and are accepting new jobs. For the record, on day 5 approx. 300 cores were back online, on day 7 approx. 600 cores were back online, on day 8 approx. 1,500 cores were back online, and on day 9 the majority of the 1,832 cores were back online.
Oct 11, 09:00 PDT

Notice: On September 28, a kernel update was applied to all compute nodes. To begin running the new kernel, each node must be rebooted. To achieve this as quickly as possible and without any loss of running jobs, the queues on the nodes were all disabled (i.e., they stopped accepting new jobs). Each node will reboot itself and re-enable its own queues as soon as all of its running jobs have completed. Since the maximum allowed run time for a job is two weeks, it may take until October 11 before all nodes have been rebooted and accepting new jobs. In the meanwhile, there will be fewer available slots on the queue than usual. To follow the progress, see the ‘Available CPU cores’ curve (target 1,832 cores) in the graph above.
Sept 28, 16:30 PDT

October 1, 2018

Kernel maintenance

Resolved: The login, development, and data transfer hosts have been rebooted.
Oct 1, 13:30 PDT

Notice: On Monday October 1 at 01:00, all of the login, development, and data transfer hosts will be rebooted.
Sept 28, 16:30 PDT

September 13, 2018

Scheduler unreachable

Resolved: Around 11:00 on Wednesday September 12, the SGE scheduler (“qmaster”) became unreachable such that the scheduler could not be queried and no new jobs could be submitted. Jobs that relied on run-time access to the scheduler may have failed. The problem, which was due to a misconfiguration being introduced, was resolved early morning on Thursday September 13.
Sept 13, 09:50 PDT

August 1, 2018

Partial shutdown

Resolved: Nodes were rebooted on August 1 shortly after the power came back.
Aug 2, 08:15 PDT

Notice: On Wednesday August 1 at 6:45am, parts of the compute nodes (msg-io{1-10} + msg-*gpu) will be powered down. They will be brought back online within 1-2 hours. The reason is a planned power shutdown affecting one of Wynton HPC’s server rooms.
Jul 30, 20:45 PDT

July 30, 2018

Partial shutdown

Resolved: The nodes brought down during the July 30 partial shutdown has been rebooted. Unfortunately, the same partial shutdown has to be repeated within a few days because the work in server room was not completed. Exact date for the next shutdown is not known at this point.
Jul 30, 09:55 PDT

Notice: On Monday July 30 at 7:00am, parts of the compute nodes (msg-io{1-10} + msg-*gpu) will be powered down. They will be brought back online within 1-2 hours. The reason is a planned power shutdown affecting one of Wynton HPC’s server rooms.
Jul 29, 21:20 PDT

June 16-26, 2018

Power outage

Resolved: The NVidia-driver issue occurring on some of the GPU compute nodes has been fixed.
Jun 26, 11:55 PDT

Update: Some of the compute nodes with GPUs are still down due to issues with the NVidia drivers.
Jun 19, 13:50 PDT

Update: The login nodes and and the development nodes are functional. Some compute nodes that went down are back up, but not all.
Jun 18, 10:45 PDT

Investigating: The UCSF Mission Bay Campus experienced a power outage on Saturday June 16 causing parts of Wynton HPC to go down. One of the login nodes (wynlog1), the development node (qb3-dev1), and parts of the compute nodes are currently non-functional.
Jun 17, 15:00 PDT