Scheduled Maintenance

The first non-holiday Wednesday of each month is typically dedicated to performing maintenance.  Maintenance typically begins between 9 AM and 10 AM and can run as late as 6 PM.  In the case maintenance will take longer than expected, additional email notification will go out while maintenance is being performed to provide an update on when we estimate maintenance to conclude.

We currently have two types of maintenance that can take place each month as outlined below.  Notification for our routine patching goes out a week in advance of the maintenance day, while the more disruptive core infrastructure update, we try to notify users a month in advance as well as a week before the maintenance day.  These types of notifications are only provided for our scheduled maintenance.  For unplanned or emergency maintenance we will try to provide users as much notice as we can, regardless of service impact.

 

Type

Frequency

Systems (nodes) affected

How updates are applied

Impact on Users

Type

Frequency

Systems (nodes) affected

How updates are applied

Impact on Users

Routine patching

9 to 11 times a year

Border systems:

  • Login nodes

  • Open onDemand nodes

  • Data transfer nodes

Internal systems:

  • Compute nodes

  • Firewall

  • File Systems

Updates are applied in a rolling fashion by power cycling all the nodes during the maintenance window, except for the compute nodes.  Compute nodes are updated by first having the scheduler drain the nodes and then power cycle the node once it is idle.  Upon restarting, the node is once again place into service. 

Interactive jobs may be impacted, but batch jobs placed in non-preempting partitions should not be impacted.

 

Momentary disruptions to storage systems and access via the border systems will occur as systems are rebooted for patching.

Core infrastructure updates

1 to 3 times a year

Border systems:

  • Login nodes

  • Open onDemand nodes

  • Data transfer nodes

Internal systems:

  • Compute nodes

  • Switches

  • File systems

  • Firewall

  • File Systems

All jobs regardless of progress are either halted and requeued or canceled if the job was submit as not being able to be requeued.

Users access is denied and all servers are idled or powered down for the duration of the maintenance window.

Once maintenance is complete, nodes are restarted and jobs that were requeued are released and begin to run once again.  Jobs that do not save checkpoints, will have lost any progress they made prior to the downtime.  

A maintenance reservation is put in for the day of the maintenance starting at 8 AM HST.  

Batch jobs that would run longer than the start of the maintenance reservation will be queue and wait until the reservation is done before they would be allocated resources.  

Interactive jobs that request a time limit that would overlap the maintenance reservation will be denied by the scheduler.

Jobs that do not overlap the maintenance reservation will be scheduled as normal.