Maintenance-Mode

From time to time updates and maintenance is required to keep our Cloud platform running as well as it possibly can.  Patching of all aspects of the stack are monitored and all new revisions are evaluated and tested in a Test/Dev environment prior to being rolled into production.  In addition to basic updates and maintenance, capacity across the stack is routinely expanded as well.  Storage and Compute nodes are regularly added to scale out accordingly.

Many updates and scale out operations are done on a regular basis without customer notification.  Our Cloud platform allows for particular non disruptive updates to occur.

The majority of our updates fall into “Level II Impact” in the below communication and description matrix.  Basically we perform updates as needed that are known to be not impactful at a time of our choosing with no communication required.  This is by far the most common type of maintenance that we perform.  No impact.  No communication.

From time to time we will physically enter rack space to perform updates/moves/adds/change to physical equipment.  When physical access is required we will send e-mail communications to our customers 48 hours ahead of the time in which the work will be performed.  Even if no impact is expected.  This work can be outside of any of the windows listed below and is more of a courtesy notification to customers.

It is important to stress that at no time do we touch the running state of your Virtual Machines or storage.  Virtual Machine and storage systems are never stopped.  We have the ability to evacuate storage systems for maintenance as needed and will not halt Virtual Machines.  This question and concern comes up from time to time so it’s worth addressing.  There is no technical need to verify data integrity, VM state, application functionality post any maintenance event.  However, many customers may choose to do so to adhere to their own internal processes.

Every once in a while we will fail over or transition states of particular networking equipment.  That equipment is typically perimeter core routing and security equipment that operates in an Active/Passive fashion.  This kind of planned failover event will move all states to the partner/Passive  equipment.  The time to move these states can range from 0-aprox 4 seconds.  During that 0-4 second time users may experience what we refer to as “Sub Optimal Routing”.  During that time frame a user operating within the Cloud in question will experience a brief pause of network connectivity into our Cloud.  The user will then pick right back up within the existing session.  This only impacts internet connectivity into/out of our Cloud.  Inter VM communications within the Cloud are not impacted.  This type of maintenance would be considered Level III Impact and be scheduled for an appropriate time and date.  Typically, Saturday 11PM – 3AM is chosen for this kind of maintenance with e-mail communication 48 hours in advance.  Customers with point to point connectivity options into our Cloud are not impacted by perimeter networking maintenance.

We take every step and put forth a lot of effort to maintain the uptime and integrity of our Cloud even during maintenance windows.  We spend 99% of our time planning for maintenance so that implementation is 1% of the time required.

 

True Cloud Server Maintenance Type and Communication Matrix

sep-right

sep-left

 

LEVEL I IMPACT

Any work performed on any component of True North’s hosting infrastructure on Saturdays between 11 p.m. and 3 a.m. on Sundays qualifies as a Level I Impact event

Impact to Service:  No Anticipated Interruption of Service

Examples:  Scheduled Maintenance Window

Maintenance Window:  Saturdays, 11 p.m. – 3a.m. (PST)

Timeframe for Communication:  At least 48 hours prior to event

Announcement:  Required

Responsible:  Hosting Team

Recipients:  All hosting customers, True North Staff

Communication Details:  Email Blast

sep-right

sep-left

 

LEVEL II IMPACT

Any non-customer-impacting work performed on any component of True North’s hosting infrastructure outside of peak operating hours or the scheduled maintenance window as the Hosting Manager deems needed, qualifies as a Level II Impact event.

Impact to Service:  No Anticipated Interruption of Service

Examples:  Urgent Firmware update, storage changes, etc.

Maintenance Window:  Outside of peak system hours (10.p.m. – 3 a.m.)

Timeframe for Communication:  Optional

Announcement:  Optional

Responsible:  Hosting Team (At Hosting Manager’s discretion)

Recipients:  All hosting customers, True North Staff

Communication Details:  Email Blast

sep-right

sep-left

 

LEVEL III IMPACT

In some cases, third party vendors may require quick response to apply fixes/patches to a customer’s application or infrastructure element to ensure optimal functionality. In such a case, the assigned True North team member, after having the course of action signed off on within the ticket by the True North department manager, will agree on a maintenance window with the customer (customer’s approval required in writing within the ticket), and proceed with the work at the agreed upon time.

Impact to Service:  Minor Interruption of Service Anticipated

Examples:  Urgent patches/updates, EMR patches/fixes

Maintenance Window:  Urgent/Ad-hoc

Timeframe for Communication:  May vary

Announcement:  Required

Responsible:  Assigned True North employee

Recipients:  True North Customer, Department Manager  

Communication Details:  Appropriate resource will contact customer to discuss details and to arrange for a short-term maintenance window.

sep-right

sep-left

 

LEVEL IV and Level V IMPACTS

For an event to be categorized as Level IV or V, multiple employees at one or more customer locations must be experiencing issues that would imply a possible hosting service interruption.

With multiple moving parts needed to manage such an event, an additional work flow document has been created to capture the process of managing impacts of this level. The engagement process for both levels IV and V is the same (see details below), the differentiating factor being only the duration of the service interruption.

sep-right

sep-left

 

LEVEL IV IMPACT

Impact to Service:  Unplanned Interruption of Service < 30min.

Examples:  Service or Application crashes, servers stuck but register as “online”

Maintenance Window:  N/A

Timeframe for Communication:  During and after the event

Announcement:  Required

Communication Process:             

  • Text Message from Communications Lead –
  • < 30 min. following incident discovery/report
  • Text Message from Communications Lead – Immediately upon resolution
  • Email from Communications Manager – Follow up within 48 hours of resolution

 

sep-right

sep-left

 

LEVEL V IMPACT

Impact to Service:  Unplanned Interruption of Service > 30min.

Examples:  Entire datacenter outage, data corruption

Maintenance Window:  N/A

Timeframe for Communication:  During and after the event

Announcement:  Required

Communication Process:             

  • Text Message from Communications Lead – < 30 min. following incident discovery/report
  • Text Message from Communications Lead – updates every 30 min. until issue resolved
  • Text Message from Communications Lead Notification immediately upon restoration of Services
  • Email from Communications Manager Follow-up and post mortem within 96 hours of resolution

 

truecloudserver_logo-9-1024x271 sep-leftsep-right