Introduction
In a new industry, where defined metrics do not yet exist, expectations are
unclear – what should a consumer expect with regards to a Service
Level Agreement or SLA? Is this a
mere insurance policy against failures, is it nothing more than a discount
structure applied through punitive means, or is there room for a true SLA to
provide competitive advantage to a business customer?
The intent of this document is to provide businesses in the
outsourcing marketplace with a road map for what to expect when navigating the
myriad of issues surrounding the successful implementation of a Service Level
Agreement in the outsourcing marketplace.
Online backup is an example of an outsourced service and some of the criteria below may apply to this simple, yet critical business process. This document does not, however, imply that the criteria below are strictly relevant to online backup nor that all the criteria are applied by Backup Direct™ in relation to its online backup service. Details of the Backup Direct™ SLA in relation to its online backup services can be found here.
Readers of this document can access the Backup Direct™ free trial software here.
topWhy ask for an SLA anyway
The first question that must be answered in dealing with any SLA is ‘why
bother?’ While it may seem trite, understanding why a business
wants an SLA is fundamental to the mutual success of both provider and
consumer. At its core, an SLA is a punitive document. It is part
marketing brochure, part boast. It is a statement about what capabilities a
business believes it can offer, and what performance it can sustain.
But at its heart, the document boils down to punitive measures enforced
when promises do not meet performance. Remedies
are typically financial and seldom can repair the damage that can arise from
non-performance. For example, when
a mission critical application hosted by a third party service provider, goes
down due to infrastructure failures at the service provider.
While the clock may start ticking against the SLA warranted
down times, the damage incurred by the business consumer will far outweigh the
credits given. These credits are generally against the hosting service charges
of a particular month’s bill. Credits
offered against miniscule monthly hosting fees are insignificant against lost
revenues of a business consumer, damage to business reputation and credibility,
as well as potential career damage to the decision maker who authorized the
outsourcing contract in the first place. Since
no financial remedy will ever be adequate enough, focusing on the punitive
aspects of an SLA is one of the pitfalls to avoid when crafting this document.
The business consumer
should avoid too much attention on remedies for failure, and focus more energy
on the mechanisms that prevent failure from occurring in the first place.
A successful SLA will alleviate doubt from the consumer’s mind, and
help ensure business continuity more so than offer financial compensation for
substandard performance.
What approach makes the most sense
One of the contested issues around SLA’s in the Online Backup marketplace is an
ongoing philosophical difference between component based agreements and holistic
/ service based agreements. Providers
are more often able to measure elements or components of discrete services and
thus tend to offer remedies and documentation that address
component-by-component the items of a given service – without ever tying it
all together. Other providers have opted for a more holistic approach at the
service level, but the pitfalls have been non-specific warranties, generic
language, and mismatched expectations between provider and business consumer.
The most successful implementations of SLA agreements begin by looking at
the service as a whole, from the customer’s point of view.
Identifying the key aspects of a solution that must function properly, serves as
a first step in identifying how to tie these elements together under a complete SLA. Providers
must realize that a partial solution is no solution and use component based
capabilities to build a complete overview that warrants all the components of a
service and gives the business consumer assurances that the overall solution
will be delivered as promised.
What should be in an effective SLA
There are three key areas that are generally addressed in the successful SLA.
The first is categorized as ‘infrastructure warranties’.
In the Online Backup marketplace this category tends to include
performance characteristics around facilities, connectivity, hardware
reliability and general availability of discrete technologies.
The second key category of an SLA is ‘process warranties’.
This category includes items like turn-around times for work
process events such as – add new user, delete user, and setup new account.
However it may be extended to include items like … develop new
scripting module to accommodate ‘x’ requirement in ‘y’ period of time.
The third key category of a successful SLA is ‘escalation warranties’. This category is
designed to give as much assurance as possible during unforeseen failures, acts
of God, external contributor failures, etc..
As no-one, perhaps excluding God, can guarantee perfection, this category
is designed to outline the flow of how a failure is resolved, what time frames
to expect, what percent of failures may fall within a given level of disaster,
etc..
Infrastructure Warranties
The infrastructure warranties section of an SLA is the easiest section for a
business consumer to become enamoured with.
Providers are generally quick to throw out impressive quality standard
numbers such as the ‘five nines’ availability percentages (99.999%) uptime.
Some vendor’s only count the nine’s behind the decimal point, some
imply it. But the net effect is to
offer an extremely high amount of availability to the consumer, and ideally lift
that consumer’s feelings about how the vendor will perform given the
guarantee. But herein lies another
pitfall to avoid, the number of nine’s in a given guarantee can quickly be
negated by two factors, the first is exclusions, the second is relevance.
Exclusions
Most service providers who offer such high availability standards provide for a
laundry list of exclusions from which time is exempt against the overall SLA
measurement. Common exclusions are
‘scheduled maintenance windows’ which may involve anything from upgrading
equipment, to periodic reboots, to backing up critical information.
For example, depending on the application technology being offered by the
Online Backup and the proficiency and skills of the provider, an NT platform is
often placed on scheduled reboots, to ‘clean-up’ a given system’s
performance and thereby reduces the number of ‘unplanned’ negative events.
If this reboot schedule is more than once a week, and takes 15-30 minutes
or more to complete, the ability to offer true 5 nine’s performance is
negated. Other exclusions may
involve the ever famous acts of God, warfare, terrorism – but more importantly
may exclude SLA provisions for failure of a third party which the service
provider does not directly control.
This limitation is often rightly used to exclude a local ISP
of the business consumer where the provider does not directly control the
ISP’s operations. It may also be
deployed against software vendors for ‘anomalies in the code base’, which
require the software vendor to fix themselves.
Sometimes connectivity providers fall into this arena for the sections of
the network they operate between consumer and the end-all hosting provider.
Business consumers should be wary to closely examine the exclusions of an
agreement and insure that the ‘real’ availability of the service matches the
perceived warranties.
A business consumer should also take special note of the ‘Application’ wording in the outsourced marketplace with respect to exclusionary provisions. Closely examine the software license agreement of any off the shelf product and the language used excludes the software from providing any benefits at all – except by sheer accident it would appear. Every possible liability is disclaimed. Given that primary software providers disclaim all warranties of their software, Online Backup providers then have a difficult task attempting to warrant a service, which relies at its core on products the original manufacturers refuse to warrant in any way. While premium service providers will attempt to ‘own’ the entire service experience for their customer base, at least from a single point of contact, or responsibility point of view, no service provider can warrant software code above what the developer will take responsibility for.
topRelevance
The second negating factor to examine against multi-nine performance warranties is
relevance. The most resilient
component of a given service is generally the one touted during a multi-nines
warranty claim. This is commonly
the resiliency of the facility, or data centre building itself.
The second most common measurement is against network uptime /
connectivity – which if the provider is worth their salt, generally spans more
than one vendor, involves peering arrangements, and is redundant enough to
warrant true ‘high availability’. In
either case, be it the data centre, or the network layer, that is being measured
– in the Online Backup marketplace the truly relevant measure is several
layers up the OSI model. Ultimately
the only measure that truly counts is at the application tier.
This implies performance of the application itself, middleware platforms
(if any), the operating system, the hardware, the network, and yes the building
itself. It is also not common to
offer multi-nine warranties when measured at this layer, since the complexity of
the environment (variability in the implementation of the OSI model) make it
hard to predict. Business consumers
should probe what measurement statements are being made about the service and
insure that the most relevant measures are in place to warrant overall
successful service delivery.
Frequency
Another common pitfall to avoid in the infrastructure sections of an SLA is lack of
attention to frequency and capacity issues.
Seems like a ‘no-brainer’ but it is surprising the lack of published
information service providers offer in SLA documents around the frequency of the
measurements that are taken against a given component of the service.
For example, it is easy (albeit risky), to tout higher
availability if only measuring a service characteristic once per day, or perhaps
what is more commonly seen, at once per hour.
A successful response can negate 59 minutes of downtime that
may or may not have gotten attention from staff.
Using polling frequencies at smaller intervals makes overall service
response time much better, focus’ attention on problem areas quicker, and
effectively ‘insures’ better system performance, even though statistically
the SLA results could look similar to the end consumer.
A reasonable measurement interval is generally around 5
minutes. More frequent measurement
than that can burden the system – we affect what we watch theory – less
frequent tends to lose responsiveness to potential issues.
Business consumers should require a list of the tools used to
measure the system from providers, and should ask what frequency of measurement
is used to determine the numbers provided.
Time Span
Generally
it is easier for a service provider to achieve better uptime results over a
longer period of measurement time. This
is not necessarily a disadvantage to either the service provider or business
consumer as the goal is to raise performance statistics over time (i.e. the
longer a service runs the better off a business consumer is).
However the remedies listed against an SLA should be tied to the billing
frequency of the business consumer. For
example, if the service being provisioned is billed monthly, which is most
common in the Online Backup sector, the SLA measurement periods should also be
monthly. This allows the business
consumer to review on the invoice each month the line item credits associated
with any breach in SLA performance. It
also facilitates a tie-in between the value the service provider offers and the
components of the service that either exceed, match, or do not meet performance
expectations of the consumer. Business
consumers should insure that the measurement times of the service correspond to
the billing cycle, and that line item credits for SLA non-performance are easy
to understand, clearly tie to discreet system performance, and are clearly
articulated by the service provider on the billing invoice.
Capacity
Capacity
concerns are sometimes overlooked in an SLA with negative results.
It is sometimes difficult to warrant certain characteristics of a given
solution such as CPU utilization thresholds, or RAM utilization / consumption
will not exceed ‘x’ percent for example.
However a provider can take certain steps to insure that peak network
capacity for example, never exceeds a sustained 70% of overall bandwidth
availability. A provider can
warrant in the process section of an SLA that hard drive capacity utilization
will be escalated to the consumer as key targets are reached – 50% filled, 75%
filled, 90% filled for example. It
is important when considering capacity utilization issues that the business
consumer does not attempt to warrant ‘best practices’ as part of an SLA
agreement. Best practices tend to
evolve and change over time, and a service provider should be free to manage the
solution for the customer by learning on a continual basis.
Forcing a service provider to keep a given component utilization at some
arbitrary number, may in fact do more detriment to a system than benefit.
Business consumers should be careful to avoid constricting language that
does not allow a service provider to implement improvements in capacity
management as they evolve over time.
Sample Set
The statistical sample set used in the measurement is also significant to the
business consumer. For example, it
is easier to maintain higher uptime reporting standards when measuring downtime
over a large base of servers, than on an individualized (my server only) basis.
A service provider should be able to measure ‘dedicated’ servers, or
those used specifically by a single customer, and provide reporting to that
individual customer regarding their discrete performance.
However measuring common components of a solution becomes much more
difficult. A fax gateway, or e-mail gateway translation service for
example, may be deployed for a large group of customers where throughput of
messages is likely measured at the gateway itself, rather than by an individual
customer. The higher
value added providers however, will find a way to offer the business consumer
visibility into their discrete usage of common infrastructure utilities.
Resiliency
The old adage – you get what you pay for – tends to ring true with respect to
heading off potential service outages by purchasing additional resiliency in the
components of any given hosted solution. For
example, a hosted solution may only require a 300Mhz strength CPU, a 4GB Hard
Drive, and 64MG of RAM in order to function properly.
Consumers will readily purchase higher Mhz CPU’s since the going rate
of speed as of this writing is 800+, they will not think twice about spending
the additional hardware costs in order to achieve ‘theoretically’ better
system performance, or in an effort to anticipate additional solution
requirements at some future time. Purchasing
resiliency however, may meet the future performance objectives, while at the
same time offering immensely better protection against potential service
failures. Instead of buying the
additional Mhz in the CPU, it may be better to buy a second server entirely that
could be mirrored, or load balanced to protect against outages.
The load balancing with a second machine may wind up offering much better
overall performance than simply upgrading a single box.
Service Providers should be able to provide the business customer with
detailed system hardware and software requirements for optimum solution
performance; this should include single configuration solutions, as well as load
balanced solutions. If the consumer
opts to purchase more resiliency, the metrics for SLA performance of technical
components of the system should be markedly higher.
Process Warranties
Work Orders
Technical
jargon and detailed explanations of infrastructure measurement techniques can
help assure a user that the service provider understands how to run the
solution. However, it does nothing
to assure the consumer that the provider understands the criticality of the full
solution to the business they serve. This
is the area where process warranties make all the difference.
By warranting specific tasks on specific timelines for example, the
consumer can be assured he knows that setting up a new account will occur in
‘x’ minutes; deleting a terminated end-user will occur throughout the system
in ‘y’ minutes. These
assurances allow the business consumer to develop business practices that the
solution provider can warrant performance against.
Knowing that requests for change, or work orders (if you will), are
warranted for turn-around in defined time periods take the guess work out of
planning, and assure both the provider and the consumer that the important
functionality of the application is being addressed in a meaningful way to the
consumer. It also gives the
provider a productivity target that can be used to attempt to develop better
performance against over time.
Notification
Process
warranties are not limited to work-order related tasks.
They can also include business information dissemination related to
utilization of the solution resources themselves.
For example, a process warranty that notifies the consumer when 50%, 75%,
and 90% of a hard drive’s free capacity has been exhausted allows the consumer
to address the issue. It may be
that the consumer enacts utilization policies for the end-user base who utilize
his solution such as ‘please limit your online storage to 5 MG per user’,
‘please delete information older than “x” days’, or ‘please delete
these sections of data within the solution itself’.
Notification on a proactive basis becomes a competitive advantage
allowing the consumer to craft the appropriate response for his business.
When storage of the data is necessary over a long term, and growth is
high or unpredictable, the business consumer may want more notifications between
50% and full – these become items of negotiation in the deployment of a
successful SLA that again keep the provider and consumer focused on the key
aspects of solution performance.
Response Times
Process
warranties can also include response times of the overall solution to stimuli
(WEB page retrievals or mouse clicks most often).
While response times can also appear in the infrastructure section of an
SLA, identification in the process section sends a message to the business
consumer that this service provider knows the truly important items to be
measured in the SLA. Who cares if
the turnaround time on a database transaction is less than 3 seconds for
example, if the time it takes to refresh the screen takes 12 seconds.
Measuring key response times to system stimulus should denote
the most important aspects of the solution, and should take into account
variability for sections of the solution that the provider may not directly be
able to control. For example,
having a 3 second response time to process a transaction is common (including
the screen refresh), but what if the user base extends to international
locations. Will the system perform
acceptably from Tokyo to London to New York?
Should it? Response time
warranties force the business consumer and the service provider to think outside
of the box, to consider end-user scenarios that may not be ‘normal’ to
business operations, but may arise from growth or other factors.
It will help avoid problem with differing expectations later in the
relationship.
Measurement
Measurement
within the Process section is equally critical to that of the infrastructure
sections of the document. Who will
monitor that the processes are functioning according to the specifications?
Normally, the provider will assign this task to internal personnel, but
it is equally important for the provider to document, who, how, and when
compliance attributes are to be monitored and reported to the business consumer.
It is not unreasonable for the business consumer to negotiate outside
auditing of the process section of an SLA.
Indeed these functions are the easiest to measure, document, and report
against by third parties. The
infrastructure section of an SLA may rely on proprietary techniques or inside
technical information that forms the basis of a competitive advantage to the
service provider. But process
warranties are formed at a higher tier, generally involve business processes
only, and therefore are easier to validate by internal or external parties.
The business consumer should be explicit with the service provider as to
who will audit, how often an audit will be conducted, and of course how
disparities in reporting will be addressed.
Escalation Warranties
Customer Care
Sometimes
referred to as the ‘customer care’ portion of an SLA, this section deals
primarily with what to expect when the unforeseeable occurs.
It is in this section of the SLA where a service provider has the
opportunity to distinguish himself from the pack.
The SLA should contain language to help set expectations regarding
failure classifications, frequency, and then define escalations both inside and
outside the provider’s direct control. For
example, a business consumer of Online Backup services should expect that 80% of
the support calls to the service provider would be quickly and efficiently
diagnosed as ‘client’ problems. History
and statistics demonstrate that the piece of any application solution most
likely to fail is where variability is the highest.
The PC is truly a ‘personal’ device, generally subject to the end
user downloading applications from the Internet, changing configurations to
accommodate games, or participating in other equally ‘personal’ behaviour on
a computer system despite the fact that the PC is generally a corporate owned
asset. Therefore variability is
generally highest at any given point in time at the client – driving the 80%
causal factor for failure. There
are still fewer variability’s in connectivity, the typically next known
culprit in perceived system non-performance; equalled by lack of training on
product functionality / feature sets; followed by true errors in the system
infrastructure; and lastly, by real errors in the code base of a given
application. This then becomes a hierarchal pecking order for system
failures at an Online Backup. It is
important for the business consumer to understand this as they negotiate
escalation procedural warranties. Avoid
the pitfall of requesting the 80% expected client issues be escalated and
reported immediately throughout the management chain at the provider and the
business consumer sites – focus on the remaining 20% of issues that can be far
more significant in terms of impact on the solution, and potentially far longer
lasting in terms of outage time.
Escalation Paths
Escalations
should be complete (both internal and external resources used), and describe the
expertise hierarchy of the provider. For
instance a common escalation chain may start at the tier 1 level of customer
support, failing resolution within a defined period of time it would move to the
tier 2 level of customer support. This
pattern may occur generally from 2 to 4 levels within support but then should
reflect an escalation to the Operations and/or Engineering staff.
Escalations should further detail the levels within Ops or Engineering
[how many, at what time intervals, to what groups in parallel (if any), etc.].
At last, the service provider should identify what types of support
relationships they have with third party providers upon which they rely.
For instance, if a service provider relies upon MCI for
Internet circuits, they should provide the business consumer an itinerary for
how an escalation takes place to MCI from the provider.
There may be special expedited support mechanisms in place
with third party providers to a given Online Backup that distinguish that Online
Backup from the rest. The business
consumer should pay particular detail to these relationships and escalation
paths, as the most significant failures to a solution will inevitably wind up
here. Defined relationships, a
history of collaboration with vendors on problem resolutions, etc., will go a
long way in shortening the time to resolution, and restoring the service to the
consumer.
Crisis Status
Status
during a crisis is the most important element of this section of the SLA.
A business consumer has a right to expect regular updates on where a
given open item is within the resolution processes of the service provider.
The best mechanism for providing this information is generally via a
secured Internet site, as accessibility is most open via this mechanism.
Business consumers should expect to see trending information related to
the performance of the service provider in resolving open trouble tickets.
Keeping in mind that the customer support organization cannot mandate the
quality of a service solution, only engineering can do this.
Trending data will show the business consumer how quickly items are
resolved, how many items are submitted and open, and when examined over a 12 to
15 month period of time, the overall quality of the service.
The number of tickets should go down over time when a system performs as
expected, assuming no new feature are deployed and the environment remains
consistent.
Reporting
Reporting
of status information and escalation compliance is something the business
consumer should take note of as well. How
does the Online Backup report against proscribed escalation procedures?
What mechanisms are in place to insure this information is
accurate, and not produced only after prompts from the business consumer on a
challenge. Ideally the consumer
should see a monthly report that matches SLA information against services billed
within the period. It should
provide highlights for compliance against objectives, and show credits due for
any non-performance issues that may have arisen.
Reporting on a more frequent basis than monthly becomes a costly
proposition for the service provider, and while they may be able to accommodate
the request, the business consumer should expect to pay a premium for this type
of additional reporting capabilities.
So how does a business derive competitive advantage from an SLA
Presentation
The
savvy business consumer should examine a service provider’s capabilities
regarding the SLA, as well as their vision regarding the interface of SLA
information to the customer. Presenting
information to the consumer on a timely basis that allows the customer to make
intelligent business decisions around the usage of a given solution is the end
goal Trending data is critical to proper analysis, and is
indicative of advanced systems and capabilities of a service provider.
It implies a graphical presentation (charts), it implies storage of
historical data, and it implies the ability to archive data over time (or the
Online Backup will drown in data overflow).
The business consumer does not need real-time feeds of this information
as ‘real-time’ implies significantly higher costs than the value it
provides, with the exception of crisis status or service outage notification.
One-day-old information is generally the most valuable and easiest to
collect and present. While the
Online Backup may not have developed the actual tools for the delivery of this
information yet, it is critical to the business consumer that the Online Backup
shares this vision and is actively engaged in making it a reality.
Intelligent Auto-Repair
Most
Online Backup’s are still struggling with building intelligent capabilities
into their service offerings and monitoring tool sets.
Implementing auto-fix scripts that take into account information from the
network, hardware, OS, and application layers, that is then analyzed from a
composite point of view, with automatic error correction is still the lofty goal
more than the common practice at this point.
But again the business consumer must examine the Online Backup’s vision
with regard to SLA implementation and insure that this capability is an end-goal
of the monitoring and measurement systems put in place to collect and report
against SLA performance. This type
of capability is new to the industry as a whole, but represents the best hope
for insuring stability in highly complex computing environments.
Mutual beneficial outcomes
Structuring a ‘win/win’ agreement may motivate both parties to perform.
An often-overused cliché, the ‘win/win’ agreement financially
motivates both the service provider and the business consumer to achieve common
goals. Online Backup’s tend to
use this concept to gain additional revenue from the services provided if they
hit all the performance objectives. But
this is simply added cost to the consumer, the reverse of punitive discount
structures. It does not represent a
true incentive to perform, only additional revenue to the provider.
True ‘win/win’ agreements will expand into revenue
sharing opportunities for both parties. For
example, if the reliability of a service can win industry awards for the
business consumer that truly distinguish their solution from others, the value
of these awards to the marketing organization could be financially compensated
against. Setting up reciprocal
service agreements where service provider and business consumer refer each
other’s customers for cross sell opportunities is another way to achieve real
‘win/win’. The implication from
such terms and conditions is that performance will have to meet or exceed
expectations or revenue flow would cease for both parties.
What is online data backup?
An online data backup service is a proven alternative to traditional optical and
tape backup solutions and can be considered a perfect solution to use on an
outsourced basis - as it is a critical but non-core business function.
Most businesses understand the need to protect their most valuable business asset. It is too easy to become victim to human error, PC crash, a virus, malicious actions, flood, fire, theft or loss of a PC.
Traditional backup solutions can be effective, but require capital expenditure and internal staff to maintain and operate them. There are general considered a hassle - especially during a crisis when key data is needed to be recovered quickly.
Online data backup is a refreshing, modern outsourced solution to ensuring cost-effective, simple and secure resolution to the niggling backup challenge - so why not try it risk-free for 15 days.