Review: Nexthink

This independent review is part of our 2013 Incident and Problem Review. See all participants and terms of the review here.

Executive Summary

Elevator Pitch If systems management monitoring takes care of servers, Nexthink presents you all you need to know about the end-user side of the coin.

Nexthink sits apart from the nuts and bolts of Service Management tooling, but offer guidance to analysts to help expedite resolution with real-time End-user IT Analytics, integrated into major ITSM tools to significantly reduce problem diagnosis times.

Strengths
  • Lightweight, non-invasive kernel-driven footprint on end-user targets helps define trouble spots in real time
  • Complements and integrates with existing IT Service Management deployments
Weaknesses
  • With so much technical capability, it needs a very strong balancing hand of strategy to get the best of a combination of this product, a service management suite, and server monitoring collaboration.
Primary Market Focus Based on the information provided, Nexthink’s customer base ranges from Small (100 end users) to Very Large (250,000+)They are classified for this review as:Specialised tooling, requiring integration to ITSM.

Commercial Summary

Vendor Nexthink
Product Nexthink V4
Version reviewed V4.3
Date of version release February 19 2013
Year founded Founded in 2004.Turnover is not disclosed but 100% yearly growth. Today 2 million users’ licenses sold
Customers 400
Pricing Structure # IT users with perpetual or subscription license. On premise Enterprise product and Cloud/SaaS offering (Q2 2013)
Competitive Differentiators Nexthink provides unique real-time end-user IT analytics across the complete infrastructure.This perfectly complements existing performance monitoring systems to drive better ITSM initiatives; end-user IT analytics are used to:

  • 1) Diagnose and isolate problems in real-time for service desk to become more effective and responsive for higher customer satisfaction
  • 2) Continuously compute metrics and KPIs for proactive actions so IT operations can improve service quality for higher business agility and productivity
  • 3) Configuration and change management is fully under compliance control.
Additional Features Nexthink’s product does real-time discovery, dependency and relationship mapping, real-time activity monitoring, alerting and reporting on all object and data analytics available.

It doesn’t rely on any external product or data to function. However integration methods exist to enrich Nexthink data with external data sources (E.g. Active Directory, Event database, CMDB), to export Nexthink data/analytics to other tools to create end-to-end correlated views/results (CMDB, ITSM, Security Events Management).See an example here: http://www.youtube.com/watch?v=rSstzl_KMdc

Independent Review

Credit should be given to Nexthink for putting themselves up against “traditional” ITSM Vendors, as their product does not do traditional Incident Management and Problem Management.

What it can do, however, is significantly shorten the amount of time it takes to resolve an incident and/or problem, by showing end user data in real time.

Nexthink have established major partners and product integrators with companies like BMC, HP and ServiceNow and provide a button on their ITSM consoles to allow analysts to view the data when required.

It almost presents itself as the super-hero of incident and problem diagnosis.

But following that super-hero position for just a moment, it is easy to get carried away with the technical potential of a shiny mapping, real-time toy.

It is much more than that, and it needs a sharp strategic mind to position it – remembering the key drivers of any ITSM related deployment.

The potential to drive down incident resolution time, and more importantly problem root cause analysis time makes it a compelling accompanying tool alongside an ITSM tool, to achieve tangible business efficiency benefits.

Widening the scope to look at the effects of IT Transition projects, and again the potential business benefits of understanding what specifications end user machines need to be to ensure speedier access to services, for example, could reap significant rewards.

Systems Monitoring vs. Real Time End User Monitoring

Nexthink acknowledge that infrastructure monitoring is an established discipline.

There are all manner of event and systems management tools that can also integrate into service management tools to present an organisation with enterprise level management.

What Nexthink do is focus on the end user perspective.

A kernel driver that is deployed out to end user machines, and loads into memory on boot-up.

Real-time data is then loaded up to a central server which can then be interrogated as and when required.

Incident & Problems Scope

Forrester research has shown that 80% of the time during the lifecycle of an incident is spent trying to isolate the problem itself.

Source: Forrester (http://apmdigest.com/5-it-operations-challenges-%E2%80%93-and-1-main-cause)

Nexthink offer a way of shortening that timeframe, mapping out relationships between failing components to see where the problem has occurred.

For example, a user may ring with a general issue of a slow response time.

Ordinarily, a support analyst would then have to drill down through applications, servers, configuration mapping to see what may be affected and how.

Nexthink can demonstrate where the issue lies and could isolate the failing link in the chain a lot more rapidly.

Nexthink’s own interface can even be used to directly query the user’s asset to assist with the diagnosis.

The information gathered can also be used to supplement the CMDB in the ITSM tool.

All this could then be used to drive more accurate logging and categorisation, and linking to any subsequent processes to resolve the situation.

The knock-on benefit is improved resolution times, potential workarounds and knowledge-base material, not to mention improved reporting.

When Nexthink is integrated with an ITSM tool, the support analysts will work off the ITSM console, but they will have a Nexthink button to be able to access the real-time analytics data.

Looking in the context of incidents and problems, whether major or otherwise, the ability to have multiple teams looking at the related end user data in terms of applications and services is invaluable.

Conclusion

There is a lot to appeal to the technical heart, looking at the depth of analytical data possible.

Key points to remember though – it takes everything from the end user point of view, and is not geared to sit on servers themselves to do that level of monitoring.

Taking just Incident and Problem Management, it is easy to see how the investigation can be shortened as an incident call comes in.

But looking at Problem, it can take proactive root cause analysis to another level.

If that is then combined with ITSM tools and their own abilities to manage multiple records (in the case of Major Incidents or Problems) then it is a powerfully complementary part of a company’s overall ITSM strategy.

Nexthink Customers

Screenshots

Click on the thumbnails to enlarge.

In Their Own Words:

Nexthink provides unique real-time end-user IT analytics across the complete infrastructure. This perfectly complements existing application performance monitoring systems to drive better ITSM initiatives.

End-user IT analytics are the path to better IT quality, security and great efficiency and cost savings.

Nexthink provides IT organizations with a real-time view of the IT activity and interaction across the complete enterprise, from the end-user perspective. This unique visibility and analytics give IT the capability to truly evaluate and understand how organizations are performing and rapidly diagnose problems or identify security risks. Nexthink uniquely collects in real time millions of events and their respective dependencies and relationship to IT services from all users, all their applications, all their devices, all workloads, and all network connections patterns (server accessed, ports, response time, duration, failure, timeouts, etc.).

Nexthink helps IT connect, communicate and collaborate to achieve their major initiatives and improve their business end-user’s IT experience. Nexthink is complimentary hence integrates well with traditional application performance management (network and server), help desk, operations management, and security tools and eases ITIL change and release management processes.

Further Information

Group Test Index

This independent review is part of our 2013 Incident and Problem Review. See all participants and terms of the review here.

2013 Incident and Problem Tools Review

Tools Reviewed:


Download Review

(Free PDF, No Registration Required – 601kb, 7 Pages)


INTRODUCTION

Incident and Problem Management are such mainstays of an ITSM tool, it is quite hard to find a way to dig through the differentiators.

The process and the related workflows themselves are so seemingly straight forward, are there really any ways to improve?

Not only that, but it has to be looked at in the context of the trends in the industry to focus on the end-user’s experience. That’s all fine when we take a look at the options available to an end-user logging an incident from a self-service portal.

But in reality, people still call service desks.

The answer is – there are ways to improve, and in many ways they are subtle features that make tools stand out.

This review bought out nuances and features to help make a couple of mature processes look exciting again.

  • Stylish use of forms, questions and linkage to knowledge bases
  • Resourcing and task planning
  • Real-time end-user analytics

These tools do more than just provide a mechanism to move an incident or a problem from A to B.

It looks to improve the lifecycle, and practice the points of the Process Certification that vendors put themselves through.

A word should be said, though, about the knowledge levels of the people who market these products day in, day out.

I would like to share an insightful tweet from Forrester’s Stephen Mann

 

In both the reviews I have done, it is always good to work from qualified consultants who have a very good understanding of balancing what the tool can do, functionally, against what the real world sometimes requires.

The devil for all these tools is in the detail of the customisation – any tool, with dedicated customisation, and knowledge, pragmatic consultancy can get the best out of any record-pushing mechanism.

Having replaced many a tool in large-scale ITSM deployments, I often recognised shortcomings in both the outgoing and the incoming tool-set.

But the key remains – can the vendor impart a sense of comfort that they not only understand their tool, the processes that need to be translated to workflow, but can they identify ways to improve?

Having people who not only understand the tool, but also recognise the need to encompass evolving best practices goes a long way to make a tool stand out from its peers in the crowd.


MARKET POSITIONING

For the purposes of this review, vendors were classified based on their primary market focus, and product capabilities.

Vendor

Target Market Size Specialist ITSM Functions Discovery Own Tool/Third Party Integration Event Management & Monitoring Own Tool/Third Party Integration

Real-Time End-User Analytics

Axios assyst Large Very Large Own Third Party Integration
BMC FootPrints Medium Large Own Own (via Integration)
Cherwell Service Management Small Medium Large Very Large Own Third Party Integration
Nexthink Small Medium Large Very Large Own Third Party Integration

TOPdesk Small Medium Large Own Third Party Integration

COMPETITIVE OVERVIEW

The table below shows a high level overview of the competitive differences between the tools

  • Elevator Pitch – An independent assessment of what this module has to offer
  • Strengths – key positive points, highlighted during the review
  • Weaknesses – areas perceived to be lacking, during the review
Vendor Elevator Pitch Strengths Weaknesses
Axios assyst A tidy interface, driven by product hierarchies, and backed up with a potentially powerful CMDB.Work put in to customise the Info Zone, Guidance and FAQs can make the job of the Service Desk, Analysts, and even the end user interaction easier
  • Crisp and clean interface, with not much clutter
  • From a self-service point of view, a nice touch in walking end users through investigation before logging a ticket
  • For those logging directly with the service desk, pulls in pre-populated forms and guidance to make that role easier/more efficient
  • Very much rooted in the technical – with the product hierarchy very comprehensive.  Would be nice to see perhaps an incorporation of more business language
  • The ability to record an analysts time against a charge code also seems to drive a specific cost as well – whilst this could just be a notional cost, some form of correlation between the two, removing the need for the analysts to know financials as well as resolving an incident, might be more beneficial
  • There are some elements of earlier ITIL iterations in the tool, as nothing is taken out which could be cumbersome to customise out
BMC FootPrints An improved interface and comprehensive coverage of Incident and Problem Management, with some added innovation to make scheduling work a little easier for Service Desks and support staff alike.
  • Logging by Type, Category and Symptom adds a meaningful level of granularity
  • Incorporates an availability of resource’s view by integrating to Outlook Exchange
  • Subscription function for end users for major incidents, as well as pop ups for potential SLA breaches
  • Design elements behind the scenes are still largely text based
Cherwell Service Management Cherwell use intelligent interfaces and well constructed forms to automate the basics of the processes in a comprehensive and informative way
  • Core stages of process management as part of the user interface
  • In-context configuration mapping that makes handling concurrent incident and problem mapping very easy
  • Potential depth of customisation in terms of use of forms (Specifics) lends itself to improving/ enhancing investigation and first-time fix
  • While promotion to a Major Incident, automatic raising of a Problem, linkage to the Global Alerts feature and the ability for users to indicate they are affected too from Self Service is great, that indication is linked to the automatically linked problem record, not the Major Incident
  • Customers seem to have indicated interest in linkage to the Major Incident as an out-of-the-box capability and it would make sense to provide it.
Nexthink If systems management monitoring takes care of servers, Nexthink presents you all you need to know about the end-user side of the coin.Nexthink sits apart from the nuts and bolts of Service Management tooling, but offers guidance to analysts to help expedite resolution with real-time End-user IT Analytics, integrated into major ITSM tools to significantly reduce problem diagnosis times
  • Lightweight kernel-driven footprint on end-user targets helps define trouble spots in real time
  • Complements existing IT Service Management deployments
  • With so much technical capability, it needs a very strong balancing hand of strategy to get the best of a combination of this product, a service management suite, and server monitoring collaboration
TOPdesk TOPdesk adds Kanban-type resource scheduling to add a new dimension onto Incident and Problem Management
  • The Plan Board incorporates a Kanban style approach to scheduling tasks to help drive efficient resourcing
  • Keywords trigger standard solutions, linking into a two-tire Knowledge base (for Analysts and End Users)
  • Task Board for individual support staff can be sliced and diced by the most time critical events
  • Sometimes “over-customisability” can rear its head in reviews – just because it is possible to have 7 different priorities, it does not mean it is a good practice to do so.
  • Some terminology (which can be changed with a little more detailed knowledge) can be a little cumbersome – For Objects for Assets

CUSTOMERS

Approximate number of customers for each vendor:

  • Axios assyst – 1000+
  • BMC FootPrints – Approximately 1000 customers across Europe and 5000 worldwide
  • Cherwell Service Management – 400+
  • Nexthink – 400
  • TOPdesk – 3150 approximate TOPdesk Enterprise customers, >5000+ unique customers in total

Analysis

Vendor Functionality Innovation Analysis
Axios assyst A tidy interface with a lot of focus on driving the product hierarchies for categorisation. Pre-populated forms and scripted guidance for the service desk.Chat function for support staff to collaborate. Axios focus on ways to automate as much as possible.Backed up with a very comprehensive CMDB structure at its core, work put into the configuration of a system up front will reap rewards in efficiency down the line.
BMC FootPrints Great to see a vendor improve from customer (and analyst) feedback and the result is a modern looking tool that handles the “bread and butter” tasks of Incident and Problem efficiently FootPrints links to Microsoft Exchange to display a view of the support staff resources and allocation of repetitive tasks.Logging by Type, Category and potentially Symptom adds an appealing level of granularity. BMC FootPrints is not alone in exploring and incorporating a view of the support staff resources, and it is evolving to be a very smart looking, mid-market offering that can punch above its weight.
Cherwell Service Management Cherwell add a number of features that make the process speedier – and their Specifics forms provide a great touch in terms of initial investigation. Cherwell get the balance right, with customisable features (forms and macros) and include a breadcrumb trail throughout the lifecycle of the record. Cherwell recognise that it is not just IT functions that need to use the tool – the Impact and Urgency in business language (Incident) and their other features all make it a roundly comprehensive tool to appeal to organisations of all sizes.
Nexthink Nexthink is not a traditional ITSM tool.  Instead it offers a chance for support analysts to proactively resolve issues faster by means of End-User real-time analytics It’s power comes from being able to assess elements from an end-user perspective, and integrates with existing ITSM tools to provide a comprehensive view of an end-user’s machine. There are a number of ways that Nexthink and ITSM tools can co-exist – Nexthink is a powerful enabler for much more proactive incident and problem resolution.
TOPdesk TOPdesk use wizards and key word matching to help drive efficient Incident and Problem logging and resolution TOPdesk takes resource planning to another level, planning shift patterns, and operating a Kanban style method of dragging and dropping tasks to less loaded support staff. The whole combination of the resource board, the way their task board can focus on the most pressing first, and their links to Knowledge Management made this a very attractive tool to review,There were some configuration niggles which can all be customised (some more easily than others) but it is certainly heading in the right direction.

BEST IN CLASS

Vendor End-User Base Product Characteristics
S M L VL
Axios assyst
L VL
  • Specialised Service Management Suite
  • Integration for Event Monitoring
BMC Footprints
M L
  • Specialised Service Management Suite
  • Integration for Event Monitoring
Cherwell Service Management
S M L VL
  • Specialised Service Management Suite
  • Integration for Event Monitoring
Nexthink
S M L VL
  • Specialised tooling, requiring integration to ITSM
TOPdesk
S M L
  • Specialised Service Management Suite
  • Integration for Event Monitoring

Best in Class (Small-Med-Large) – BMC FootPrints

BMC FootPrints have taken on board customer feedback, and even observations from previous reviews to make subtle but very noticeable adjustments to their interface.

The result is a tool that offers more intuitive investigation diagnostics as calls are being logged, and is continually looking to improve.

FootPrints is getting a real benefit from being part of the larger BMC brand, but is fast establishing itself as a tool to appeal across the entire market-place.

Best in Class (Small-Med-Large-V Large) – Cherwell

As with FootPrints, the inclusion of diagnostic forms, within records, linked to the categories makes Cherwell stand out when logging Incidents, in particular.

Best in Class (All Tools):TOPdesk

The inclusion of the Kanban-style resourcing board, but also the way in which tasks can be placed and moved about really made this stand out, in terms of the way that innovation within a tool can really make processes less cumbersome.

Honourable Mention: Nexthink

This tool deserves to stand apart from its Service Management cousins.

It adds a unique element, which can truly help drive efficiencies, especially where Problem Management is concerned.

With the right business drivers and strategic vision, not to mention strong partnership with some of the ITSM industry big-hitters, Nexthink’s real-time end-user analysis can help in so many more service management disciplines.  I feel we have only scratched the surface of its potential.


Deep Dive

Further details for each vendor can be found by using the links below:


DISCLAIMER, SCOPE & LIMITATIONS

The information contained in this review is based on sources and information believed to be accurate as of the time it was created. Therefore, the completeness and current accuracy of the information provided cannot be guaranteed. Readers should therefore use the contents of this review as a general guideline and not as the ultimate source of truth.

Similarly, this review is not based on rigorous and exhaustive technical study. The ITSM Review recommends that readers complete a thorough live evaluation before investing in technology.

This is a paid review. That is, the vendors included in this review paid to participate in exchange for all results and analysis being published free of charge without registration. For further information please read the ‘Group Tests’ section on our Disclosure page.

Coming Soon: Axios, BMC, Cherwell, NetSupport, TOPdesk & Nexthink Slog it out

Incident and Problem Product Review
Axios, BMC, Cherwell, NetSupport, TOPdesk & Nexthink slog it out for our Incident and Problem Management review

Axios, BMC, Cherwell, NetSupport, TOPdesk and Nexthink are confirmed participants for our upcoming ‘Incident and Problem Management’ review.

Our assessment Criteria at a Glance:

  • Logging & Categorization
  • Tracking
  • Lifecycle Tracking
  • Prioritisation
  • Escalations
  • Major Incidents and Problems
  • Incident and Problem Models
  • Incident and Problem Closure

Full details of the assessment criteria can be found here.

Reviewer: Ros Satar 

Confirmed Participants:

All results will be published free of charge without registration on The ITSM Review. You may wish to subscribe to the ITSM Review newsletter (top right of this page) or follow us on Twitter to receive a notification when it is published.

Assessment Criteria for Incident and Problem Management

We will soon begin our review of Incident and Problem Management offerings in the ITSM market place. As with our previous comparison of REQUEST FULFILMENT – Our goal is to highlight the key strengths, competitive differentiators and innovation in the industry.

During Request Fulfilment our original aim was to look at how the tool supported the process, but refreshingly vendors who participated also shared their experiences and some insight into their consulting approaches.

When assessing the bread-and-butter elements of Incident Management, the challenge will be to identify be true differentiators in a discipline that is quite rigid.

We would like to encourage the same philosophy of identifying how deployment experiences have shaped the evolution of tools.

Incorporating Incident & Problem Management for a review

From my experience, deployments often implement Request Fulfilment, and Incident Management in the early phases of projects, but often Problem Management is left until later phases.

Yet the two processes, in tool terms are often linked together – quite often the record functionality and layout is the same for Incident, Problem (and Request for that matter).

My assessment criteria for the Incident and Problem Management review are below, if you have any comments or recommendations please contact us.


Suggested Criteria for Incident & Problem Management

Overall Alignment

  • Have our target vendors aligned to ITIL and if so, to which version?
  • How do the set up roles and users to perform functions?
  • What demo capabilities can they offer potential customers?

Logging & Categorisation

These can either be made simple, and to great effect, or made so complex, that they become irrelevant as the Service Desk totally ignore them and pick the first thing on the list!

  • What information is made mandatory on the incident and problem record?
  • What categories and/or sub-categories are provided out-of-the-box?
  • How easy is it to customise these fields and values?
  • Show us how incident/problem matching and linkage to known errors are presented to users and/or service staff to expedite the process.
  • How much administration is needed to do bespoke changes?

Tracking

“Oh come on now,” I hear you cry, “What tool cannot track incidents and problems?”

But there can be a lot more to tracking these records than meets the eye:

  • What statuses are included out-of-the-box and how easy is it to add/modify status definitions to suit customer requirements
  • Can your tool show how many “hops” a record may face if wrongly assigned?

Lifecycle Tracking

Perhaps the best way of allowing vendors to show off their tool’s capabilities is for them to really go to town in terms of playing out scenarios.

The aim of this assessment is to look at how tools can help keep communication going during the lifecycle of an incident/problem and its linkage to other processes.

  • First time Fix from the Service Desk
  • Resolved via support group(s)
  • Demonstrating visibility of the incident/problem through its lifecycle, from  end-user, Service Desk and support group(s) points of view
  • Linkage to other processes

Prioritisation

  • How are priorities determined and managed (out-of-the-box)?
  • What happens when the priority is adjusted during the lifecycle of the incident/problem?
  • We would like to also give vendors an opportunity to show us how they link SLAs to Incidents.

Escalations

  • Demonstrate routing to multiple groups
  • Show the tool’s capability for handling SLA breaches – in terms of notifications and the effect on the Incident record during that time
  • Show us what happens when an incident/problem has NOT been resolved satisfactorily
  • Demonstrate integration between incident and other processes

Major Incidents and Problems

Much like categorisation, this can be very simple, or can be made so complex, and more time can be spent negotiating the process than fixing the issue in the first place.

  • Provide and end-to-end scenario to demonstrate how the tool handles the management and co-ordination across multiple groups for a Major Incident & Problem

Incident & Problem Models

All of the above criteria are what I consider the basics of an ITSM tool.

But I am keen to delve deeper into what vendors understand by the concept of Models.

In turn, how can their tools add significant value in this area?

There are several ways of looking at this concept (there will be no points for throwing it over the fence to Problem Management and focussing on Known Errors).

There are assessment criteria around the handling of Models, and we want to see how tools help in this aspect.

  • Demonstrate how your tool facilitates the use of Models (include if/where links to other relevant processes/support groups as part of the demonstration).

Incident & Problem Closure

It makes sense to end our list of assessment criteria examining how tools resolve and/or close incidents and problems by default.

  • Show how an incident/problem is routed for closure.

This assessment will be quite scenario heavy, and we want to give participating vendors the freedom to develop their scenarios without limiting them to defined parameters (for example, specifying which service has failed, or which groups to use).

A key part of the assessment will also include how flexible the tool is with regards to customisation.

Incident Management can sometimes be taken for granted, so we would like participating vendors to really take a look at how Incident Management can made “everyone’s” business.

But more importantly, Problem Management is often left to later phases, while organisations focus on processes like Request Fulfilment, Incident and Change – perhaps there is a case to make for implementing them hand in hand?


What is your view, what have we missed?

Please leave a comment below or contact us. Similarly if you are a vendor and would like to be included in our review, please contact us.

Rob England: Incident Management at Cherry Valley, Illinois

It had been raining for days in and around Rockford, Illinois that Friday afternoon in 2009, some of the heaviest rain locals had ever seen. Around 7:30 that night, people in Cherry Valley – a nearby dormitory suburb – began calling various emergency services: the water that had been flooding the road and tracks had broken through the Canadian National railroad’s line, washing away the trackbed.

An hour later, in driving rain, freight train U70691-18 came through the level crossing in Cherry Valley at 36 m.p.h, pulling 114 cars (wagons) mostly full of fuel ethanol – 8 million litres of it – bound for Chicago. Although ten cross-ties (sleepers) dangled in mid air above running water just beyond the crossing, somehow two locomotives and about half the train bounced across the breach before a rail weld fractured and cars began derailing. As the train tore in half the brakes went into emergency stop. 19 ethanol tank-cars derailed, 13 of them breaching and catching fire.

In a future article we will look at the story behind why one person waiting in a car at the Cherry Valley crossing died in the resulting conflagration, 600 homes were evacuated and $7.9M in damages were caused.

Today we will be focused on the rail traffic controller (RTC) who was the on-duty train dispatcher at the CN‘s Southern Operations Control Center in Homewood, Illinois. We won’t be concerned for now with the RTC’s role in the accident: we will talk about that next time. For now, we are interested in what he and his colleagues had to do after the accident.

While firemen battled to prevent the other cars going up in what could have been the mother of all ethanol fires, and paramedics dealt with the dead and injured, and police struggled to evacuate houses and deal with the road traffic chaos – all in torrential rain and widespread surface flooding – the RTC sat in a silent heated office 100 miles away watching computer monitors. All hell was breaking loose there too. Some of the heaviest rail traffic in the world – most of it freight – flows through and around Chicago; and one of the major arteries had just closed.

Back in an earlier article we talked about the services of a railroad. One of the major services is delivering goods, on time. Nobody likes to store materials if they can help it: railroads deliver “just in time”, such as giant ethanol trains, and the “hotshot” trans-continental double-stack container trains with nine locomotives that get rail-fans like me all excited. Some of the goods carried are perishables: fruit and vegetables from California, stock and meat from the midwest, all flowing east to the population centres of the USA.

The railroad had made commitments regarding the delivery of those goods: what we would call Service Level Targets. Those SLTs were enshrined in contractual arrangements – Service Level Agreements – with penalty clauses. And now trains were late: SLTs were being breached.

A number of RTCs and other staff in Homewood switched into familiar routines:

  • The US rail network is complex – a true network. Trains were scheduled to alternate routes, and traffic on those routes was closed up as tightly bunched together as the rules allowed to create extra capacity.
  • Partner managers got on the phone to the Union Pacific and BNSF railroads to negotiate capacity on their lines under reciprocal agreements already in place for situations just such as this one.
  • Customer relations staff called clients to negotiate new delivery times.
  • Traffic managers searched rail yard inventories for alternate stock of ethanol, that could be delivered early.
  • Crew managers told crews to pick up their trains in new locations and organised transport to get them there.

Fairly quickly, service was restored: oranges got squeezed in Manhatten, pigs and cows went to their deaths, and corn hootch got burnt in cars instead of all over the road in Cherry Valley.

This is Incident Management.

None of it had anything to do with what was happening in the little piece of hell that Cherry Valley had become. The people in heavy waterproofs, hi-viz and helmets, splashing around in the dark and rain, saving lives and property and trying to restore some semblance of local order – that’s not Incident Management.

At least I don’t think it is. I think they had a problem.

An incident is an interruption to service and a problem is an underlying cause of incidents. Incident Management is concerned with the restoration of expected levels of service to the users. Problem Management is concerned with removing the underlying causes.

To me that is a simple definition that works well. If you read the books and listen to the pundits you will get more complex models that seem to imply that everything done until trains once more rolled smoothly though Cherry Valley is Incident Management. I beg to differ. If the customer gets steak and orange juice then Cherry Valley could be still burning for all they care: Incident Management has met its goals.

Image Credit

The RBS Glitch – A Wake Up Call?

More than a fortnight (from the last couple of weeks of June) after a “glitch” affected Royal Bank of Scotland (RBS), Natwest and Ulster Bank accounts, the fall-out continues with the manual processing backlog still affecting Ulster Bank customers.

Now, the Online Oxford Dictionary defines a glitch as:
a sudden, usually temporary malfunction or fault of equipment

I don’t think anyone affected would see it in quite the same way.

So when did this all happen?

The first I knew about was a plaintive text from a friend who wanted to check her balance, and could not because:
“My bank’s computers are down”
By the time the evening rolled around, the issue was becoming national news and very clear that this was more than just a simple outage.

On the night of Tuesday 19th June, batch processes to update accounts were not being processed and branches were seeing customer complaints about their balances.

As the week progressed, it became clear that this was no simple ‘glitch’, but the result of some failure somewhere, affecting 17 million customers.

What actually happened?

As most people can appreciate, transactions to and from people’s accounts are typically handled and updated using batch processing technology.

However, that software requires maintenance, and an upgrade to the software had to be backed out, but as part of the back out, it appears that the scheduling queue was deleted.

As a result, inbound payments were not being registered, balances were not being updated correctly, with the obvious knock on effect of funds showing as unavailable for bills to be paid, and so on.

The work to fix the issues meant that all the information that had been wiped had to be re-entered.

Apparently the order of re-establishing accounts were RBS first, then NatWest, and customers at Ulster Bank were still suffering the effects as we moved into early July.

All the while news stories were coming in thick and fast.

The BBC reported of someone who had to remain an extra night in jail as his parole bond could not be verified.

House sales were left in jeopardy as money was not showing as being transferred.

Even if you did not have your main banking with any of the three banks in the RBS group, you were likely to be effected.

If anyone in your payment chain banked with any of those banks, transactions were likely to be affected.

Interestingly enough, I called in to a local branch of the one of the affected banks in the week of the crisis as it was the only day I had to pay in money, and it was utter chaos.

And I called in again this week and casually asked my business account manager how things had been.

The branches had very little information coming to them at the height of the situation.

When your own business manager found their card declined while buying groceries that week, you have to wonder about the credibility of their processes.

Breaking this down, based on what we know

Understandably, RBS has been reticent to provide full details, and there has been plenty of discussion as to the reasons, which we will get to, but let’s start by breaking down the events based on what we know.

  • Batch Processing Software

What we are told is that RBS using CA Technologies CA-7 Batch processing software.

A back-out error was made after a failed update to the software, when the batch schedule was completely deleted.

  •  Incidents Reported

Customers were reporting issues with balance updates to accounts early on in the week commencing 20th June, and soon it became clear that thousands of accounts were affected across the three banks.

Frustratingly some, but not all services were affected – ATMs were still working for small withdrawals, but some online functions were unavailable.

  •  Major Incident

As the days dragged on, and the backlog of transactions grew, the reputation of RBS and Natwest particular came under fire.

By the 21st June, there was still no official fix date, and branches of NatWest were being kept open for customers to be able to get cash.

  •  Change Management

Now we get to the rub.

Initial media leaks pointed to a junior administrator making an error in backing out the software update and wiping the entire schedule, causing the automated batch process to fail.

But what raised eyebrows in the IT industry initially, was the thorny subject of outsourcing.

RBS, (let me stress like MANY companies), has outsourced elements of IT support off-shore.

Some of that has included administration support for their batch processing, but with a group also still in the UK.

Many of these complex systems have unique little quirks.  Teams develop “in-house” knowledge, and knowledge is power.

Initial reports seemed to indicate that the fault lay with the support and administration for the batch processing software, some of which was off-shore.

Lack of familiarity with the system also pointed to perhaps issues in the off-shoring process.

However, in a letter to the Treasury Select Committee, RBS CEO Stephen Hester informed the committee that the maintenance error had occurred within the UK based team.

  •  Documentation

The other factor is human need to have an edge on the competition – after all, knowledge if power.

Where functions are outsourcers, there are two vital elements that must be focussed on (and all to often are either marginalised/ignored due to costs):

1)      Knowledge Transfer

I have worked on many clients where staff who will be supporting the services to be outsourced are brought over to learn (often from the people whose jobs they will be replacing).

Do not underestimate what a very daunting and disturbing experience this will be, for both parties concerned.

2)      Documentation

Even if jobs are not being outsourced, documentation is often the scourge of the technical support team.  It is almost a rite of passage to learn the nuances of complex systems.

Could better processes help?

It is such a negative situation, I think it is worth looking at the effort that went into resolving it.

The issues were made worse by the fact that the team working to resolve the problem could not access the record of transactions that were processed before the batch process failed.

But – the processes do exist for them to manually intervene and recreate the transactions, albeit die to lengthy manual intervention.

Teams worked round the clock to clear the backlog, as batches would need to be reconstructed once they worked out where they failed.

In Ulster Bank’s case, they were dependant on some NatWest systems, so again something somewhere must dictate the order in which to recover, else people would be trying to update accounts all over the place.

Could adherence to processes have prevented the issue in the first place?

Well undoubtedly.  After all, this is not the first time the support teams will have updated their batch software, nor will it have been the first time they have backed out a change.

Will they be reviewing their procedures?

I would like to hope that the support teams on and off shore are collaborating to make sure that processes are understood and that documentation is bang up-to-date.

What can we learn from this?

Apart from maybe putting our money under the mattress, I think this has been a wake up call for many people who, over the years, have put all their faith in the systems that allow us to live our lives.

Not only that, though, but in an environment where quite possibly people have been the target of outsourcing in their own jobs, it was a rude awakening to some of the risks of shifting support for complex integrated systems without effective training, documentation, and more importantly back up support.

Prior to Mr Hester’s written response to the Treasury Select Committee, I had no problem believing that elements such as poor documentation/handover, and a remote unfamiliarity with a system could have resulted in a mistaken wipe of a schedule.

What this proves is that anyone, in ANY part of the world can make a mistake.

7 Benefits of Using a Known Error Database (KEDB)

KEDB - a repository that describes all of the conditions in your IT systems that might result in an incident for your customers.

I was wondering – do you have a Known Error Database? And are you getting the maximum value out of it?

The concept of a KEDB is interesting to me because it is easy to see how it benefits end users. Also because it is dynamic and constantly updated.

Most of all because it makes the job of the Servicedesk easier.

It is true to say that an effective KEDB can both increase the quality and decrease the time for Incident resolution.

The Aim of Problem Management and the Definition of “The System”

One of the aims of Problem Management is to identify and manage the root causes of Incidents. Once we have identified the causes we could decide to remove these problems to prevent further users being affected.

Obviously this might be a lengthy process – replacing a storage device that has an intermittent fault might take some scheduling. In the meantime Problem Managers should be investigating temporary resolutions or measures to reduce the impact of the Problem for users. This is known as the Workaround.

When talking about Problem Management it helps to have a good definition of “Your System”. There are many possible causes of Incidents that could affect your users including:

  • Hardware components
  • Software components
  • Networks, connectivity, VPN
  • Services – in-house and outsourced
  • Policies, procedures and governance
  • Security controls
  • Documentation and Training materials

Any of these components could cause Incidents for a user. Consider the idea that incorrect or misleading documentation would cause an Incident. A user may rely on this documentation and make assumptions on how to use a service, discover they can’t and contact the Servicedesk.

This documentation component has caused an Incident and would be considered the root cause of the Problem

Where the KEDB fits into the Problem Management process

The Known Error Database is a repository of information that describes all of the conditions in your IT systems that might result in an incident for your customers and users.

As users report issues support engineers would follow the normal steps in the Incident Management process. Logging, Categorisation, Prioritisation. Soon after that they should be on the hunt for a resolution for the user.

This is where the KEDB steps in.

The engineer would interact with the KEDB in a very similar fashion to any Search engine or Knowledgebase. They search (using the “Known Error” field) and retrieve information to view the “Workaround” field.

The “Known Error”

The Known Error is a description of the Problem as seen from the users point of view. When users contact the Servicedesk for help they have a limited view of the entire scope of the root cause. We should use screenshot of error messages, as well as the text of the message to aid searching. We should also include accurate descriptions of the conditions that they have experienced. These are the types of things we should be describing in the Known Error field A good example of a Known Error would be:

When accessing the Timesheet application using Internet Explorer 6 users experience an error message when submitting the form.

The error message reads “Javascript exception at line 123”

The Known Error should be written in terms reflecting the customers experience of the Problem.

The “Workaround”

The Workaround is a set of steps that the Servicedesk engineer could take in order to either restore service to the user or provide temporary relief. A good example of a Workaround would be:

To workaround this issue add the timesheet application to the list of Trusted sites

1. Open Internet Explorer 2. Tools > Options > Security Settings [ etc etc ]

The Known Error is a search key. A Workaround is what the engineer is hoping to find – a search result. Having a detailed Workaround, a set of technical actions the Servicedesk should take to help the user, has multiple benefits – some more obvious than others.

Seven Benefits of Using a Known Error Database (KEDB)

  1. Faster restoration of service to the user – The user has lost access to a service due to a condition that we already know about and have seen before. The best possible experience that the user could hope for is an instant restoration of service or a temporary resolution. Having a good Known Error which makes the Problem easy to find also means that the Workaround should be quicker to locate. All of the time required to properly understand the root cause of the users issue can be removed by allowing the Servicedesk engineer quick access to the Workaround.
  2. Repeatable Workarounds – Without a good system for generating high-quality Known Errors and Workarounds we might find that different engineers resolve the same issue in different ways. Creativity in IT is absolutely a good thing, but repeatable processes are probably better. Two users contacting the Servicedesk for the same issue wouldn’t expect a variance in the speed or quality of resolution. The KEDB is a method of introducing repeatable processes into your environment.
  3. Avoid Re-work – Without a KEDB we might find that engineers are often spending time and energy trying to find a resolution for the same issue. This would be likely in distributed teams working from different offices, but I’ve also seen it commonly occur within a single team. Have you ever asked an engineer if they know the solution to a users issue to be told “Yes, I fixed this for someone else last week!”. Would you have prefered to have found that information in an easier way?
  4. Avoid skill gaps – Within a team it is normal to have engineers at different levels of skill. You wouldn’t want to employ a team that are all experts in every functional area and it’s natural to have more junior members at a lower skill level. A system for capturing the Workaround for complex Problems allows any engineer to quickly resolve issues that are affecting users.Teams are often cross-functional. You might see a centralised application support function in a head-office with users in remote offices supported by their local IT teams. A KEDB gives all IT engineers a single place to search for customer facing issues.
  5. Avoid dangerous or unauthorised Workarounds – We want to control the Workarounds that engineers give to users. I’ve had moments in the past when I chatted to engineers and asked how they fixed issues and internally winced at the methods they used. Disabling antivirus to avoid unexpected behaviour, upgrading whole software suites to fix a minor issue. I’m sure you can relate to this. Workarounds can help eliminate dangerous workarounds.
  6. Avoid unnecessary transfer of Incidents – A weak point in the Incident Management process is the transfer of ownership between teams. This is the point where a customer issue goes to the bottom of someone else queue of work. Often with not enough detailed context or background information. Enabling the Servicedesk to resolve issues themselves prevents transfer of ownership for issues that are already known.
  7. Get insights into the relative severity of Problems – Well written Known Errors make it easier to associate new Incidents to existing Problems. Firstly this avoids duplicate logging of Problems. Secondly it gives better metrics about how severe the Problem is. Consider two Problems in your system. A condition that affects a network switch and causes it to crash once every 6 months. A transactional database that is running slowly and adding 5 seconds to timesheet entry You would expect that the first Problem would be given a high priority and the second a lower one. It stands to reason that a network outage on a core switch would be more urgent that a slowly running timesheet system But which would cause more Incidents over time? You might be associating 5 new Incidents per month against the timesheet problem whereas the switch only causes issues irregularly. Being able to quickly associate Incidents against existing Problems allows you to judge the relative impact of each one.

The KEDB implementation

Technically when we talk about the KEDB we are really talking about the Problem Management database rather than a completely separate store of data. At least a decent implementation would have it setup that way.

There is a one-to-one mapping between Known Error and Problem so it makes sense that your standard data representation of a Problem (with its number, assignment data, work notes etc) also holds the data you need for the KEDB.

It isn’t incorrect to implement this in a different way – storing the Problems and Known Errors in seperate locations, but my own preference is to keep it all together.

Known Error and Workaround are both attributes of a Problem

Is the KEDB the same as the Knowledge Base?

This is a common question. There are a lot of similarities between Known Errors and Knowledge articles.

I would argue that although your implementation of the KEDB might store its data in the Knowledgebase they are separate entities.

Consider the lifecycle of a Problem, and therefore the Known Error which is, after all, just an attribute of that Problem record.

The Problem should be closed when it has been removed from the system and can no longer affect users or be the cause of Incidents. At this stage we could retire the Known Error and Workaround as they are no longer useful – although we would want to keep them for reporting so perhaps we wouldn’t delete them.

Knowledgebase articles have a more permanent use. Although they too might be retired, if they refer to an application due to be decommissioned, they don’t have the same lifecycle as a Known Error record.

Knowledge articles refer to how systems should work or provide training for users of the system. Known Errors document conditions that are unexpected.

There is benefit in using the Knowledgebase as a repository for Known Error articles however. Giving Incident owners a single place to search for both Knowledge and Known Errors is a nice feature of your implementation and typically your Knowledge tools will have nice authoring, linking and commenting capabilities.

What if there is no Workaround

Sometimes there just won’t be a suitable Workaround to provide to customers.

I would use an example of a power outage to provide a simple illustration. With power disrupted to a location you could imagine that there would be disruption to services with no easy workaround.

It is perhaps a lazy example as it doesn’t allow for many nuances. Having power is a normally a binary state – you either have adequate power or not.

A better and more topical example can be found in the Cloud. As organisations take advantage of the resource charging model of the Cloud they also outsource control.

If you rely on a Cloud SaaS provider for your email and they suffer an outage you can imagine that your Servicedesk will take a lot of calls. However there might not be a Workaround you can offer until your provider restores service.

Another example would be the February 29th Microsoft Azure outage. I’m sure a lot of customers experienced a Problem using many different definitions of the word but didn’t have a viable alternative for their users.

In this case there is still value to be found in the Known Error Database. If there really is no known workaround it is still worth publishing to the KEDB.

Firstly to aid in associating new Incidents to the Problem (using the Known Error as a search key) and to stop engineers in wasting time in searching for an answer that doesn’t exist.

You could also avoid engineers trying to implement potentially damaging workarounds by publishing the fact that the correct action to take is to wait for the root cause of the Problem to be resolved.

Lastly with a lot of Problems in our system we might struggle to prioritise our backlog. Having the Known Error published to help routing new Incidents to the right Problem will bring the benefit of being able to prioritise your most impactful issues.

A users Known Error profile

With a populated KEDB we now have a good understanding of the possible causes of Incidents within our system.

Not all Known Errors will affect all users – a network switch failure in one branch office would be very impactful for the local users but not for users in another location.

If we understand our users environment through systems such as the Configuration Management System (CMS) or Asset Management processes we should be able to determine a users exposure to Known Errors.

For example when a user phones the Servicedesk complaining of an interruption to service we should be able to quickly learn about her configuration. Where she is geographically, which services she connects to. Her personal hardware and software environment.

With this information, and some Configuration Item matching the Servicedesk engineer should have a view of all of the Known Errors that the user is vulnerable to.

Measuring the effectiveness of the KEDB.

As with all processes we should take measurements and ensure that we have a healthy process for updating and using the KEDB.

Here are some metrics that would help give your KEDB a health check.

Number of Problems opened with a Known Error

Of all the Problem records opened in the last X days how many have published Known Error records?

We should be striving to create as many high quality Known Errors as possible.

The value of a published Known Error is that Incidents can be easily associated with Problems avoiding duplication.

Number of Problems opened with a Workaround

How many Problems have a documented Workaround?

The Workaround allows for the customer Incident to be resolved quickly and using an approved method.

Number of Incidents resolved by a Workaround

How many Incidents are resolved using a documented Workaround. This measures the value provided to users of IT services and confirms the benefits of maintaining the KEDB.

Number of Incidents resolved without a Workaround or Knowledge

Conversely, how many Incidents are resolved without using a Workaround or another form of Knowledge.

If we see Servicedesk engineers having to research and discover their own solutions for Incidents does that mean that there are Known Errors in the system that we aren’t aware of?

Are there gaps in our Knowledge Management meaning that customers are contacting the Servicedesk and we don’t have an answer readily available.

A high number in our reporting here might be an opportunity to proactively improve our Knowledge systems.

OLAs

want to ensure that Known Errors are quickly written and published in order to allow Servicedesk engineers to associate incoming Incidents to existing Problems.

One method of measuring how quickly we are publishing Known Errors is to use Organisational Level Agreements (or SLAs if your ITSM tool does’t define OLAs).

We should be using performance measurements to ensure that our Problem Management function is publishing Known Errors in a timely fashion.

You could consider tracking Time to generate Known Error and Time to generate Workaround as performance metrics for your KEDB process.

In summary

Additionally we could also measure how quickly Workarounds are researched, tested and published. If there is no known Workaround that is still valuable information to the Servicedesk as it eliminates effort in trying to find one so an OLA would be appropriate here.

Planning for Major Incidents

Do regular processes go out of the window during a Major Incident?

Recently I’ve been working on Incident Management, and specifically on Major Incident planning.

During my time in IT Operations I saw teams handle Major Incidents in a number of different ways. I actually found that in some cases all process and procedure went out of the window during a Major Incident, which has a horrible irony about it. Logically it would seem that this is the time that applying more process to the situation would help, especially in the area of communications.

For example in an organisation I worked in previously we had a run of Storage Area Network outages. The first couple caused absolute mayhem and I could see people pushing back against the idea of breaking out the process-book because all that mattered was finding the technical fix and getting the storage back up and running.

At the end of the Incident, once we’d restored the service we found that we, maybe unsurprisingly had a lot of unhappy customers! Our retrospective on that Incident showed us that taking just a short time at the beginning of the outage to sort out our communications plan would have helped the users a lot.

ITIL talks about Major Incident planning in a brief but fairly helpful way:

A separate procedure, with shorter timescales and greater urgency, must be used for ‘major’ incidents. A definition of what constitutes a major incident must be agreed and ideally mapped on to the overall incident prioritization system – such that they will be dealt with through the major incident process.

So, the first thing to note is that we don’t need a separate ITIL process for handling Major Incidents. The aim of the Incident Management process is to restore service to the users of a service, and that outcome suits us fine for Major Incidents too.

The Incident model, its categories and states ( New > Work In Progress > Resolved > Closed ) all work fine, and we shouldn’t be looking to stray too far from what we already have in terms of tools and process.

What is different about a Major Incident is that both the urgency and impact of the Incident are higher than a normal day-to-day Incident. Typically you might also say that a Major Incident affects multiple customers.

Working with a Major Incident

When working on a Major Incident we will probably have to think about communications a lot more, as our customers will want to know what is going on and rough timings for restoration of service.

Where a normal Incident will be handled by a single person (The Incident Owner) we might find that multiple people are involved in a Major Incident – one to handle the overall co-ordination for restoring service, one to handle communications and updates and so on.

Having a named person as a point of contact for users is a helpful trick. In my experience the one thing that users hate more than losing their service is not knowing when it will be restored, or receiving confusing or conflicting information. With one person responsible for both the technical fix and user communications this is bound to happen – split those tasks.

If your ITSM suite has functionality for a news ticker, or a SocialIT feed it might be a good idea to have a central place to update customers about the Major Incident you are working on. If you run a service for the paying public you might want to jump onto Twitter to stop the Twitchfork mob discussing your latest outage without you being part of the conversation!

What is a Major Incident

It is up to each organisation to clearly define what consitutes a Major Incident. Doing so is important, otherwise the team won’t know under what circumstances to start the process. Or you might find that without clear guidance a team will treat a server outage one week as Major (with excellent communciations) and not the next week with poor communications.

Having this defined is an important step, but will vary between organisations.

Roughly speaking a generic definition of a Major Incident could be

  • An Incident affecting more than one user
  • An Incident affecting more than one business unit
  • An Incident on a device on a certain type – Core switch, access router, Storage Area Network
  • Complete loss of a service, rather than degregation

Is a P1 Incident a Major Incident?

No, although I would say that every Major Incident would be a P1. An urgent Incident affecting a single user might not be a Major Incident, especially if the Incident has a documented workaround or can be fixed straightaway.

Confusing P1 Incidents with Major Incidents would be a mistake. Priority is a calculation of Impact and Urgency, and the Major Incident plan needs to be reserved for the absolute maximum examples of both, and probably where the impact is over multiple users.

Do I need a single Incident or multiple Incidents for logging a Major Incident?

This question might depend on your ITSM toolset, but my preference is to open a separate Incident for each user affected in the Incident when they contact the Servicedesk.

The reason for this is that different users will be impacted in different ways. A user heading off to a sales pitch will have different concerns to a user just about to go on holiday for 2 weeks. We might want to apply different treatment to these users (get the sales pitch user some sort of service straight away) and this becomes confusing when you work in a single Incident record.

If you have a system of Hierarchical escalation you might find that one customer would escalate the Major Incident (to their sales rep for example) where another customer isn’t too bothered because they use the affected service less frequently.

Having an Incident opened for each user/customer allows you to judge exactly the severity of the Incident. The challenge then becomes to manage those Incidents easily, and be able to communicate consistently with your customers.

Is a Major Incident a Problem?

No, although if we didn’t have a Problem record open for this Major Incident I think we should probably do so.

Remember the intended outcome of the Incident and Problem Management processes:

  • Incident Management: The outcome is a restoration of service for the users
  • Problem Management: The outcome is the identification and possibly removal of the causes of Incidents

The procedure is started when an Incident matches our definition of a Major Incident. It’s outcome is to restore service and to handle the communication with multiple affected users. That restoration of service could come from a number of different sources – The removal of the root cause, a documented Workaround or possibly we’ll have to find a Workaround.

Whereas the Major Incident plan and Problem Management process will probably work closely together it is not true to say that a Major Incident IS a Problem.

How can I measure my Major Incident Procedure?

Simon Morris

I have some metrics for measuring the Major Incident procedure and I’d love to know your thoughts in the comments for this article.

  • Number of Incidents linked to a Major Incident: Where we are creating Incidents for each customer affected by a Major Incidents we should be able to measure the relative impact of each occurance.
  • The number of Major Incidents: We’d like to know how often we invoke the Major Incident plan
  • Mean Time Between Major Incidents: How much time elapses between Major Incidents being logged. This would be interesting in an organisation with service delivery issues, and they would hope to see Major Incidents happen less frequently

There you go. In summary handling Major Incidents isn’t a huge leap from the method that you use to handle day-to-day Incidents. It requires enhanced communciation and possibly measurement.

I hope that you found this article helpful.

Photo Credit

Interview: Simon Morris, 'Sneaking ITIL into the Business'

Ignoring the obvious may lead to a nasty mess

I found Simon Morris via his remarkably useful ITIL in 140 app. Simon recently joined ServiceNow from a FTSE100 Advertising, Marketing and Communications group. He was Head of Operations and Engineering and part of a team that lead the Shared Services IT organisation through its transition to IT Service Management process implementation. Here, Simon kindly shares his experiences of ITSM at the rock face.

ITSM Review: You state that prior to your ITSM transformation project you were ‘spending the entire time doing break-fix work and working yourselves into the ground with an ever-increasing cycle of work’. Looking back, can you remember any specific examples of what you were doing, that ITSM resolved?

Simon Morris:

Thinking back I can now see that implementing ITSM gave us the outcomes that we expected from the investment we made in time and money, as well as outcomes that we had no idea would be achieved. Because ITIL is such a wide-ranging framework I think it’s very difficult for organisations to truly appreciate how much is involved at the outset of the project.

We certainly had no idea how much effort would be spent overall on IT Service Management, but we able to identify results early on which encouraged us to keep going. By the time I left the organisation we had multiple people dedicated to the practice, and of course ITSM processes affect all engineering staff on a day-to-day basis.

As soon we finished our ITILv3 training we took the approach of selecting processes that we were already following, and adding layers of maturity to bring them into line with best practice.

I guess at the time we didn’t know it, but we started with Continual Service Improvement – looking at existing processes and identifying improvements. One example that I can recall is Configuration Management – with a very complex Infrastructure we previously had issues in identifying the impact of maintenance work or unplanned outages. The Infrastructure had a high rate of change and it felt impossible to keep a grip on how systems interacted, and depended on each other.

Using Change Management we were able to regulate the rate of change, and keep on top of our Configuration data. Identifying the potential impact of an outage on a system was a process that went from hours down to minutes.

Q. What was the tipping point? How did the ITSM movement gather momentum from something far down the to do list to a strategic initiative? 

If I’m completely honest we had to “sneak it in”! We were under huge pressure to improve the level of professionalism, and to increase the credibility of IT, but constructing the business case for a full ITSM transition was very hard. Especially when you factor in the cost of training, certification, toolsets and the amount of time spent on process improvement. As I said, at the point I left the company we had full time headcount dedicated to ITSM, and getting approval for those additional people at the outset would have been impossible.

We were lucky to have some autonomy over the training budget and found a good partner to get a dozen or so engineers qualified to ITILv3 Foundation level. At that point we had enough momentum, and our influence at departmental head level to make the changes we needed to.

One of the outcomes of our “skunkworks” ITIL transition that we didn’t anticipate at the time was a much better financial appreciation of our IT Services. Before the project we were charging our internal business units on a bespoke rate card that didn’t accurately reflect the costs involved in providing the service. Within a year of the training we had built rate cards that both reflected the true cost of the IT Service, but also included long term planning for capacity.

This really commoditised IT Services such as Storage and Backup and we were able to apportion costs accurately to the business units that consumed the services.

Measuring the cost benefit of ITSM is something that I think the industry needs to do better in order to convince leaders that it’s a sensible business decision – I’m absolutely convinced that the improvements we made to our IT recharge model offset a sizeable portion of our initial costs. Plus we introduced benefits that were much harder to measure in a financial sense such as service uptime, reduced incident resolution times and increased credibility.

Q. How did you measure you were on the right track? What specifically were you measuring? How did you quantify success to the boss? 

Referring back to my point that we started by reviewing existing processes that were immature, and then adding layers to them. We didn’t start out with process metrics, but we added that quite early on.

If I had the opportunity to start this process again I’d definitely start with the question of measurements and metrics. Before we introduced ITSM I don’t think we definitively knew where our problems were, although of course we had a good idea about Incident resolution times and customer satisfaction.

Although it’s tempting to jump straight into process improvement I’d encourage organisations at the start of their ITSM journey to spend time building a baseline of where they are today.

Surveys from your customers and users help to gauge the level of satisfaction before you start to make improvements (Of course, this is a hard measurement to take especially if you’ve never asked your users for honest feedback before, I’ve seen some pretty brutal survey responses in my time J)

Some processes are easier to monitor than others – Incident Management comes to mind, as one that is fairly easy to gather metrics on, Event Management is another.

I would also say that having survived the ITIL Foundation course it’s important to go back into the ITIL literature to research how to measure your processes – it’s a subject that ITIL has some good guidance on with Critical Success Factors (CSFs) and Key Performance Indicators (KPIs).

Q. What would you advise to other companies that are currently stuck in the wrong place, ignoring the dog? (See Simon’s analogy here). Is there anything that you learnt on your journey that you would do differently next time? 

Wow, this is a big question.

Business outcomes

My first thought is that IT organisations should remember that our purpose is to deliver an outcome to the business, and your ITSM deployment should reflect this. In the same way that running IT projects with no clear business benefit, or alignment to an overall strategy is a bad idea – we shouldn’t be implementing ITIL just for the sake of doing it.

For every process that you design or improve, the first question should be “What is the business outcome”, closely followed by “How am I going to prove that I delivered this outcome”. An example for Incident Management would be an outcome of “restoring access to IT services within an agreed timeframe”, so the obvious answer to the second question is “to measure the time to resolution.”

By analysing each process in this way you can get a clearer idea of what types of measurement you should take to:

  • Ensure that the process delivers value and
  • Demonstrate that value.

I actually think that you should start designing the process back-to-front. Identify the outcome, then the method of measurement and then work out what the process should be.

Every time I see an Incident Management form with hundreds of different choices for the category (Hardware, Software, Keyboard, Server etc.) I always wonder if the reporting requirements were ever considered. Or did we just add fields for the sake of it.

Tool maturity

Next I would encourage organisations to consider their process maturity and ITSM toolset maturity as 2 different factors. There is a huge amount of choice in the ITSM suite market at the moment (of course I work for a vendor now, so I’m entitled to have a bias!), but organisations should remember that all of vendors offer a toolset and nothing more.

The tool has to support the process that you design, and it’s far too easy to take a great toolset and implement a lousy process. A year into your transition to ITSM you won’t be able to prove the worth of the time and money spent, and you have the risk of the process being devalued or abandoned.

Having a good process will drive the choice of tool, and design decisions on how that tool is configured. Having the right toolset is huge factor in the chances of a successful transition to ITSM. I’ve lived through experiences with legacy, unwieldy ITSM vendors and it makes the task so much harder.

Participation at every level

One of the best choices we made when we transitioned to ITSM was that we trained a cross-section of engineers across the company. Of the initial group of people to go through ITILv3 Foundation training we had engineers from the Service desk, PC and Mac support, Infrastructure, Service Delivery Managers, Asset management staff and departmental heads.

The result was that we had a core of people who were motivated enough to promote the changes we were making all across the IT department at different levels of seniority. Introducing change, and especially changes that measure the performance of teams and individuals will always induce fear and doubt in some people.

Had we limited the ITIL training to just the management team I don’t think we would have had the same successes. My only regret is that our highest level of IT management managed to swerve the training – I’ll send my old boss the link to this interview to remind him of this!

Find the right pace

A transition to ITSM processes is a marathon, not a sprint so it’s important to find the right tempo for your organisation. Rather than throwing an unsustainable amount of resource at process improvement for a short amount of time I’d advise organisations to recognise that they’ll need to reserve effort on a permanent basis to monitor, measure and improve their services.

ITIL burnout is a very real risk.

 

Simon Morris

My last piece of advice is not to feel that you should implement every process on day one. I can’t think of one approach that would be more prone to failure. I’ve read criticism from ITSM pundits that it’s very rare to find a full ITILv3 implementation in the field. I think that says more about the breadth and depth of the ITIL framework than the failings of companies that implement it.

There’s an adage from the Free Software community – “release early, release often” that is great for ITSM process improvements.

By the time that I left my previous organisation we had iterated through 3 versions of Change Management, each time adding more maturity to the process and making incremental improvements.

I’d recommend reading “ITIL Lite, A road map to full or partial ITIL implementation” by Malcolm Fry. He outlines why ITILv3 might not be fully implemented and the reasons make absolute sense:

  • Cost
  • No customer support
  • Time constraints
  • Ownership
  • Running out of steam

IT Service Management is a cultural change, and it’s worth taking the time to alter peoples working habits gradually over time, rather than exposing them to a huge amount of process change quickly.

Q. Lastly, what do you do at ServiceNow?

I work as a developer in the Application Development Team in Richmond, London. We’re responsible for the ITSM and Business process applications that run on our Cloud platform. On a day-to-day basis this means reviewing our core applications (Incident, Problem, Change, CMDB) and looking for improvements based on customer requirements and best practice.

Obviously the recent ITIL 2011 release is interesting as we work our way through the literature and compare it against our toolset. Recently I’ve also been involved in researching how best to integrate Defect Management into our SCRUM product.

The sign that ServiceNow is growing at an amazing rate (we’re currently the second fastest growing tech company in the US) shows that ITSM is being taken seriously by organisations, and they are investing money to get the returns that a successful transition can offer. These should be encouraging signs to organisations that are starting their journey with ITIL.

@simo_morris
Beer and Speech
Photo Credit