A structured approach to problem solving

Those who have worked in IT Operations have a strong affinity with the skills of problem solving and troubleshooting. Although a huge amount of effort is taken to improve resiliency and redundancy of IT systems the ability to quickly diagnose the root cause of problems has never been more important and relevant.

IT Service Management has gone a long way towards making practices standardised and repeatable. For example you don’t want individual creative input when executing standard changes or fulfilling requests. Standard Operating Procedures and process manuals means that we expect our engineers and practioners to behave in predictable ways. Those reluctant to participate in these newly implemented processes might even complain all the fun has gone out of IT support.

A Home for Creative and Inquiring Minds?

However there is still a place for creative and inquiring minds in ITSM driven organisations. Complex systems are adept at finding new and interesting ways to break and stop functioning. Problem analysis still needs some creative input.

When I recruited infrastructure engineers into my team I was always keen to find good problem solvers. I’d find that some people were more naturally inclined to troubleshooting than others.

Some people would absolutely relish the pursuit of the cause of a difficult network or storage issue… thinking of possible causes, testing theories, hitting dead ends and starting again. They tackled problems with the mindset of a stereotypical criminal detective… finding clues, getting closer to the murder weapon, pulling network cables, tailing through the system log.

These kinds of engineers would rather puzzle over the debug output from their core switch than get stuck into the daily crossword. I’m sure if my HR manager let me medically examine these engineers I’d find that the underlying psychological brain activity and feeling of satisfaction would be very similar to crossword puzzlers and sudoku players. I was paying these guys to do the equivalent of the Guardian crossword 5 days a week.

Others would shy away from troubleshooting sticky problems. They didn’t like the uncertainty of being responsible for fixing a situation they knew little about. Or making decisions based on the loosest of facts.

They felt comfortable in executing routine tasks but lacked the capability to logically think through sets of symptoms and errors and work towards the root cause.

The problem I never solved

Working in a previous organisation I remember a particularly tricky problem. Apple computers running Microsoft PowerPoint would find that on a regular basis their open presentation would lock and stop them saving. Users would have to save a new version and rename the file back to its original name.

It was a typical niggling problem that rumbled on for ages. We investigated different symptoms, spent a huge amount of time running tests and debugging network traces. We rebuilt computers, tried moving data to different storage devices and found the root cause elusive. We even moved affected users between floors to rule out network switch problems.

We dedicated very talented people to resolving the problem and made endless promises of progress to our customers. All of which proved false as we remained unable to find the root cause of the problem.

Our credibility ran thin with that customer and we were alarmed to discover that our previous good record of creatively solving problems in our infrastructure was under threat.

What’s wrong with creative troubleshooting?

The best troubleshooters in your organisation share some common traits.

  • They troubleshoot based on their own experiences
  • They (probably) aren’t able to always rationalise the root cause before attempting to fix it

Making assumptions based on your experiences is a natural thing to do – of course as you learn skills and go through cycles of problem solving you are able to apply your learnings to new situations. This isn’t a negative trait at all.

However it does mean that engineers approach new problems with a potentially limited set of skills and experiences. To network engineers all problems look like a potentially loose cable.

Not being able to rationalise the root cause is a balance between intuition, backed up by evidence and research. Your troubleshooter will work towards the root cause and sometimes have hard evidence to confirm the cause.

“I can see this in the log… this is definitely the cause!”

But in some cases the cause might be suspected, but you aren’t able to prove anything until the fix is deployed.

Wrong decisions can be costly

Attempting the wrong fix is expensive in many ways, not least financially. It’s expensive in terms of time, user patience and most critically the credibility of IT to fix problems quickly.

Expert troubleshooters are able to provide rational evidence that confirm their root cause before a fix is attempted.

A framework is needed

As with a lot of other activities in IT a process or framework can aid troubleshooters to identify the root cause of problems quickly. In addition to providing quick isolation of the root cause, the framework I’m going to discuss can provide evidence as to why we are suggesting this as the root cause.

Using a common framework has other benefits. For example:

  1. To allow collaboration between teams – Complex infrastructure problems can span multiple functional areas. You would expect to find subject matter experts from across the IT organisation working together to resolve problems. Using a common framework in your organisation allows teams to collaborate on problems in a repeatable way. Should the network team have a different methodology for troubleshooting than the application support team?
  2. To bring additional resources into a situation – Often ownership of Problems will be handed between teams in functional or hierarchical escalation. External resources may be brought in to assist with the problem. Having a common framework allows individuals to quickly get an appraisal of the situation and understand the progress that has already been made.
  3. To provide a common language for problem solvers – Structured problem analysis techniques have their own terminology. Having shared understanding of “Problem Area”, “Root cause” and “Probable cause” will prevent mis-understandings and confusion during critical moments

The Kepner Tregoe Problem Analysis process

Kepner-Tregoe is a global management consultancy firm specialising in improving the efficiency of their clients.

The founders, Chuck Kepner and Ben Tregoe, were social scientists living in California in the 1950’s. Chuck and Ben studied the methods of problem solvers and managers and consolidated their research into process definitions.

Their history is an interesting one and a biography of the organisation is outside the scope of this blog post – but definitely worth researching.

One of the processes developed, curated and owned by Kepner-Tregoe, is Structured Problem Analysis, known as KT-PA.

KT-PA is used by hundreds of organisations to isolate problems and discover the root cause. It’s a framework used by problem solvers and troubleshooters to resolve issues and provide rational evidence that the investigation has discovered the correct cause.

Quick overview of the process

1. State the Problem

KT-PA begins with a clear definition of the Problem. A common mistake in problem analysis is a poor description of the problem, often leading to resources dedicated to researching symptoms of the problem rather than the issue itself.

Having a clear and accurate Problem Statement is critical to finding the root cause quickly. KT-PA provides guidance on identifying the correct object and it’s deviation.

A typical Problem Statement might be

Users of MyAccountingApplication are experiencing up to 2 second delays entering ledger information

This problem statement is explicit about the object (“Users of MyAccountingApplication”) and the deviation from normal service (“2 second delays entering ledger information”)

2. Specify the Problem

The process then defines how to specify the problem into Problem Areas. A Problem is specified in 4 dimensions and all should be considered. What, Where, When, Extent:

  1. What object has the deviation
  2. What is the deviation
  3. Where is the deviation on the object
  4. When did the deviation occur
  5. Extent of the deviation (How many deviations are occurring, What is the size of one deviation, Are the number of deviations increasing or decreasing)

The problem owner inspects the issue from these dimensions and documents his results. Results are recorded in the format of IS and IS NOT. Using the IS/IS NOT logical comparison starts to build a profile of the problem. Even at this early stage certain causes might become more apparent or less likely.

Already troubleshooters will be getting benefit from the process. The fact that the 2 second delay in the problem dimension of Where “IS Users in London” but “IS NOT Users in New York” is hugely relevant.

The fact that the delay occurs in entering ledger information but not reading ledger information is also going to help subject matter experts think about possible causes.

3. Distinctions and Changes

Having specified the problem and made logical comparisons as to where the problem IS and IS NOT each problem area the next step is to examine Distinctions and Changes.

Each answer to a specifying question is examined for Distinctions and Changes.

  • What is distinct about users in London when compared to users in New York. What is different about their network, connectivity, workstation build?
  • What has changed for users in London?
  • What is distinct about August 2012 when compared to July?
  • What changed around the 30th July?

As these questions are asked and discussed possible root causes should become apparent. These are logged for testing in the next step.

4. Testing the cause

The stage of testing the cause before confirmation is, for me, the most valuable step in the KT-PA process. It isn’t particularly hard to think of possible root causes to a problem. Thinking back to the “problem I never solved” we had many opinions on what the cause might be from different technical experts.

If we had used KT-PA with that problem we could have tested the cause against the problem specification to see how probable it is.

As an example lets imagine that during the Distinctions and Changes stage with our problem above 3 possible root causes were suggested

  • LAN connection issue with the switch the application server is connected to
  • The new anti-virus installation installed across the company in August is causing issues
  • Internet bandwidth in the London office is saturated

When each possible root cause is evaluated against the problem specification you are able to test it using the following question

“If LAN connection issue with the switch the application server is connected to is the true cause of the problem then how does it explain why Users in London experience the issue. Users in New York do not”

This possible root cause doesn’t sound like a winner. If there were network connectivity issues with the server wouldn’t all users be affected?

“If The new anti-virus installation installed across the company in August is causing issues is the true cause of the problem then how does it explain why Users in London experience the issue. Users in New York do not”

We came to this root cause because of a distinction and change in the WHEN problem dimension. In August a new version of anti-virus was deployed across the company? But this isn’t a probable root cause for the same reason that New York users aren’t affected

If Internet bandwidth in the London office is saturated is the true cause of the problem then how does it explain why Users in London experience the issue. Users in New York do not”

So far this possible root cause sounds most probable. The cause can explain the dimension of WHERE. Does it also prove other dimensions of the problem.

“If Internet bandwidth in the London office is saturated is the true cause of the problem then how does it explain why First noticed in August 2012, not reported before 30th July”

Perhaps now we’d be researching Internet monitoring charts to see if the possible root cause can be confirmed.

The New Rational Manager

You might find my recommendation of a book published in 1965 as one of the most relevant Problem Management books I’ve read to be incredulous.

But I’m recommending it anyway.

The New Rational Manager

The New Rational Manager, written by Charles H Kepner and Benjamin B Tregoe is a must read for anyone that needs to solve problems, be they manufacturing, industrial, business or Information Technology.

It explains the process above in a readable way with great examples. I think the word “Computer” is mentioned once – this is not a book about modern technology – but it teaches the reader a process that can be applied to complex IT problems

In Summary

Problem Management and troubleshooting is a critical skill in ITSM and Infrastructure and Operations roles. Many talented troubleshooters make their reputation by applying creative, technical knowledge to a problem and finding the root cause.

Your challenge is harnessing that creativity into a process to make their success repeatable in your organisation and to reduce the risk of fixing the wrong root cause.

Sudoku Image Credit

Rob England: Incident Management at Cherry Valley, Illinois

It had been raining for days in and around Rockford, Illinois that Friday afternoon in 2009, some of the heaviest rain locals had ever seen. Around 7:30 that night, people in Cherry Valley – a nearby dormitory suburb – began calling various emergency services: the water that had been flooding the road and tracks had broken through the Canadian National railroad’s line, washing away the trackbed.

An hour later, in driving rain, freight train U70691-18 came through the level crossing in Cherry Valley at 36 m.p.h, pulling 114 cars (wagons) mostly full of fuel ethanol – 8 million litres of it – bound for Chicago. Although ten cross-ties (sleepers) dangled in mid air above running water just beyond the crossing, somehow two locomotives and about half the train bounced across the breach before a rail weld fractured and cars began derailing. As the train tore in half the brakes went into emergency stop. 19 ethanol tank-cars derailed, 13 of them breaching and catching fire.

In a future article we will look at the story behind why one person waiting in a car at the Cherry Valley crossing died in the resulting conflagration, 600 homes were evacuated and $7.9M in damages were caused.

Today we will be focused on the rail traffic controller (RTC) who was the on-duty train dispatcher at the CN‘s Southern Operations Control Center in Homewood, Illinois. We won’t be concerned for now with the RTC’s role in the accident: we will talk about that next time. For now, we are interested in what he and his colleagues had to do after the accident.

While firemen battled to prevent the other cars going up in what could have been the mother of all ethanol fires, and paramedics dealt with the dead and injured, and police struggled to evacuate houses and deal with the road traffic chaos – all in torrential rain and widespread surface flooding – the RTC sat in a silent heated office 100 miles away watching computer monitors. All hell was breaking loose there too. Some of the heaviest rail traffic in the world – most of it freight – flows through and around Chicago; and one of the major arteries had just closed.

Back in an earlier article we talked about the services of a railroad. One of the major services is delivering goods, on time. Nobody likes to store materials if they can help it: railroads deliver “just in time”, such as giant ethanol trains, and the “hotshot” trans-continental double-stack container trains with nine locomotives that get rail-fans like me all excited. Some of the goods carried are perishables: fruit and vegetables from California, stock and meat from the midwest, all flowing east to the population centres of the USA.

The railroad had made commitments regarding the delivery of those goods: what we would call Service Level Targets. Those SLTs were enshrined in contractual arrangements – Service Level Agreements – with penalty clauses. And now trains were late: SLTs were being breached.

A number of RTCs and other staff in Homewood switched into familiar routines:

  • The US rail network is complex – a true network. Trains were scheduled to alternate routes, and traffic on those routes was closed up as tightly bunched together as the rules allowed to create extra capacity.
  • Partner managers got on the phone to the Union Pacific and BNSF railroads to negotiate capacity on their lines under reciprocal agreements already in place for situations just such as this one.
  • Customer relations staff called clients to negotiate new delivery times.
  • Traffic managers searched rail yard inventories for alternate stock of ethanol, that could be delivered early.
  • Crew managers told crews to pick up their trains in new locations and organised transport to get them there.

Fairly quickly, service was restored: oranges got squeezed in Manhatten, pigs and cows went to their deaths, and corn hootch got burnt in cars instead of all over the road in Cherry Valley.

This is Incident Management.

None of it had anything to do with what was happening in the little piece of hell that Cherry Valley had become. The people in heavy waterproofs, hi-viz and helmets, splashing around in the dark and rain, saving lives and property and trying to restore some semblance of local order – that’s not Incident Management.

At least I don’t think it is. I think they had a problem.

An incident is an interruption to service and a problem is an underlying cause of incidents. Incident Management is concerned with the restoration of expected levels of service to the users. Problem Management is concerned with removing the underlying causes.

To me that is a simple definition that works well. If you read the books and listen to the pundits you will get more complex models that seem to imply that everything done until trains once more rolled smoothly though Cherry Valley is Incident Management. I beg to differ. If the customer gets steak and orange juice then Cherry Valley could be still burning for all they care: Incident Management has met its goals.

Image Credit

Review: itSMF Continual Service Improvement SIG

Like many who work in ITSM, I am of course aware of the need for, and the importance of Continual Service Improvement throughout the Service Management Lifecycle.

But what does it entail in real terms, and not just what I read on the ITIL course/in the books?

I came along to the itSMF CSI SIG, held in London to find out.

CSI in a nutshell

The purpose of CSI is to constantly look at ways of improving service, process and cost effectiveness.

It is simply not enough to drop in an ITSM tool to “fix” business issues, (of course backed up with reasonable processes) and then walk away thinking: “Job well done.”

Business needs and IT services constantly evolve and change and CSI supports the lifecycle and is iterative – the review and analysis process should be a continual focus.

Reality

CSI is often aspired to, and has been talked about in initial workshops, but all too often gets swallowed up in the push to configure and push out a tool, tweak and force in processes and all too often gets relegated to almost “nice to have” status.

A common question one sees in Linked in Groups is:

“Why do ITIL Implementations fail?”

A lack of commitment to CSI is often the reason, and this session looked to try and identify why that might be.

Interactive

I have never been to a SIG before, and it was very clear from the outset that we were not going to be talked at, nor would we quite be doing the speed-dating networking element from my last regional venture.

SIG chair Jane Humphries started us off by introducing the concept of a wall with inhibitors.  The idea was that we would each write down two or three things on post-it notes for use in the “Speakers Corner” segment later in the day.

What I liked about this, though, was that Jane has used this approach before, showing us a wall-graphic with inhibitors captured and written on little bricks, to be tackled and knocked down in projects.

Simple but powerful, and worth remembering for workshops, and it is always worth seeing what people in the community do in practice.

Advocates, Assassins, Cynics and Supporters

The majority of the sessions focussed on the characteristics of these types of potential stakeholders – how to recognise them, how to work with them, and how to prioritise project elements accordingly.

The first two breakout sessions split the room into four groups, to discuss these roles and the types of people we probably all have had to deal with in projects.

There was, of course, the predictable amusement around the characteristics of Cynics – they have been there and seen it all before, as indeed a lot of us had, around the room.

But what surprised me was a common factor in terms of managing these characteristics: What’s in it for me? (WIIFM)

Even for Supporters and Advocates, who are typically your champions, there is a delicate balancing act to stop them from going over to the “dark side” and seeing become cynics, or worse assassins to your initiative.

The exercises which looked at the characteristics, and how to work with them proved to be the easiest.

Areas to improve

What didn’t work so well was a prioritisation and point-scoring exercise which just seemed to confuse everyone.

For our group we struggled to understand if the aim was to deliver quick wins for lower gains, or go for more complex outcomes with more complex stakeholder management.

Things made a little more sense when we were guided along in the resulting wash-up session.

The final element to the day was a take on the concept of “Speakers’ Corner” – the idea being that two or three of the Post-It inhibitors would be discussed.  The room was re-arranged with a single chair in the middle and whoever had written the chosen topic would start the debate.

To add to the debate, a new speaker would have to take the chair in the centre.

While starting the debate topics were not an issue, the hopping in and out of the chairs proved to be hard to maintain, but the facilitators were happy to be flexible and let people add to the debate from where they sat.

Does Interactive work?

Yes and no.

I imagined that most people would come along and attend a Special Interest Group because they are just that – Interested!

But participating in group sessions and possibly presenting to the room at large may not be to everyone’s liking.

I have to admit, I find presenting daunting enough in projects where I am established.  So to have to act as scribe, and then bite the bullet and present to a huge room of people is not a comfortable experience for me, even after twenty years in the industry.

But you get out of these sessions what you put in, so I took my turn to scribe and present.  And given the difficulties we had, as a group, understanding the objectives of the third breakout session, I was pleased I had my turn.

The irony is Continual Service Improvement needs people to challenge and constantly manage expectations and characters in order to be successful.  It is not a discipline that lends itself to shy retiring wallflowers.

If people are going to spend a day away from work to attend a SIG, then I think it makes sense for them to try and get as much out of it as they can.

Perhaps my message to the more shy members in the room who hardly contributed at all is to remember that everyone is there to help each other learn from collective experience.  No-one is there to judge or to act as an Assassin/Cynic so make the most of the event and participate.

For example, in Speakers’ Corner, the debate flowed and people engaged with each other, even if the chair hopping didn’t quite work, but acknowledgement also needs to go to the SIG team, who facilitated the day’s activities very well.

I have attended three events now, a UK event, a Regional Seminar and a SIG and this was by far the most enjoyable and informative so far.

A side note: Am I the only one that hears CSI and thinks of crime labs doing imaginative things to solve murders in Las Vegas, Miami, and New York?  No?  Just me then.

Supplier Relationship Management – An emerging capability in the ITSM toolbox

"Development opportunities can be completely missed because the two organizations have not properly explored how to grow together, indeed contractor enthusiasm may be misinterpreted as land grabbing."

Paul Mallory is VP Europe and Africa for the IACCM, with responsibility for member services, training and certification and research.

The recent article on the role that SLAs play in the relationship style between two organizations made me think. For some relationships, SLAs replace or even reduce the effort that an organization puts into managing the strategic development of opportunities between client and contractor.

If the contractor is seen to be achieving their SLAs then they are considered to be doing their job effectively.  If they are missing their SLAs then there is a large focus on understanding why they have failed and potentially much discussion around any mitigating circumstances the contractor puts forward.

SLAs definitely have their place as they allow the client and contractor to look at service development and continuous performance improvement through stretch targets based on the existing contractual agreement.

However, supplier relationship management (SRM) comes into effect when you want to truly transform the way in which you work with your suppliers.

So first up, how should we define SRM?

In our recently launched SRM training course, the IACCM defines SRM as:

“The function that seeks to develop successful, collaborative relationships with key suppliers for the delivery of significant tangible business benefits for both parties”.

Why is SRM important?

The average tenure of a CIO is about 4.5 years.  Most IT Service Management contracts (be it for any of the ITIL disciplines, applications or data centre outsourcing) are for between 5 and 10 years, with public sector contracts often reaching and even exceeding the upper end of that range.  Therefore it follows that those people who were in place at the outset, and developing the IT strategy, may not be there further down the contract lifecycle, yet the contractual relationship continues to exist and needs the right management practices to bring the expected benefit to both sides.

Furthermore, with the cost of IT services re-procurement often being around 30% of the annual contract cost (once transition, exit and procurement time has been taken into consideration), implementing a successful contract extension becomes a financial KPI.

Keld Jensen, Chairman of the Centre for Negotiation at the Copenhagen Business School, has identified that 42% of contract value is left on the table and not even addressed or even recognised by either party during the initial negotiations.  This means that in ITSM contracts there is great opportunity for both parties to access that 42% once the negotiation and procurement teams have left the room.  The supplier relationship manager is part of the mechanism to enable that.

What does a Supplier Relationship Manager do?

First, we must remember that IT Service contracts can incorporate a number of inter-related disciplines (especially if we take the ITIL view).  Each of those teams is going to be heavily focussed on their immediate needs and how their portion of the supply chain is delivering to them.  They will also be interested in where there are process hand-offs, but my experience has shown me that often there is a poor, singular joined up view across IT disciplines.

If this is the case, a weak supplier will maximise this to their advantage, especially where they are delivering many facets of ITSM.  They will minimise exposure to their service shortcomings and keep their network of relationships separate and distinct.  A good supplier though may just accept the frustration of dealing with a discordant IT department and focus its development opportunities on its other customers, the “customers of choice”.

In both scenarios the client does not access the 42% of value that Keld Jensen discusses, it may be from fire-fighting performance issues or an inability to properly interact with the supplier through a lack of focal point.  This is where the supplier relationship manager steps in, because they are there to:

  • Manage all aspects of the inter-company relationship, especially where the supplier’s remit goes beyond ITSM
  • They look to build trust through open communications, both internally and with the supply chain
  • They understand the full capability of the supply chain and will seek to develop successful, collaborative relationships with key, strategic suppliers
  • They share company strategy, mission and values with the suppliers
  • They ensure that the relationship follows appropriate governance requirements
  • They have ready access to, and influence from the top levels of management
Paul Mallory, IACCM
Paul Mallory, IACCM

By understanding not only the strategy of IT but also of the company as a whole, they are in a position to create a collaborative relationship with the strategic suppliers where mutual win-win opportunities are developed and encouraged.  Innovation can be targeted at the right teams, process efficiency can be realised and cross fertilisation of ideas can occur between teams who may not have realised they were working towards similar outcomes.

The supplier is turned into a strategic asset that can positively affect your organisations success, rather than an entity whose invoices are paid each month if service targets were met.

Through the conversations that the IACCM is having with its membership, we see that SRM is an emerging discipline which is becoming more important in these times of austerity.  There needs to be value for every pound, dollar or euro an organisation spends.  Effective SRM is there to ensure that value is realised.

Paul Mallory is VP Europe and Africa for the IACCM, with responsibility for member services, training and certification and research.

Request Fulfilment in ITIL 2011

"ITIL 2011 sees a hefty revision for the Request Fulfilment process."

What is it?

The ITIL® Request Fulfilment process exists to fulfil Service Requests – for the most part minor changes or requests for information.

Request Fulfilment landed on us in ITIL v3 when there was now a clear distinction between service interruptions (Incidents) and requests from users (Service Requests for example password resets)

And what does ITIL 2011 give us?

ITIL 2011 sees a hefty revision for the Request Fulfilment process.  There are more detailed sub-processes involved with steps broken down logically.

Now, I like me a good diagram and finally Request Fulfilment gets a decent flow and most importantly the linkages to other interfaces to the other lifecycle stages are included in a lot more detail.

Perhaps the most impressive thing is far more detail in the section about the Critical Success Factors (CSFs) and Key Performance Indicators (KPIs) that have been included.  Having experienced the hilarity of definitions of over-complex metrics – this is a good starter for 10, straight off the bat and of course can be added to suit an organisation’s needs.

But what does all this REALLY mean?

It means nothing if the best practices cannot be applied and adapted into real life.

  • Now we all know that at the back of a Service Request is a process that will step through authorisation, any interfaces to other processes etc., but the business value is to provide a quick and easy way for end users to get new services.
  • A mechanism to reduce costs through centralising functions.
  • Understand what other stages of the lifecycles are needed alongside Request Fulfilment – this does not happen in glorious isolation.

Is there such a magic bullet?

The simple answer?  NO!

But there are a few things that should be taken into consideration when looking at implementing Request Fulfilment (often as part of an integrated solution).

Let’s look at the easy stuff first:

  • Look at starting nice and easily with simple Request Models that will happen often, and can be met with a consistently repeatable solution.
  • Look at what kind of options you are going to put in front of the user.  Most people are now familiar with the type of shopping basket type approach through the internet so offer them a familiar interface, with as many options that can be pre-defined
  • Make sure that the different stages of the request can be tracked – the purpose is two-fold:
    • End users don’t get (as) ratty
    • Reporting and routing can be made simpler and more accurate with meaningful status definitions

Getting the hang of this…

  • Give some thought to how you want to prioritise and escalate requests depending on their complexity to fulfil, and again pre-define where possible.

Let’s do the whole shebang…

  • Eventually there will be a need to include financial approval(s) which in turn means sticky things like deputies and budget limits
  • There may also be external interactions with fulfilment groups dealing with procurement

Back up a second – who now?

  • Give some thought to which groups are going to be involved.  In my experience it is sometimes easier to work backwards, from the outcome to the selection and fill out all the bits you need in between.
  • Easy stuff is most likely taken care of by a single, often centralised group – typically the Service Desk, or in some cases specific co-ordinators who work at that Level One tier.
  • Decide if your existing resolver groups are appropriate for some fulfilment tasks or where you need specialised groups and build your workflows to suit.  Typically the first-line support group handling the request always has the ability to track the progress of the request, and is the point of contact for the end users.

Is that it?

  • Whether your request is a simple How Do I to a Hand craft me a personally engraved and gift wrapped iPad the request needs a defined closure procedure.  There has to be a mechanism to validate that the request has been fulfilled satisfactorily before it is closed.

How do we go about deciding what works and what doesn’t?

There is something I will state, use and promote constantly, and that is the use of scenarios.  These are invaluable whether you are testing a deployment, performing user-acceptance testing with a client, or whether you are just evaluating products.

  • Decide on what criteria you need to establish your end goal
  • Break them down to manageable steps, and here the ITIL 2011 activities and points are very nicely presented to give a starter for ten
  • For a product review, for example, look at how easy it is to configure – can I do this myself using demos on the web, or do I need a proper demo on site/webinar with a tool administrator
  • As an aside, what kind of administrative skill is required for your tool of choice?

This is a doddle, no?

A number of things can kill an otherwise promising and/or straightforward deployment:

  • Poorly defined scope – People wanting the process to do too much or not really grasping the idea that Service Request models should be pre-definable, and consistently repeatable.
  • Poorly Designed User Interfaces – The best back end workflows in the world will not help you if the user interface makes no sense to an end user.  Too often I have banged my head against a desk with developers who love how THEY understand what is being asked, so who cares if some desk jockey can’t – they can ring the help desk, right?  WRONG!  Missing the entire point of the business benefits for removing the need to drive everything through 1-2-1 service desk interaction.
  • What is worse than a front end you need a degree in programming to work through?  Haphazard back end workflow that twists and turns like a snake with a stomach upset.  Just keep it simple.  Once it starts to get super-complex, then really ask yourself is this a minor request or something that requires specific change planning.
  • Make sure your tool of choice is capable of measuring meaningful metrics.  Remember, there are lies, damned lies and statistics.  What are you looking to improve, why, what is the benefit, and what can it lead to in terms of Continual Service Improvement

There are, of course, interactions that I haven’t gone into any great level of detail in this article; but do look at one of our latest articles by  Rob England has already touched on this in: What is a Service Catalogue? here on The ITSM Review.

Image Credit

7 Benefits of Using a Known Error Database (KEDB)

KEDB - a repository that describes all of the conditions in your IT systems that might result in an incident for your customers.

I was wondering – do you have a Known Error Database? And are you getting the maximum value out of it?

The concept of a KEDB is interesting to me because it is easy to see how it benefits end users. Also because it is dynamic and constantly updated.

Most of all because it makes the job of the Servicedesk easier.

It is true to say that an effective KEDB can both increase the quality and decrease the time for Incident resolution.

The Aim of Problem Management and the Definition of “The System”

One of the aims of Problem Management is to identify and manage the root causes of Incidents. Once we have identified the causes we could decide to remove these problems to prevent further users being affected.

Obviously this might be a lengthy process – replacing a storage device that has an intermittent fault might take some scheduling. In the meantime Problem Managers should be investigating temporary resolutions or measures to reduce the impact of the Problem for users. This is known as the Workaround.

When talking about Problem Management it helps to have a good definition of “Your System”. There are many possible causes of Incidents that could affect your users including:

  • Hardware components
  • Software components
  • Networks, connectivity, VPN
  • Services – in-house and outsourced
  • Policies, procedures and governance
  • Security controls
  • Documentation and Training materials

Any of these components could cause Incidents for a user. Consider the idea that incorrect or misleading documentation would cause an Incident. A user may rely on this documentation and make assumptions on how to use a service, discover they can’t and contact the Servicedesk.

This documentation component has caused an Incident and would be considered the root cause of the Problem

Where the KEDB fits into the Problem Management process

The Known Error Database is a repository of information that describes all of the conditions in your IT systems that might result in an incident for your customers and users.

As users report issues support engineers would follow the normal steps in the Incident Management process. Logging, Categorisation, Prioritisation. Soon after that they should be on the hunt for a resolution for the user.

This is where the KEDB steps in.

The engineer would interact with the KEDB in a very similar fashion to any Search engine or Knowledgebase. They search (using the “Known Error” field) and retrieve information to view the “Workaround” field.

The “Known Error”

The Known Error is a description of the Problem as seen from the users point of view. When users contact the Servicedesk for help they have a limited view of the entire scope of the root cause. We should use screenshot of error messages, as well as the text of the message to aid searching. We should also include accurate descriptions of the conditions that they have experienced. These are the types of things we should be describing in the Known Error field A good example of a Known Error would be:

When accessing the Timesheet application using Internet Explorer 6 users experience an error message when submitting the form.

The error message reads “Javascript exception at line 123”

The Known Error should be written in terms reflecting the customers experience of the Problem.

The “Workaround”

The Workaround is a set of steps that the Servicedesk engineer could take in order to either restore service to the user or provide temporary relief. A good example of a Workaround would be:

To workaround this issue add the timesheet application to the list of Trusted sites

1. Open Internet Explorer 2. Tools > Options > Security Settings [ etc etc ]

The Known Error is a search key. A Workaround is what the engineer is hoping to find – a search result. Having a detailed Workaround, a set of technical actions the Servicedesk should take to help the user, has multiple benefits – some more obvious than others.

Seven Benefits of Using a Known Error Database (KEDB)

  1. Faster restoration of service to the user – The user has lost access to a service due to a condition that we already know about and have seen before. The best possible experience that the user could hope for is an instant restoration of service or a temporary resolution. Having a good Known Error which makes the Problem easy to find also means that the Workaround should be quicker to locate. All of the time required to properly understand the root cause of the users issue can be removed by allowing the Servicedesk engineer quick access to the Workaround.
  2. Repeatable Workarounds – Without a good system for generating high-quality Known Errors and Workarounds we might find that different engineers resolve the same issue in different ways. Creativity in IT is absolutely a good thing, but repeatable processes are probably better. Two users contacting the Servicedesk for the same issue wouldn’t expect a variance in the speed or quality of resolution. The KEDB is a method of introducing repeatable processes into your environment.
  3. Avoid Re-work – Without a KEDB we might find that engineers are often spending time and energy trying to find a resolution for the same issue. This would be likely in distributed teams working from different offices, but I’ve also seen it commonly occur within a single team. Have you ever asked an engineer if they know the solution to a users issue to be told “Yes, I fixed this for someone else last week!”. Would you have prefered to have found that information in an easier way?
  4. Avoid skill gaps – Within a team it is normal to have engineers at different levels of skill. You wouldn’t want to employ a team that are all experts in every functional area and it’s natural to have more junior members at a lower skill level. A system for capturing the Workaround for complex Problems allows any engineer to quickly resolve issues that are affecting users.Teams are often cross-functional. You might see a centralised application support function in a head-office with users in remote offices supported by their local IT teams. A KEDB gives all IT engineers a single place to search for customer facing issues.
  5. Avoid dangerous or unauthorised Workarounds – We want to control the Workarounds that engineers give to users. I’ve had moments in the past when I chatted to engineers and asked how they fixed issues and internally winced at the methods they used. Disabling antivirus to avoid unexpected behaviour, upgrading whole software suites to fix a minor issue. I’m sure you can relate to this. Workarounds can help eliminate dangerous workarounds.
  6. Avoid unnecessary transfer of Incidents – A weak point in the Incident Management process is the transfer of ownership between teams. This is the point where a customer issue goes to the bottom of someone else queue of work. Often with not enough detailed context or background information. Enabling the Servicedesk to resolve issues themselves prevents transfer of ownership for issues that are already known.
  7. Get insights into the relative severity of Problems – Well written Known Errors make it easier to associate new Incidents to existing Problems. Firstly this avoids duplicate logging of Problems. Secondly it gives better metrics about how severe the Problem is. Consider two Problems in your system. A condition that affects a network switch and causes it to crash once every 6 months. A transactional database that is running slowly and adding 5 seconds to timesheet entry You would expect that the first Problem would be given a high priority and the second a lower one. It stands to reason that a network outage on a core switch would be more urgent that a slowly running timesheet system But which would cause more Incidents over time? You might be associating 5 new Incidents per month against the timesheet problem whereas the switch only causes issues irregularly. Being able to quickly associate Incidents against existing Problems allows you to judge the relative impact of each one.

The KEDB implementation

Technically when we talk about the KEDB we are really talking about the Problem Management database rather than a completely separate store of data. At least a decent implementation would have it setup that way.

There is a one-to-one mapping between Known Error and Problem so it makes sense that your standard data representation of a Problem (with its number, assignment data, work notes etc) also holds the data you need for the KEDB.

It isn’t incorrect to implement this in a different way – storing the Problems and Known Errors in seperate locations, but my own preference is to keep it all together.

Known Error and Workaround are both attributes of a Problem

Is the KEDB the same as the Knowledge Base?

This is a common question. There are a lot of similarities between Known Errors and Knowledge articles.

I would argue that although your implementation of the KEDB might store its data in the Knowledgebase they are separate entities.

Consider the lifecycle of a Problem, and therefore the Known Error which is, after all, just an attribute of that Problem record.

The Problem should be closed when it has been removed from the system and can no longer affect users or be the cause of Incidents. At this stage we could retire the Known Error and Workaround as they are no longer useful – although we would want to keep them for reporting so perhaps we wouldn’t delete them.

Knowledgebase articles have a more permanent use. Although they too might be retired, if they refer to an application due to be decommissioned, they don’t have the same lifecycle as a Known Error record.

Knowledge articles refer to how systems should work or provide training for users of the system. Known Errors document conditions that are unexpected.

There is benefit in using the Knowledgebase as a repository for Known Error articles however. Giving Incident owners a single place to search for both Knowledge and Known Errors is a nice feature of your implementation and typically your Knowledge tools will have nice authoring, linking and commenting capabilities.

What if there is no Workaround

Sometimes there just won’t be a suitable Workaround to provide to customers.

I would use an example of a power outage to provide a simple illustration. With power disrupted to a location you could imagine that there would be disruption to services with no easy workaround.

It is perhaps a lazy example as it doesn’t allow for many nuances. Having power is a normally a binary state – you either have adequate power or not.

A better and more topical example can be found in the Cloud. As organisations take advantage of the resource charging model of the Cloud they also outsource control.

If you rely on a Cloud SaaS provider for your email and they suffer an outage you can imagine that your Servicedesk will take a lot of calls. However there might not be a Workaround you can offer until your provider restores service.

Another example would be the February 29th Microsoft Azure outage. I’m sure a lot of customers experienced a Problem using many different definitions of the word but didn’t have a viable alternative for their users.

In this case there is still value to be found in the Known Error Database. If there really is no known workaround it is still worth publishing to the KEDB.

Firstly to aid in associating new Incidents to the Problem (using the Known Error as a search key) and to stop engineers in wasting time in searching for an answer that doesn’t exist.

You could also avoid engineers trying to implement potentially damaging workarounds by publishing the fact that the correct action to take is to wait for the root cause of the Problem to be resolved.

Lastly with a lot of Problems in our system we might struggle to prioritise our backlog. Having the Known Error published to help routing new Incidents to the right Problem will bring the benefit of being able to prioritise your most impactful issues.

A users Known Error profile

With a populated KEDB we now have a good understanding of the possible causes of Incidents within our system.

Not all Known Errors will affect all users – a network switch failure in one branch office would be very impactful for the local users but not for users in another location.

If we understand our users environment through systems such as the Configuration Management System (CMS) or Asset Management processes we should be able to determine a users exposure to Known Errors.

For example when a user phones the Servicedesk complaining of an interruption to service we should be able to quickly learn about her configuration. Where she is geographically, which services she connects to. Her personal hardware and software environment.

With this information, and some Configuration Item matching the Servicedesk engineer should have a view of all of the Known Errors that the user is vulnerable to.

Measuring the effectiveness of the KEDB.

As with all processes we should take measurements and ensure that we have a healthy process for updating and using the KEDB.

Here are some metrics that would help give your KEDB a health check.

Number of Problems opened with a Known Error

Of all the Problem records opened in the last X days how many have published Known Error records?

We should be striving to create as many high quality Known Errors as possible.

The value of a published Known Error is that Incidents can be easily associated with Problems avoiding duplication.

Number of Problems opened with a Workaround

How many Problems have a documented Workaround?

The Workaround allows for the customer Incident to be resolved quickly and using an approved method.

Number of Incidents resolved by a Workaround

How many Incidents are resolved using a documented Workaround. This measures the value provided to users of IT services and confirms the benefits of maintaining the KEDB.

Number of Incidents resolved without a Workaround or Knowledge

Conversely, how many Incidents are resolved without using a Workaround or another form of Knowledge.

If we see Servicedesk engineers having to research and discover their own solutions for Incidents does that mean that there are Known Errors in the system that we aren’t aware of?

Are there gaps in our Knowledge Management meaning that customers are contacting the Servicedesk and we don’t have an answer readily available.

A high number in our reporting here might be an opportunity to proactively improve our Knowledge systems.

OLAs

want to ensure that Known Errors are quickly written and published in order to allow Servicedesk engineers to associate incoming Incidents to existing Problems.

One method of measuring how quickly we are publishing Known Errors is to use Organisational Level Agreements (or SLAs if your ITSM tool does’t define OLAs).

We should be using performance measurements to ensure that our Problem Management function is publishing Known Errors in a timely fashion.

You could consider tracking Time to generate Known Error and Time to generate Workaround as performance metrics for your KEDB process.

In summary

Additionally we could also measure how quickly Workarounds are researched, tested and published. If there is no known Workaround that is still valuable information to the Servicedesk as it eliminates effort in trying to find one so an OLA would be appropriate here.

Free Access to Ten Gartner ITSM Research Papers

Cracks in the Paywall?

It appears Gartner are banging the new business drum and are a offering 90 day trial access to their online research.

I’m not a Gartner client so I took advantage of this offer to gain access to the following:

  • How to Assess ITIL Effectiveness With a Balanced Scorecard Strategy Map ~ Tapati Bandopadhyay & Patricia Adams
  • IT Metrics: New Economic Rules of IT Spending and Staff Metrics ~ Kurt Potter
  • Use the Balanced Scorecard and Strategy Map to Link ITIL initiatives with Business Value ~ Tapati Bandopadhyay & Patricia Adams
  • How to Influence the Collective Mind to Adopt ITIL for IT Service Management ~ Tapati Bandopadhyay & Patricia Adams
  • Service Management, ITIL and the Process-Optimizing IT Delivery Model ~ Colleen M. Young
  • ITIL and Process Improvement Key Initiative Overview ~ David M. Coyle
  • How to Identify Key Pain Processes to Prioritize ITIL Adoption ~ Tapati Bandopadhyay
  • How to Create an I&O Cost Optimization Process and Improve Financial Performance ~ John Rivard
  • Best Practices for Supporting ‘Bring Your Own Mobile Devices’ ~ Nick Jones, Leif-Olof Wallin
  • IT Budgeting: Fundamentals ~ Michael Smith

See also: Free Access to Ten Gartner ITAM Research Papers

The offer includes ten research papers. It is not immediately obvious which papers are paid for and which are free to access, but if you click on the search preferences you can filter out any paid content.

Photo Credit

How to Provide Support for VIPs

One of the outcomes of IT Service Management is the regulation, consistency and predictability in the delivery of services.

I remember working in IT before Service Management was adopted by our organisation and realising that we would over-service some customers and under-service others. Not intentionally but we didn’t have a way of regulating our work and making our output predicatable.

Our method of work delivery seemed to be somewhere between “First come first served” and “She who shouts loudest shall get the best service”. Not the best way to manage service delivery.

Chris York tweeted an interesting message recently;

It’s a great topic to talk about and one that I remember having to deal with personally in previous jobs.

I have two different views on VIP treatment – I think it’s a complex subject and I’d love to know your thoughts in the comments below.

if your names not down you're not getting support
if your names not down you're not getting support

The Purist

Firstly IT Service Management is supposed to define exactly how services will be delivered to an organisation. The service definition includes the cost, warranty and utility that is to be provided.

Secondly, there is a difference between the Customer of the service and the User of the service. The Customer is characterised as the people that pay for the service. They also define and agree the service levels.

Users are characterised as individuals that use the service.

There are loads of great analogys to reinforce this point – from local government services that are outsourced (The local Government is the customer, the local resident is the user), to restaurants and airports. The IT Skeptic has a good discussion on the subject

It’s also true to say that the Customer might not also be a user of the service, although in organisations I’ve worked in it is usually so.

This presents an interesting dilemma for both the Provider and the Customer. Should the Customer expect more from the service than they originally negotiated with the Provider? I think the most common example that this dilemma occurs is end-user services – desktop support.

The people that would “sign on the dotted line”for the IT Services we used to provide would be Finance Directors, IT Directors, CFOs or CIOs. Very senior people with responsibility for the cost of their services and making sure the company gets a good deal.

Should we be surprised when senior people that ultimately pay for the service expect preferential treatment? No – but we should remind them of the service warranty that they agreed would be supplied.

Over-servicing VIPs has to be at the cost of someone else – and by artificially raising the quality of service for a few people we risk degrading the service for everyone.

The Pragmatist

The reality is that IT Service Management is a people business and a perception business, especially end-user services.

People call the Service desk when they want something (a Request) or they need help (an Incident). Both of these are quite emotional human states.

The performance and usability of someones IT equipment is fundamental to their own productivity and their own success. It feels very personal when your equipment that you rely on stops functioning.

Although we can gather SLA and performance statistics for our stakeholder meetings we have the problem that we are often seen as being as good as our last experience with that individual person. It shouldn’t be this way – but it is.

I’ve been to meetings full of good news about the previous months service only to be ripped to pieces for a request submitted by the CEO that wasn’t actioned. I’ve been to meetings after a period of general poor service and had good reviews because the Customer had a (luckily) excellent experience with the Service desk.

Much as we don’t like it prioritising VIP support it has an overall positive effect when we do.

The middle ground (or “How I’ve seen it done before”)

If you don’t like the Pragmatist view above there are ways to come to a compromise. Stephen Mann touched on an idea I have seen before:

Deciding business criticality is obviously a challenge.

In my previous role, in the advertising world, the most important people in an agency are the Creatives.

These guys churn out graphical and video content and work on billable hours. When their equipment fails the clock is ticking to get them back up and running again.

So calculating the financial cost of individuals downtime and assigning a role is a method of designating those that can expect prioritised support.

As a Service Provider in that last role our customer base grew and our list of VIPs got longer. We eventually allocated 5% of each companies headcount to have “VIP” status in our ITSM tool.

I think there are ways to write VIP support into an IT Services contract that allows the provider to plan and scale their support to cater for it.

Lastly, we should talk about escalated Incidents. This is a more “formal” approach to Service Management (the Purist would be happy) where a higher level of service is allocated to resolving an Incident if it meets the criteria for being escalated.

When dealing with Users it is worth having a view of that persons overall experience with the Service Provider. If a user already has one escalated Incident should she expect a better service when she calls with another? Perhaps so – the Pragmatist would see that although we file each Incident separately her perception of the service is based on the overall experience. With our ITSM suite we use informational messages to guide engineers as to the overall status of a User.

Simon Morris
Simon Morris

In summary…

I think everyone would agree that VIP support is a pain.

The Purist will have to deal with the fact that although he kept his service consistent regardless of the seniority of the caller he might have to do some unnecessary justification at the next review meeting.

The Pragmatist will have to suffer unexpected drain on her resources when the CEOs laptop breaks and everything must be focussed on restoring that one users service.

Those occupying the middle ground will be controlling the number of VIPs by defining a percentage of headcount for the Customer to allocate. Hopefully the Customer will understand the business well enough to allocate them to the correct roles (and probably herself).

The Middle Ground will also be looking at a users overall experience and adjusting service to make sure that escalated issues are dealt with quickly.

No-one said IT Service Management was going to be easy!


A Great Free ITSM & ITAM Process Tool (via #Back2ITSM)

Cognizant Process Model
Cognizant Process Model

This is a very cool online tool for anyone in ITAM or ITSM.

COGNIZANT PROCESS MODEL

This great resource was kindly shared by Shane Carlson of Cognizant.

Shane is a founding member of the #Back2ITSM community, whereby ITSM professionals are encouraged to share their expertise for the benefit of others (and therefore develop the industry).

The process model includes the following models:

  • Request Management
  • Incident Management
  • Event Management
  • Problem Management
  • Change Management
  • Configuration Management
  • Release Management
  • Service Level Management
  • Availability Management
  • Capacity Management
  • IT Service Continuity Management
  • Continuity Operations
  • Financial Management for IT Services
  • Asset Management
  • Service Catalog
  • Knowledge Management
  • Information Security Management
  • Security Operations
  • Access Management
  • Portfolio Management
  • Program and Project Management

Each module includes guidance on the following areas:

  • Process Diagram
  • Benefits
  • Controls
  • Goal
  • Metrics
  • Policies
  • Process Team
  • Resources
  • Roles
  • Scope
  • Specification

According to the blurb….

“PathFinder is specifically designed to those:

  • Tasked with designing an IT Process.
  • Seeking validation that a process has been validated in the industry.
  • Looking to increase effectiveness of their current process design.
  • Seeking assistance with the cultural adoption of their IT process.
  • Faced with meeting compliance regulations.”

VIEW THE COGNIZANT PROCESS MODEL

Thanks to Shane for sharing this great free resource.

Yes, free. No registration, no 30 day trial, no salesman will call. Enjoy! If you find it useful please share the link and don’t forget to mention #Back2ITSM.

Planning for Major Incidents

Do regular processes go out of the window during a Major Incident?

Recently I’ve been working on Incident Management, and specifically on Major Incident planning.

During my time in IT Operations I saw teams handle Major Incidents in a number of different ways. I actually found that in some cases all process and procedure went out of the window during a Major Incident, which has a horrible irony about it. Logically it would seem that this is the time that applying more process to the situation would help, especially in the area of communications.

For example in an organisation I worked in previously we had a run of Storage Area Network outages. The first couple caused absolute mayhem and I could see people pushing back against the idea of breaking out the process-book because all that mattered was finding the technical fix and getting the storage back up and running.

At the end of the Incident, once we’d restored the service we found that we, maybe unsurprisingly had a lot of unhappy customers! Our retrospective on that Incident showed us that taking just a short time at the beginning of the outage to sort out our communications plan would have helped the users a lot.

ITIL talks about Major Incident planning in a brief but fairly helpful way:

A separate procedure, with shorter timescales and greater urgency, must be used for ‘major’ incidents. A definition of what constitutes a major incident must be agreed and ideally mapped on to the overall incident prioritization system – such that they will be dealt with through the major incident process.

So, the first thing to note is that we don’t need a separate ITIL process for handling Major Incidents. The aim of the Incident Management process is to restore service to the users of a service, and that outcome suits us fine for Major Incidents too.

The Incident model, its categories and states ( New > Work In Progress > Resolved > Closed ) all work fine, and we shouldn’t be looking to stray too far from what we already have in terms of tools and process.

What is different about a Major Incident is that both the urgency and impact of the Incident are higher than a normal day-to-day Incident. Typically you might also say that a Major Incident affects multiple customers.

Working with a Major Incident

When working on a Major Incident we will probably have to think about communications a lot more, as our customers will want to know what is going on and rough timings for restoration of service.

Where a normal Incident will be handled by a single person (The Incident Owner) we might find that multiple people are involved in a Major Incident – one to handle the overall co-ordination for restoring service, one to handle communications and updates and so on.

Having a named person as a point of contact for users is a helpful trick. In my experience the one thing that users hate more than losing their service is not knowing when it will be restored, or receiving confusing or conflicting information. With one person responsible for both the technical fix and user communications this is bound to happen – split those tasks.

If your ITSM suite has functionality for a news ticker, or a SocialIT feed it might be a good idea to have a central place to update customers about the Major Incident you are working on. If you run a service for the paying public you might want to jump onto Twitter to stop the Twitchfork mob discussing your latest outage without you being part of the conversation!

What is a Major Incident

It is up to each organisation to clearly define what consitutes a Major Incident. Doing so is important, otherwise the team won’t know under what circumstances to start the process. Or you might find that without clear guidance a team will treat a server outage one week as Major (with excellent communciations) and not the next week with poor communications.

Having this defined is an important step, but will vary between organisations.

Roughly speaking a generic definition of a Major Incident could be

  • An Incident affecting more than one user
  • An Incident affecting more than one business unit
  • An Incident on a device on a certain type – Core switch, access router, Storage Area Network
  • Complete loss of a service, rather than degregation

Is a P1 Incident a Major Incident?

No, although I would say that every Major Incident would be a P1. An urgent Incident affecting a single user might not be a Major Incident, especially if the Incident has a documented workaround or can be fixed straightaway.

Confusing P1 Incidents with Major Incidents would be a mistake. Priority is a calculation of Impact and Urgency, and the Major Incident plan needs to be reserved for the absolute maximum examples of both, and probably where the impact is over multiple users.

Do I need a single Incident or multiple Incidents for logging a Major Incident?

This question might depend on your ITSM toolset, but my preference is to open a separate Incident for each user affected in the Incident when they contact the Servicedesk.

The reason for this is that different users will be impacted in different ways. A user heading off to a sales pitch will have different concerns to a user just about to go on holiday for 2 weeks. We might want to apply different treatment to these users (get the sales pitch user some sort of service straight away) and this becomes confusing when you work in a single Incident record.

If you have a system of Hierarchical escalation you might find that one customer would escalate the Major Incident (to their sales rep for example) where another customer isn’t too bothered because they use the affected service less frequently.

Having an Incident opened for each user/customer allows you to judge exactly the severity of the Incident. The challenge then becomes to manage those Incidents easily, and be able to communicate consistently with your customers.

Is a Major Incident a Problem?

No, although if we didn’t have a Problem record open for this Major Incident I think we should probably do so.

Remember the intended outcome of the Incident and Problem Management processes:

  • Incident Management: The outcome is a restoration of service for the users
  • Problem Management: The outcome is the identification and possibly removal of the causes of Incidents

The procedure is started when an Incident matches our definition of a Major Incident. It’s outcome is to restore service and to handle the communication with multiple affected users. That restoration of service could come from a number of different sources – The removal of the root cause, a documented Workaround or possibly we’ll have to find a Workaround.

Whereas the Major Incident plan and Problem Management process will probably work closely together it is not true to say that a Major Incident IS a Problem.

How can I measure my Major Incident Procedure?

Simon Morris

I have some metrics for measuring the Major Incident procedure and I’d love to know your thoughts in the comments for this article.

  • Number of Incidents linked to a Major Incident: Where we are creating Incidents for each customer affected by a Major Incidents we should be able to measure the relative impact of each occurance.
  • The number of Major Incidents: We’d like to know how often we invoke the Major Incident plan
  • Mean Time Between Major Incidents: How much time elapses between Major Incidents being logged. This would be interesting in an organisation with service delivery issues, and they would hope to see Major Incidents happen less frequently

There you go. In summary handling Major Incidents isn’t a huge leap from the method that you use to handle day-to-day Incidents. It requires enhanced communciation and possibly measurement.

I hope that you found this article helpful.

Photo Credit