Pink14 Preview: What’s the big idea?

"Sometimes you're so busy putting out fires that you don't have time to improve fire-fighting or fire-safety"
“Sometimes you’re so busy putting out fires that you don’t have time to improve fire-fighting or fire-safety”

Do you ever get a Big Idea?  You’ll be talking or reading about ITSM and the proverbial light bulb comes on.  You see a connection or an underpinning concept that you hadn’t seen before.  Sometimes it appears to be an original insight, one you haven’t heard expressed exactly that way before.  And very occasionally it really is novel and it really is right: you subject it to the scrutiny of others and it stands up.

It happens to me.  Because I’m privileged to spend so much time interacting with some of the best minds in ITSM worldwide – and thinking and writing about what I learned in those discussions, and applying that knowledge as a consultant – it happens to me quite often, about once a year. In fact I will be presenting on some of these big ideas at the upcoming Pink Elephant IT Service Management Conference and Exhibition (PINK14).

Standard+Case

A couple of years ago my Big idea was Standard+Case, a topic which I will be running a half-day workshop on at PINK14.

Standard+Case is a synthesis of our conventional “Standard” process-centric approach to responding, with Case management, a discipline well-known in industry sectors such as health, social work, law and policing.

The combination of Standard and Case concepts gives a complete description of ticket handling, for any sort of activity from Incidents to Changes.

  • Standard tickets are predefined because they deal with a known situation. They use a standard process to deal with that situation. They can be modelled by BPM, controlled by workflow, and improved by the likes of Lean IT and ITIL.

  • Case tickets present an unknown or unfamiliar situation. They rely on the knowledge, skills and professionalism of the person dealing with them. They are best dealt with by experts, being knowledge-driven and empowering the operator to decide on suitable approaches, tools, procedures and process fragments.

ITIL and Lean do fit this S+C paradigm, if you use them in the right situation: Standard responses. S+C extends them with better tools for non-Standard cases: Adaptive Case Management, Kanban, Knowledge Centered Support(KCS)… Better still, this S+C approach might let the ITIL and anti-ITIL camps live in peace and harmony at last.

Slow IT

Last year it was Slow IT.  Slow IT is a provocative name.  It doesn’t mean IT on a go-slow.    It means slowing down the pace of business demands on IT so as to focus better on what matters, and to reduce the risk to what already exists.  Think Slow Food, and more recently Slow Business and mindfulness etc.

The intent of Slow IT is to allow IT to deliver important results more quickly.  It does this by concentrating on the interfaces between business executives and CIOs.  Slow IT highlights the importance of Governance of IT and of Service Portfolio in order to make the right decisions to do the right things in the right way at the right time, to maximise benefit and minimise risk.

Right now the pace of change in IT is approaching human limits.  Many IT shops are overwhelmed by change, drowning in projects.  More are overheating: working at lunatic pace because the IT community convinces us we have to.  Slow IT challenges the hysterias and fads of IT to ensure that these results are really needed as quickly as we think they are.  Slow IT is about trying to introduce more measured responses, to bring some sanity to the current dangerous madness that is organisational IT (you can read more on this here).

I’ll be presenting on Slow IT at PINK14.  In addition we’ll talk about my Meet-In-The-Middle strategy to address the Slow IT issues by offering a quid pro quo: Fast IT.   If the organisation will slow down the demands on IT, IT will have the breathing space to implement approaches to respond faster, such as Lean, Agile, DevOps, and good old CSI.  Right now too many IT teams are so flat out serving the business they don’t have the bandwidth to introduce better methods properly.  It’s the old catch-22 of being so busy putting out fires that you can’t improve fire-fighting or fire-safety.  Slow IT takes off a bit of pressure, giving the team some headroom, to make improvements.

I hope to see you at the Pink Elephant ITSM conference.  I’m honoured to be assembling some of those great ITSM minds at the Pink Think Tank, to address one of the biggest issues facing IT today: how to manage a multi-sourced IT value chain.  We’ll be looking to produce tangible actionable advice, so look out for the results.  I have a feeling it may be the catalyst for my next Big Idea.

What do YOU think the next “big idea” will be?


Find me at PINK14:

Rob England: Incident Management at Cherry Valley, Illinois

It had been raining for days in and around Rockford, Illinois that Friday afternoon in 2009, some of the heaviest rain locals had ever seen. Around 7:30 that night, people in Cherry Valley – a nearby dormitory suburb – began calling various emergency services: the water that had been flooding the road and tracks had broken through the Canadian National railroad’s line, washing away the trackbed.

An hour later, in driving rain, freight train U70691-18 came through the level crossing in Cherry Valley at 36 m.p.h, pulling 114 cars (wagons) mostly full of fuel ethanol – 8 million litres of it – bound for Chicago. Although ten cross-ties (sleepers) dangled in mid air above running water just beyond the crossing, somehow two locomotives and about half the train bounced across the breach before a rail weld fractured and cars began derailing. As the train tore in half the brakes went into emergency stop. 19 ethanol tank-cars derailed, 13 of them breaching and catching fire.

In a future article we will look at the story behind why one person waiting in a car at the Cherry Valley crossing died in the resulting conflagration, 600 homes were evacuated and $7.9M in damages were caused.

Today we will be focused on the rail traffic controller (RTC) who was the on-duty train dispatcher at the CN‘s Southern Operations Control Center in Homewood, Illinois. We won’t be concerned for now with the RTC’s role in the accident: we will talk about that next time. For now, we are interested in what he and his colleagues had to do after the accident.

While firemen battled to prevent the other cars going up in what could have been the mother of all ethanol fires, and paramedics dealt with the dead and injured, and police struggled to evacuate houses and deal with the road traffic chaos – all in torrential rain and widespread surface flooding – the RTC sat in a silent heated office 100 miles away watching computer monitors. All hell was breaking loose there too. Some of the heaviest rail traffic in the world – most of it freight – flows through and around Chicago; and one of the major arteries had just closed.

Back in an earlier article we talked about the services of a railroad. One of the major services is delivering goods, on time. Nobody likes to store materials if they can help it: railroads deliver “just in time”, such as giant ethanol trains, and the “hotshot” trans-continental double-stack container trains with nine locomotives that get rail-fans like me all excited. Some of the goods carried are perishables: fruit and vegetables from California, stock and meat from the midwest, all flowing east to the population centres of the USA.

The railroad had made commitments regarding the delivery of those goods: what we would call Service Level Targets. Those SLTs were enshrined in contractual arrangements – Service Level Agreements – with penalty clauses. And now trains were late: SLTs were being breached.

A number of RTCs and other staff in Homewood switched into familiar routines:

  • The US rail network is complex – a true network. Trains were scheduled to alternate routes, and traffic on those routes was closed up as tightly bunched together as the rules allowed to create extra capacity.
  • Partner managers got on the phone to the Union Pacific and BNSF railroads to negotiate capacity on their lines under reciprocal agreements already in place for situations just such as this one.
  • Customer relations staff called clients to negotiate new delivery times.
  • Traffic managers searched rail yard inventories for alternate stock of ethanol, that could be delivered early.
  • Crew managers told crews to pick up their trains in new locations and organised transport to get them there.

Fairly quickly, service was restored: oranges got squeezed in Manhatten, pigs and cows went to their deaths, and corn hootch got burnt in cars instead of all over the road in Cherry Valley.

This is Incident Management.

None of it had anything to do with what was happening in the little piece of hell that Cherry Valley had become. The people in heavy waterproofs, hi-viz and helmets, splashing around in the dark and rain, saving lives and property and trying to restore some semblance of local order – that’s not Incident Management.

At least I don’t think it is. I think they had a problem.

An incident is an interruption to service and a problem is an underlying cause of incidents. Incident Management is concerned with the restoration of expected levels of service to the users. Problem Management is concerned with removing the underlying causes.

To me that is a simple definition that works well. If you read the books and listen to the pundits you will get more complex models that seem to imply that everything done until trains once more rolled smoothly though Cherry Valley is Incident Management. I beg to differ. If the customer gets steak and orange juice then Cherry Valley could be still burning for all they care: Incident Management has met its goals.

Image Credit

Planning for Major Incidents

Do regular processes go out of the window during a Major Incident?

Recently I’ve been working on Incident Management, and specifically on Major Incident planning.

During my time in IT Operations I saw teams handle Major Incidents in a number of different ways. I actually found that in some cases all process and procedure went out of the window during a Major Incident, which has a horrible irony about it. Logically it would seem that this is the time that applying more process to the situation would help, especially in the area of communications.

For example in an organisation I worked in previously we had a run of Storage Area Network outages. The first couple caused absolute mayhem and I could see people pushing back against the idea of breaking out the process-book because all that mattered was finding the technical fix and getting the storage back up and running.

At the end of the Incident, once we’d restored the service we found that we, maybe unsurprisingly had a lot of unhappy customers! Our retrospective on that Incident showed us that taking just a short time at the beginning of the outage to sort out our communications plan would have helped the users a lot.

ITIL talks about Major Incident planning in a brief but fairly helpful way:

A separate procedure, with shorter timescales and greater urgency, must be used for ‘major’ incidents. A definition of what constitutes a major incident must be agreed and ideally mapped on to the overall incident prioritization system – such that they will be dealt with through the major incident process.

So, the first thing to note is that we don’t need a separate ITIL process for handling Major Incidents. The aim of the Incident Management process is to restore service to the users of a service, and that outcome suits us fine for Major Incidents too.

The Incident model, its categories and states ( New > Work In Progress > Resolved > Closed ) all work fine, and we shouldn’t be looking to stray too far from what we already have in terms of tools and process.

What is different about a Major Incident is that both the urgency and impact of the Incident are higher than a normal day-to-day Incident. Typically you might also say that a Major Incident affects multiple customers.

Working with a Major Incident

When working on a Major Incident we will probably have to think about communications a lot more, as our customers will want to know what is going on and rough timings for restoration of service.

Where a normal Incident will be handled by a single person (The Incident Owner) we might find that multiple people are involved in a Major Incident – one to handle the overall co-ordination for restoring service, one to handle communications and updates and so on.

Having a named person as a point of contact for users is a helpful trick. In my experience the one thing that users hate more than losing their service is not knowing when it will be restored, or receiving confusing or conflicting information. With one person responsible for both the technical fix and user communications this is bound to happen – split those tasks.

If your ITSM suite has functionality for a news ticker, or a SocialIT feed it might be a good idea to have a central place to update customers about the Major Incident you are working on. If you run a service for the paying public you might want to jump onto Twitter to stop the Twitchfork mob discussing your latest outage without you being part of the conversation!

What is a Major Incident

It is up to each organisation to clearly define what consitutes a Major Incident. Doing so is important, otherwise the team won’t know under what circumstances to start the process. Or you might find that without clear guidance a team will treat a server outage one week as Major (with excellent communciations) and not the next week with poor communications.

Having this defined is an important step, but will vary between organisations.

Roughly speaking a generic definition of a Major Incident could be

  • An Incident affecting more than one user
  • An Incident affecting more than one business unit
  • An Incident on a device on a certain type – Core switch, access router, Storage Area Network
  • Complete loss of a service, rather than degregation

Is a P1 Incident a Major Incident?

No, although I would say that every Major Incident would be a P1. An urgent Incident affecting a single user might not be a Major Incident, especially if the Incident has a documented workaround or can be fixed straightaway.

Confusing P1 Incidents with Major Incidents would be a mistake. Priority is a calculation of Impact and Urgency, and the Major Incident plan needs to be reserved for the absolute maximum examples of both, and probably where the impact is over multiple users.

Do I need a single Incident or multiple Incidents for logging a Major Incident?

This question might depend on your ITSM toolset, but my preference is to open a separate Incident for each user affected in the Incident when they contact the Servicedesk.

The reason for this is that different users will be impacted in different ways. A user heading off to a sales pitch will have different concerns to a user just about to go on holiday for 2 weeks. We might want to apply different treatment to these users (get the sales pitch user some sort of service straight away) and this becomes confusing when you work in a single Incident record.

If you have a system of Hierarchical escalation you might find that one customer would escalate the Major Incident (to their sales rep for example) where another customer isn’t too bothered because they use the affected service less frequently.

Having an Incident opened for each user/customer allows you to judge exactly the severity of the Incident. The challenge then becomes to manage those Incidents easily, and be able to communicate consistently with your customers.

Is a Major Incident a Problem?

No, although if we didn’t have a Problem record open for this Major Incident I think we should probably do so.

Remember the intended outcome of the Incident and Problem Management processes:

  • Incident Management: The outcome is a restoration of service for the users
  • Problem Management: The outcome is the identification and possibly removal of the causes of Incidents

The procedure is started when an Incident matches our definition of a Major Incident. It’s outcome is to restore service and to handle the communication with multiple affected users. That restoration of service could come from a number of different sources – The removal of the root cause, a documented Workaround or possibly we’ll have to find a Workaround.

Whereas the Major Incident plan and Problem Management process will probably work closely together it is not true to say that a Major Incident IS a Problem.

How can I measure my Major Incident Procedure?

Simon Morris

I have some metrics for measuring the Major Incident procedure and I’d love to know your thoughts in the comments for this article.

  • Number of Incidents linked to a Major Incident: Where we are creating Incidents for each customer affected by a Major Incidents we should be able to measure the relative impact of each occurance.
  • The number of Major Incidents: We’d like to know how often we invoke the Major Incident plan
  • Mean Time Between Major Incidents: How much time elapses between Major Incidents being logged. This would be interesting in an organisation with service delivery issues, and they would hope to see Major Incidents happen less frequently

There you go. In summary handling Major Incidents isn’t a huge leap from the method that you use to handle day-to-day Incidents. It requires enhanced communciation and possibly measurement.

I hope that you found this article helpful.

Photo Credit