“Fightback” – manage a technology failure – a case study

A technology failure, they all blame you, but you don’t have a clue what caused it.

This story is based on real events. The names of individuals have been changed, and companies are not identified. Simplifications have been made for the sake of narrative.

The story illustrates the pain that comes from swapping one piece of technology for another in a retail business. It also demonstrates handling unknown-unknowns using intelligence-based project management.

9 page story, plus commentary.
Author: Adrian Cowderoy

“Fightback”

Ankur’s dream

Ankur started the day so early that the birds had just started singing in the tree outside. He glanced through the window at a crow. It had scared the small birds from the feeding table. Usually Ankur would have been angry with the thief, but after the excitement of yesterday, he was still hyper. The blackbird meant nothing.

Once, there was a dream. It came from the antiquated sequence the company used for managing the product information in its online store. Ankur and the tech team talked of shining new technology that worked properly. The senior execs had a vision of technologies that would allow the company to go beyond its current limitations. And the small army of people who managed the day-to-day business longed for their lives to be easier. A shared dream. If everyone believes it, then it must happen.

A tech team was born, with the sole purpose of delivering the dream. By the time of the launch it had a dozen people in it, some full time like Ankur, others helping when needed. Ankur’s charge was for the quality of the product data. It needed to flow smoothly through the daisy chain of technologies, and it all needed to be there. Missing products or badly displayed details meant lost sales. Lost sales meant a lot of angry people in the desks near him. They’d blame him.

For two years Ankur and the rest of the team struggled through one problem after another. Replacing old technology with new was like an ultramarathon, with shifting types of pain, growing exhaustion, a supporter’s club that became bored. And all the time there were fears that when they switched the old for the new, something would break in the daisy chain of technologies. He feared the launch of the new system would be like a space rocket starting to lift off, then aborting and flopping back onto the launch pad in a controlled abort. A “rollback” was the term the team used.

Ankur’s fears had proven groundless. Yesterday’s launch was smooth, like a dream. Yes, technology sometimes fails. But not this time.

Man concentrating on a computer screen, illustrating the pain of a technology failure.
(Adobe stock image, used for illustrative purposes.)

The worst nightmare

Ankur glanced at his mobile as he plodded into the kitchen. His attention was not on the spotless white surfaces and gleaming equipment; it was on the company’s webstore. It was still there and presenting page after page of beautifully presented products.

Ankur had just sorted his son’s breakfast cereals when he noticed a web page displaying nothing but the product name, a blank where there should be images, and no detail about the product.

“Add milk,” he said to his son and headed to the section of the living room he used for his home office. One of his slippers fell off and he did not even pause to retrieve it.

His laptop gave him access to technology backdoors. Via these he could study the daisy chain of technologies. He checked for error messages from the systems, but there were none. It should all be working. It worked yesterday – thousands of products, with just four unexplained errors that had come from corrupted data. The senior management said they weren’t holding up the launch because of four minor products. “Sort them out after launch.”

Ankur had a hunch. He compared the new faulty product to the four unexplained errors. It was the same pattern. He went through a tech sequence to look for corrupted data. Where there had been four errors, there were now 60. All of them were “data corruptions”. There was no explanation, and there was no way to fix them. Even republishing them didn’t work.

“Paapa, you left your shoe,” his son interrupted his thoughts.

Ankur looked up and forced a smile of thanks.

“Paapa, what’s wrong?”

My worst nightmare, Ankur said to himself. “Urgent work,” he said to his son.

Hanging on to hope

Ankur checked for colleagues showing as “Available”. The Ukraine tech team was active because their clocks were two hours ahead of British time. He typed a help message in the team chat. And with that, the flood gates opened. People who had been hiding from their messages appeared within minutes. With the help came the questions. “What caused this?” “Can it be corrected?” “Will it happen again?” Give me time, he wrote back. But the clock clicked with digital simplicity. Three hours later and the likely causes had been disproved. “I don’t know,” Ankur was forced to admit.

The executive order came like a thunderclap. “Abort, roll back.” It triggered a complex technical sequence, as the new system was replaced by the old technology they’d hated.

Ankur’s worst nightmare had happened: an unexplained technology failure.

As he dived into another video call, he waited for the inevitable question: “After all the months of your team’s testing, what went wrong?” There was a hint of accusation in their voices. He tensed his face muscles as he prepared to reply. “I’ve got suspicions,” he said. “There’s still hope.”

Focus shifted onto the discussion of hope, but the flickers of hope were being snuffed out like candle flames. Anjur stood to get something, and as he did he realised he was still wearing his pyjama bottoms. He slumped into his chair and put his face in his hands. I thought I knew this technology, he said to himself, I was wrong. He thought of his colleagues. I’ve let them down – all of them.

The worst of it was the reaction from the Chief Technology Officer. There were questions. The questions took time to answer – time that Ankur wanted to spend on the investigation. And there were no words of criticism. The lack of criticism was intimidating. What’s he really thinking? Ankur wondered.

The Zombies

Among the team there was a new project manager. Jeanne had a smile and a laugh, bounding enthusiasm, and she’d fitted into the team immediately. Within a fortnight it felt like she’d been there for months. Jeanne did not know the company, although she’d worked in so many before that it barely mattered. She didn’t understand the details of the technology in the way of her long-standing predecessor, but she was learning fast. But she did have personality. She was one of those overpowering figures, with a laugh like a bullhorn and who did care what people thought of her.

Ankur watched Jeanne’s reaction to the news of the rollback. She sat in her home office, watching the team’s faces on his screen, and wrinkled her nose. “It reminds me of those old zombie films,” she said. “First there were two zombies, then sixty, and if we’d left it longer there could have been hundreds wandering around and scaring the customers.” Jeanne’s analogies were odd.

“Well,” she continued, “the good news is that we’re not fighting zombies. We’ve hit an unknown-unknown. We’ve all seen technology failures before in our careers. This one is as bad as it gets, but the routine is still the same. We need to be methodical: theorise, test the hypothesis, and keep going until we understand enough to look for a solution.” Jeanne leant back in her chair and stared straight into the camera. “Think like a scientist, not a zombie hunter.”

For a moment Ankur wondered if Jeanne was serious, but Jeanne had seen the gravity of the situation as fast as anyone, and this was her first smile for hours. Her confidence in science was like a ray of hope. But it was just hope. The other side of the coin came from previous experiences: there might be no easy solution.

A message flashed on Ankur’s computer. It was an announcement from the Chief Technology Officer, sent to everyone Ankur knew in the company and many others as well. Ankur read the words. “We took a risk, it didn’t work, we rolled back, it was the right decision.” It read like he was protecting the team from criticism. Bless him.

But the reality wasn’t as simple as words. The next day, Ankur was in the office. He could see sections of the company replanning their work, and changing their aspirations. There were furtive glances at him, and when people spoke it was with guarded words.

Researching the unknown

Ankur turned his attention to the failed technology. To the customers, the webstore looked like a single app. What they didn’t see was the daisy chain of ten different technologies. Each link in the  daisy chain contributed more information or transformed the data into a different form. It allowed the webstore to get data from a variety of different sources. But with a daisy chain, there were plenty of places where faults could occur.

Everyone had ideas about the cause of the fault, and they all wanted their ideas checked. That was slow, because there were only two versions of the system, so only two people could test at a time.

A routine developed. There were catchups two or three times a day to compare notes, and see how people could help each other. Drafts and results were shared. Nobody ever said a word of complaint, even when Ankur made the same mistake twice because he was so tired. And at the end of the day they worked out the priority for the next day. It’s an action-on intelligence cycle, Jeanne explained. Ankur didn’t care what it was called as long as it worked.

Within days it became clear that the technology failure was not in one of the usual suspect areas. It was something entirely new. “That’s a comfort,” Jeanne commented, “because it means we didn’t miss it in testing. We just didn’t know. It really was an unknown-unknown.”

Ankur heard the words, but he still criticised himself for not knowing. And more, the Chief Technology Officer had gone quiet. Curtis was the best CTO Ankur had known. As a human, Curtis was striking, with a tall athletic frame and long hair that made him look like a techie. His casual attitude caused confusion for newcomers because they assumed he was just one of them. As a CTO, Curtis knew the technology and its strengths and limitations, yet he let the different teams work in their own ways, and he was careful to avoid the wildcat ideas that Ankur had seen with some CTOs.

Curtis was not the kind of person who was normally quiet. But now it was just words of encouragement he gave Ankur, and some suggestions. Curtis’s smile and patience were frightening, given the number of problems he was now facing. He’s judging us, Ankur decided. He wondered how long he’d have a job for. Then he heard one of the other people on the team saying the same thing.

The dance of the managers

It wasn’t just Jeanne and Curtis who were pestering Ankur. There was a string of others, some of whom he barely knew. There were more managers asking questions than there were people in the technical team. “The dancing managers”, he called them in private. Ankur resented the time wasted on repeating the same message different people. Answering Curtis was bad enough. Now he was spending more time talking than on the problem.

That’s when Jeanne and Curtis cut in. One by one, they tracked down the people asking questions. They set up alternative communications for the dancing managers. With that, the questions to Ankur stopped and the words of support started. For the first time in days, Ankur realised he no longer needed to protect his back. He could just focus on the problem.

It was clearly not so for Jeanne or Curtis. From their messages, he could see the long hours they were working. When he posted an announcement about something, there’d be a line of little smiles or tears or celebration hats after his message. They were as close as friends.

There were also Jeanne’s executive summaries. She seemed to be intent on making complicated things seem absurdly simple. She’d be there in almost all of the technical meetings. Sometimes she prompted them with ideas, sometimes with questions to clarify, and at times of tension she broke the sequence with a joke. Mostly she was silent, nodding approval. And then towards the end of the meeting, she’d summarise it with just a sentence or two. “Yes,” Ankur would say, that’s true if you add in this extra word, but there’s much more to it than that.” To which Jeanne replied: “Too many words.”

Ankur puzzled how managers could perform with so little detail. It was like watching people dancing in the dark. They kept bumping into each other.

Management fantasies

The demand for a relaunch date came from the dancing managers. “We must have a date to put in our calendars,” they insisted.

“We don’t know,” Ankur protested. “Any date we give you will be a fantasy.”

Jeanne relayed their response. “They’re not happy. Their world is one of certainties, broken by coping with problems. They want a solution, and they want it now.”

“But Jeanne, it’s pointless speculating about solutions when we don’t understand the problem. Anything we change might not fix the problem. Fixing this could take weeks or months, or perhaps never.”

Jeanne massaged her left earlobe. It had a tiny gold stud. “Let’s try this a different way. If you could identify the problem in the next three days, and it was a simple one to build, how long until we go live again? I’d guess one week for technology, two weeks for testing, and one week contingency. Does that sound right?”

Ankur gestured with his arms. “It’s too dangerous a promise.”

“It’s not a promise. It’s a statement that if everything went perfectly, that’s the earliest. It’s an aspiration – the best that could happen. Clearly, alternative scenarios exist.”

“Why do they want a target that’s a fantasy?”

“Because where fear and pain is all around, people like to hope.”

Ankur’s hunch

After the abort and rollback, Ankur had imagined they’d find the fault quickly. But a week into the testing and they still hadn’t found the cause. That was the trouble with a daisy chain of technologies, where a problem with one technology could trigger problems with another, further down the chain.

Ankur had a hunch. It related to a series of tests he’d performed before the failure occurred. Those were tests to see what happened when there were large updates. He repeated the tests, looking at extreme cases. He wants to know what happened when the technology was intentionally overloaded.

He was on a roll. It took another week with his Quality Assurance team, but he found what was causing the technology failure. It was a design limitation in one of the core building blocks. Nobody had anticipated it, and the warning signals had been too subtle to notice.

The snag was, the fault could be triggered by an entirely innocent action, and it could come at any time. Worse, there was no obvious way to correct it. It was clearly a technology failure.

“Keep hoping,” the team said to each other. “There must be a way to fix it.”

A broken component. Everyone on the team had ideas. Most of the ideas involved replacing it with something that worked better. The hope of finding the solution gave Ankur a buzz. He worked from early in the morning to late in the evening.

The buzz lasted less than two days. Then it became obvious the cost of reengineering it would be enormous and it would take ages to build … and even then it might not work. All options seemed bad.

Turning point

Then one of the team had an idea about the technology daisy chain, and how it was set up.  Ankur ran an experiment overnight and spent the next day checking. The number of failures reduced a little. So he changed the settings, and tried again on the following day.

Finding the correct setup was going to be trial and error. Each trial took a day. That meant plodding on for days. Sometimes there was an improvement, and at others it went backwards. And each time Curtis wanted a report.

The breakthrough came when Ankur had a setup that would always work. Now, at least, there was a way of correcting any fault that found its way to the website. That was a day of celebration. But it was bitter sweet. It worked, but it was incredibly slow.

Another week of testing, and Ankur and his team had found a “sweet spot”. It was a compromise, but one Curtis could live with. He gave it the green light. “Complete it, test it, and prepare to launch it.”

The ghost of a zombie

Evil ghost.
Adobe stock image.

Ankur still couldn’t sleep at night. It was the fear of something else going wrong. Until now, the urgency had been to find a solution. Now he had a new fear: what if the new solution also broke? To suffer one disaster had been a nightmare. To suffer a second within such a short time would be heart-breaking. Worse, people would ask about Ankur’s professional competence.

The clock was ticking. Curtis and the dancing managers needed the new technology working again, other projects were being held up, and costs were increasing. The old tech had to be switched off.

Ankur pushed himself from early morning into the evenings, researching the possibilities of what might go wrong. He had his team work through the weekends, testing each of the possibilities. The analysts were posting new ideas for testing. Anybody he wanted for help, he got.

The ten-day count-down to launch had started. Data was being prepped, people retrained, experts put on standby to help with surprises, and marketing activities were rescheduled. Everybody was watching.

Relaunch

Ankur longed for a relaunch, but he also feared it. One abort was heart-breaking. Two would be a level of pain that he wasn’t sure he could handle. He could feel his guts churning, and at times his back hurt.

Ultimately, the launch decision had to be made by Curtis and the senior execs. They called it a go/no-go call. Ankur’s successes were an input, but so was his list of the things he hadn’t been able to test. The rest of the team had added a list of caveats that went on for three pages. “The caveats are rare edge-cases,” Curtis insisted, “and if any of them did happen, there’s a way to cope.” But Ankur still remembered the “rare” case that had brought down the system a month and a half earlier. As Jeanne says, “unknowns happen”.

Launch day. The launch was not a super flash like a rocket. Like last time, it took hours of work, shifting the setup of different technologies, checking, then shifting more, until finally the product data started flowing down the new daisy chain and into the webstore. The old system sat silent, and barely used.

This time, problems appeared which hadn’t happened last time. We can solve this – everyone agreed. Ankur worked with the team, puzzling them out, fixing them. Finally the site was live. He watched as they tested to ensure the full daisy chain of technologies was still working.

That night, Ankur hoped to sleep well. But the fear was still there. Last time, the big problems had begun the day after launch.

The next day produced more surprises, but they were quick to handle. The day after was mad from the start, but by the end of the day they’d fixed the new problems. It was that same intelligence sequence they’d been using all along: detect, prioritise, research, analyse, decide, build & test, then inform everyone.

The day after that was quiet – so quiet that Ankur could take the afternoon off to play with his kids. The next week had surprises, but the team was quick at research-analyse-implement. Two and a half months after the disaster, Ankur’s world was no longer dominated by fears. Yes, there would be more surprises, but the team had become expert at resolving them.

Commentary on how to manage a technology failure

The method for handling the crisis used intelligence-led project management (see more) within an agile project discipline.

What if it had not been an agile project?

In classic project management, such as PMP or PRINCE-2, a technology failure of this kind represents a “project exception”. We write a report for the senior execs, explain the cause, the consequences, the options and recommendations, and we add some lessons learned. For a sample structure, see https://www.stakeholdermap.com/project-templates/prince-2-exception-report.html 

  • In the case illustrated here, the consequences were obvious and could be reported immediately. The request was to go into a research mode, then come back at the end with a new proposal.
  • The execs were not pushed into making big judgement calls – it was clearly the best way forward. During the subsequent phase, they made small tactical decisions from the options presented to them. They relied on the team doing the research and consultation before the calls. 
  • By the time the final “go/no-go” decision was made for re-launch, the level of understanding of the technology risks was well understood. One of the strengths of intelligence-led project management is that it increases the level of understanding.

Unknown unknowns

The discipline of managing unknown-unknowns starts with techniques to identify them, using just anomalous symptoms. The method is described here on this website. 

In the example above, when there was the original launch, the only symptoms were four corrupt product records. There was no indication that they could multiply to dozens or hundreds. It was a complete surprise.

Gambling in go/no-go calls

Group of dice close up.
Adobe stock image.

Should the original launch have gone ahead? 

If this was a safety critical system, then no. But in those technologies, the level of testing is exhaustive and very expensive. This was not one of those cases.

The senior exec decision was based on the business urgency of proceeding without further delays. The risk of proceeding was a gamble, but acceptable because “rollback” was possible and few customers would notice that something went wrong. Given the successful rollback, their choice was valid.

Daily updates

The project was following the action-on intelligence cycle, with a daily cadence that’s superficially similar to the Kanban agile method.

However the details are different. Ankur’s and Jeanne’s project followed the same sequence as in terrorist-prevention and other critical intelligence-led activities. Every day there were new priorities, new research goals for the day, follow-up on analysis on the previous day, and prep for the management reporting. There was shared information within the team so the daily sequence could be maintained.

Sequence of activities for identifying risk: brief, research, analysis, engagement, and decision.
Action-on intelligence cycle – for more please see here
(c) Adrian Cowderoy

Research focus

The emphasis in the case study focuses disproportionally on the intelligence research process, leading to understanding, and also to the project management role of communications. There was a lot of analysis work continuing at the same time, but for story-telling reasons that’s been skipped here.

Images of work ethic, teamwork, adaptability, problem solving, communication, creativity, time management, leadership.
For more about the roles and the skills people need, please see here.

Training in intelligence-led project management

The project manager was experienced at applying intelligence-led project management, but nobody else had been trained. Training was done entirely by precedent and without explaining the theory.

This suggests the technique is easy to learn and intuitive, provided there is a mentor.

For another case study of the method, see “Today will be different” and there’s a fictionalised example at “Sweet and Sour Chaos” – both on this website.

Review of the intelligence-led process

How efficiently did the intelligence-led approach work?

With hindsight, the team might have been able to save one week by reviewing the research progress more thoroughly. However the new launch date could not have been brought forward, because it would have clashed with holiday commitments. In terms of effort, the cost could not have been reduced without slowing the process.

The upside? The system is now well understood, and well documented. It’s been possible to reduce the staffing level for business-as-usual to a very low level. And everyone is proud of what they achieved.

Commentary on the risks from replatforming a PIM

This case was about a Product Information Management system (PIM) and the many related technologies linked to it. The technologies were being replaced by new versions – called “replatforming”. Any of these technologies had the potential to fail, bringing down the whole system.

The notes here illustrate the range of things that can go wrong.

As background, PIMs are databases of product information – they list information about every product a business sells (or has sold in recent years). It includes details that customers see plus details for internal use, and also links to images and videos, and links to related products.

  • PIMs are business critical – if any part of their technology fails, then either there’s no product updates or the entire trading stops.

PIM’s are a hub that represents a “single source of truth” about product data. PIMs supply product data to webstores, warehouses, accounting systems, and to other companies who resell the products. Unfortunately, most of these other systems store product data in different formats, and the data needs to be translated/transformed when it’s passed between the systems. PIMs also use data from other sources, such as accounting, pricing and image stores. That also needs to be converted.

  • PIMs are heavily dependent on the quality of the integrations with other systems. When replatforming, some of the old integration methods may need to be replaced for more reliable and secure technology.

Dirty data. Replacing a PIM that has been in use for years, involves converting all its existing data to a new format. Errors and inconsistencies in the old data may have to be removed, and unorthodox rules have to be resolved.

  • Data cleaning effort can be unpredictable if there’s not a careful review at the start. Much of the effort falls on the product teams, not the technical ones.

Load testing. The testing before launch should check the system can handle huge bursts of data being changed at the same time. An example, if someone in the product team changes a popular cross-sell recommendation, it forces changes to a large number of records. Testing of large volumes of data needs to be done across the entire daisy chain in a safe test environment where damage can be repaired.

  • Surprises happen, because the real world is not the same as the test world. Some of the test environments are not perfect replicas. And in test conditions, humans don’t make the same irrational actions and mistakes they do in day-to-day conditions.

Changed workflows. There’s a sequence of tasks that the admin staff use for adding and changing product details. When a PIM is replatformed, the detail of the tasks change because of the new technology. There may also be larger changes because the new technology opens possibilities that did not exist before. But if the company hasn’t documented the detailed subtleties of how they currently work, there could be critical cases that are not being handled by the new technology.

  • If workflow change is not managed, the launch is likely to be delayed while the business teams adapt, or there’s urgent changes to the technology.

In conclusion: PIM replatforming is inherently risky. It’s important to have an experienced integration partner to manage the entire process.