Technology failure. This is a case study of a quality assurance manager facing his worst nightmare, and how he reacted.
It’s an actual example, illustrating the pain that comes from replatforming – swapping one piece of technology for another. It also demonstrates the intelligence-based project management described here on this website.
An 8½ page story, plus commentary on the methodology.
1. Ankur’s dream
Ankur started the day so early that the birds were singing in the tree outside. He glanced through the window at a crow. It had scared the small birds from the feeding table. Usually Ankur would have been angry with the thief, but after the excitement of yesterday’s launch, he was still hyper. The blackbird meant nothing.
Once, a long time ago, there was a dream. It came from the antiquated sequence the company used for managing the product information in its online store. Ankur and the tech team talked of shining new technology that worked properly. The senior execs had a vision of technologies that would allow the company to go beyond its current limitations. And the small army of people who managed the day-to-day business longed for their lives to be easier. “Just one little thing,” they insisted. “The new technology must work almost exactly the same way as the old one.” A shared dream. If everyone believes it, then it must happen.
A tech team was born, with the sole purpose of delivering the dream. By the time of the launch it had a dozen people in it, some full-time like Ankur, others helping when needed. Ankur’s charge was for the quality of the product data. It needed to flow smoothly through the daisy chain of technologies, and it all needed to be there. Missing products or badly displayed details meant lost sales. Lost sales meant a lot of angry people in the desks near him. They’d blame him.
For two years Ankur and the rest of the team struggled through one problem after another. Replacing old technology with new was like an ultramarathon, with shifting types of pain, growing exhaustion, a supporter’s club that became bored, and all the time there were fears that when they switched the old for the new, something would break in the daisy chain of technologies. He feared the launch of the new system would be like a space rocket exploding on the launch pad.
Ankur’s fears had proven groundless. The launch had been smooth, like a dream. Yes, technology sometimes fails. But not this time.
2. The worst nightmare
Ankur glanced at his mobile as he plodded into the kitchen. His attention was not on the spotless white surfaces and gleaming equipment; it was on the company’s webstore. It was still there and presenting page after page of beautifully presented products.
Ankur had just sorted his son’s breakfast cereals when he noticed a web page displaying nothing but the product name, a blank where there should be images, and no detail about the product.
“Add milk,” he said to his son and headed to the section of the living room he used for his home office. One of his slippers fell off and he did not even pause to retrieve it.
His laptop gave him access to technology backdoors. Via these he could study the daisy chain of technologies. He checked for error messages from the systems, but there were none. It should all be working. It worked yesterday – thousands of products, with just four unexplained errors that had come from corrupted data. The senior management said that was too minor to hold up a launch.
Ankur had a hunch. He compared the new faulty product to the four unexplained errors. It was the same pattern. He went through a tech sequence to look for corrupted data. Where there had been four errors, there were now 60 errors – “data corruptions”. There was no explanation, and there was no way to fix them.
“Paapa, you left your shoe,” his son interrupted his thoughts.
Ankur looked up and forced a smile of thanks.
“Paapa, what’s wrong?”
My worst nightmare, Ankur said to himself. “Urgent work,” he said to his son.
3. Hanging on to hope
Ankur checked for colleagues showing as “Available”. The Ukraine tech team was active because their clocks were two hours ahead of British time. He typed a help message in the team chat. And with that, the flood gates opened. People who had been hiding from their messages appeared within minutes. With the help came the questions. “What caused this?” “Can it be corrected?” “Will it happen again?” Give me time, he wrote back. But the clock clicked with digital simplicity. Three hours later and the likely causes had been disproved. “I don’t know,” Ankur was forced to admit.
The executive order came like a thunderclap. “Shut it down and roll back.” It triggered a complex technical sequence, as the new system was replaced by the old technology they’d hoped would go.
Ankur’s worst nightmare had happened: an unexplained technology failure.
As he dived into another video call, he waited for the inevitable question: “After all the months of your team’s testing, what went wrong?” His fear now was of looking incompetent. “I’ve got suspicions,” he said. “There’s still hope.”
Focus shifted onto the discussion of hope. A tricky subject given that the flickers of hope were being snuffed out like candle flames. He stood to get something, and as he did he realised he was still wearing his pyjama bottoms.
Ankur finished the call, slumped into his chair and put his face in his hands. He thought he knew the technology. I was wrong, he said to himself. He thought of his colleagues on the tech team, and the internal users who had been like colleagues throughout the project. I’ve let them down – all of them.
The worst of it was the reaction from the Chief Technology Officer. There were questions. The questions took time to answer – time that Ankur wanted to spend on the investigation. And there were no words of criticism. The lack of criticism was a relief, but it was also intimidating. What’s he really thinking? Ankur wondered.
3. The Zombies
Among the team there was a new project manager. Jeanne had a smile and a laugh, bounding enthusiasm, and she’d fitted into the team immediately. Within a fortnight it felt like she’d been there for months. Jeanne did not know the company, although she’d worked in so many before that it barely mattered. She didn’t understand the details of the technology in the way of her long-standing predecessor, but she was learning fast. But she did have personality. She was one of those overpowering figures, with a laugh like a bullhorn and who did care what people thought of her.
Ankur watched Jeanne’s reaction to the news of the rollback. She sat in her home office, watching the team’s faces on his screen, and wrinkled her nose. “It reminds me of those old zombie films,” she said. “First there were two zombies, then sixty, and if we’d left it longer there could have been hundreds wandering around and scaring the customers.” Jeanne’s analogies were odd.
“Well,” she continued, “the good news is that we’re not fighting zombies. We’ve hit an unknown-unknown. We’ve all seen technology failures before in our careers. This one is as bad as it gets, but the routine is still the same. We need to be methodical: theorise, test the hypothesis, and keep going until we understand enough to look for a solution.” Jeanne leant back in her chair and stared straight into the camera. “Think like a scientist, not a zombie hunter.”
For a moment Ankur wondered if Jeanne was serious, but Jeanne had seen the gravity of the situation as fast as anyone, and this was her first smile for hours. Her confidence in science was like a ray of hope. But it was just hope. The other side of the coin came from previous experiences: there might be no easy solution.
A message flashed on Ankur’s computer. It was an announcement from the Chief Technology Officer, sent to everyone Ankur knew in the company and many others as well. Ankur read the words. “We took a risk, it didn’t work, we rolled back, it was the right decision.” It read like he was protecting the team from criticism. Bless him.
But the reality wasn’t as simple as words. The next day, Ankur was in the office. He could see sections of the company replanning their work, and changing their aspirations. There were furtive glances at him, and when people spoke it was with guarded words.
4. Researching the unknown
Ankur turned his attention to the failed technology. To the customers, the webstore looked like a single app. What they didn’t see was the daisy chain of ten different technologies. Each link in the daisy chain contributed more information or transformed the data into a different form. It allowed the webstore to get data from a variety of different sources. But with a daisy chain, there were plenty of places where faults could occur.
Everyone had ideas about the cause of the fault, and they all wanted their ideas checked. That was slow, because there were only two versions of the system, so only two people could test at a time.
A routine developed. There were catchups two or three times a day to compare notes, and see how people could help each other. Drafts and results were shared. Nobody ever said a word of complaint, even when Ankur made the same mistake twice because he was so tired. And at the end of the day they worked out the priority for the next day. It’s an action-on intelligence cycle, Jeanne said. Ankur didn’t care what it was called as long as it worked.
Within days it became clear that the failure was not coming from the usual suspect areas in the technology. It was something entirely new. “That’s a comfort,” Jeanne commented, “because it means we didn’t miss it in testing. We just didn’t know. It really was an unknown-unknown.”
Ankur heard the words, but he still criticised himself for not knowing. And more, the Chief Technology Officer had gone quiet. Curtis was the best CTO Ankur had known. As a human, Curtis was striking, with a tall athletic frame and long hair that made him look like a techie. He knew everyone and his casual attitude caused confusion for newcomers to tech, when they assumed he was just one of them. As a CTO, Curtis knew the technology and its strengths and limitations, he let the different teams work in their own ways, and he carefully avoided the wildcat ideas that Ankur had seen with some CTOs.
Curtis was not the kind of person who was normally quiet. But now it was just words of encouragement he gave Ankur, and some suggestions. Curtis’s smile and patience were frightening, given the number of problems he was now facing. He is judging us, Ankur decided. He wondered how long he’d have a job for. And then he heard one of the other people on the team saying the same thing.
5. The dance of the managers
It wasn’t just Jeanne and Curtis who were pestering Ankur. There was his own boss, but a string of others, some of whom he barely knew. There were more managers asking questions than there were people in the technical team.
Ankur hated the questions, and repetition. Answering Curtis was bad enough. Now it looked like it was going to become impossible.
That’s when Jeanne and Curtis cut in. One by one, they tracked down the people asking questions. There must have been some private conversations, because afterwards the questions stopped and the words of support started. For the first time in days, Ankur felt he did not need to protect his back. He could just focus on the task.
It was clearly not so for Jeanne or Curtis. From their messages, he could see the long hours they were working. When he posted an announcement about something, there’d be a line of little smiles or tears or celebration hats after his message. They felt as close as friends.
There were also Jeanne’s executive summaries. She seemed to be intent on making complicated things seem absurdly simple. She’d be there in almost all of the technical meetings. Sometimes she prompted them with ideas, sometimes with questions to clarify, at times of tension she broke the sequence with a joke. Mostly she was silent, nodding approval. And then towards the end of the meeting, she’d summarise it with just a sentence or two. “Yes,” Ankur would say, that’s true if you add in this extra word, but there’s much more to it than that.” To which Jeanne replied: “Too many words.”
How can managers perform with so little detail in their heads? Ankur wondered. It was like watching people dancing in the dark. They kept bumping into each other.
6. “It’s a target, Jeanne, but not a target as we know it”
The demand for a relaunch date came from Curtis and the dancing managers. They didn’t say it to Ankur, but to Jeanne. It seemed to be her job to break the request to Ankur and the tech team.
“We can’t say when we can relaunch,” Ankur protested.
“They insist there must be a re-plan,” Jeanne relayed the response. “One of the troubles with managers is that they need dates and deliverables in order to plan and schedule. Uncertainty and continual change breaks the way they work.”
“But Jeanne, we don’t know the problem, we don’t know the solution, so there is no date,” Ankur insisted.
Jeanne massaged her left earlobe, where there was a tiny gold stud. “Let’s try this a different way. If you could identify the problem in the next three days, and it was a simple one to build, how long until we go-live again? I’d guess one week for technology, two weeks for testing, and one week contingency. Does that sound right?”
Ankur gestured with his arms. “It’s too dangerous a promise.”
“It’s not a promise. It’s a statement that if everything went perfectly, that’s the earliest. It’s an aspiration, not a promise. I’ll make it clear that there’s a one in six chance we’ll hit the target and they must have contingency.”
“Why do they want a target that’s so unlikely?”
“It’s an aspiration to show we’re trying, and it buys them time to build their Plan B.” Jeanne shrugged. “Aspirational targets remind me of Dr Spock in Star Trek, describing an unfamiliar new form of intelligent life. It’s life, Jim, but not life as we know it.”
7. Ankur’s hunch
After the technology failure when the rollback occurred, Ankur had imagined they’d find the fault quickly. But a week into the testing and they still hadn’t found the cause. That was the trouble with a daisy chain of technologies, where a problem in one place could have an impact much further down the chain.
Ankur had a hunch. It related to a series of tests he’d performed before the failure occurred. Those were tests to see what happened when there were large updates. He repeated the tests, looking to see what happened in the extreme cases where the technology was intentionally overloaded.
He was on a roll. It took another week with his Quality Assurance team, but he found what was causing the technology failure. It was a design limitation in one of the core building blocks. Nobody had anticipated it, and the warning signals had been too subtle to notice.
The snag was, the fault could be triggered by an entirely innocent action, and it could come at any time. Worse, there was no obvious way to correct it. The technology had failed. Years of effort would be lost unless something could be done.
“Keep hoping,” the team said to each other. “There must be a way to fix it.”
A broken component. Everyone on the team had ideas. There was a suggestion to replace it with something that worked better. Ankur feared the suggestion. It would be slow and costly, and could introduce new risks. His preference was to reengineer the technology to get around the limitation.
The hope of finding the solution gave Ankur such a buzz that he was working into the evening and from early in the morning.
The buzz lasted less than two days. Then it became obvious the cost of reengineering it would be enormous and it would take ages to build … and even then it might not work. All options seemed bad.
8. Turning point
Then one of the team had an idea about the technology daisy chain, and how this section of it was set-up. Ankur ran an experiment overnight and spent the next day checking. The number of failures reduced a little. So he changed the settings, and tried again on the following day.
Finding the correct setup was going to be trial and error. Each trial took a day. That meant plodding on for days. Sometimes there was an improvement, and at others it went backwards. And each time Curtis wanted a report.
The breakthrough came when Ankur had a setup that would always work. Now, at least, there was a way of correcting any fault that found its way to the website. That was a day of celebration. But it was bitter sweet. It worked, but it was incredibly slow.
Another week of testing, and Ankur and his team had found a “sweet spot”. It was a compromise, but one Curtis could live with. He gave it the green light. “Complete it, test it, and prepare to launch it.”
9. The ghost of a zombie
Ankur still couldn’t sleep at night. It was the fear of something else going wrong. Until now, the urgency had been to find a solution. Now Ankur had a new fear: what if the new solution also broke? To suffer one disaster had been a nightmare. To suffer a second within such a short time would be heart-breaking. Worse, people would ask about Ankur’s professional competence.
The clock was ticking. Curtis and the dancing managers needed the new technology working again, other projects were being held up, and costs were increasing. The old tech had to be switched off.
Ankur pushed himself from early morning into the evenings, researching the possibilities of what might go wrong. He had his team work through the weekends, testing different possibilities. The analysts were posting new ideas for testing. Anybody he wanted for help, he got.
The ten-day count-down to launch had started. Data was being prepped, people retrained, experts put on standby to help with surprises, and marketing activities were rescheduled. Everybody was watching.
The launch decision had to be made by Curtis and the senior execs. They called it a go/no-go call. Ankur’s successes were an input, but so was his list of the things he hadn’t been able to test. The rest of the team had added a list of caveats that went on for three pages. “The caveats are rare edge cases,” Curtis insisted, “and if any of them did happen, there’s a way to cope.” But Ankur still remembered the “rare” case that had brought down the system a month and a half earlier. As Jeanne says, “unknowns happen”.
Launch day. The launch was not a super flash like a rocket. Like last time, it took hours of work, shifting the setup of different technologies, checking, then shifting more, until finally the product data started flowing down the new daisy chain and into the webstore. The old system sat silent, and barely used.
This time, problems appeared which hadn’t happened last time. We can solve this – everyone agreed. Ankur worked with the team, puzzling them out, fixing them. Finally the site was live. He watched as they tested to ensure the full daisy chain of technologies was still working.
That night, Ankur hoped to sleep well. But the fear was still there. Last time, the big problems had begun the day after launch.
The next day produced more surprises, and they were handled. The day after was mad from the start, but by the end of the day they’d fixed the problems. It was that same intelligence sequence they’d been using all along: detect, prioritise, research, analyse, decide, build & test, then inform everyone.
The day after that was quiet – so quiet that Ankur could take the afternoon off to play with his kids. The next week had surprises, but the team was quick at research-analyse-implement.
Two and a half months after the disaster, Ankur’s world was no longer dominated by fears. Yes, there would be more surprises, but the team had become expert at resolving them.
Commentary on how to manage a technology failure
The method for handling the crisis used intelligence-led project management (see more) within an agile project discipline.
What if it had not been an agile project?
In classic project management, such as PMP or PRINCE-2, a technology failure of this kind represents a “project exception”. We write a report for the senior execs, explain the cause, the consequences, the options and recommendations, and we add some lessons learned. For a sample structure, see https://www.stakeholdermap.com/project-templates/prince-2-exception-report.html
- In the case illustrated here, the consequences were obvious and could be reported immediately. The request was to go into a research mode, then come back at the end with a new proposal.
- The execs were not pushed into making big judgement calls – it was clearly the best way forward. During the subsequent phase, they made small tactical decisions when there were options. They did not need to do research and consultation before the calls.
- By the time the “go/no-go” decision was made for launch, the level of understanding of the technologies was way beyond the previous launch – the risk assessment was thorough. One of the strengths of intelligence-led project management is that it increases understanding.
The discipline of managing unknown-unknowns starts with techniques to identify them, using just anomalous symptoms. The method is described here on this website.
In the example above, when there was the original launch, the only symptoms were four corrupt product records. There was no indication that they could multiply to dozens or hundreds. It was a complete surprise.
Gambling in go/no-go calls
Should the original launch have gone ahead?
If this was a safety critical system, then no. But in those technologies, the level of testing is exhaustive and very expensive. This was not one of those cases.
The senior exec decision here was that the business urgency could not tolerate further delays for risks that seemed unlikely and manageable. It was a gamble to ignore the four. They took the gamble knowing that it was possible to rollback to the previous technology within hours and few if any customers would notice. Given the successful rollback, their choice was valid.
The project was following the action-on intelligence cycle, with a daily cadence that’s superficially similar to the Kanban agile method. But the details are different. Ankur’s and Jeanne’s project followed the same sequence as in terrorist-prevention and other critical intelligence-led activities. Every day there were new priorities, new research goals for the day, follow-up on analysis on the previous day, and prep for the management reporting. There was shared information within the team so the daily sequence could be maintained.
The emphasis in the case study focuses disproportionally on the intelligence research process, leading to understanding, and also to the project management role of communications. There was a lot of analysis work continuing at the same time, but for story-telling reasons that’s been skipped here.
Training in intelligence-led project management
The project manager was experienced at applying intelligence-led project management, but nobody else had been trained. Training was done entirely by precedent and without explaining the theory.
This suggests the technique is easy to learn and intuitive.
Review of the intelligence-led process
How efficiently did the intelligence-led approach work?
With hindsight, the team might have been able to save one week by reviewing the research progress more thoroughly. However the new launch date could not have been brought forward, because it would have clashed with holiday commitments. In terms of effort, the cost could not have been reduced without slowing the process.
The upside? The system is now well understood, and well documented. It’s been possible to reduce the staffing level for business-as-usual to a very low level. And everyone is proud of what they achieved.
- 1. Ankur’s dream
- 2. The worst nightmare
- 3. Hanging on to hope
- 3. The Zombies
- 4. Researching the unknown
- 5. The dance of the managers
- 6. “It’s a target, Jeanne, but not a target as we know it”
- 7. Ankur’s hunch
- 8. Turning point
- 9. The ghost of a zombie
- 10. Relaunch
- Commentary on how to manage a technology failure