British Airways IT failure

Impetus · 28-05-2017 2:35pm #1

A few billion worth of aircraft were unable to fly on Saturday and Sunday due to British Airways computer system failure. Causing pain to customers etc.

Is it not possible for a large company with mission critical applications, to have two or three sites running their entire system with cross server updates taking place in real time, 10 secs later, and say 10 minutes later? Ideally one of these should be located on another continent.

(I hate the airline and haven't used them for 15 years, over-priced, old aircraft, crap food, needless security (or to be more precise extra security because GB has so many enemies in the world) - however the issue could impact any airline, without notice).

CeilingFly · 28-05-2017 2:39pm

Always check info before making comments.

It affected Saturday - very little effect today.

It was a major power outage.

Considered to have best airline food these days, fairly modern fleet, security is an airport issue not an airline issue. Airlines must adhere to airport security rules.

Charles Babbage · 28-05-2017 6:39pm

CeilingFly wrote: »

Always check info before making comments.

It affected Saturday - very little effect today.

I'm watching the BBC news right now, there clearly is a major effect today, so perhaps you should follow your own advice.

It was a major power outage.p

It is a complete disgrace. As the first post noted, there should have been multiple systems, including sources of power, which would prevent a cock up on this scale.

.

Impetus · 28-05-2017 7:23pm

CeilingFly wrote: »

Always check info before making comments.

It affected Saturday - very little effect today.

It was a major power outage.

Considered to have best airline food these days, fairly modern fleet, security is an airport issue not an airline issue. Airlines must adhere to airport security rules.

I did check this morning on http://www.flightstats.com/go/FlightStatus/flightStatusByFlight.do and it did show delays this AM. For LHR the story was one of cancelled, delayed and scheduled - but a few hours later the typical scheduled flight was still scheduled. Long past due time.

The ba CEO said on Twitter later that it was a power supply issue. Not a computing issue. As an ex-financial director of a British PLC I know all about the mains power failures in the London area - which challenge the running of computer systems on an ongoing basis - if you don't have a generator or similar to back up. I am in a former British colony at the moment, (now a Republic) but with a legacy of crappy British wiring and 3 pin sockets, and have to put up with power failures from time to time. As well as having to carry clumpy 3pin to European standard 2 pin adapters around with me.

The last power failure I had in my apartment in France was 14 years ago, and it was notified to residents a week in advance. ie it was planned. They were installing new kit in the building, and said that the power would be off on day x from 09h00.00 to 09h00.30 (ie 30 seconds). The break took 29 secs, while they moved to new meters.

The French or German languages have no word for a 'trade'. Everybody has a profession and is trained to be professional - be they an electrician, plumber, heart surgeon, accountant or IT specialist. Unfortunately not so in Anglo-Saxon countries - hence the poor state of infrastructure in GB, IRL and USA - not that IRL is Anglo-Saxon but it hasn't got over the invasion yet.

By the way, if you consult flightstats, there are still a huge number of BA delays at LHR. I am a trained professional, and was trained not to make statements without checking facts.

lawred2 · 28-05-2017 7:28pm

Any fit for purpose IT system should have redundancy and fail over capabilities.. A power cut in one site should not take down an entire system.

Desperate stuff.

Impetus · 28-05-2017 7:28pm

Does anybody have an answer to the root question?

"Is it not possible for a large company with mission critical applications, to have two or three sites running their entire system with cross server updates taking place in real time, 10 secs later, and say 10 minutes later? Ideally one of these should be located on another continent. "

(aside from posting ejit responses)

lawred2 · 28-05-2017 7:29pm

Impetus wrote: »

Does anybody have an answer to the root question?

"Is it not possible for a large company with mission critical applications, to have two or three sites running their entire system with cross server updates taking place in real time, 10 secs later, and say 10 minutes later? Ideally one of these should be located on another continent. "

(aside from posting ejit responses)

Especially with cloud computing being so ubiquitous and ready to adopt.. Typical large company intransigence and risk averse culture.

listermint · 28-05-2017 7:32pm

Impetus wrote: »

I did check this morning on http://www.flightstats.com/go/FlightStatus/flightStatusByFlight.do and it did show delays this AM. For LHR the story was one of cancelled, delayed and scheduled - but a few hours later the typical scheduled flight was still scheduled. Long past due time.

The ba CEO said on Twitter later that it was a power supply issue. Not a computing issue. As an ex-financial director of a British PLC I know all about the mains power failures in the London area - which challenge the running of computer systems on an ongoing basis - if you don't have a generator or similar to back up. I am in a former British colony at the moment, (now a Republic) but with a legacy of crappy British wiring and 3 pin sockets, and have to put up with power failures from time to time. As well as having to carry clumpy 3pin to European standard 2 pin adapters around with me.

The last power failure I had in my apartment in France was 14 years ago, and it was notified to residents a week in advance. ie it was planned. They were installing new kit in the building, and said that the power would be off on day x from 09h00.00 to 09h00.30 (ie 30 seconds). The break took 29 secs, while they moved to new meters.

The French or German languages have no word for a 'trade'. Everybody has a profession and is trained to be professional - be they an electrician, plumber, heart surgeon, accountant or IT specialist. Unfortunately not so in Anglo-Saxon countries - hence the poor state of infrastructure in GB, IRL and USA - not that IRL is Anglo-Saxon but it hasn't got over the invasion yet.

By the way, if you consult flightstats, there are still a huge number of BA delays at LHR. I am a trained professional, and was trained not to make statements without checking facts.

If BA have continual problems with power in London then their stuff should be hosted in data centres with DR and redundancy.

Working myself in the SAAS business an outage of this scale in our line of company would be unheard of and simply not tolerated.

We have to report outages of 5 minutes to the board of directors and take those with the utmost of important and severity.

Companies of the scale of BA don't seem to understand they are both an airline and a IT company. The technology used for their systems is both their shop window and the daily running of their operations get serious about it , real serious

Impetus · 28-05-2017 7:36pm

Wall Street Journal online confirms BA system failure is in day two

https://www.wsj.com/articles/british-airways-faces-second-day-of-disruption-after-computer-failure-1495959626

BigEejit · 30-05-2017 2:29pm

Several people should lose their jobs. Unfortunately they outsourced all their useful IT staff to Tata in India some time ago and those people will probably get blamed. As an old company that had to embrace technology many decades ago I would not be surprised if the core of their environment contained old mainframes with little or no capabilities to have a consistent DR setup that can be spooled up to take the load or even work in an active-active mode. One thing is certain, having no IT staff who were well acquainted with the power, network and systems elongated the downtime.

Apparently a switch (or possibly multiple switches) went down somewhere in their environment:
(starts about 12m in) http://www.bbc.co.uk/programmes/b08rp2xd

A: "On Sat morning, We had a power surge in one of our DCs which affected the networking hardware, that stopped messaging -- millions and millions of messages that come between all the different systems and applications within the BA network. It affected ALL the operations systems - baggage, operations, passenger processing, etc. We will make a full investigation... "

Q: "I'm not an IT expert but I've spoken to a lot of people who are, some of them connected to your company,. and they are staggered, frankly and that's the word I'd use, that there isn't some kind of backup that just kicks in when you have power problems. If there IS a backup system, why didn't it work? Because these are experts - professionals -- they cannot /believe/ you've had a problem going over several *days*."

A: "Well, the actual problem only lasted a few minutes. So there WAS a power surge, there WAS a backup system, which DID not work, at that particular point in time. It was restored after a few hours in terms of some hardware changes, but eventually it took a long time for messsaging, and for systems to come up again as the operation was picking up again. We will find out exactly WHY the backup systems did not trigger at the right time,and we will make
sure it doesn't happen again."

Bacchus · 30-05-2017 4:00pm

BigEejit wrote: »

A: "Well, the actual problem only lasted a few minutes. So there WAS a power surge, there WAS a backup system, which DID not work, at that particular point in time. It was restored after a few hours in terms of some hardware changes, but eventually it took a long time for messsaging, and for systems to come up again as the operation was picking up again. We will find out exactly WHY the backup systems did not trigger at the right time,and we will make
sure it doesn't happen again."

What a BS answer... Oh we did have a backup system, it just didn't work that that particular time. It was working all the time when we didn't need though... No, you did not have a backup system in place, at least not one that appears to have had any sort of regular testing or maintenance (seeing as hardware had to be replaced!). What a monumental c*ck up from their IT dept. Failure to plan for a simple power surge :rolleyes:

markpb · 30-05-2017 4:43pm

Bacchus wrote: »

What a BS answer... Oh we did have a backup system, it just didn't work that that particular time. It was working all the time when we didn't need though... No, you did not have a backup system in place, at least not one that appears to have had any sort of regular testing or maintenance (seeing as hardware had to be replaced!). What a monumental c*ck up from their IT dept. Failure to plan for a simple power surge :rolleyes:

I think people are jumping to conclusions on this one. BA announced early on that it was a power incident so everyone assumed they didn't have UPS or generators. This, unsurprisingly enough, turned out to be false. Of course they had generators! Then BA announced that it was a networking/messaging problem and people are assuming they didn't have a working DR solution. Of course they did.

The problem wasn't a lack of preparedness or any of the simple things that people here are jumping to. The problem was that the power surge caused a corruption of data which made it difficult to resume service. All the UPS, generators and DR sites in the world won't help you if your data is unavailable. It's very tricky to test a DR plan which covers every possible error scenario. Anyone who tells you otherwise is full of hot air.

People have a tendency to think that this stuff is simple and that their **** smells of roses but very few systems in the world could survive all the possible things that could go wrong. The guys in BA will have to take a long hard look at what happened and how to stop it happening again. If we in IT were as professional as the aviation industry, the exact detail of what happened would be published so other companies could learn from it and avoid it too. We're not, of course.

Impetus · 30-05-2017 5:14pm

Michael O'Leary was in New York today, and was interviewed for about 10 minutes on CNBC TV. He said that they replicate their systems over five sites in various places. Most of the interview was taken up with criticism of Brexit and BA's shooting themselves in the foot over the weekend. Apparently it cost BA GBP 82 million in ticket revenues and there is a tail cost of an estimated GBP 150 million in PAX claims for hotels, meals, baggage delays, breach of contract - who knows what else. He repeated that Britain could be without air services to/from the EU if the legal arrangements for inter-state flying do not continue under the Brexit deal. Of course he would say that.

Some media are referring to BA moving some of their IT functions to India, and firing several hundred IT staff in GB. The inference is that those IT staff who remained in GB (or so the media suggest) may have assisted the power failure - to show the company what damage they can do, (presumably in case there might be any more terminations of IT staff on the cards). It appears that they did have one other back-up site, but that backup failed too. Which seems rather strange (at the same time as the power failure in site A).

BigEejit · 30-05-2017 8:04pm

No large enterprise data center properly set up should lose multiple items of network infrastructure due to a 5 minute power issue.

http://www.bbc.co.uk/news/business-35662763
British Airways and Iberia's parent company IAG reported a 64% rise in yearly pre-tax profits to €1.8bn (?1.4bn), helped in part by lower fuel prices.

Massively profitable company offshores IT jobs to India and has had three outages since.
https://www.theregister.co.uk/2017/04/11/british_airways_website_down/
https://www.theregister.co.uk/2016/09/06/ba_check_in_outage/

Bacchus · 31-05-2017 9:28am

markpb wrote: »

I think people are jumping to conclusions on this one. BA announced early on that it was a power incident so everyone assumed they didn't have UPS or generators. This, unsurprisingly enough, turned out to be false. Of course they had generators! Then BA announced that it was a networking/messaging problem and people are assuming they didn't have a working DR solution. Of course they did.

The problem wasn't a lack of preparedness or any of the simple things that people here are jumping to. The problem was that the power surge caused a corruption of data which made it difficult to resume service. All the UPS, generators and DR sites in the world won't help you if your data is unavailable. It's very tricky to test a DR plan which covers every possible error scenario. Anyone who tells you otherwise is full of hot air.

People have a tendency to think that this stuff is simple and that their **** smells of roses but very few systems in the world could survive all the possible things that could go wrong. The guys in BA will have to take a long hard look at what happened and how to stop it happening again. If we in IT were as professional as the aviation industry, the exact detail of what happened would be published so other companies could learn from it and avoid it too. We're not, of course.

Saying that anyone that disagrees with you is "full of hot air" doesn't make you any less wrong.

How can you say it wasn't a lack of preparedness when clearly they were not prepared for a brief power surge. You even outline how they repeatedly mis-diagnosed the issue! In fact, this story of a power surge is looking more and more suspect as everyone realizes how ridiculous it would be for a power surge to do this much damage.

Also, you think no data backup or redundancy mechanisms is acceptable for a company as large as BA? We're talking about basic data mirroring here. Not to mention how critical their systems are. Sure, it's a costly system to setup and maintain but this is one of the largest airlines in the world, not an SME.

Where are you getting this info anyway that it was a data corruption issue? You accuse others of jumping to conclusions but there is no conclusive evidence that point to THE cause of the outage nor why it took so long to get back up and running. If there is, please do link it here.

MarkR · 31-05-2017 10:38am

Whatever backup system they had doesn't seem to be fit for purpose. Their redundancy was also ineffective. I don't think there's any defence really. It's not like the Data Centre was hit by a nuke. Some hardware failed, and took down everything. That shouldn't be possible.

nate.drake · 01-06-2017 12:45pm

Impetus wrote: »

Does anybody have an answer to the root question?

"Is it not possible for a large company with mission critical applications, to have two or three sites running their entire system with cross server updates taking place in real time, 10 secs later, and say 10 minutes later? Ideally one of these should be located on another continent. "

(aside from posting ejit responses)

The short answer is, "Yes, in theory". In principle BA could migrate all their servers to a cloud cluster spanning a dozen servers across the world and duplicate data across each and every one, along with custom client software which will automatically connect to another instance if one goes down.

I imagine it's a question of covering the initial cost to rework their systems and to keep on maintaining said cluster that keeps them from doing this. Far cheaper just to pay a few mill in compensation every now and then when the power goes down. Don't forget it's we who would end up paying, as they increased the price of airline tickets to cover everything!

Edit : One of my treasured associates has said this would also be a great usage scenario for Ethereum.. storing the booking application in the blockchain means it's independent of any one server. I doubt they'd go for it though, too new.

liamo · 02-06-2017 12:08pm

From today's Indo

A contractor doing maintenance work at a British Airways data centre inadvertently switched off the power supply, knocking out the airline's computer systems
...
the power supply unit that sparked the IT failure was working perfectly but was accidentally shut down by a worker.

Oops!!

Bacchus · 02-06-2017 12:29pm

liamo wrote: »

From today's Indo

Oops!!

And the UPS units didn't kick in because.... ? Still a lot of questions for BA to answer. You can't just pin it on human error. You should be accounting for human error in your infrastructure.

As much as an "epic fail" that is for the contractor, how did he/she have such easy access to such a critical switch that it could accidentally be flipped (that Father Ted episode on the plane springs to mind).

NickNickleby · 02-06-2017 12:58pm

IT guy: "So on Saturday, your electricians arrive on site to carry out essential maintenance work on our power".

Electrical supervisor : "Correct"

IT Guy :" and this will not impact on the actual supply during that period"

Electrical Supervisor : " that is correct, your UPS and Generator will protect you".

IT Guy : "Fine, see you Saturday"

Roll on Saturday...

IT guy : " so at 09:00 you will commence your work, you will isolate the mains supply , do your work and restore everything?"

Electrician on site : "that's right, no interruption".

IT guy : " ok, so just run through it with me before we start"

Electrician : " we flip the mains, balh blah blah..."

IT guy : "hang on! isn't true that by flipping THAT switch AND that switch, you're removing both the UPS and the mains? "

Electrician : "yes that's right, but your generator comes in, in less than 5 seconds, you won't even notice"

IT Guy :"Goodbye"

As ridiculous as it sounds, that is a true story, and I cannot recall the exact configuration, but it was fool proof. Unless a fool came along.

BigEejit · 02-06-2017 4:14pm

Many years ago in a small DC in Cork we had scheduled work on the 'A' pdu, UPS and back up generator and we were told that the power would likely be on and off a bit on that pdu over the weekend so we were told to disconnect all the A power leads on Friday evening.

Saturday morning, half asleep electrician turns up and powers off B.

First time trying to first recover esxi and then recover 30 - 40 VM's that were all in complete siht.

ItHurtsWhenIP · 02-06-2017 11:35pm

TL;DR - If you don't spec and test your alternate power supplies properly BEFORE you have an issue ... you'll have an issue.

Waaaay back in the day (1990), we got a (supposedly) kick-ass UPS to protect our AS/400 (Model B30 as I recall with 2x1.6m racks) from short outages. At the time there seemed to be a lot of electrical storms, which the slightest flicker would cause a 70 minute IPL, as opposed to a normal 20 minute IPL.

First electrical storm and it did it's job ... brilliant says we. Happy out. Oh we also had a diesel genny for longer outages.

Roll on 1991 and the ESB unions wanted gold in their wages or something :rolleyes:. Cue rolling outages and we thought ... sure we'll be grand ... UPS will hold until the genny cuts over in 5-10 seconds.

Power goes, UPS holds, no sign of the Genny spinning up, suddenly the UPS decides to go into bypass mode for no good reason and drops the AS/400, which goes down hard.

UPS comes out of bypass and, as the AS/400 is configured to reboot on power restore, so it starts it's IPL ... goes fine until the DASD units in the second rack spin up and the UPS has a sh1t attack and switches to bypass and ... you guessed it drops the AS/400, which goes down hard again.

UPS comes out of bypass and, as the AS/400 ... you get the picture ... I had to yank the mains supply to the primary rack to stop it's reboot loop.

Power comes back a few hours later and she starts up slowly.

UPS engineer out - can't figure what could possibly have caused the behaviour I described. So after hours, we rebooted the AS/400 (while there was mains). We watched as the second rack of DASD units spun up and saw why the UPS switched to bypass. The load spiked to nearly 20 amps. The UPS was only specced to take 16 amps (her runtime load was about 9 amps). So we shut down again and the UPS was upgraded to 24 amps.

Genny engineer also dragged out and tests and fixes issues around the failure to start and takeover from mains.

Next day's outage comes around. Power goes, UPS holds, Genny spins up, switchover occurs, the genny surge nukes the UPS. AS/400 goes down hard ... again.

UPS engineer called - takes a look - you never told us you had a genny. :mad: Fits a surge protector or whatever was needed to protect the UPS.

At this stage the boss said we'd had enough downtime and while the AS/400 was off, fired up the genny manually and had it provide power until the strikes were called off.

After that we never had a long enough power cut that would have caused the genny to kick in to see whether we'd fixed the problem or not. :rolleyes:

British Airways IT failure

Comments