See the entire conversation

This is wildly disingenuous, I speak as a flight instructor and major IT incident investigator. Modern software authors have the professional discipline of a cute puppy in comparison to aviation practitioners.
I agree with Chris. This is the kind of thinking that leads to "Why can't we just have building codes for software? It worked to protect against earthquakes and fire!" Earthquakes and fire aren't conscious adversaries. Try writing a standards document on how to win at chess.
257 replies and sub-replies as of Aug 10 2018

Airplanes *are* under constant attack by gravity, weather, system complexity, human factors, delivery process, training consistency, traffic congestion, and even under attack now by software developers
But every time a disaster happens, we learn from it, publicly, and we share. We're still learning from crashes decades ago. Software developers? Bullshit.
Software is authored by an organization, including programmers, architects, technical writers, quality assurance, UX, and most importantly, management. They're all authors with a responsibility to implement and improve standards of professional discipline.
The PMP credential exists for a real reason, because even management can grasp the value of a shared body of knowledge to use in the construction and improvement of workflows and processes. Don't blame "management," they're trying.
Here's an excellent example - 35 years ago, an airline bought their first metric airliner, management cancelled the project to update all ground paperwork from metric. Plane ran out of gas and engines shut down in the air. 200 page report: data2.collectionscanada.gc.ca/e/e444/e011083…
Where's the detailed 200-page public report from Facebook on how their management failed to prevent major disinformation campaigns in the US election? There isn't one, because they're just not that mature.
And "try to write a standards document on how to win at chess" give me a break my dude there is one and it is software and it works and you know that.
Sadly, @ErrataRob is making the same mistakes here. The objective in aviation is "Safe Transportation" and not "Preventing accidents" - a subtle wording difference, but an entirely different mindset at a much higher level.
That XKCD on voting machine software is wrong
The latest XKCD comic on voting machine software is wrong, profoundly so. It's the sort of thing that appeals to our prejudices, but mistake...
blog.erratasec.com
Similarly, the objective in elections is "Confidence in democracy" and not "stopping attackers," which the CSE clearly lays out as one of many fronts: cse-cst.gc.ca/sites/default/…
Simplistic focus on the machine and loss of perspective on the bigger system & society is the hubris that keeps the technology industry trapped in the footgun cycle.
I mean "Airplanes and elevators are designed to avoid accidental failures" come on have you never heard of fail-safe design? Elevators and planes fail *all the time* but they fail SAFE.
Since this is picking up steam, I want to be clear that it's not "engineering standards" or "way more money" that gives the aviation industry the edge -- it's the constant, daily, global, organized and disciplined continuous improvement.
But the medical community is learning now, and Microsoft even brought in a surgeon to lecture them on lessons the medical community has learned from the aviation industry:
The Checklist Manifesto
We live in a world of great and increasing complexity, where even the most expert professionals struggle to master the tasks they face. Longer training, more...
youtube.com
Surgeons didn't want to use checklists because they were too full of themselves, but then accidental deaths fell by 30-50% in hospitals that adopted them. Know who else often suffers from the same hubris? Programmers.
So many programmers are feeling defensive because they just think I'm talking about bugs. I'm not. There were no "bugs" exploited in the theft of Podesta's emails, there were no "bugs" exploited in the 2016 Facebook disinformation campaign.
Google and Facebook need to get absolutely spanked around because they keep pretending that they are software companies when they are not. They're platforms, environments, ecosystems, societies, whatever.
I know @zeynep has been talking about this stuff for ages -- so long as these companies think that their product is software, and don't get held accountable by society, humanity will increasingly suffer.
More amplification of smarter voices than mine: "How Complex Systems Fail," Cognitive Technologies Laboratory, web.mit.edu/2.75/resources…
STELLA Report from the SNAFUcatchers Workshop on Coping With Complexity: snafucatchers.github.io
Root Causes don't reflect a technical understanding of the nature of failure:
Friends, never forget that “post accident attribution to a ‘root cause’ is fundamentally wrong” @ri_cook web.mit.edu/2.75/resources…
UK's @NCSC Active Cyber Defence report, one year in, by Dr. Ian Levy
If you wonder if gov departments can make progress on reducing vulnerability and threats to their constituents, have a jealous read of the UK's @NCSC Active Cyber Defence report, one year in, by Dr. Ian Levy. Practical. Measured. Informative.
You'll want to learn more about THERAC-25, here's a good start:
I wrote about software safety practices, for people starting a programming career.
jorendorff/talks
Some talks I've given
github.com
IT Practitioners can't even start to make things better unless they start with a baseline of psychological safety: usenix.org/system/files/l…
Don't just learn from your own mistakes, learn from *every* industry that has to manage complexity:
We will "just" experience unforseen consequences.
So, I uttered the words DO-178C and Esterel/ANSYS SCADE in reply to Stamos; what do you think about those two? Personally, I think they missed the point: the comparison being made is not between election systems and avionics, but deliberate attacks on elections and on avionics.
I have no problems with your points, really - but it's worth pointing out that "software" is an incredibly wide industry. Not saying aviation is simple or a narrow field, but every player has the same interest. In software, you could be making Facebook, control software for (1/x)
surgical robots, traffic lights, a guest book for your personal home page or the means to control a space station. It's just a tool to accomplish something within a "real" domain, so to speak. And many of these disciplines are just not as mature as the aviation industry, (2/x)
...where everyone pulls in the same direction. And then, for orgs with smaller budgets, the expectations are insanely high even for short term, "cheap" projects. Everyone's colored by how Google and Facebook works, and if their software is in any way worse, (3/x)
it's not good enough. Even though their budget is tiny in comparison. However, with all the open source tooling, all the conferences that are out there etc, I would indeed say the software industry is interested in learning from its own mistakes. (4/x)
It's just an insane amount of interacting parties, and very few standards bodies in comparison. There are some, and many things are indeed standardised, but probably not even close to how the aviation industry is regulated. Sadly. Lets give software twenty more years and see 🙃
Agreed on many points. Software written in mature industries is mature, software written in the software industry is not.
Another problem in the "software industry", I guess, is that many companies that really want good software hire consultants on short term project basis. They come, stay for two years, and leave. A year later, they hire new consultants to upgrade/fix/whatever and then leave.
Instead of hiring their own people permanently, which would probably be both cheaper, give better software, a more stable delivery rate and a better way to contain knowledge inhouse. At least here in Norway this is a very clear trend. Exceptions exist, but for many this is it.
They're governments of societies. Digital societies where subjects have no vote and are essentially serfs.
I'd rather spank around the many, at all levels, who still insist on regulating, breaking, nationalizing, rebuilding... whatever those same platforms, instead of making them obsolete by realistic, but truly decentralized personal clouds. E.g. this, mfioretti.com/2018/02/calicu…
My first reaction was also: Well managing an airplane and surgery are repeatable (whilst complex) operations. Constructing a software system for a domain/use case for which there is no precedence (framework, library) has much more variability.
But I am wondering myself now whether that is actually true. A lot of software engineering efforts focus on making knowledge on a particular domain reusable despite the variability. Would that maybe be a good starting point? Making attention to critical aspects "reusable"?
Maybe if you work in a place that hasn't adopted CI and static analysis tools (are there any of those left)? Automated checklists.
Likewise: Doctors didn’t want to use flow approaches because “you can’t reduce our work to a factory”
We know about washing our hands now to yes. Ha
I have seen software devs be highly resistant to having even the most rudimentary operational review requirements for new services. Simple checklists of things like "does it have alerts", "does it log", etc.
Well one of reason is that when those checklists trumps and overtakes the actual functionality. Then project gets software which does things right without doing right things.
Proper observability of production software is basic functionality.
Just those two phrases are way too scary. ☑️Produces actionable alerts according to recommendation XYZ... ☑️Sends structured event logs while masking sensitive data. Good checklists take work.
Aviation is the safest way to travel thanks to checklists.
What I frequently call "culture", or "engineering culture"—am I wrong
Some culture is bad, some is good. Not all culture has constant, daily, global, organized and disciplined continuous improvement.
Right. And I think current bubble economics tends to amplify the problems you've mentioned.
How does the organised continuous improvement happen? Enforced at a professional or organisational level? Just culture?
Money is not sufficient, but is necessary. The Shuttle group had the culture described, IIRC there were 17 documented defects in 400k LOC over ~30 years, and each one prompted new learning. Perceived risk/benefit for most software is far less. Outlay (effort) correspondingly less
Which is only an explanation. Because while all planes have to fail safe, it truly isn't justified to spend that kind effort on Angry Birds. So the question is how to better judge value and risk, so that truly important things can be resourced to support the culture you describe
The mantra of Silicon Valley is "move quickly and break things". That's all you need to know. Collateral damage is other people's concern, not theirs.
#aviation does well because it embeds #SystemsEngineering & #SystemSafetyEngineering in the design of #aircraft, aircraft systems and operations. Dont get me wrong, #aviation is not perfect, but its light years ahead of other industries, especially IT who are simply not as mature
I have yet to see credible analysis showing electronic voting offers the same level of security as paper ballots. It's an extra risk, and for what? Saving a few hours of counting ballots? Just because a problem can be addressed with software doesn't mean it should be.
Hmm, I believe the civil aviation devices (planes :-) are not well protected against attacks. Like from rockets (omg, happens more often than I thought en.m.wikipedia.org/wiki/List_of_a…), bombs and malicious insiders (MH370?, GWI18G) I thinksthat is what @ErrataRob is talking about
What he's talking about is that he thinks software programmers can be trusted with the management of democracy. That was the joke in the comic.
=8-o I need to read the article.
Ok. I reread the article. I still don't I don't think it imposes "trust the software developers". It imposes failsafes which work despite the software
I studied software engineering as a degree. One point that was made by a professor stuck with me. Software engineers are NOT engineers. Engineers build structures that don’t fall down: bridges, buildings... The IT industry writes software which breaks ALL the time.
Really??? * Pedestrian footpath collapses in Florida * Hyatt Regency walkway collapse * Tacoma Narrows Bridge Should I continue? Henry Petroski "To Engineer Is Human: The Role of Failure in Successful Design" is good read regarding this
can always count on his sixhead to float up my timeline being obtuse
Dear IT folks, please read @www_ora_tion_ca's thread. Think on it. Read it again. ↑
Please don’t put me in the position of having to defend an XKCD comic. That’s a bridge too fucking far.
FWIW the Stockfish chess engine code annotates relevant routines with expected ELO gains. IIRC the engine goes through a suite of test positions for hours against other engines, stats are taken, it's not guesswork. Some teachable stuff in there IMO.
Major difference here - government body vs private company. There is no upside for Facebook to spend the time or effort on this or to release it. Inside the government and some large companies massive reports do exist, they just aren’t often published.
Additionally while I agree with your general thought - often catastrophic things have to occur to spark this level of insight (Eg planes falling out of the sky and people dying). It’s only recently begun happening in software. There is simply less history of truly bad failures
My argument would be that the regulators are part of the sector, and the lack of regulators are a major contributor to the lack of maturity in software creation.
Agree on that front. The question is what parts need to be regulated and at what level? Lots of software in the world, not much of it is actually critical /important.
The rules in Canada are simple: Engineering is regulated and requires licencing. Programming is not regulated and can be done with just a CS degree. That's enough of a start, the remaining details come naturally with continuous improvement feedbackery loops over decades.
CS degree optional
Contractors follow rules too when building a building. Degree or not. Apprenticeships work for unions, and they work for CS degrees in Germany.
That not every programmer who has passed through formal training has heard of the Therac-25 is a goddamn indictment of our profession.
There's no legal obligation for Facebook to prevent "disinformation campaigns", and not even consensus that there's an ethical obligation, given the free speech issues involved. Get that consensus first, and then you can talk about professional discipline.
In early days, web software was cheap infrastructure for nonessential tasks. “Don’t worry. It’s just a chat client.” But we got tempted to use the same designs for critical infrastructure. The industry needs to mature and do the unglamorous work of fixing that mistake.
We overpromised on the security and safety of software. We continue overpromise. The industry is still focused on “disruption” and “innovation”. We throw a fit and make excuses whenever anyone asks about safety, security, or trust. We’re not young anymore. We have to grow up.
That they had a glider pilot in the cockpit was a GD miracle
Not miracle: It was a government initiative to give hundreds of kids taxpayer-subsidized scholarships every year to fly gliders make sure there was a recruitment pipeline for military and industry, this is a significant factor in why western countries have far safer airlines
i.e. Bob Pearson and Sully Sullenberger and Chris Hadfield all had taxpayer-funded glider training as kids. Investment pays off.
Presumably the Gimli Glider. Thorough.
Without looking at the link - the Gimli Glider?
And customers accept ridiculous liability statements. How did we get here?
Too much control of underlying net infrastructure given to private actors who value their control over public scrutiny
OSS software is often authored by a handful of individuals left alone to support tens or hundreds of thousands of users (or more) with little to no resources. I *wish* I had tech writers, QA, UX, and project management. Please tell me where/how to get them with my zero budget.
I will not defend all s/w devs, but alot of times it's their mgmt that is the problem. The devs (some of them) *want* to do the right thing in the right way.
Aviation has known for decades that management kills people, so they fixed management. Your argument is invalid.
no, you support it. I blame mgmt for alot of our life ills.
Management is a major part of the software development process, and a major part of the industry. If you think programming alone is what results in software.... Well.
You're misreading where I'm coming from. I initially felt you were giving undue emphasis to the devs themselves, rather than the larger, more inclusive, picture. I've worked several roles within both telephony/internet systems and medical industry. I've seen sausage being made.
This is why I find Air Crash Investigation so reassuring. Whenever something goes wrong, there's so much effort put into working out what it was and what to do to prevent it from happening again. Everyone thinks I'm weird.
All the best people are weird
Don't forget snakes.
Yeah, you don’t get it, though. A lot of users actively don’t want security in their software. What would aviation look like if 30% of air passengers actively wanted the plane to crash?
What would it look like if the government wanted there to be a vulnerability in every airplane so they could crash planes on demand? If users sought other forms of influence to mandate air travel un-safety?
Hardware itself defeats security. What would aviation look like if Boeing built every aircraft with a button on each passenger’s seat that would let them take control of the cockpit?
Tip from someone who "gets it": There *are* government-mandated vulnerabilities in airliners so that governments can crash them.
Like users of airplanes don’t actively not want security on the airplanes? I see tons of people complaining and/or ignoring them all the time. Until we are at freaking NASA moon landing software standards, we’re no where near other industry standards
Um: Neil Armstrong switched that computer off and landed manually because he did not want to die.
1202 Data overflow alarm at mission time 102:38:26, 1201 alarm at 102:42:19, Manual control due to computer navigational error established at 102:43:15 -- hq.nasa.gov/alsj/a11/a11.l…
Sure, but things like that never happen in aviation? I thought it was about higher standards, not about perfect systems, because those don’t exist. You know Armstrong did it because it’s public. There is probably a ton of research documentation on every trip into space.
It's not about how high the standards are, it's about applying momentum to continuous improvement of those standards. Any standard can be improved, and it's that improvement cycle that's missing/immature in the software industry.
Fact: Eagle was off course because of navigational computer computation error, headed for a crater wall. Fact: Neil Armstrong landed Eagle manually. Fact: The 1201 and 1202 alarms were unrelated to the above, but added to the confusion at the time.
Ars is disputing that the 1201 and 1202 alarms almost caused an abort, which is a fair argument to make, but unrelated to the fact that the computer was about to fly Eagle sideways in to rock.
Do you have good pointers on the “navigational computer computation error”?
Why mentioning 1201/1202 then. It especially added confusion to your tweet (at least it confused me :)
Because space jargon is FUN
I think you're making Rob's point quite nicely: users don't want "no security". They want to get the(ir) job done, they want convenience, they need that attachment, etc. Air passengers don't want crashes, they want less hassle at security checks or smoke a cig onboard.
We're talking about a comic about voting machine software, not everyone's daily software to "get their job done" and fart around with attachments.
OK. I think my point still holds, but the examples change. Users want to vote quickly and get the job of managing an election done easily. There may be a higher risk of deliberate attacks if they're easy and people think they can get away with it (e.g. voting twice).
The attachment example also underlines the "no culture of learning from incidents" point -- remember the "iloveyou" virus? Nothing has been learned WRT executing active content from mail. That's why more than 10 years later, ransomware-by-mail attacks were so easily carried out.
It's not that users don't want security, it's that they have been trained by horrendous IT to be used to exceptionnaly low levels of security. Like LinkedIn asking your GMail password to get to your contacts. That should get people fired.
NASA & JPL do have software engineers who can write low defect software. Doing so is slow and expensive, and may also require clear requirements. Most end user software is written like toddlers build towers, piling stuff up and hoping for the best.
Eh, not toddlers building towers, more like Lego (but not nec. the Technic kind) - I mean it works, and there can be some good principles in it, but rarely is it truly robust especially in new environs.
Aviation Eng and Builders are solving one set of well-defined problems: keep a plane flying et al, keep a building standing (Ok, dramatic oversimplification) Software is asked to solve HUUGE variety of problems (and cheaply! w/ stakes usually low) Oh, and new UI plz.
You're looking directly at programmers -- but software authorship is done at a management level, defining requirements and such. Of course software programming sucks when software management sucks -- my comments are at the authorship level.
Programmers write code, they do not make software. It takes a hell of a lot more than programming to make a software or a service, but programmers still think management is an unmanageable boogeyman.
I can see why you got into aviation... your horse is so high you need a whirlybird to reach the saddle.
The whole "mgmt writes specs & coders are just modern day scribes" ethos you're flaunting here is how software was done 50 yrs ago.
As @kirkjerk was saying, modern software is everywhere, solving every problem. Doesn't make sense to apply the same methodology to every project. Many software shops wouldn't be able to exist if they followed waterfall or somesuch.
Which is to say, this whole "you ain't writing software" gatekeeping act you're putting on here is nonsense. Some software shops are completely flat, no mgmt, no ivory tower handing down algorithms and flow charts... and some of those are making 8+ figures of revenue.
yes yes and your value to society and humanity is defined by your revenue right sure
Humanity, huh... so is that an integration test I need to add to my CI/CD pipeline? Can't ship to production unless a child achieves enlightenment by my algorithms?
And here I thought you couldn't get any more sanctimonious. Boy, have I got some egg on my face!
Seriously, software is just a fucking job. I write it to get a paycheck. If my software makes enough money to sustain the business, what more could I ask for? Nothing. That's why I mention revenue. That's really all that matters here. The rest of this diatribe is faff.
A job and a profession are two completely different things. It's nice you have a job, I'm glad you enjoy it, I hope your doctor is a professional.
You push paper around a desk and punch keys on a keyboard. You don't have a life in your hands. (The lives that use your product are in the hands of thousands) If your head got any bigger it'd start affecting the tides.
We're talking about a comic about voting machine software, not punching keyboards and pushing paper.
That might be what you're thinking, but it sure ain't how you're coming off.
Agree. From personal experience, after moving to aviation from "normal" software industry, it takes a bit of time to appreciate the role of proper management and process. But when the product needs to be in service for 35 years, the "startup fever" is not the right mindset.
This isn't a dichotomy, nor is it even a spectrum. There are many dimensions to our work products. "35 years" is a useless figure to someone who works on HFT algos or med imaging.
I'm glad that you have changed the pitch from "this is how it was 50 yrs ago". Yes, there are many dimensions, so please allow others do the safety critical stuff the way it needs to be done. Have fun with HFT, but beware med imaging - some of it might need certification.
And shortly after, somewhere else I found this (fresh and close to "med imaging"): bbc.co.uk/news/av/techno… You don't want it. And making billions in revenue on software that is selling ads gives no credibility in *this* area. Same for autonomous cars, etc.
Hack attack can stop people's hearts
Researchers disclose an unfixed vulnerability that threatens medical devices.
bbc.co.uk
as a modern software engineer, I agree. The few times we let software control really important stuff like airplanes, the rules are completely different.
Conversely, having just gotten to peer into the cockpit of the B757 we were supposed to take but was grounded, and beheld the finest software 1990 had to offer (as well as the iPads suction cupped to the windows to compensate): there are downsides to excessive conservatism.
I’ve worked both in aerospace and the software tech industry and I completely agree with this.
The cowboys who built Facebook and Google are now billionaires. The software industry doesn't reward careful engineering right now... it rewards shiny new bells and whistles. I do a lot of programming, and there's always huge pressure to build more features in less time.
If my boss on the ship I work on accidentally dumped oil overboard, he could: Lose his engineer's license Get tossed in jail Be required to pay a huge fine.
More like a kitten on super catnip laced with cocaine
"All of my previous work was JavaScript but I had a revelation while drunk and high last night and learned Rust and we're coding your key components in it now wheeeeeee I'm still high wheeeee" (Not *quite* a literal quote. But far far too close...)
There is a field that deals with safety critical software engineering, but I'd bet an index finger 90% of professional software engineers couldn't describe any of those principles or practices. The knowledge exists to enable us to build robust software systems; we choose not to.
Yes, modern software discipline is a lot worse than aviation discipline. But the attacks are also more powerful. Weather, etc. are like the network connection going out. Software should be able to survive that, and often doesn't. But an *attack* is like an AA gun.
Almost none of the things on your list of things attacking airplanes actually *want* the plane to fail, they just make it hard to succeed. With software, you often *do* have a malicious attack, which is harder to deal with.
You deal with it the same way you deal with everything else: disciplined continuous improvement, which the software industry sucks at compared to the aviation industry.
Yes, it does suck. But there's still a material difference. If something can go wrong with a flight one time in a hundred billion by chance, there haven't been enough flights to notice. If there's a five-byte input that causes your program to crash, an attacker will find it.
Okay so why would you want a voting machine like that
I didn't say I did; I was just pointing out that it's not *all* bad discipline. (Although I think with enough care (and time and money), we could have secure, safe voting machines. We're not there yet and a lot of smart people disagree it's even in-principle possible, though.)
There is also a lot of work to make electronic systems safer. Solvers to prove algorithms will work the same way. Fuzzing to find those five byte inputs. There is plenty of work and education left to go into incremental improvement before “trust” isn’t a question for software.
Electronic voting systems also don’t just have electronics problems. You can make your box “unhackable” and then the shipping company sends the wrong one that has a nicer PCB. Change in process of large systems requires methodical introspection.
The point of the five byte figure is it's a nice round number greater than 100 billion, the number I used for airplane problems. I should have taken into account that running the software is cheaper and more common than flying; the point is "wouldn't happen accidentally".
Some fuzzers (AFL, for example) are smart enough to try to find inputs that trigger weird behavior, but if I had said "one kilobyte" instead of "five bytes", the point would hold but fuzzing might not find it.
An attack against software looks like what happened to MH17. I don't think the airline industry is building defenses for civilian airliners. Bad weather is a router reboot.
You could get software regulated to the level of airlines if you could convince people to pay the same kind of pricing as for aircraft, and do similar maintenance. There is a small limited market for this.
Sometimes software on my *phone*, which should be used to a flaky network, misbehaves on router reboot. Software quality could be a lot higher than it is, even without attackers. But I got that app for free, and nobody's giving out free airplanes.
If a civilian airplane is perfectly safe unless 40mm, 800g balls of metal hits it a couple times a second at faster than the speed of sound, nobody will buy it because it's too heavy and they want something lighter even if it is less safe.
So how many lines of code have come off of your fingers to and into a product someone loved?
Quibble: cute puppies can be housebroken in less than a year.
Indeed. Speaking as a software engineer who was on a National Academies panel on software dependability, I totally agree. Yes, software is hard. But Facebook didn't use known best practices like social threat modeling. Shockingly unprofessional. cc @digitalsista
Speaking as a software developer, tutor, and code reviewer: the software development industry does *not* have any kind of widespread concept of 'professional ethics'. That's where the problem starts. There's absolutely no common sense of responsibility.
This creates a lot of compounding issues: 1) Issues are treated as 'solved, never think about again'. 2) Because most developers are negligent, clients expect no expenses on security; which makes ethical developers non-competitive. 3) And so on.
Which unfortunately means that this is something that can't be solved on an individual level; it's a systemic issue and there are no incentives for individual developers to change it.
Great thread, and something I've been going on about for ages, but it's not disingenuous, it's cultural. Software as a field is so new and subjected to so much light and heat that the practitioners don't know if could be better.
There are other structural problems with bringing standards and good practice to software, the wild growth of the field combined with the lack of centralization or structure to the field, comparatively.
It would be as if most people who got into aviation did so by building a plane at home first. But despite this I do think software liability would go a long way to starting to get the incentives in line with creating good standards similar to medicine and aviation.
There’s also an issue of scale: if planes failed the way software does, the next time a single Boeing 777 crashes, every other Boeing 777 in the air would also lose power.
There are various places where the metaphor fails, but also doesn't. In a way that's exactly what we say with wannacry -- over and over again.
and honestly, i think the reason wannacry gets to keep exploiting SMB v1 and Boeings don't get to keep crashing is that we can take pictures of one, and see it clearly, and treat it as an event mentally.
There’s also the monoculture vs diversity aspect though: each airplane has its own set of pilots double-checking each other. Software is like a single godlike pilot flying all the world’s planes at the same time, but she sometimes gets sleepy.
i think that's a misfitting metaphor. pilots might match better with sysadmins/ops, who aren't ime the biggest fans of programmer bullshit.
Today a key subsystem failed because the third-party datasource API on which it depended simply disappeared, deactivated and delisted without warning. Whose fault was it—theirs for going down or ours for relying on it? The turtles are all wobbly, all the way down.
it literally doesn't matter. fault is the wrong frame, as the original thread made clear: responsibility is the right frame. Who deals with it and what recourse do they have against others who may have failed their own responsibilities in the chain.
Attribution is always a political choice, right now that choice is to attribute in a way that protects vendors from any liability or responsibility for what their software does in the world.
The fault was with the management that decided years ago not to care
no one in the chain cares. i know we love to beat up on the suits, but this is a toxic culture, and has been as long as i've been in it. the privilege to treat quality as esoteric and impossible is of the same piece as why it's so dominated by whiteness, sexism, etc.
bullshit is bullshit, and you see it expressed in myriad ways.
sorry i'm so ranty this has been many many years of my life and i'm so over it.
Anyway, this is the most recent piece I've written about it: emptywheel.net/2017/09/14/sof…
Software is not authored by a programmer, it takes a village. The problems need management fixed first.
dev culture is a lot of the problem, and management alone can't fix that. the place it needs to get fixed first, or can be, is software liability. honestly, it's going to be the goddamned insurance companies if it's anyone.
Maybe this says more about your career and the places you work as „incident investigator“ than it says about the software engineering field?
You *can* have reliable software, and it can be developed in a professional and deeply risk averse manner. It costs about 100x as much and takes 10x as long.
But that doesn’t make for a great ivory tower monologue
Are you also taking into account the cost of having the US Elections hacked by Russia? Externalities are frequently ignored, but no less costly.
That's the cost of designing systems that will fail safely in the event of a hack. It might be more expensive than that, even, so for now the gold standard will continue to be paper.
It's not wildly disingenuous, it's completely oblivious and uninformed. There's no need to even discuss it.
Robustness (robustness against random errors) is much, much easier to handle than security (robustness against targeted errors). Think that everyone could change gravity for each atom all over the world, design airplane for that.
You might think so, but it's just a matter of properly designing for the threat model and failure modes. Hiring a Red Team to attack the system at various stages of the process (including initial design documents) is one way to develop an appropriate threat model.
I had a reality check when I read somewhere that there was no way NASA would be interested in most tech companies, as they need things to work with little room for error, and could do without side challenges, like npm package management or github merge conflicts 😂
Many topics covered in this thread and incredulity is expected. However, as someone who has been “bridging” Systems Safety/Human Factors and modern software engineering for the past ~8 years, I can confidently say: it’s not as simple as you make it out.
Yes, the longevity and maturity of “software engineering” is part of this. Yes, it’s likely that code of ethics (possibly licensing, but I’m unsure of that) and ‘professionalism’ differences between aviation and software. But: this is a simplistic comparison between fields.
Regulation plays a significant role here. Some positive in some directions (independent investigation organizations, for example) and some negative in other directions (reports that list bullshit such as “pilot error” or “loss of situational awareness” as causes).
*All* software has potential for unintended consequences, regardless of the domain. Airplanes, cars, social media, email...all of it. Those unintended consequences manifest sometimes as ‘vulnerabilities’ exploited by adversaries or bugs resulting in unavailability or...
...or software that works exactly as the developer intended but used differently by users, or many others. The same is true for other domains. Comparisons like these are not even apples and oranges, they’re apples and doorknobs.
It's not simple to successfully accomplish. It's simple to start trying.
Trying should include a) resisting the temptation to compare incompatible domains, and b) oversimplifying complex adaptive systems in 280 characters. :)
Many major diffs, inc. ICAO and major accidents
I’d agree that he is drastically understating the threat that information security community, but that makes it even more concerning that he is bang on the mark about the degree of seriousness that the problem is approached with.
And this is mostly because information resources tends to be critically under resourced compared to the level of risk.
And that is in large part because their employers aren’t being held to account for their negligence, and legislators aren’t being held to account for their failure to engage with the world we have been actually living in for the last thirty years...
In particular the Equifax and Facebook incidents stand out as opportunities to send a message, that were dramatically overlooked.
One thing I think the industry needs to do is get a lot more forthright about the damage being caused by these incidents, rather than the “no personal details were leaked” bullshit that often gets dragged out.
And the public, governments and (I’m guessing the financial sector) need to get far more confrontational about the way risk is being externalised to them.
"Degree of seriousness" is just another form of "lack of airmanship" found in aviation accident reports. If there are no breaches or they are decreasing, does that indicate a proper degree of seriousness? (no)
Calls for "better" standards of practice is a common reaction to all consequential accidents. It's easy to do, adds very little to dialogue about future improvement, and ignores the real "messy details" of actual work.
Having operated nuclear submarines for many years and more recently computer services, I get excited every time this debate comes up. I authored a few RCAs in the Navy, and read thousands of others. Almost all were attributed to "Human Error."
When I switched from submarines to software services, the differences were puzzling: why were there no operating procedures, periodic maintenance schedules, incident procedures, on-call rotations, checklists, standing orders, hydrostatic test, incident drills, etc
...some of this has changed over the last 15 years. Wiki serves as a evolving operating/incident/maintenance procedure on some teams. On-Call rotations are ubiquitous. And yet these debates are often, to quote Crash Davis in Bull Durham, like a "martian talking to a fungo"
I've puzzled over these differences and have followed the work of Drs Allspaw and Cook with great interest as they create a new field. Yet the debate still explodes occasionally with a practitioner of traditional accident investigation saying, "do the RCAs, human error is real...
...MTTR, MTTB, MTTD, MTTC." My experience is that the traditional methodologies worked very well on submarines, but were much less successful in software services. For a while I attributed this to immature culture, lack of leadership, lack of accountability.
This thread already mentions two other differences, regulation and the maturity of "the field." My sense is there are three attributes of "traditional operations" that makes "traditional Problem Management" (RCA, Post Mortems, Continuous Improvement) work:
1. stability of architecture 2. horizontal scaling 3. absence of Moore's Law (related to #2)
An operator from WWII, would be quite familiar with the "architecture" of a submarine: Engineroom in the back, control room, Sonar, torpedo-room, ballast tanks. While the flat-screen displays would surprise, we have replicated the intuitive utility of the dial-gauge on those.
Any software service that scales, by necessity, changes architecture. One could argue that at some super high-level there is just a "front end" and "back end", but under the hood there will be entirely new components every few years that dramatically change how operators interact
2. horizontal scaling- sure everyone is going to say, "of course I horizontally scale my service," but let's compare the count of submarines, airplanes or automobiles or to web search services or crypto-currencies. Replication of platform creates a fleet of identical machines,
...each with a different Ops team. This creates a large 'n' for a central office to collect and compare incidents, thereby refining and converging procedures and designs. Read the introductory chapters of ITIL.
The beginning of the submarine reactor plant manual stated that "everything you will ever need to do to this machine is documented here. If you think you need to do something that is not documented, read it again. If you still don't find it, surface the ship and radio home"
In years of operations, I never encountered an exception. Name a software service wiki for which that is true. The scaling and rapid iteration of architecture is enabled in services by 3. Moore's Law.
If submarines were more like software services, you'd have to imagine a single submarine in the fleet that held a billion torpedoes, traveled at half the speed of light and fit in the palm of your hand.
The operators of that machine would likely be challenged to use traditional methods of accident investigation. I am excited how the new field generates methodologies that can be retrofitted back onto the traditional Ops disciplines. That has often happened in history. (EOM)
Human error is a constant of the universe and never a root cause. Root cause is always process, management, culture, etc.
The rule of thumb I like is: you’ve got the root cause if you have a (simple/understood/doable) course of action which clearly fixes the problem. Otherwise you’re still working with symptoms and your investigation is incomplete.
Nope. There is no single root cause of complex systems failure. It doesn't exist, it's not a thing, and it's not only a waste of time trying to find it, it's dangerous to assert confidence with. That this concept continues to survive is why we will continue to have accidents.
True that there is never only one root cause if you look hard enough, though I always have to keep investigations from stopping until we've found at least one that meets the definition.
Root Cause: The most basic cause (or causes) of an incident that management has control to fix (i.e. a process/procedure that is Missing, Incomplete or Not followed) and, when fixed, will prevent (or significantly reduce the likelihood of) additional problems of the same type.
(The actual definition is longer, I shortened it for the character limitation)
My strong suggestion is to read the multiple sources of research on how these (and other) definitions are critically problematic. You could start here: kitchensoap.com/2012/02/10/eac… or cut to the chase and read Dekker's "Field Guide To Understanding Human Error" 3rd edition.
Kitchen Soap – Each necessary, but only jointly sufficient
I thought it might be worth digging in a bit deeper on something that I mentioned in the Advanced Postmortem Fu talk I gave at last year’s Velocity c
kitchensoap.com
Was waiting for you to spot this.
spot-on thread, Rob. (I'm a professional software developer).
Cost of developing aviation software vs cost of mundane developing business software is very different, because risks are very different. Market forces clearly say that non-safety IT accidents are acceptable.
This is a very interesting thread, but I have to disagree on the simple matter of scope. The standard for SSL and it's implementers do follow an open process very much like what you're describing.
But to do so for all of software engineering would be like the aviation industry setting standards for everything from a tricycle to an attack helicopter. There's just too much variation in they types of software being produces for a single set of best practices to work.
That's why most companies do it internally and have their own best practices to prevent past mistakes. But in a capitalist economy, they'll consider those trade secrets as long as they can (<- one place I totally agree with you; major failure analysis should be public).
If you tried to create a "programming checklist" even for a smaller-scope, such as writing netcode, there's still so much variation and innovation that half the items would be checked off as N/A. That's not a useful checklist.
A former boss once told me that he wished everyone did their job like a fighter pilot -- because if a fighter pilot @#$%s up, they *die.*
Safety and complexity related to software operating machinery in the physical world are very different to safety and complexity operating in a human design problem space. A lot of abstract thought is needed to have meaningful discussions on risk.
The way I see it, most software doesn't actually need to be that error-free. Where we've recognized that it does, it actually is (airplanes, etc). The issue is that we haven't recognized that some categories, including voting software (!), need to be error-free.
Airplane software is *not* error-free. I've had flight computers crash on me mid-flight several times. But, I'm alive and well -- because the software is designed to be a safely fallible component.
I was using the term error-free holistically, as in "software on planes never crashes so badly as to crash the plane" which isn't true of, say, many word processors. (If there's a better word that captures that concept, lmk so I can improve my communication)
The chess documents been written. The real issue is none of the standards stake holders can agree/want a decent standard. Try writing a standard when the earthquakes and meteors from outer space are the standards authors