See the entire conversation

This is wildly disingenuous, I speak as a flight instructor and major IT incident investigator. Modern software authors have the professional discipline of a cute puppy in comparison to aviation practitioners.
I agree with Chris. This is the kind of thinking that leads to "Why can't we just have building codes for software? It worked to protect against earthquakes and fire!" Earthquakes and fire aren't conscious adversaries. Try writing a standards document on how to win at chess.
508 replies and sub-replies as of Sep 09 2019

Airplanes *are* under constant attack by gravity, weather, system complexity, human factors, delivery process, training consistency, traffic congestion, and even under attack now by software developers
But every time a disaster happens, we learn from it, publicly, and we share. We're still learning from crashes decades ago. Software developers? Bullshit.
Software is authored by an organization, including programmers, architects, technical writers, quality assurance, UX, and most importantly, management. They're all authors with a responsibility to implement and improve standards of professional discipline.
The PMP credential exists for a real reason, because even management can grasp the value of a shared body of knowledge to use in the construction and improvement of workflows and processes. Don't blame "management," they're trying.
Here's an excellent example - 35 years ago, an airline bought their first metric airliner, management cancelled the project to update all ground paperwork from metric. Plane ran out of gas and engines shut down in the air. 200 page report: data2.collectionscanada.gc.ca/e/e444/e011083…
Where's the detailed 200-page public report from Facebook on how their management failed to prevent major disinformation campaigns in the US election? There isn't one, because they're just not that mature.
And "try to write a standards document on how to win at chess" give me a break my dude there is one and it is software and it works and you know that.
Sadly, @ErrataRob is making the same mistakes here. The objective in aviation is "Safe Transportation" and not "Preventing accidents" - a subtle wording difference, but an entirely different mindset at a much higher level.
That XKCD on voting machine software is wrong
The latest XKCD comic on voting machine software is wrong, profoundly so. It's the sort of thing that appeals to our prejudices, but mistake...
blog.erratasec.com
Similarly, the objective in elections is "Confidence in democracy" and not "stopping attackers," which the CSE clearly lays out as one of many fronts: cse-cst.gc.ca/sites/default/…
Simplistic focus on the machine and loss of perspective on the bigger system & society is the hubris that keeps the technology industry trapped in the footgun cycle.
I mean "Airplanes and elevators are designed to avoid accidental failures" come on have you never heard of fail-safe design? Elevators and planes fail *all the time* but they fail SAFE.
Since this is picking up steam, I want to be clear that it's not "engineering standards" or "way more money" that gives the aviation industry the edge -- it's the constant, daily, global, organized and disciplined continuous improvement.
But the medical community is learning now, and Microsoft even brought in a surgeon to lecture them on lessons the medical community has learned from the aviation industry:
The Checklist Manifesto
We live in a world of great and increasing complexity, where even the most expert professionals struggle to master the tasks they face. Longer training, more...
youtube.com
Surgeons didn't want to use checklists because they were too full of themselves, but then accidental deaths fell by 30-50% in hospitals that adopted them. Know who else often suffers from the same hubris? Programmers.
So many programmers are feeling defensive because they just think I'm talking about bugs. I'm not. There were no "bugs" exploited in the theft of Podesta's emails, there were no "bugs" exploited in the 2016 Facebook disinformation campaign.
Google and Facebook need to get absolutely spanked around because they keep pretending that they are software companies when they are not. They're platforms, environments, ecosystems, societies, whatever.
I know @zeynep has been talking about this stuff for ages -- so long as these companies think that their product is software, and don't get held accountable by society, humanity will increasingly suffer.
More amplification of smarter voices than mine: "How Complex Systems Fail," Cognitive Technologies Laboratory, web.mit.edu/2.75/resources…
STELLA Report from the SNAFUcatchers Workshop on Coping With Complexity: snafucatchers.github.io
Root Causes don't reflect a technical understanding of the nature of failure:
Friends, never forget that “post accident attribution to a ‘root cause’ is fundamentally wrong” @ri_cook web.mit.edu/2.75/resources…
UK's @NCSC Active Cyber Defence report, one year in, by Dr. Ian Levy
If you wonder if gov departments can make progress on reducing vulnerability and threats to their constituents, have a jealous read of the UK's @NCSC Active Cyber Defence report, one year in, by Dr. Ian Levy. Practical. Measured. Informative.
You'll want to learn more about THERAC-25, here's a good start:
I wrote about software safety practices, for people starting a programming career.
jorendorff/talks
Some talks I've given
github.com
IT Practitioners can't even start to make things better unless they start with a baseline of psychological safety: usenix.org/system/files/l…
Don't just learn from your own mistakes, learn from *every* industry that has to manage complexity:
For the record: Checklists are not the solution, they are a first baby step towards maturity. The aviation industry has already moved well beyond just checklists to a full SMS model. en.wikipedia.org/wiki/Safety_ma…
The responsibility for fixing the software industry is 100% with software industry leadership, and not with the grunt programming labour told to do whatever management wants.
But it‘s also the other way round. Management thinks the programmers cannot think, cannot decide. As always the truth is in the middle. Managers and Developers share the responsibility for success in the end.
Programmers can't even organize themselves well enough to get offices with doors, how can we expect programmers to effect moral change?
The railway engineers who worked out the timetables for Sobibór and Bełżec likely had a good explanation for why it was better to do the loathsome job competently than leave it to the less skilled and virtuous.
So many programmers with fake twitter names trying to explain to me how planes work right now. [zero empathy expected from you, ladies, feel free to chuckle]
For folks late to this thread, this is about **voting machine software**, and how the software industry (Google and Facebook, specifically) hasn't earned the trust required to manage democracy securely.
Two heavyweights from Google (Chrome "engineer") and Facebook's former CISO saw a cartoon and attacked the cartoonist:
Voting Software
(no description)
xkcd.com
They called the cartonist a "non-practitioner" (though he had programming experience at NASA), and a nihilist, and belittled the relative maturity of risk management in the aviation industry (they appear to have been "non-practitioners" of aviation).
There's nothing that the software industry can gain by belittling the life-safety industries that they can learn so much from, just because a comic strip hurt their feelings.
Thank you for facilitating this discussion. I am learning a lot....my own profession of Medicine still has a way to go.
How many THOUSAND people got out of the twin towers due to the fire safety rating that kept them standing nearly an hour in the worst case, despite an attack way beyond what was anticipated/imaginable when they were designed? NOW let’s talk life- and society-critical software
They got out safely only because of lessons learned from the previous terrorist attacks on those two towers specifically
Procedures implemented long after the building was constructed. The original procedures would have left thousands more dead.
This is an example proving my point - what's needed is just the Deming wheel, everything else follows
Also because of fire safety standards during design & construction. Standards that slowed spread of the fire, delayed the collapse
Prior to the 1990s bombings at the WTC, the building's procedures explicitly stated that the towers should *not* be evacuated. My argument in this thread is that systems maturity in complex environments comes from industry-wide continuous improvement processes.
The fire safety standards during design and construction are absolutely a product of industry-wide continuous improvement processes -- tailored to the needs of the industry based on lessons learned and public sharing of knowledge.
I don't want to say that "standards" are the solution to the things that are wrong in the software industry, because that's way too specific. Standards might be part of what gets implemented, but it'll be a waste of time until they have industry-wide continuous improvement
There were plenty of standards and regulations in place that didn't prevent the Grenfell tower disaster. I think that incident is a powerful example for the need to firmly entrench industry-wide continuous improvement with Just Culture and SMS.
There are many ways to share information between competitors to create a greater good. Patents, with lags and leakage issues. Guild-type codes. Employee turnover⇒info sharing. See Elinor Ostrom’s Nobel work on “the commons.” And Standards, enforced by an external regulator.
Where do you think standards come from?
#actually bit of a funny story there: the WTC towers had waivers for NYC’s fire code, which were granted over the objections of the FDNY commissioner at the time. “The Fires” by Joe Flood, goes into this in some detail.
The miracle of the 9/11 collapse is that it didn’t kill _more_ people.
Almost as if “Move fast and break things” was A Thing before Facebook. Thanks for the point, showing you can’t look at “standards,” “protection” or “regulations” in isolation from the political process that enforces them. That they evolve in a confrontational dialectic
Just wait until they have to deal with the equivalent of mud dauber wasps in the pitot tubes …
If we'd adopted "building codes" for software and were working on the next level of prevention, people would understand. Arguing "building codes won't stop nuclear strikes" then letting homes routinely fall over in rainstorms, isn't going to cut it.
You're working very hard to miss the point.
I don't want to say that "standards" are the solution to the things that are wrong in the software industry, because that's way too specific. Standards might be part of what gets implemented, but it'll be a waste of time until they have industry-wide continuous improvement
Maybe I'm making a different one? We're on the same page with the need to improve. The question is what you mean by standards. We can think of them as both things we do to ensure a good result (processes and rules we follow). I think that's what you meant by "standards" right?
I think you're trying to sound smart without first reading the things I wrote.
If that's what you think, I'll leave you to it.
But standards also play a role in enabling continuous improvement. Without a definition of what's to be produced (a standard definition, or requirement) there is no way to measure the quality of what you produce and whether you're getting better or not.
And it's worse than that. As long as I'm making up my own definitions as I do the work, I'm never wrong and there is no need to improve. This is where we find the software industry: we say they need to improve, and their answer is, "Improve what? My stuff is great!"
So yes, we can simply say "continuously improve", but there is a fundamental cognitive issue that keeps that from happening: they see the software they wrote yesterday as "good" and failures in the newspaper as "outside hackers", and as long as they do your words can't be heard.
That makes me…laugh? Sigh? I worked on B-1Bs starting in 1987 or so. We had one crash and people die because of a fucking pelican hitting the wing pivot and severing all four hydraulic lines. Because what were the odds? As it turned out…
thread @charles_consult @polit2k @charles_consult @schulte_stef @odtorson maths tackle physicists now focus on machine learning please. tweet deleted
When I saw this cartoon, I printed it out, replaced "blockchain" with "Rust" (because I had a reputation for advocating for it), and put it on my cubicle wall.
Back then reading this, I concluded we are living the 747 moment of IT: future will not be higher, faster, better, but more reliable and failing safely. idlewords.com/talks/web_desi…
'splanation solidarity high five
Rob - this is a FANTASTIC thread. Thanks for focusing on SAFETY, Organizational Authorship and lack of professional maturity in software. There's a lot of wisdom in THE CHECKLIST MANIFESTO I've been banging this drum for a decade.
Actually the best engineering teams in history have never had offices with doors. Witness NASA from nearly day 1 and countless similar examples. The science behind this is overwhelming.
Sorry for being flippant about the door comment before the thread blew up - my core audience would recognize that as a long-running joke from @Pinboard and @TechSolidarity
The best engineeing teams succeed despite offices with no doors. The open plan office productivity myth has been debunked multiple times. Here’s a recent study: rstb.royalsocietypublishing.org/content/373/17…
@statvfs Oh there is SO much wrong with this study I don't know where to begin...
@statvfs I guess it's worth saying that blindly opening up all office spaces with no clue on how to do it would def yield the results in this "study".
OK I was deeply into you and your thread and THEN I saw this comment and I died. Bravo.
No, I disagree. People are able to think, to speak up, to demand change. Nobody gets off the hook. Everybody has the power and the responsibility to do something to make life better!
Much value in this thread. But! Take grunt out of the vocabulary.
Not just the word (obviously) but all it signifies; still the attitude that leaks through in the word there, undermines the thread
If non-licenced developers stop calling themselves engineers, I'll stop calling them grunts. I think that's a reasonable compromise.
Because calling yourself an engineer when you don't have an engineering licence is dangerous.
Nah, strawman. It is an argument that has a place, but not as a defense for calling _anyone_ a grunt. And no industry gets better if,
I'm regularly told that calling any codemonkey an Engineer is part of the software industry culture in the US. I'm happy to set that straw man on fire.
After three years you get "Senior Engineer"
Oh gee guys, you’re so right. My bad. Let’s fix this by calling people grunts and monkeys. Sure. I’ll just shut up now. NOT!!!!
Digging in doesn’t make it right.
If much of the problem is hubris, then much of the solution is humility.
You don’t induce humility by demeaning; you do model (lead by example) humility by recognizing when you made a mistake.
This is a thread about a cartoon on a bird website. This is exactly where I aim to make the most mistakes.
I can see that ;) Seriously, it’s a great thread — but it is undermined at the point where, not only do you erroneously attribute all
responsibility to (senior) management, you also demean.
Our industry has strengths — even if it can be hard to disentangle the strengths from the weaknesses. And one of our strengths is how
often we will speak our truth to power :)
Not often enough, by any means. But still.
Super interesting stuff. Too interesting for short-form, almost. There is certainly a lot of practices and mentality–especially on the dealing with errors/crisis or the question of consequences or error/mistake/crisis–that software can learn from previous critical infrastructure.
Thankfully, checklists (e.g. MITRE ATT&CK and PCI DSS) are beginning to catch on. As you say, far from complete solutions, but if your org works through diligently and in good faith (ha!), you'll be in a better position than orgs that don't.
Thx for that, I heard of the CSR report but couldn't put my finger it
I can't recommend Trevor Kletz's "What Went Wrong" highly enough on this. There's even a chapter on problems that computers introduce to chemical engineering!
Really interesting thread!
that accident and the amazing Levenson report detailing the failures changed me. i’ve been annoyed at my own profession ever since, and consider the title “software engineer” [i wore it on and off at various int’l orgs] a howler.
I think most 'hacking' is social hacking, not code.
Not all failures are accidents, though. Some are repeatable, and happen over and over—those frequently have smaller sets of causes, or even a single root cause.
I interpret the cited Short Treatise as saying this: Thinking in terms of *root cause* doesn't reflect a technical understanding of the nature of failure. Thinking in terms of *root causes* might help, though.
While we're at it, how is it that we're both in Toronto but we've never met?
A good root cause analysis would include factors similar to the ones named. Any modern iso management system is supposed to consider environments and roles and responsibilities- all the way to the C suite.
A good fish one diagram may end up with dozens of factors and relations.
*Fish bone diagram
We will "just" experience unforseen consequences.
So, I uttered the words DO-178C and Esterel/ANSYS SCADE in reply to Stamos; what do you think about those two? Personally, I think they missed the point: the comparison being made is not between election systems and avionics, but deliberate attacks on elections and on avionics.
I have no problems with your points, really - but it's worth pointing out that "software" is an incredibly wide industry. Not saying aviation is simple or a narrow field, but every player has the same interest. In software, you could be making Facebook, control software for (1/x)
surgical robots, traffic lights, a guest book for your personal home page or the means to control a space station. It's just a tool to accomplish something within a "real" domain, so to speak. And many of these disciplines are just not as mature as the aviation industry, (2/x)
...where everyone pulls in the same direction. And then, for orgs with smaller budgets, the expectations are insanely high even for short term, "cheap" projects. Everyone's colored by how Google and Facebook works, and if their software is in any way worse, (3/x)
it's not good enough. Even though their budget is tiny in comparison. However, with all the open source tooling, all the conferences that are out there etc, I would indeed say the software industry is interested in learning from its own mistakes. (4/x)
It's just an insane amount of interacting parties, and very few standards bodies in comparison. There are some, and many things are indeed standardised, but probably not even close to how the aviation industry is regulated. Sadly. Lets give software twenty more years and see 🙃
Agreed on many points. Software written in mature industries is mature, software written in the software industry is not.
Another problem in the "software industry", I guess, is that many companies that really want good software hire consultants on short term project basis. They come, stay for two years, and leave. A year later, they hire new consultants to upgrade/fix/whatever and then leave.
Instead of hiring their own people permanently, which would probably be both cheaper, give better software, a more stable delivery rate and a better way to contain knowledge inhouse. At least here in Norway this is a very clear trend. Exceptions exist, but for many this is it.
They're governments of societies. Digital societies where subjects have no vote and are essentially serfs.
I'd rather spank around the many, at all levels, who still insist on regulating, breaking, nationalizing, rebuilding... whatever those same platforms, instead of making them obsolete by realistic, but truly decentralized personal clouds. E.g. this, mfioretti.com/2018/02/calicu…
they are a bunch of baffons "running code"(scrypted) on someone else's tech. environment would be appropriate. true programmers write real apps and applications and work on the console.
My first reaction was also: Well managing an airplane and surgery are repeatable (whilst complex) operations. Constructing a software system for a domain/use case for which there is no precedence (framework, library) has much more variability.
But I am wondering myself now whether that is actually true. A lot of software engineering efforts focus on making knowledge on a particular domain reusable despite the variability. Would that maybe be a good starting point? Making attention to critical aspects "reusable"?
Are a tech company's business practices in the realm of engineering here? Software engineering is hella immature but these examples seem like a stretch
All of the flaws exploited by the Russians were executive leadership flaws. To build an airliner, the whole company needs a license, not just the engineers.
Some airlines have shit business practices; are those under the remit of engineering? Facebook is both Boeing and *example airline X* in this case, so the distinction between engineering, operator training (aircrew vs ... devOps/SRE?) and business decision makers?
Engineering, as other disciplines, is under the remit of the management environment of their respective industries. These aren't engineering challenges, they are industry challenges.
Maybe if you work in a place that hasn't adopted CI and static analysis tools (are there any of those left)? Automated checklists.
Likewise: Doctors didn’t want to use flow approaches because “you can’t reduce our work to a factory”
We know about washing our hands now to yes. Ha
I have seen software devs be highly resistant to having even the most rudimentary operational review requirements for new services. Simple checklists of things like "does it have alerts", "does it log", etc.
Well one of reason is that when those checklists trumps and overtakes the actual functionality. Then project gets software which does things right without doing right things.
Proper observability of production software is basic functionality.
Just those two phrases are way too scary. ☑️Produces actionable alerts according to recommendation XYZ... ☑️Sends structured event logs while masking sensitive data. Good checklists take work.
Aviation is the safest way to travel thanks to checklists.
And yet programmers who work for companies like Intuitive Surgical and Mazor Robotics ARE in fact writing program code for robotic surgery that is BETTER than the human surgeons.
And management got those companies licensed and certified in their industry. That's software written in the medical industry, not in the software industry where management has no regulation.
Ok...perhaps starting to follow you a bit more. When you say "software industry" you should clarify the boundaries. Software is rarely "stand alone" and is embedded in almost every other industry (as a core or side effect)...
I'm pretty sure that my thread, as a whole, makes it quite clear what I'm talking about.
For folks late to this thread, this is about **voting machine software**, and how the software industry (Google and Facebook, specifically) hasn't earned the trust required to manage democracy securely.
Lots of stuff on the interwebs. Sometimes I don't read every single post. But enjoyed your perspective even if not aligned on all points. Good dialog tho!
What I frequently call "culture", or "engineering culture"—am I wrong
Some culture is bad, some is good. Not all culture has constant, daily, global, organized and disciplined continuous improvement.
Right. And I think current bubble economics tends to amplify the problems you've mentioned.
How does the organised continuous improvement happen? Enforced at a professional or organisational level? Just culture?
It's called not killing people and everytime you do, the NTSB (or other nation's equivalent) are all over you and make sure everyone in your industry learns from your fuck up
Money is not sufficient, but is necessary. The Shuttle group had the culture described, IIRC there were 17 documented defects in 400k LOC over ~30 years, and each one prompted new learning. Perceived risk/benefit for most software is far less. Outlay (effort) correspondingly less
Which is only an explanation. Because while all planes have to fail safe, it truly isn't justified to spend that kind effort on Angry Birds. So the question is how to better judge value and risk, so that truly important things can be resourced to support the culture you describe
The mantra of Silicon Valley is "move quickly and break things". That's all you need to know. Collateral damage is other people's concern, not theirs.
#aviation does well because it embeds #SystemsEngineering & #SystemSafetyEngineering in the design of #aircraft, aircraft systems and operations. Dont get me wrong, #aviation is not perfect, but its light years ahead of other industries, especially IT who are simply not as mature
I have yet to see credible analysis showing electronic voting offers the same level of security as paper ballots. It's an extra risk, and for what? Saving a few hours of counting ballots? Just because a problem can be addressed with software doesn't mean it should be.
Hmm, I believe the civil aviation devices (planes :-) are not well protected against attacks. Like from rockets (omg, happens more often than I thought en.m.wikipedia.org/wiki/List_of_a…), bombs and malicious insiders (MH370?, GWI18G) I thinksthat is what @ErrataRob is talking about
What he's talking about is that he thinks software programmers can be trusted with the management of democracy. That was the joke in the comic.
=8-o I need to read the article.
Ok. I reread the article. I still don't I don't think it imposes "trust the software developers". It imposes failsafes which work despite the software
I studied software engineering as a degree. One point that was made by a professor stuck with me. Software engineers are NOT engineers. Engineers build structures that don’t fall down: bridges, buildings... The IT industry writes software which breaks ALL the time.
Really??? * Pedestrian footpath collapses in Florida * Hyatt Regency walkway collapse * Tacoma Narrows Bridge Should I continue? Henry Petroski "To Engineer Is Human: The Role of Failure in Successful Design" is good read regarding this
Yes really. You just proved my point in that those failures are heavily investigated and are well known because they are rare. However software failures are way more common and not investigated. Are you trying to counter the principles I’m putting forward with specific examples?
"Engineers build structures that don’t fall down"??? The fall down. Even after 10k years of building bridges! Socioeconomic will cause systems to fail. & Don't get me wrong. U should learn all u can from other disciplines. But I see the "real engineering" discussion as damaging
I was generalizing to save space. I guess I should have said “On the whole engineers mostly build structures that do not regularly collapse. Whereas software engineers build software that crashes all the time.” Hopefully that clarifies the point I’m making.
I believe the all engineering is under constant pressure to optimise, to build thing cheaper and faster. Until it breaks. Than it takes a step back.
can always count on his sixhead to float up my timeline being obtuse
Dear IT folks, please read @www_ora_tion_ca's thread. Think on it. Read it again. ↑
Please don’t put me in the position of having to defend an XKCD comic. That’s a bridge too fucking far.
FWIW the Stockfish chess engine code annotates relevant routines with expected ELO gains. IIRC the engine goes through a suite of test positions for hours against other engines, stats are taken, it's not guesswork. Some teachable stuff in there IMO.
Major difference here - government body vs private company. There is no upside for Facebook to spend the time or effort on this or to release it. Inside the government and some large companies massive reports do exist, they just aren’t often published.
Additionally while I agree with your general thought - often catastrophic things have to occur to spark this level of insight (Eg planes falling out of the sky and people dying). It’s only recently begun happening in software. There is simply less history of truly bad failures
My argument would be that the regulators are part of the sector, and the lack of regulators are a major contributor to the lack of maturity in software creation.
Agree on that front. The question is what parts need to be regulated and at what level? Lots of software in the world, not much of it is actually critical /important.
The rules in Canada are simple: Engineering is regulated and requires licencing. Programming is not regulated and can be done with just a CS degree. That's enough of a start, the remaining details come naturally with continuous improvement feedbackery loops over decades.
CS degree optional
Contractors follow rules too when building a building. Degree or not. Apprenticeships work for unions, and they work for CS degrees in Germany.
That not every programmer who has passed through formal training has heard of the Therac-25 is a goddamn indictment of our profession.
There's no legal obligation for Facebook to prevent "disinformation campaigns", and not even consensus that there's an ethical obligation, given the free speech issues involved. Get that consensus first, and then you can talk about professional discipline.
In early days, web software was cheap infrastructure for nonessential tasks. “Don’t worry. It’s just a chat client.” But we got tempted to use the same designs for critical infrastructure. The industry needs to mature and do the unglamorous work of fixing that mistake.
We overpromised on the security and safety of software. We continue overpromise. The industry is still focused on “disruption” and “innovation”. We throw a fit and make excuses whenever anyone asks about safety, security, or trust. We’re not young anymore. We have to grow up.
That they had a glider pilot in the cockpit was a GD miracle
Not miracle: It was a government initiative to give hundreds of kids taxpayer-subsidized scholarships every year to fly gliders make sure there was a recruitment pipeline for military and industry, this is a significant factor in why western countries have far safer airlines
i.e. Bob Pearson and Sully Sullenberger and Chris Hadfield all had taxpayer-funded glider training as kids. Investment pays off.
Presumably the Gimli Glider. Thorough.
Without looking at the link - the Gimli Glider?
And customers accept ridiculous liability statements. How did we get here?
Too much control of underlying net infrastructure given to private actors who value their control over public scrutiny
OSS software is often authored by a handful of individuals left alone to support tens or hundreds of thousands of users (or more) with little to no resources. I *wish* I had tech writers, QA, UX, and project management. Please tell me where/how to get them with my zero budget.
I will not defend all s/w devs, but alot of times it's their mgmt that is the problem. The devs (some of them) *want* to do the right thing in the right way.
Aviation has known for decades that management kills people, so they fixed management. Your argument is invalid.
no, you support it. I blame mgmt for alot of our life ills.
Management is a major part of the software development process, and a major part of the industry. If you think programming alone is what results in software.... Well.
You're misreading where I'm coming from. I initially felt you were giving undue emphasis to the devs themselves, rather than the larger, more inclusive, picture. I've worked several roles within both telephony/internet systems and medical industry. I've seen sausage being made.
This is why I find Air Crash Investigation so reassuring. Whenever something goes wrong, there's so much effort put into working out what it was and what to do to prevent it from happening again. Everyone thinks I'm weird.
All the best people are weird
Don't forget snakes.
Yeah, you don’t get it, though. A lot of users actively don’t want security in their software. What would aviation look like if 30% of air passengers actively wanted the plane to crash?
What would it look like if the government wanted there to be a vulnerability in every airplane so they could crash planes on demand? If users sought other forms of influence to mandate air travel un-safety?
Hardware itself defeats security. What would aviation look like if Boeing built every aircraft with a button on each passenger’s seat that would let them take control of the cockpit?
Tip from someone who "gets it": There *are* government-mandated vulnerabilities in airliners so that governments can crash them.
Like users of airplanes don’t actively not want security on the airplanes? I see tons of people complaining and/or ignoring them all the time. Until we are at freaking NASA moon landing software standards, we’re no where near other industry standards
Um: Neil Armstrong switched that computer off and landed manually because he did not want to die.
1202 Data overflow alarm at mission time 102:38:26, 1201 alarm at 102:42:19, Manual control due to computer navigational error established at 102:43:15 -- hq.nasa.gov/alsj/a11/a11.l…
Sure, but things like that never happen in aviation? I thought it was about higher standards, not about perfect systems, because those don’t exist. You know Armstrong did it because it’s public. There is probably a ton of research documentation on every trip into space.
It's not about how high the standards are, it's about applying momentum to continuous improvement of those standards. Any standard can be improved, and it's that improvement cycle that's missing/immature in the software industry.
Fact: Eagle was off course because of navigational computer computation error, headed for a crater wall. Fact: Neil Armstrong landed Eagle manually. Fact: The 1201 and 1202 alarms were unrelated to the above, but added to the confusion at the time.
Ars is disputing that the 1201 and 1202 alarms almost caused an abort, which is a fair argument to make, but unrelated to the fact that the computer was about to fly Eagle sideways in to rock.
Do you have good pointers on the “navigational computer computation error”?
Computational error based on incorrect inputs from the ground, search this oral history for the keyword "perturbations"
The Oral History of the Apollo 11 Moon Landing
The knuckle-biting story of the first lunar landing from the people who were there.
popularmechanics.com
Fun quote from later in there: "Under the software control, it did a software restart. Five times during the landing, the whole software was flushed and reconstructed in terms of what was being executed."
Read that, but again it looks related to the 1201/1202. It seems that the not optimal trajectory was indeed because ground gave the initial go 4-5s too late
Why mentioning 1201/1202 then. It especially added confusion to your tweet (at least it confused me :)
Because space jargon is FUN
My understand on this is that 1201/1202 were harmless for the LM but @TheRealBuzz and Neil didn't know it at first, and their handling prevented them to scan the landing area to chose a safe one. Then Neil decided to switch to manual to pick the precise landing site himself
I'm not aware of a navigational error (based on 102:42:35) so any relevant (and accurate, considering the amount of noise one can find on this) pointer would be nice :)
I think you're making Rob's point quite nicely: users don't want "no security". They want to get the(ir) job done, they want convenience, they need that attachment, etc. Air passengers don't want crashes, they want less hassle at security checks or smoke a cig onboard.
We're talking about a comic about voting machine software, not everyone's daily software to "get their job done" and fart around with attachments.
OK. I think my point still holds, but the examples change. Users want to vote quickly and get the job of managing an election done easily. There may be a higher risk of deliberate attacks if they're easy and people think they can get away with it (e.g. voting twice).
The attachment example also underlines the "no culture of learning from incidents" point -- remember the "iloveyou" virus? Nothing has been learned WRT executing active content from mail. That's why more than 10 years later, ransomware-by-mail attacks were so easily carried out.
It's not that users don't want security, it's that they have been trained by horrendous IT to be used to exceptionnaly low levels of security. Like LinkedIn asking your GMail password to get to your contacts. That should get people fired.
You make a good point. This problem needs to be addressed. Large passenger aircraft are safer than small aircraft and cars because most societies have zero appetite for 100s of people dying in a fireball. I wonder if Malaysian Airlines is still hemorrhaging $.
NASA & JPL do have software engineers who can write low defect software. Doing so is slow and expensive, and may also require clear requirements. Most end user software is written like toddlers build towers, piling stuff up and hoping for the best.
Eh, not toddlers building towers, more like Lego (but not nec. the Technic kind) - I mean it works, and there can be some good principles in it, but rarely is it truly robust especially in new environs.
You know what's more expensive? The current infosec industry. The financial, social and PERSONAL cost of breaches. Shoddy software is only cheap because we've externalized the cleanup costs.
Aviation Eng and Builders are solving one set of well-defined problems: keep a plane flying et al, keep a building standing (Ok, dramatic oversimplification) Software is asked to solve HUUGE variety of problems (and cheaply! w/ stakes usually low) Oh, and new UI plz.
You're looking directly at programmers -- but software authorship is done at a management level, defining requirements and such. Of course software programming sucks when software management sucks -- my comments are at the authorship level.
Programmers write code, they do not make software. It takes a hell of a lot more than programming to make a software or a service, but programmers still think management is an unmanageable boogeyman.
I can see why you got into aviation... your horse is so high you need a whirlybird to reach the saddle.
The whole "mgmt writes specs & coders are just modern day scribes" ethos you're flaunting here is how software was done 50 yrs ago.
As @kirkjerk was saying, modern software is everywhere, solving every problem. Doesn't make sense to apply the same methodology to every project. Many software shops wouldn't be able to exist if they followed waterfall or somesuch.
Which is to say, this whole "you ain't writing software" gatekeeping act you're putting on here is nonsense. Some software shops are completely flat, no mgmt, no ivory tower handing down algorithms and flow charts... and some of those are making 8+ figures of revenue.
yes yes and your value to society and humanity is defined by your revenue right sure
Humanity, huh... so is that an integration test I need to add to my CI/CD pipeline? Can't ship to production unless a child achieves enlightenment by my algorithms?
And here I thought you couldn't get any more sanctimonious. Boy, have I got some egg on my face!
Seriously, software is just a fucking job. I write it to get a paycheck. If my software makes enough money to sustain the business, what more could I ask for? Nothing. That's why I mention revenue. That's really all that matters here. The rest of this diatribe is faff.
A job and a profession are two completely different things. It's nice you have a job, I'm glad you enjoy it, I hope your doctor is a professional.
You push paper around a desk and punch keys on a keyboard. You don't have a life in your hands. (The lives that use your product are in the hands of thousands) If your head got any bigger it'd start affecting the tides.
We're talking about a comic about voting machine software, not punching keyboards and pushing paper.
That might be what you're thinking, but it sure ain't how you're coming off.
There's no P.E. or M.D. after your name. Looks like you have a job too... professionals are accredited and regulated.
I have a real name that I'm willing to stand behind, you don't even have that. Yes, I have accreditations and professional designations and licences, but that's not what makes me right about stuff on a silly bird website.
Ooh he's moved from gatekeeping to bullying. You want to meet me in the parking lot, that why you want my name? How very...
yes yes im the first to start bullying in this thread boo hoo anonymous coward
Is that part of attaining a professional accreditation: demonstrating mastery of playground taunts? What letters do you get after your name, there... P.U.?
Agree. From personal experience, after moving to aviation from "normal" software industry, it takes a bit of time to appreciate the role of proper management and process. But when the product needs to be in service for 35 years, the "startup fever" is not the right mindset.
This isn't a dichotomy, nor is it even a spectrum. There are many dimensions to our work products. "35 years" is a useless figure to someone who works on HFT algos or med imaging.
I'm glad that you have changed the pitch from "this is how it was 50 yrs ago". Yes, there are many dimensions, so please allow others do the safety critical stuff the way it needs to be done. Have fun with HFT, but beware med imaging - some of it might need certification.
And shortly after, somewhere else I found this (fresh and close to "med imaging"): bbc.co.uk/news/av/techno… You don't want it. And making billions in revenue on software that is selling ads gives no credibility in *this* area. Same for autonomous cars, etc.
Hack attack can stop people's hearts
Researchers disclose an unfixed vulnerability that threatens medical devices.
bbc.co.uk
But it‘s also the other way round. Management thinks the programmers cannot think, cannot decide. As always the truth is in the middle. Managers and Developers share the responsibility for success in the end.
1/ It sounds like you’re saying the root cause of problematic s/w is human error & humans at fault are management. But management answers to Board/investors, who are in turn “answering” to profits/customers; custs are operating w/incomplete info. Root cause attrib is hard.
2/ Calling out Facebook‘s failure to prevent foreign election influence conveniently ignores how ludicrous the very idea of bots influencing an election was, not long ago. FB has scaled 2000x in 14 years. Outcomes are more obvious in hindsight.
Sociologists predicted it, gamer gate proved it was coming, but Facebook ignored their expertise and warnings.
There is no single root cause, just accountability.
Rob - your concept of "authorship" is flawed. A building author IS the architect. S/he is the one licensed, regulated & accountable to produce a safe, well engineered building. The prop developer cannot override the laws & regs nor will be liable like the architect for flaws.
You're being overly theoretical and superficial. 72 people died in Grenfell, and I'll wager a beer that no individual architect will lose their license.
Not unique to ANY industry. And theory is precisely the core of of the aviation industry (and others). Being theoretical is kinda what engineers (and pioneers) do, ESP in aerospace. So lets set attacks on proper science aside.
as a modern software engineer, I agree. The few times we let software control really important stuff like airplanes, the rules are completely different.
Conversely, having just gotten to peer into the cockpit of the B757 we were supposed to take but was grounded, and beheld the finest software 1990 had to offer (as well as the iPads suction cupped to the windows to compensate): there are downsides to excessive conservatism.
The slightest change in avionics needs to be Certified. And Certification is... *very* expensive. Hence the conservatism.
I’m aware. The discussion was about software/design/compliance. My point was pointing to aviation as an unqualified success of more rigorous design does not tell the whole story.
Rigorous design wasn't what saved the aviation industry, it was applying the Deming wheel of continuous improvement with transparency and accountability and sharing of best practices.
That actually happens pretty well in software too. Issue is one of priorities and incentives: cloud providers take this stuff real seriously, IoT providers (for instance) can’t be bothered to care.
Would you rather that Aviation be conservative. Or have a shiny UI each flight with a 99% chance of crashing?
Aviation is the most innovative industry we have, from kitty hawk to supersonic space shuttle landings in a single person's lifetime. Aviation isn't conservative, it's disciplined.
100% agree. Am updating my mental models...
Sigh. Meaninglessly hyperbolic statement. 20th century had a lot of technological low-hanging fruit that was grabbed. Aviation, along with basically every other area I can think of, had remarkable advances.
There is actually a variety of interesting discussions to be had here if anyone cared about nuance. The dynamics of aviation are very different from other tech.
Needs to be much more conservative than other areas because of safety, but unclear whether it needs to be as heavyweight and expensive as it is. Conversely, reasons for crappy reliability in other areas of software have mostly to do with incentives, not ignorance.
E.g., cloud providers do a pretty good job with security, actually, balanced against fast iteration. IoT vendors are terrible because they have little incentive to do better.
I would like airline pilots to have access to UI advances from the past 30 years to give them better situational awareness. I have better tools at my disposal as a private pilot than airline pilots generally do.
The iPads largely replace a lot of paper, including heavy flight bags pilots used to have to carry. They could build that functionality into the airplane, but it turns out the ideal form factor for this interface is a... tablet, that the pilots can hold like a piece of paper.
Yep, the iPad is part of the pilot's equipment, not part of the aircraft's equipment -- and at no point does the safety of the aircraft depend on the iPad.
Sure does if misreading the approach plate means pilot flies into a mountain, which is more of a problem than wings failing. Also, iPads tend to be suction cupped to window, so not exactly loose in cabin. “Pilot” vs “plane” not interesting w.r.t. safety: unified system.
That's like saying the aircraft's safety in the paper approach plate days relied on paper. The pilot misreading the approach plate has nothing to do with it being on an iPad versus paper. Of course, the pilot is not going to set the iPad down loosely when not using it.
My point is planes (currently) don’t fly themselves. You can end up dead on the best engineered plane if the pilot makes a mistake. And user interface design affects the pilot’s awareness of what is going on. How information is displayed to pilot, and when, 100% affects safety.
Pilots train on user interfaces before flying with them, it's never a surprise.
Not the issue. E.g., G1000 gives better awareness of what’s going on, lowers cognitive load, alerts urgent issues, highlights most relevant info. Equivalently experienced pilot using g1000 is _safer_ than pilot using old steam gauges. User interfaces matter. Ask any pilot.
What does this have to do with iPads "compensating" for the 757's software? A G1000 is aircraft equipment, and iPad is not. The iPad does not interface with the aircraft's controls or displays.
I'm a pilot, and a flight instructor, so I asked myself. You're very, very wrong. We have to turn all that blinky shit off for our student pilots.
No digital readout will ever beat the intuitive design of these instruments. Here's the ASI that I flew with today -- I don't even have to read a number, I get what I need with a glance in a quarter of a second, then I get my head back out of the cockpit.
Unless you're flying IFR, the G1000 is a dangerous distraction.
Every second that my eyes are on an instrument will take me away from the real dangers that can knock me out of the sky -- rich doctors in SR22s staring at fancy G1000s who aren't looking where they're going.
I’ve worked both in aerospace and the software tech industry and I completely agree with this.
The cowboys who built Facebook and Google are now billionaires. The software industry doesn't reward careful engineering right now... it rewards shiny new bells and whistles. I do a lot of programming, and there's always huge pressure to build more features in less time.
If my boss on the ship I work on accidentally dumped oil overboard, he could: Lose his engineer's license Get tossed in jail Be required to pay a huge fine.
More like a kitten on super catnip laced with cocaine
"All of my previous work was JavaScript but I had a revelation while drunk and high last night and learned Rust and we're coding your key components in it now wheeeeeee I'm still high wheeeee" (Not *quite* a literal quote. But far far too close...)
There is a field that deals with safety critical software engineering, but I'd bet an index finger 90% of professional software engineers couldn't describe any of those principles or practices. The knowledge exists to enable us to build robust software systems; we choose not to.
Yes, modern software discipline is a lot worse than aviation discipline. But the attacks are also more powerful. Weather, etc. are like the network connection going out. Software should be able to survive that, and often doesn't. But an *attack* is like an AA gun.
Almost none of the things on your list of things attacking airplanes actually *want* the plane to fail, they just make it hard to succeed. With software, you often *do* have a malicious attack, which is harder to deal with.
You deal with it the same way you deal with everything else: disciplined continuous improvement, which the software industry sucks at compared to the aviation industry.
Yes, it does suck. But there's still a material difference. If something can go wrong with a flight one time in a hundred billion by chance, there haven't been enough flights to notice. If there's a five-byte input that causes your program to crash, an attacker will find it.
Okay so why would you want a voting machine like that
I didn't say I did; I was just pointing out that it's not *all* bad discipline. (Although I think with enough care (and time and money), we could have secure, safe voting machines. We're not there yet and a lot of smart people disagree it's even in-principle possible, though.)
There is also a lot of work to make electronic systems safer. Solvers to prove algorithms will work the same way. Fuzzing to find those five byte inputs. There is plenty of work and education left to go into incremental improvement before “trust” isn’t a question for software.
Electronic voting systems also don’t just have electronics problems. You can make your box “unhackable” and then the shipping company sends the wrong one that has a nicer PCB. Change in process of large systems requires methodical introspection.
The point of the five byte figure is it's a nice round number greater than 100 billion, the number I used for airplane problems. I should have taken into account that running the software is cheaper and more common than flying; the point is "wouldn't happen accidentally".
Some fuzzers (AFL, for example) are smart enough to try to find inputs that trigger weird behavior, but if I had said "one kilobyte" instead of "five bytes", the point would hold but fuzzing might not find it.
An attack against software looks like what happened to MH17. I don't think the airline industry is building defenses for civilian airliners. Bad weather is a router reboot.
You could get software regulated to the level of airlines if you could convince people to pay the same kind of pricing as for aircraft, and do similar maintenance. There is a small limited market for this.
Sometimes software on my *phone*, which should be used to a flaky network, misbehaves on router reboot. Software quality could be a lot higher than it is, even without attackers. But I got that app for free, and nobody's giving out free airplanes.
If a civilian airplane is perfectly safe unless 40mm, 800g balls of metal hits it a couple times a second at faster than the speed of sound, nobody will buy it because it's too heavy and they want something lighter even if it is less safe.
So how many lines of code have come off of your fingers to and into a product someone loved?
Quibble: cute puppies can be housebroken in less than a year.
But not after your puppies have puppies. (15 years for GHOST on Linux, who knows how long link_ntoa on BSD...)
Indeed. Speaking as a software engineer who was on a National Academies panel on software dependability, I totally agree. Yes, software is hard. But Facebook didn't use known best practices like social threat modeling. Shockingly unprofessional. cc @digitalsista
Speaking as a software developer, tutor, and code reviewer: the software development industry does *not* have any kind of widespread concept of 'professional ethics'. That's where the problem starts. There's absolutely no common sense of responsibility.
This creates a lot of compounding issues: 1) Issues are treated as 'solved, never think about again'. 2) Because most developers are negligent, clients expect no expenses on security; which makes ethical developers non-competitive. 3) And so on.
Which unfortunately means that this is something that can't be solved on an individual level; it's a systemic issue and there are no incentives for individual developers to change it.
Great thread, and something I've been going on about for ages, but it's not disingenuous, it's cultural. Software as a field is so new and subjected to so much light and heat that the practitioners don't know if could be better.
There are other structural problems with bringing standards and good practice to software, the wild growth of the field combined with the lack of centralization or structure to the field, comparatively.
It would be as if most people who got into aviation did so by building a plane at home first. But despite this I do think software liability would go a long way to starting to get the incentives in line with creating good standards similar to medicine and aviation.
There’s also an issue of scale: if planes failed the way software does, the next time a single Boeing 777 crashes, every other Boeing 777 in the air would also lose power.
There are various places where the metaphor fails, but also doesn't. In a way that's exactly what we say with wannacry -- over and over again.
and honestly, i think the reason wannacry gets to keep exploiting SMB v1 and Boeings don't get to keep crashing is that we can take pictures of one, and see it clearly, and treat it as an event mentally.
There’s also the monoculture vs diversity aspect though: each airplane has its own set of pilots double-checking each other. Software is like a single godlike pilot flying all the world’s planes at the same time, but she sometimes gets sleepy.
i think that's a misfitting metaphor. pilots might match better with sysadmins/ops, who aren't ime the biggest fans of programmer bullshit.
Today a key subsystem failed because the third-party datasource API on which it depended simply disappeared, deactivated and delisted without warning. Whose fault was it—theirs for going down or ours for relying on it? The turtles are all wobbly, all the way down.
it literally doesn't matter. fault is the wrong frame, as the original thread made clear: responsibility is the right frame. Who deals with it and what recourse do they have against others who may have failed their own responsibilities in the chain.
Attribution is always a political choice, right now that choice is to attribute in a way that protects vendors from any liability or responsibility for what their software does in the world.
The fault was with the management that decided years ago not to care
no one in the chain cares. i know we love to beat up on the suits, but this is a toxic culture, and has been as long as i've been in it. the privilege to treat quality as esoteric and impossible is of the same piece as why it's so dominated by whiteness, sexism, etc.
bullshit is bullshit, and you see it expressed in myriad ways.
sorry i'm so ranty this has been many many years of my life and i'm so over it.
Anyway, this is the most recent piece I've written about it: emptywheel.net/2017/09/14/sof…
Software is not authored by a programmer, it takes a village. The problems need management fixed first.
dev culture is a lot of the problem, and management alone can't fix that. the place it needs to get fixed first, or can be, is software liability. honestly, it's going to be the goddamned insurance companies if it's anyone.
liability for what though? FOSS is strictly non-liable, and strict liability sounds more like indentured servitude, which is worse if you work with non-geniuses.
As @allspaw point out, this could be aa much good as bad. I personally think bad. Something that is under and that we ought to talk about more is how we move the culture forward toward safety. Lot of possibilities for actions, some already happening.
Personally, i would love to see existing group like the ACM begin to applies publically their ethic code... changing the fubar interview process and more remote job could help in allowing developers to say "no". We forget that a huge part of the field is invisible.
All these "enterprise" dev that still spend their day doing Java or Perl or Cobol at BoA, at most hardware manufacturer, at car manufacturer, at Samsung, etc. You never see them but they also do not see us. The whole conference circuit: do not know. They are stuck in the 90s.
Giving them personal liability will do nothing because they will not know about it. How would they know? A trial? Really? That will just not happen. They know nothing about the consequences of their act : noone understand these systems. They are complex and "organically grown"
Maybe this says more about your career and the places you work as „incident investigator“ than it says about the software engineering field?
You *can* have reliable software, and it can be developed in a professional and deeply risk averse manner. It costs about 100x as much and takes 10x as long.
But that doesn’t make for a great ivory tower monologue
Are you also taking into account the cost of having the US Elections hacked by Russia? Externalities are frequently ignored, but no less costly.
That's the cost of designing systems that will fail safely in the event of a hack. It might be more expensive than that, even, so for now the gold standard will continue to be paper.
@rantydave This is a thread specifically about voting machines.
@rantydave Doesn’t make @rantydave’s point any less valid. Even if it were deemed important enough to make it “the secure way”, the mere fact that other software is made at such a reduced price + speed almost guarantees someone will cut corners to be more economical
@rantydave My argument was that Facebook and Google, as corporations, should not be trusted to build voting machines. I fail to see any counter-argument in David's tweet.
@rantydave I think he's saying that other corporations in other industries could theoretically be trusted, which is a point irrelevant to my argument and thread.
@rantydave Ahhh thank you for clarifying. Yeah, let’s not give them the contracts for voting 😅
It's not wildly disingenuous, it's completely oblivious and uninformed. There's no need to even discuss it.
Robustness (robustness against random errors) is much, much easier to handle than security (robustness against targeted errors). Think that everyone could change gravity for each atom all over the world, design airplane for that.
You might think so, but it's just a matter of properly designing for the threat model and failure modes. Hiring a Red Team to attack the system at various stages of the process (including initial design documents) is one way to develop an appropriate threat model.
The real problem there is, the threat defense model is organization based, but the actual threat surface ranges over 50 billion nodes, some of which are obsolete smart light bulbs and XP machines.
That's the potential attack vectors. If you design a system you have opportunities at several stages to limit the attack surface. Say by not using massive general-purpose OS distributions as a base, and by not using public networks for communication.
That seems pretty restrictive for general software.
For word processors and games, sure, but there are sensitive applications like voting machines where it is the minimal cost of entry if you want a secure system. I'd say it applies to light bulbs, too.
From the xkcd is wrong link: " Security against human attack consists of the entire infrastructure outside the plane, such as TSA forcing us to take off our shoes, to trade restrictions to prevent the proliferation of Stinger missiles." 1/2
And the same thing in the OP: "Simplistic focus on the machine and loss of perspective on the bigger system & society is the hubris that keeps the technology industry trapped in the footgun cycle." So while we can prolly agree TSA is silly, the larger issue is the larger surface.
True, but when designing systems you define attack surfaces at multiple layers. This isn't just the machine, but also the operators and the administrators of the machine, and how all of them together interact with the outside world. We can do this, with discipline and patience.
This is a thread about voting machines, not general software
"Where's the detailed 200-page public report from Facebook" kind of threw me off. Sorry.
I had a reality check when I read somewhere that there was no way NASA would be interested in most tech companies, as they need things to work with little room for error, and could do without side challenges, like npm package management or github merge conflicts 😂
Many topics covered in this thread and incredulity is expected. However, as someone who has been “bridging” Systems Safety/Human Factors and modern software engineering for the past ~8 years, I can confidently say: it’s not as simple as you make it out.
Yes, the longevity and maturity of “software engineering” is part of this. Yes, it’s likely that code of ethics (possibly licensing, but I’m unsure of that) and ‘professionalism’ differences between aviation and software. But: this is a simplistic comparison between fields.
Regulation plays a significant role here. Some positive in some directions (independent investigation organizations, for example) and some negative in other directions (reports that list bullshit such as “pilot error” or “loss of situational awareness” as causes).
*All* software has potential for unintended consequences, regardless of the domain. Airplanes, cars, social media, email...all of it. Those unintended consequences manifest sometimes as ‘vulnerabilities’ exploited by adversaries or bugs resulting in unavailability or...
...or software that works exactly as the developer intended but used differently by users, or many others. The same is true for other domains. Comparisons like these are not even apples and oranges, they’re apples and doorknobs.
It's not simple to successfully accomplish. It's simple to start trying.
Trying should include a) resisting the temptation to compare incompatible domains, and b) oversimplifying complex adaptive systems in 280 characters. :)
Many major diffs, inc. ICAO and major accidents
I’d agree that he is drastically understating the threat that information security community, but that makes it even more concerning that he is bang on the mark about the degree of seriousness that the problem is approached with.
And this is mostly because information resources tends to be critically under resourced compared to the level of risk.
And that is in large part because their employers aren’t being held to account for their negligence, and legislators aren’t being held to account for their failure to engage with the world we have been actually living in for the last thirty years...
In particular the Equifax and Facebook incidents stand out as opportunities to send a message, that were dramatically overlooked.
One thing I think the industry needs to do is get a lot more forthright about the damage being caused by these incidents, rather than the “no personal details were leaked” bullshit that often gets dragged out.
And the public, governments and (I’m guessing the financial sector) need to get far more confrontational about the way risk is being externalised to them.
"Degree of seriousness" is just another form of "lack of airmanship" found in aviation accident reports. If there are no breaches or they are decreasing, does that indicate a proper degree of seriousness? (no)
Calls for "better" standards of practice is a common reaction to all consequential accidents. It's easy to do, adds very little to dialogue about future improvement, and ignores the real "messy details" of actual work.
...non-linear systems with emergent properties. This is a really hard problem. We know enough about how hard software engineering is to know that voting machines are a threat to democracy. Secret ballot on paper, public scrutiny of the tally and counting process is all we need.
Having operated nuclear submarines for many years and more recently computer services, I get excited every time this debate comes up. I authored a few RCAs in the Navy, and read thousands of others. Almost all were attributed to "Human Error."
When I switched from submarines to software services, the differences were puzzling: why were there no operating procedures, periodic maintenance schedules, incident procedures, on-call rotations, checklists, standing orders, hydrostatic test, incident drills, etc
...some of this has changed over the last 15 years. Wiki serves as a evolving operating/incident/maintenance procedure on some teams. On-Call rotations are ubiquitous. And yet these debates are often, to quote Crash Davis in Bull Durham, like a "martian talking to a fungo"
I've puzzled over these differences and have followed the work of Drs Allspaw and Cook with great interest as they create a new field. Yet the debate still explodes occasionally with a practitioner of traditional accident investigation saying, "do the RCAs, human error is real...
...MTTR, MTTB, MTTD, MTTC." My experience is that the traditional methodologies worked very well on submarines, but were much less successful in software services. For a while I attributed this to immature culture, lack of leadership, lack of accountability.
This thread already mentions two other differences, regulation and the maturity of "the field." My sense is there are three attributes of "traditional operations" that makes "traditional Problem Management" (RCA, Post Mortems, Continuous Improvement) work:
1. stability of architecture 2. horizontal scaling 3. absence of Moore's Law (related to #2)
An operator from WWII, would be quite familiar with the "architecture" of a submarine: Engineroom in the back, control room, Sonar, torpedo-room, ballast tanks. While the flat-screen displays would surprise, we have replicated the intuitive utility of the dial-gauge on those.
Any software service that scales, by necessity, changes architecture. One could argue that at some super high-level there is just a "front end" and "back end", but under the hood there will be entirely new components every few years that dramatically change how operators interact
2. horizontal scaling- sure everyone is going to say, "of course I horizontally scale my service," but let's compare the count of submarines, airplanes or automobiles or to web search services or crypto-currencies. Replication of platform creates a fleet of identical machines,
...each with a different Ops team. This creates a large 'n' for a central office to collect and compare incidents, thereby refining and converging procedures and designs. Read the introductory chapters of ITIL.
The beginning of the submarine reactor plant manual stated that "everything you will ever need to do to this machine is documented here. If you think you need to do something that is not documented, read it again. If you still don't find it, surface the ship and radio home"
In years of operations, I never encountered an exception. Name a software service wiki for which that is true. The scaling and rapid iteration of architecture is enabled in services by 3. Moore's Law.
If submarines were more like software services, you'd have to imagine a single submarine in the fleet that held a billion torpedoes, traveled at half the speed of light and fit in the palm of your hand.
The operators of that machine would likely be challenged to use traditional methods of accident investigation. I am excited how the new field generates methodologies that can be retrofitted back onto the traditional Ops disciplines. That has often happened in history. (EOM)
All this boils down to "if we don't know how the machine works, we can't understand how it failed, and if we don't take the time to understand how it might fail, we'll never understand how it works."
Rickover took uranium from the ground and put it to work in a submarine -- a feat of engineering much more difficult than anything most programmers will encounter. He was able to publish the book you read because he made choices and established rules for how the machine operated.
Along the way there were thousands of decisions made to eliminate risk and create the best system. There were experiments with multi-reactors and sodium-cooled models, which were abandoned. But all of it began and ended with safety, a mindset that nuclear accidents weren't OK.
There is no such thinking in most software development. But there is in many areas. And in these areas, the kinds of things we're discussing routinely happen and the software is expected to work every time.
It is not that SW is some special, magical gift unlike anything else in our universe. It's that we can either have an attitude that errors are inevitable or errors can and must be prevented. It's as simple as that.
Here's the reality, @www_ora_tion_ca, hospitals are literally killing their patients from medical errors and can't agree that this shouldn't happen. Each industry sees what is different about it and uses it as a shield such that ideas from outside are difficult to consider.
We have a perfect example here. Someone who worked in a very disciplined industry goes to a much less-disciplined one and what happens? Do they change the industry or does the industry change them?
If SW devs begin flight school, do they argue that complexity compels them to make errors during landing? Or do they focus their minds and talents to properly land the plane every time? And when they've mastered that and go back to work, their mindset changes once again.
It's not a matter of "can", it's a matter of will. Of want to. The same people coding in the morning and flying in the afternoon adopt very different mindsets, each adapting to their environment. Each seems rational by that standard, yet each is clearly a choice.
SW Devs have been some of my most challenging student pilots during instruction, their tendency isn't to complain about complexity, the challenge is to prevent a distracting hyperfocus on a minor detail.
And if they do, will they be leaping on their own, or will they have been pushed from a comfortable resting place?
They're getting pushed by Bezos, Buffett, and Dimon.
While you may intend derision with "don't know how the machine works" that is in fact the starting point for these new approaches to system failure. Hard work, brilliant engineering and unparalleled adherence to high standards were the keys to Rickover's incredible achievement...
...resulting in a machine that for all practical purposes we fully understand, even in failure modes. But systems live on a spectrum of complexity from my toaster on one end to the global economy on the other. Cook et al have been pretty clear that when they say "complex"...
...they mean that it isn't fully understood by any single operator. Now you might say it's irresponsible, or lazy or indicative of a lack of character to build a machine that isn't fully understood, but in truth hard nosed entrepreneurs build them all the time...
...and people are better off for them. If you don't believe that complex systems (defined in this way) do or should exist, then you're expressing an orthogonal argument.
I think complexity is something that we can generally structure ourselves out of if we choose to do so. You've already seen Rickover do that by turning the atomic bomb into atomic power plants run by 20 year-old kids.
If you want to argue that sending and processing information through a set of pre-defined rules is more complex than that, OK, but as for someone with more time discussing complexity science than me, I'll give you @hvgard and his article: pni2.org/2014/12/comple…
And it is also "used" by totally untrained, ordinary, people from across the world to do everything from finding a job to arranging a coffee meeting to buying a house to learning about a nuclear submarine's power plant failure protocols.
A key source of complexity is that successful software gets built upon by other systems. Just look at how we rely on a Simple Mail Transfer Protocol (SMTP) that was invented by one person. Much software complexity comes from vertical integration of systems
Raymond Tomlinson, Who Put the @ Sign in Email, Is Dead at 74
The computer programmer chose the “at” sign to separate a user name from a destination address in his new messaging program.
nytimes.com
Human error is a constant of the universe and never a root cause. Root cause is always process, management, culture, etc.
The rule of thumb I like is: you’ve got the root cause if you have a (simple/understood/doable) course of action which clearly fixes the problem. Otherwise you’re still working with symptoms and your investigation is incomplete.
Nope. There is no single root cause of complex systems failure. It doesn't exist, it's not a thing, and it's not only a waste of time trying to find it, it's dangerous to assert confidence with. That this concept continues to survive is why we will continue to have accidents.
True that there is never only one root cause if you look hard enough, though I always have to keep investigations from stopping until we've found at least one that meets the definition.
Root Cause: The most basic cause (or causes) of an incident that management has control to fix (i.e. a process/procedure that is Missing, Incomplete or Not followed) and, when fixed, will prevent (or significantly reduce the likelihood of) additional problems of the same type.
(The actual definition is longer, I shortened it for the character limitation)
Can i just add a slight counterpoint? The failure of 9/11 was a social phenomenon in that technology (big planes) was used for an unplanned purpose. These problems clearly exist for both software and hardware, and safety systems sometimes amount to nought. All systems fail.
Perhaps in some systems, especially social ones, we can never achieve 'fail-safe'? If that isn't an option, what are the other ones? Fail less often? Fail less catastrophically? Fail publicly? Fail too often and you get regulated? I really don't know.
My strong suggestion is to read the multiple sources of research on how these (and other) definitions are critically problematic. You could start here: kitchensoap.com/2012/02/10/eac… or cut to the chase and read Dekker's "Field Guide To Understanding Human Error" 3rd edition.
Kitchen Soap – Each necessary, but only jointly sufficient
I thought it might be worth digging in a bit deeper on something that I mentioned in the Advanced Postmortem Fu talk I gave at last year’s Velocity c
kitchensoap.com
While there is a hint at at convenience, there should be a nod to “socially constructed”. The idea of “will prevent” and “same type” is also problematic in *complex* systems, esp. wrt emergence. All safety specialists would benefit from reading about the philosophy of causation.
You can identify the start of the failure cascade but is it even fair to claim a system is complex if it can fail with a singular root? Lack of preventative measures is just as much a casual issue.
As is inadequate monitoring! I see each causal chain as having a root, but there are many chains.
Root Cause: [...] (or causes) All the rest of the palaver is made irrelevant by this idiocy. Gods, give us fucking patience. When the second train carriage is derailed the root cause is the engine being derailed, not the first carriage being derailed.
Let's just tell the second carriage to "not follow" the first carriage.
Ah, but the root cause of the second carriage being derailed was the damaged line that derailed the engine. Wrong. By that iterative logic, the root cause was the Earth coming into existence, or God always existing. Doh. Engineer your answers.
You haven't gotten to anything that management has control to fix yet. Why was the line damaged? Why didn't maintenance catch it?
In this example, 18 different causal chains were traced to their roots: tsb.gc.ca/eng/rapports-r…
And the causes weren't really engineering.
Yes, the train carriages weren't a very good example, just a bad approximation. Language syntax is a part of this problem, specifically here, the interpretation of root. As @allspaw says "...no single root cause of complex systems failure..."
Introducing "that management has control to fix" appears to prevent reductio ad absurdum but masks unpredictability in complex intersections. This artificial bound provides for KPI's at the cost of admitting the improbable making an appearance.
Attempting to allow for all improbables is simply a no-go. Compiler design has long recognised this with a state of "undefined".
The point is that KPI based thinking can instill unwarranted confidence in covering all the bases. It's a necessary evil, without which nobody would have a job, or the wrong people would have the job.
But it abrogates recognition that in complex systems, things will go wrong, unpredictably.
Applying the word "single" as part of a root cause is scary. Looking for the roots of many causal chains is a better goal.
Delayed response, work... We've come to believe that effort in mitigation of the results of failure in complex systems should be viewed as having higher "goodness" in employment KPI terms than extended root causes (note, plural) analysis.
This is not meant to imply better, it's a very soft concept, couched in social mores, not amenable to hard analysis, particularly in purely financial terms. Or more succinctly, "what does a human life cost?".
In case of dick-measuring reactions, we've been the buck stops here person for a national airline network and consulted for other life at risk industry verticals.
So your search continues until you’ve found something that satisfies an arbitrarily created definition without basis in the realities of a complex world. Interesting.
It's not an arbitrary definition, there's a tremendous amount of research behind it.
Really not. Complexity science doesn't agree at all.
My reference texts seem to be in general agreement as to the definition of a root cause, and I've already spent 15 years arguing over specific word choices. It's a high-level definition, because every complex situation is different.
Research? Can you cite a paper that tested it against something else as a measure of reality?
Not playing that game with someone only bringing negativity here. Say what you believe to be true, don't just say that others are wrong.
Imagine if this is how science actually worked.
I'm sorry you seem to be confusing a bird website with sci-hub
My apologies- but differences of opinion = diversity. Negativity is something else. The concept of “root causes” is an oversimplified model of the world. What are culture, processes and Mgmt but the actions of people?
How about: “root cause” language helps people to point to situational factors they believe to be salient while system safety and resilience engineering language help people to elicit and combine all available information, and to seek out broader perspective as necessary.
Re-upping: like most tragedies, accidents and breaches play out on a “stage” that was constructed years in advance. System safety + resilience help illuminate both the script and the stage, not just an arbitrary fatal flaw (“root cause”). twitter.com/anoncept/statu…
A metaphor for people seeking rich(er) narratives: if accidents are tragedies, then on what stage are they played out? — and when and how was the stage constructed and set, by whom, and to what ends?
An especially important class of example: any time we see the same story unfold multiple times with different participants, it’s a good bet that there’s an important common environmental cause and that systems safety tools may help. After all: replacing the people didn’t.
I have given supporting references in this thread. I’m genuinely and earnestly interested in what research you’re referring to, since modern Safety Science and Human Factors and related fields has long since dismissed the concept of root cause in complex systems.
Sorry you got caught up in my frustration, John, my criticism was of Steven and Andrew's behaviour. I agree that the concept of a single root cause is bogus.
That said, my executive stakeholders demand root causes at the conclusion of my investigations. My assertions here are that the term is defined.
Even dismissed concepts still have definitions, and those definitions have been very useful to me in managing executive stakeholder expectations.
I've been successful in completing investigation reports with a dozen casual chains, where each chain gets traced to a root.
I can't finish a report with zero root causes, so I compromise by finishing the reports with several.
As a practitioner in industry, I don't have many of the luxuries that academics can bandy about.
Just today I had to work with senior executives who insisted that a root cause is what you get after asking "why" 5 times.
I (and @docgrose and @StevenShorrock) understand this dilemma in industry (we are all practitioner-researchers) and have been working to change this.
It can be done, albeit slowly. For a great example, see the US Forestry’s Learning Review Guide as an attempt to bring Systems thinking to accident investigation in wilderness firefighting: wildfirelessons.net/HigherLogic/Sy…
"behaviour"... I suggest that you rethink think that wrt science. Seriously.
Demonstrably mature comment, consistent with expectations.
No need for name calling. My point was that the "behaviour" you referred to was asking for evidence and making the point that root cause is debunked concept in safety science, complexity science, and philosophy of causation. That's all.
In European ATM this is now understood, with the help of David Woods, Erik Hollnagel, Richard Cook, etc who have explained why this is the case, to audiences of aviation safety specialists. May seem a small language issue but sig. ramifications, wrt analysis, eg RCA in healthcare
People also increasingly understand the problem with the idea of linearity, fishbones, causal chains. There are still reversions to simple and mechanistic system language, but understanding but finally folk are talking about patterns, interactions, influences, emergence...
Admittedly this process also involves some simplifications in information presentation in order to enhance understanding (eg. skybrary.aero/index.php/Tool…) since the research on safety science, complexity and philosophy of causation is not accessible to many.
Sorry. But Really not.
It's disappointing when experts grumpily fling poop instead of contributing positively to a discussion. If we will stipulate that you're smart, can you be constructive?
Human error can rarely be the "root cause" of an incident in isolated cases. Everything else is systemic.
It's never a root cause. Ever. By definition.
Was waiting for you to spot this.
spot-on thread, Rob. (I'm a professional software developer).
Cost of developing aviation software vs cost of mundane developing business software is very different, because risks are very different. Market forces clearly say that non-safety IT accidents are acceptable.
This is a very interesting thread, but I have to disagree on the simple matter of scope. The standard for SSL and it's implementers do follow an open process very much like what you're describing.
But to do so for all of software engineering would be like the aviation industry setting standards for everything from a tricycle to an attack helicopter. There's just too much variation in they types of software being produces for a single set of best practices to work.
That's why most companies do it internally and have their own best practices to prevent past mistakes. But in a capitalist economy, they'll consider those trade secrets as long as they can (<- one place I totally agree with you; major failure analysis should be public).
If you tried to create a "programming checklist" even for a smaller-scope, such as writing netcode, there's still so much variation and innovation that half the items would be checked off as N/A. That's not a useful checklist.
What makes programmers unique is that we won't even attempt to make the checklist and learn from which parts are useful or not. We just assume that we know best, it's too complicated and hard for other fields to bother trying to help us organize, so we just keep buttonmashing.
A former boss once told me that he wished everyone did their job like a fighter pilot -- because if a fighter pilot @#$%s up, they *die.*
Safety and complexity related to software operating machinery in the physical world are very different to safety and complexity operating in a human design problem space. A lot of abstract thought is needed to have meaningful discussions on risk.
The way I see it, most software doesn't actually need to be that error-free. Where we've recognized that it does, it actually is (airplanes, etc). The issue is that we haven't recognized that some categories, including voting software (!), need to be error-free.
Airplane software is *not* error-free. I've had flight computers crash on me mid-flight several times. But, I'm alive and well -- because the software is designed to be a safely fallible component.
I was using the term error-free holistically, as in "software on planes never crashes so badly as to crash the plane" which isn't true of, say, many word processors. (If there's a better word that captures that concept, lmk so I can improve my communication)
Agree. Good point on its own. However I think that: - most developers/orgs overestimate (not deliberately) the quality & completeness of their software. - the realization that “software doesn’t actually need to be that error-free” is NOT the reason why software quality is poor.
I'd say it's not the developers but the customers who are making that determination, and setting up appropriate incentives
Customers are definitely complicit but not equally culpable. Customers don’t (can’t) appreciate the costs of developing software. Developers don’t either and so overpromise & underdeliver. It’s a vicious cycle that results in mediocre (at best) software & missed expectations.
I guess I just see it differently; I don't see it as necessarily being a bad thing; perfection isn't necessarily your goal, and I think software is an example of an area where good enough is good enough
The chess documents been written. The real issue is none of the standards stake holders can agree/want a decent standard. Try writing a standard when the earthquakes and meteors from outer space are the standards authors
An old (94') article about this: "The Professional Responsibilities of Software Engineers" by David Parnas.