See the entire conversation

Update on cloud outage impacting ~400 customers. As part of scheduled maintenance our team ran a script to delete legacy data from a deprecated service. Instead of deleting the data the script erroneously deleted sites, and connected products, users, and 3rd party apps. (1/5)
178 replies and sub-replies as of Apr 14 2022

Atlassian maintains extensive backup and recovery systems, and there has been no reported data loss for customers that have been restored to date. This incident was not the result of a cyberattack and there has been no unauthorized access to customer data. (2/5)
We know this outage is unacceptable & are fully committed to achieving a full & safe restoration ASAP. So far we’ve restored functionality for over 35% of impacted users & estimate the process to last up to 2 more weeks due to complexity of rebuilds for each customer site (3/5)
We're communicating directly with each customer with full transparency on the length of their outage. We know you have questions. If you’re impacted, please reach out via the existing support ticket. In addition to our direct communication see status.atlassian.com (4/5)
We also want to apologize for the lack of comms on the incident. We’ve been focused on getting all the right information directly to impacted customers, and should have shared more externally. You'll see a more detailed and technical update from us later today. (5/5)
As an impacted customer, I would say we haven't gotten very good information from you. ¯\_(ツ)_/¯
Hi there, we apologize for not being more proactive in our communication with you - are you yet in touch with our support team via ticket? If yes, we encourage you to continue working with the team directly, and if not, please DM us your email address so we can connect you. twitter.com/messages/compo…
I'd be happy if you were just transparent about it publicly.
Let me a bit frank - I'm not trying to give you guys a hard time, but as an IT professional, I would have been embarassed working for a company that handled something this way. Everything in this post should have been shared April 5th, not April 12th.
hi @atom8bit - Arseny here from Atlassian comms. It's clear we failed to live up to our own standards in transparency and urgency in comms during this incident. We'll share lessons & actions in the public post-incident review to prevent this in the future.
I realize your team is working hard and I feel for your sysadmins/techs/engineers, I really do. I don't envy them by any means. But if I were in their position, I'd be ten times as frustrated with the public perception of how it was handled than having to work to fix the problem.
I can't stress enough that transparency and communication is *critical* in incident response. So I'm shocked by the lack of/delay in communication when the incident was, in part, caused by a communication gap internally, according to your blog post.
In the spirit of being constructive, perhaps I could point you to Gitlab's outage in Jan. 2017 as a model of an incident being handled very well:
Postmortem of database outage of January 31
Postmortem on the database outage of January 31 2017 with the lessons we learned.
about.gitlab.com
Additionally, once we have restored service for all customers, we will conduct and share a detailed post-incident review with our findings and next steps. This report will be made public and available to everyone. 2/2
Please give us back the server option 🙏
A technical update two weeks after the incidents.....slow clap
I would agree the information to impacted clients has not really been there until yesterday when told it could be 2 more weeks
Hi Alan, again we apologize for this. We will be following up shortly with a detailed blog about what happened, what we're doing to recover, and how we'll be updating customers frequently and transparently moving forward.
we are one of impacted customers and your assessment does not begin to explain the situation. we have been down for almost a full week and have nothing to go on. This lack of transparency has made it very difficult to plan. you have brought our IT teams to a stand sto
Hi Collin, we understand our products are mission-critical for you & want to make sure we're relaying the most accurate information possible in a timely manner. We'll continue to work directly with your organization to restore your site, communicating 1:1 via support ticket. 1/2
We have just released a company blog with more detailed updates on this incident at atlassian.com/engineering/ap… and once we have restored service for all customers, we will conduct and share a detailed post-incident review with our findings and next steps. 2/2
April 2022 outage update
This incident remains the top priority for my engineering team and for our entire company.
atlassian.com
Thankfully our JIRA / Confluence instances remain ON-PREM. Let's hope Atlassian learn from this and never repeat.
… it’s an entirely different kind of knowledge, altogether.
Bro, I totally did that! I feel you, my delete affected a subset of customers and their data… this is next level. May your support staff receive the rest they need when this is all back online again. Xoxo We still love you!
Good teamwork go on 24/7.
Ouch two weeks to complete a restore?
Hi Mesh, if you'd like to read more about the incident and the timeline to full resolution, we've just published a company blog with more details: atlassian.com/engineering/ap…. Once we have restored service for all customers, we will conduct and share a detailed post-incident review.
April 2022 outage update
This incident remains the top priority for my engineering team and for our entire company.
atlassian.com
Two more weeks ... O U C H
I would get sacked if my team delivered the same extensive backup and recovery systems are your company and we are a fraction of your size. Software 101: REGULARLY TEST YOUR BACKUP AND RECOVERY PROCESS
Yet you insist on moving customers away from self hosted...
Well i see 2 options after this worste case outage! 1. Atlassian continues to provide on prem 2. Atlassian will die faster as the would ever have imagined! There where much bigger companies failing in the past because of such errors.
All self-hosted services are going down (sadly), check your Office suite, do you own it or you are renting it? It's the same for games nowadays. Unfortunately, this might bug the end user, but grants short time profits to the companies, and so here we are...
Office is installed on my computer. I can operate it all day without the cloud. I (still) run the Atlassian suite on self-hosted computers for a few reasons, one of which is f*ckups are mine, I don't have core business systems crashing because someone in AUS writes bad scripts.
If you use a new O365 (not migrating from 2016), its services are on cloud, even if they weren't, the license keys are checked compulsively, this means that even with MS Word installed it won't work if you don't pay a monthly fee.
This is why Office 97 is the best office.
Atlassian took ages to even publicly acknowledge the issue to not scare people from their main push of going cloud. It's like govs pushing COVID jab and burying adverse reactions reports to "prevent hesitancy". On prem? Are you having "cloud hesitancy", cloud denier?
Hey guys, do you know the 4 or 6 eyes principle? Reviewing code before running or using dry mode (which only shows what will be deleted) before running it finally? -> dudes get some new devs!
I am not sure @Atlassian has implemented any principles their ISO 27001 documentation suggests they have implemented.
In real world, ISO 27001 means nothing.
Bugs and errors are happening no matter how many certificates you have.
Bugs and errors indeed happen. Major mishaps like this one suggests a serious lack of controls. It is not the original incident, it is the fact that they have not tested their recovery process that is a major source of concern.
What would have happened if ten times or 25 times as much sites were affected ? It would take over a year to restore everything. Sh*t happens all the time, however, as a good software company you make sure you can recover from this.
Don’t blame the devs. It’s an environment that every engineer, IC and manager, should be working to improve.
It’s almost never a single person’s error. It’s a system that enables these sorts of issues. The system has to evolve.
This is 100% due to a systematic culture and management problem for which IC’s should NOT be blamed.
Blaming or even firing the dev who pressed enter on the DB script will not solve any of Atlassian's problems. They need to figure how to fix their development and SRE processes instead. I hope they realise this.
I think you are barking at the wrong tree. They've probably focused more on buying companies and increasing revenue instead of improving processes.
That's a big oopsie, and more, how is the data stored and are there any permission in place at all to access data if a delete 'legacy data' script can access and delete users, 3rd party apps and sites? Also… no redundancy?
If you want to make sure that mistakes are swept under the rug, you blame the devs. Experienced and highly functional teams adapt blameless culture for good reason. sre.google/sre-book/postm….
👍👍👍 Reliability is never based on a foundation of finger pointing.
The problem with software is that a single character or unexpected path can be catastrophic. They (probably) don’t need new devs, but an improved release and testing process. But even with that, mistakes will happen. In any case, the time-to-recovery is completely bonkers.
Hi there, if you'd like to read more about this incident and our timeline to full resolution, we've published a company blog with full details at atlassian.com/engineering/ap…. We will also conduct and share a detailed post-incident review with our findings and next steps.
April 2022 outage update
This incident remains the top priority for my engineering team and for our entire company.
atlassian.com
Atlassian making this mistake is crazy because they literally own Confluence, Jira and fucking OpsGenie. $PD salespeople must be excited right now.
Actually, first you tell your customers on prem is dead after 2024 and now you kill your hosted version? This should give you enough reasons to continue to provide the on prem version!
On-prem is not going to be deprecated, as far as I remember. In 2024 only DC licenses will be available, so you can have your own on-prem instance, if you wish…
That’s a bit disingenuous. For small companies with far fewer than the DC licenses’ 500 user minimum, losing access to Atlassian’s on-prem option priced at an affordable rate represents a significant hit to those customers.
I understand your point, but I think a blanket statement saying that on-prem is not going to be offered at all is a bit disingenuous.
I think we can all agree that the change in stance is disrespectful to the small companies that rely on Atlassian’s products. Also highlights how awful the on prem market has become.
I understand it. That’s all I can say…
What’s your take? Consolidated code base? Lowered maintenance? Just curious. I opened a ticket five years ago and it’s still open and receives comments weekly asking why it’s not fixed. 😛
Now it's more clear than ever - Atlassian Could will definitely *not* be an option for my org. On-prem or another vendor.
Atlassian took ages to even publicly acknowledge the issue to not scare people from their main push of going cloud. It's like govs and pharma pushing COVID jab and burying adverse reactions reports to "prevent hesitancy". On prem? Are you having "cloud hesitancy", cloud denier?
Hope restoring the data works without any problems. Good luck 🍀
I talk to my students a lot about backing up their data and protecting others’ data. I will continue to refer to @Atlassian as a company they can learn from and this is a great example. Owning the problem, communicating the problem and working towards a solution with backups.
U must be blind or something It is a slackers dream company . Engineers go when they want to 'rest and vest' and just coast, otherwise they work in FAANG companies if they want to do some real work. What comms, what engineering.. Lying to customers.. A deletion script..Really?
Surely you're a parody account right?
So you don’t have a test script in segregated environment?
“We don’t always test our code. But, when we do we do it in production!” 😉
My point exactly- I won’t even dare ask if they do any #owasp !
Haha…I remember first learning about owasp when I was just starting out. I used to think only small businesses struggled to meet standards. Silly me thinking enterprises were any different. I’m more shocked when places are actually following proper standards
Im not even a user just an auditor who asks for evidence to organisations who use this software- I remember this company pushing companies to unsupport on premise applications.. Zoho and Salesforce must be rubbing their hands together!!
I can only imagine the things you’ve seen as an auditor. Their competitors, probably building use cases as why they’re better. Idk how those affected will function, if they have to wait 2 weeks for data. “It’s not if something bad will happen. But, when…” 😬
TONIGHT, WE TEST IN PROD!
“Only the strongest servers will be granted glory!!!” 💪
This can't be true. I have it on good authority from @Atlassian that you only disabled sites, and not deleted them.....or is that the same outrageous lie as your SLAs?
Hi, a blog will be coming out shortly with details about what happened and what we're doing to recover.
Lol reminds me of saying software ain't perfect
For some reason some of your impacted customers find it easier to reach out to me than to communicate with you. If I were you, I’d fix this. Not a good look when customers say the official Twitter comms doesn’t reflect their reality as a customer. twitter.com/gergelyorosz/s…
Unfortunately, impacted customers are telling me @Atlassian is not doing what they are communicating publicly. This is from a company who has been down since 5 April. Atlassian, why are you not talking with your own, paying customers? Why do you not give alternatives? Shame…
“We’re communicating directly with each customer.” And yet a customer impacted tells me: “Our bill is close to $10k/month and I doubt we are a big enough customer to care about. They certainly haven’t shown us that we matter. There have been zero personalized communications.”
Hi @gergely Orosz, Atlassian CTO here - we’re communicating proactively with every customer - but you’re right that we can and should do better. (1/8)
You can find a full explanation of why this took longer than it should have and what customers can expect next in terms of communication here: atlassian.com/engineering/ap… (2/8)
April 2022 outage update
This incident remains the top priority for my engineering team and for our entire company.
atlassian.com
Restoring customer sites: We will continue to work directly with affected customers to restore their sites, communicating 1:1 via support tickets and through our Customer Support team. I’m committed to moving through this as fast as possible. (3/8)
Post-incident review: Following this incident, we will conduct and share a detailed post-incident review with our findings and next steps. This report will be made public and available to everyone. (5/8)
Thanks to our customers for the partnership as we navigate each step together. We know you have stakeholders to answer to and that our failure has resulted in major disruptions to your business. (6/8)
We’re committed to restoring service as soon as possible, and doing what we can to make this right for you. (7/8)
As a reminder, we are posting daily updates on our status page at: status.atlassian.com. Once again, thank you. (8/8)
I would LOVE to have an estimate or a detailed planning for the restoration of the sites. This outage is a disaster for my company. Oh…a tip for the future; ever heard of the “ROLLBACK TRAN” statement? 🙈 #massiveFAIL
This situation puts critical perspective on “…we will use commercially reasonable efforts…” from the Atlassian SLA.
Glad that after a week of customers being in the dark you’re committing to doing better. To be clear, I really don’t want to be in the business of listening to your customers complain to me. But if they cannot get responses from you: it’s what customers do. Good luck restoring.
On behalf of all developers in the trenches, let me say that @Jira has been a worthless cancer on our entire industry and has increased agile’s micromanagement and hegemony and the destruction of our projects, careers, and work life balance. We savor the justice while it lasts.
the fact that jira has been down for a week and I’m only just hearing about it tells me that a million programmers are in a frenzy accomplishing all the things in their mental backlog that don’t reduce to tickets
you just have had shit managers bro. go find some good peeps cheers
Damage control too late.
Good thing we still have it working on-site. We are laughing at your emails about moving to the cloud. If you anticipate future growth, we encourage you to consider our Cloud offerings to meet your future needs.
"Communicating" has been a copy/paste from the status page. Zero personalized communication, and all questions asked have been ignored.
That is completely in line - you can also faster build yourself a chrome extension to fix their terrible editor behavior, than to get the fix from Atlassian.
Hang in there .. Get well soon !!
anyone missing a sabot?
The fancy term the industry folks use to sound cool around that shit is "ChaosOps - PROD version"
Running a [delete] script in prod without dry running it first is always a thrilling experience. 100% recommended.
"Fire in the hole!" *clicks on 'run'*
Hands in resignation letter.
“Didn’t we estimate this deletion would take 3 minutes to run? Why is it still going 30 minutes later?” *sigh*
That was exactly the trigger moment :p
From a memorable incident report: "not to ask unwanted questions, but why is there so much free disk space?"
Probably it's a good timing to review the licensing model and bring hosted licenses back
No lesson learned or improvement points ? Terrifying incident, especially the time for recovery. Good luck Folks. ✌️
A typical delete without where
That sounds like a very painful recovery for presumably the best Atlassian product support on the planet…an interesting data point for future orgs considering these products.
that's the thing, most recovery plans are for total data loss, not partially restoring a handful of customer's data
Yeah that’s kind of the point though, it’s not like teams doing disaster recovery have the luxury of controlling what they lose. Your statement seems more relevant in highlighting the difference between single and multi-tenant recovery operations.
indeed, you'd normally have to restore to another platform and export the lost tenants individually to restore to production... that takes time
all depends on how their data is structured though, I've got no insight into that
That’s fair, and why I said they are probably better-suited than anyone else to manage this—was more a comment on their saas vs self-managed offerings than how this particular team is succeeding or failing for this incident.
Thanks for being transparent. It might happen to all of us. No matter how much you pay attention on it, somehow it happens.
Not if you run eraser scripts over a dev platform not production.
Did you execute that command in Hadoops?
What about reversing on-site server license decision?
In case anyone is looking for a fast Issue tracker that won’t delete your data.
Atlassian’s reaction and information policy to customers has been slow and contradictory causing quite some damage, especially to companies in the healthcare sector. We look forward to full restoration and follow up.
Hi Angel - This is our biggest priority and have mobilized hundreds of engineers to work on this incident 24/7. We will continue to work directly with your organization to restore your site, communicating 1:1 via support ticket and through our Customer Support team. 1/2
ITM, we have released a company blog with more details on this incident and timeline to full resolution: atlassian.com/engineering/ap… and we will also conduct & share a detailed post-incident review with our findings and next steps. 2/2
April 2022 outage update
This incident remains the top priority for my engineering team and for our entire company.
atlassian.com
Acompáñenme a ver esta triste historia.
#HugOps to my friends down under
I still don't understand how this can happen. Wouldn't that script is tested rigorously and results are thoroughly verified?
Nope, the swedish intern did it
Companies are getting rid of QA because it makes devs "more accountable" 😂
It sounds like the script did exactly what it was told to do. The problem was that a human told it to do the wrong things.
Yeah, you are right! 😂😂
Agile Development ™ ©
Can we get story points and subtasks?
Bad stuff can and will happen. Your main focus needs to be on making sure it can't happen to all customers and regions at once, i.e., that changes are gradual with time to see if they break things and stop.
Ojaja... Prod environment >>> test environment
In addition to forcing their customers to migrate to the cloud and increasing their monthly bills, Atlassian can't help wiping their data.
Can you also delete @chiefchimpanzee's account? Thank you!
Whilst you’re there, I have some specific users I’d like you to delete - it will help a lot of people.
Here’s the thing. Agile actively discourages the creation of safeguards the prevent human error. Work on such preventative things don’t produce the crunchy delicious story points that look so lovely in @Jira reports…
Why blame agile for that? I see no reason why it doesn’t allow for scheduling such improvements. Not every story needs to be a feature. It’s merely those making the calls what to prioritize that fail to understand the value & necessity of certain improvements.
I guess continuously shipping features and “value” won out over writing safe and reliable automation scripts. #jiradown is what happens when you let #agile into a business. #AgileKillsKittens
A key to customer value, according to @Atlassian #CTO @SriViswan: "Continuously shipping features." #AtlassianSummit #agile #devops #appdev #productmanagement
Atlassian software alternatives... bye-bye-server.com
I deleted production once. We all make a big boo-boo eventually. I hope y’all reassured the developers that Shit Happens and use this as a learning opportunity.
Oh well happens to the best of us.
Oh wow. At least you’re owning it.
Trying to press CTRL-C but it was too late....
I stopped using Atlassian as soon as they moved to a cloud first strategy
Good luck Guys 🍀
We are not having any problems because we run Jira SERVER (not Cloud, and not Data Center). Which makes me wonder why @Atlassian is ending support for the SERVER versions of your products....?
Im not a customer but wow this situation must suck for a whole lot of people who didnt create this FUBAR. Wish all of you a speedy recovery, no pun intended.
Alguien se olvidó de poner alguna doble comilla... :) fuera bromas bien por la transparencia
Godspeed response team. I can only imagine some very long days/nights as you guys sort this out.
Damn, been there, done that, though never this bad