See the entire conversation

Where does engineering rep come from? Here's a Google engineer slagging on Amazon engineering, which they say mediocre. I don't think this is an unusual opinion, I've heard this from people both inside and outside of Google. Google has the best engineering, Amazon is mediocre.
300 replies and sub-replies as of May 24 2020

When I worked on cloud at MS, it was the same -- although AWS was clearly in the lead, a major concern was that Google's superior engineering would allow them to crush AWS and Azure; "preparing for a knife fight with Amazon, but Google is going to bring a gun to this knife fight"
But when I looked at execution speed on actual projects (via backchannel communications), AWS was smoking both us and Google. In one case, I heard that they got the idea for a project from our product announcement and they still shipped before we did.
They weren't moving fast and breaking things -- when I looked at 3rd party measured uptime, AWS was clearly #1 and we were going back and forth with Google for #2. This understates AWS's edge since they had fewer global outages and less flakiness that didn't count as downtime.
The more I looked into this, the more impressed I was with Amazon engineering. But AFAICT this never translated into any kind of reputational change. I don't think this is unique to Amazon either. When I compare general reputation to what I can observe, they seem uncorrelated.
BTW, I don't mean this thread as an attack on MS or Google. It's more that if I could take a sabbatical from my job and intern somewhere to learn from them, Amazon would be at the top of my list and I don't think many others would put any company in my top 3 in their top 50.
I've also never understood Widows snobbery. StackOverflow was running on 11 IIS boxes + 4 MySQL boxes in 2016, could tolerate failing down to 1 IIS box. Meanwhile, some trendy SV companies were serving multiple orders of magnitude less traffic at multiple OoM greater cost.
PG also says: Python programmers are smarter than Java programers, good hackers prefer Python. Odd, Google was built on Java & C++ (w/some Python). But if you were on the MS stack you wouldn't have to choose between the perf & IDE support of Java and Python expressiveness.
Until Kotlin, there wasn't a mainstream non-MS language that had anything close to the same combination of: * Performance * Ease of use / ease of onboarding new devs * IDE suport * General expressiveness (arguably Go, but I would disagree on expressiveness & IDE support)
Why care about performance? While trendy $1B to $70B SV companies were devoting a ton of time, money, & effort to scaling up a v. low performance stack, SO was humming along with relatively little effort devoted to scaling because they started with a moderate performance stack.
If you compare the 2013 and 2016 StackOverflow architectures, the changes aren't radical: nickcraver.com/blog/2013/11/2… nickcraver.com/blog/2016/02/1… Meanwhile, trendy SV companies were and are moving to incredibly complex architectures to scale out despite having much less traffic.
what impresses you about amazon engineering?
Yeah it’s pretty amazing what they are able to do. Scaling out can be quite expensive. It’d be fascinating to see their estimated costs on AWS vs. their approach.
Yes! And @expensify was running for quite some time on a sqlite in a single box. blog.expensify.com/2018/01/08/sca… Get the traffic first and then you can think how to scale, no need to follow complex architectures born from the problems you might never have.
Scaling SQLite to 4M QPS on a Single Server (EC2 vs Bare Metal)
Tips and tricks for scaling SQLite for 4M real-ish queries per second on a single server, and how bare metal compares to EC2 for this workload.
blog.expensify.com
Very good points. But complexity is got flexibility - continuous deployment and so on
Your messages are pretty contradictory.Python is low performace stack.But not Java or Go. open TechEmpower,C# is nowhere to be seen in the top. SO example just says that the amount of work for each request is low.Comaparin amount of "boxes",where one is 4 cores,another is 96?
I'm curious on the part describing the amount of work per request as low - what do you consider "low"? We're querying the database for question, answers, votes, comments (baking comment markdown), sidebar, top user - hitting Redis for various things...mismatch on "low" here?
So there are just 3 entities - service, db and redis. While services with a complex flow should do actions across multiple domains, involving many services, many datastore, events and so on. And communication is unfortunately very expensive.
*should*? You've lost me there. Who decided everything *should* be that complex? It is wrong to keep everything as simple as possible to do what you actually need to these days? That's all we're doing.
That's what I was talking about. For SO threre was no need for complexity. Of course it's better to avoid it while you can.
There isn't need for such complexity in *most* systems, but that's the path they've chosen. SO isn't unique here, just want to be clear about that. We're fetching as much information from various sources on a page as a lot of people, just not being complex about it.
I'm not sure about "with relatively little effort devoted" @Nick_Craver & @marcgravell are always trying to squeeze more performance out of the stack.
To be fair - we do spikes, but just not ignoring it and dedicating some conscious time over the last decade is the main thing (not constant love). That's why MiniProfiler exists: gives us a number in the corner of every page load: put it in every dev's face. Make it a priority.
Exactly this. We don't sit there thinking "perf perf perf" as our day job, but we a: monitor load on an ongoing basis, and b: occasionally investigate what the big pain points are, and solve those. We only solve actual problems.
In 2015 I went from C#/Visual Studio/Resharper at a startup to Java/Eclipse at Google. I was flabbergasted with how bad the tooling and language support was.
Just try Goland IDE
According to Steven Levy's In The Plex, the first version of Google (aka Backrub) was in Python, then Urs came in and rewrote it in C++.
Google much more built on C++ than Java
I'd generalise to: I don't understand snobbery! That said, the observable correlation between "really good programmer" and "doesn't use Windows" has been strong in many software sub-areas (but not e.g. games?) for years. But I have seen some wonderful counter-examples too.
The lesson this impressed upon me is that someone who really learns their tools can be good no matter the specific tools. The question I don't know the answer to is: do some tools make it easier to become good or do some tools attract people who want to become good?
I think you mean SQL Server, actually, whuch actually emphasizes the point. nickcraver.com/blog/2016/02/1…
You're totally right. Good catch, thanks!
No problem. Also that quote is Paul Graham of YC. Just silliness. I think this pretty perfectly sums both the quote and Paul up...
They have a culture of execution. I hear Nvidia is also quite good.
Oh, we absolutely rock it. I think it comes to giving work to people who want to do it. I didn't see much change in my work efficiency pre-covid and in-covid. Always trying to find a way to get the job done
Also, nvidia is an SOL company. If you can't achieve it, find solutions to get there
Speed Of Light
Amazon would be my choice as well, but without the hand-wringing around engineering skill. AWS executes - other CSPs, at one point, made a big deal about how poor the AWS internal networking was. Did it matter? No. Pragmatic, efficient delivery is interesting for its own sake.
AWS gets customer obsession better than Google. It gets operating at scale better than Microsoft. Neither comes from hiring who is best at solving CS problems on a whiteboard, nor the funnest job as an engineer there. But it does enable them to rapidly build new robust platforms.
What would be the other two (I mean, Amazon is an unsurprising pick for you- you've already worked at Google and Microsoft. I'm just curious if either Apple or Facebook is in your top 5)
Unrelated, but I'd bet you that you're 95%ing and that most programmers who aren't In The Know would love to work at Amazon
Being in the 95th percentile and assuming everybody else is as knowledgeable as you are, as in danluu.com/p95-skill/
You'd be 95%ing if you went "nobody would read my thoughts on running a business, they all already know this stuff"
I suspect my top pick that would surprise the most people would be Pivotal/CloudFoundry. They seem to actually value being nice. Almost everyone says they do, but if you looked at who gets hired and promoted, they clearly don't mean it. PCF seems to mean it.
They also have a set of processes that values finding roles that are a good fit for people over PIPing or sidelining unproductive people. This seems to allow them to get value out of a larger fraction of their programmers than any other company I know of.
What is PIPing?
Doing CYA process to fire someone.
Finding the right fit is so undervalued. An under-performer on one team can easily be a superstar on another — and the high costs of recruiting new engineers makes it even better.
this was exactly my impression when I joined Pivotal, < 1 month after VMware acquisition was announced -- absolutely grateful to see experience it in person pre-acq (not to say it has changed, but many people have left since acq)
I'm curious if Pivotal culture will survive the high turnover + apparent lack of understanding of the culture by VMWare leadership (demonstrated by, among other things, the People Ops layoffs). I think attrition is a sign that many employees are bearish on this, unfortunately.
There have been some missteps but so far, from my perspective, things look promising on this front. Former Pivotal is a big part of a new org within VMware and I'm also hopeful that we'll also be able to incorporate some needed changes. But only time will tell...
Oh, heh, I have bent Dan's ear about exactly this topic, except I think my metric was +3 sigma.
I'm not sure if that's exactly the same. I don't think being in the top 5% of programming would make you specifically disinclined to work at Amazon.
Dan is definitely in the top 5% of people with knowledge about various company's working conditions
I think you'll be unsurprised to find that I believe that pretty much any programmer could be 95%-ile with respect to this pretty easily (since most people I know spend almost no effort trying to figure out a job is like, even one they're about to take).
But I agree that this makes my original comment suspect. I don't really know how to guess what someone who has no information about any of these companies would think (my mental model was someone who has the generic beliefs you see stated on HN/reddit/Twitter/etc.).
It's not so much "throwing money" as it is "throwing bodies with pagers", and presuming a high % of those people will burn out and quit before the backloaded grant stock vests. I'd keep that in mind before your hypothetical internship.
I'm curious; why would you opt hypothetically for interning if you could also just work there?
How much do you think that execution velocity has to do with the fact that Amazon has more engineers / project (vs Google) and the siloed team structure? At scale, maybe clear silos between teams remove significant execution overhead and is just the way to go.
Hi Dan. In your opinion is there any reason to believe that $goog or $msft will catch up in the next few years?
One possible reason is that Google is biggest ads agency in the world and they definitely use this power to promote perception of their excellence. I can see this clearly in my area (deep learning research).
Also, there seems to be endless stream of books, papers and posts about their culture and processes from employees. From Amazon or Microsoft? Not so much. I don’t think I heard anything about development inside AWS, except product press releases
Not an enormous amount in here yet, but here’s some:
The Amazon Builders' Library
(no description)
aws.amazon.com
Do you think we're also seeing the effects of structural/organisational differences between MS/GOOG/AMZN?
Mediocre's ethymology is the one always in the middle ground, and sometimes, the virtue lays in the middle where you maybe neither on top, neither in the below, just aiming for "a little above the average win"
Google are clearly very good at data structures and algorithms, so maybe these people are looking that rather than amazons ability to put together architectures that work..
On a scale of 1 to 10 how much of this is folk lore spill over from Steve Yegge's blog. :)
THIS. Post-Google Yegge slags Google in language eerily similar to how post-Amazon Yegge slagged Amazon.
There's definitely a type of blogger that the moment they switch companies they will start criticizing their old company and extolling the virtues of their new company. Each company they said was amazing becomes terrible the moment they move on.
I was absolutely not going for the Steve Yegge dunk on this one haha. Merely noting that his blog (and similar ones) likely have an oversized impact on public perception since otherwise people won't interact much with say, the engineering dept at goog. And I thought it was funny.
He's really fun to read, and there's usually some directionally interesting truth to what he writes, but ... sometimes a good storyteller doesn't let facts get in the way of the narrative.
Phew! Good thing I avoid this by criticizing past, present, and future employers! Hard to know what future employers I might have, but I try to cover all the bases.
My serious comment is, in my current job, I've never failed to close a candidate and have convinced a lot of people who were uninterested to interested to apply; my "one weird trick" is that I'm honest about the downsides. Most pitches are oversold, I think honesty is refreshing
If you're talking to me, the job's not perfect, but at least you know what the downsides are If you're talking to someone who says the job is amazing in every dimension, all you know is that they're snowing you, you're going to be surprised by the downsides when you take the job
Sorry if I wasn't clear, but this was not a comment about you. What I consider dishonest are bloggers/influencers who will join some company, write about how it's the greatest thing since sliced bread then soon after leaving write a post trashing said company.
My impression was that Googlers didn't want to work on Google Cloud and there's the general perception that eng/mgmt was worse at cloud than other orgs. But I guess another plausible explanation of the observed phenomenon is that Google lost to AWS and Googlers are sore losers?
Perhaps Google's engineers are good but their processes/structures/incentives suck?
I wonder if this is Amazon’s engineering chips being undervalued, or Google’s being overvalued...
Lamport built TLA+ at Microsoft, but Microsoft only started using TLA+ for Azure after Amazon published their success using it to verify AWS.
Lamport built most of TLA+ while at Compaq.
supplement hearsay with a well-designed survey. large sample, clear intention, make assumptions explicit and so forth. assess behavior and attitude.
might be "worse is better". we see this with Alexa—although Google Voice Search is more sophisticated (can take context from previous searches, eg "who came before them"), Alexa has a robust collection of regexes which seems to be what customers care about
could also be attributed to the quality of the product managers/leadership—are they focused on the right thing or not? even if you have an excellent engineering team, if they're not doing the right thing you might not do well in the market.
but going back to the question about reputation, I would guess it has to do with aesthetics. people form a ranking of companies based on things like "how easy is it to get a job there", and "how much do they pay"
my takeaway would be that large organisations probably have teams at lots of different levels of engineering competency. AWS is good at Amazon, Google Cloud is bad at Google. I'm honestly unqualified to judge the distribution of quality at each org.
I suspect it's interview difficulty and Google/Amazon social network feeds the reputation.
You can be fast and still mediocre. Amazon gets the core parts right and the rest doesn’t seem to matter to much. Also mediocrity is a super power - not building ultra-complex systems is always a better choice when possible.
Avg Amazon Dev may be worse but top ones are great and lead the masses. Part of the culture is accepting high operations and fire-fighting to maintain availabilty. You can achieve high availability in spite of great number of crappy sub systems and processes.
I guess most people equate pay with “quality” of engineering skills. Dunning Kruger is effect is highly visible within FAANG companies.
There is still a disconnect between “simple and reliable” vs “impressive” in engineering.
My take: different defs of "engineering." At Amzn, engineering probably == time to market, stability, and $$$. Google, they probably over index on QoL for the engineers: good tools & readability, which doesn't contribute to shipping speed or stability as much as they think. 1/
Another thing to consider: Google Cloud products are much more powerful than AWSs. BigQuery & GKE blow Redshift & EKS out of the water, respectively. It's not even close. 2/2
I think it's a product/business strategy difference, not an engineering difference. Amazon builds fewer features but they build the right ones.
Would it make sense to treat AWS and Amazon engineering separately when talking about reputation?
Perhaps @Google uses other criteria beyond engineering to judge talent (i.e., political correctness), and/or they are not as good as managing their engineers as @amazon. Thanks for sharing.
I think this is a function of what engineers value. For example, many engineers value elegance, or performance. Fewer value speed to delivery to market, even though doing that well and repeateadly is hard.
I don't disrespect the skill of AWS engineers, but I also think MS and Google's Cloud efforts are fundamentally solving bigger problems, and so AWS naturally moves faster.
honestly the most ominous shade possible towards GCP & Azure
Nah, when MS and Google catch up with specific regional offerings from Amazon, the market is going to get very wild very quickly. Amazon simply doesn't have the kinds of infrastructure Microsoft and Google have.
For example, they don't have end-to-end best-in-class developer tooling like MS does. They don't have the supermassive global network and datacenter footprint like Google does.
Amazon's principle means of competition is that they're just *way* less conservative from a methodological standpoint than Google or Microsoft. Also, they're very happy to work people into the ground and reward abusive personalities as "high performers." Very Uber-esque.
The frontline dev leverage MS has (and brand, and LoB trust) is a big deal, those things will keep them in the top 2 ~~permanently now that they've turned the corner outside of the windows franchise.
My ~uninformed reckon is that either of the other players can buy their way over any systemic advantage GCP has
I assure you this is not the case. Google is 3-5 years ahead of the global curve on networking. Which is not to say they don't fall for weird topology traps like other folks do. It's more that the networks and data centers are phenomenally robust and efficient.
There are places in the world where if Google's infrastructure goes down, you will experience a [redacted]x increase in latency to Google properties as you go back to using the normal internet.
I can believe that. The real question tho is, do people care enough vs. the benefits (tangible and otherwise) of just slogging away with AWS? At least so far, the answer there is mostly "no".
Depends. It's less about "caring" and more about "what is the value of a global footprint to you?" Google is the only game in town for a lot of the global load balancing stuff. But then their regional offerings are somewhat behind the competition.
Sure, I was using "care" as a shorthand for "are the benefits worth the costs"
If at some point Google can offer regional data ownership while still doing global load balancing I think a lot of people will care Google isn't much more expensive than Amazon, in fact they're often cheaper for medium sized businesses. There is a weird dynamic in play here
Google warns you that you really need to put your app in multiple regions to be safe, and it absolutely can break if you don't. Amazon just throws humans into the gears to make user us-east-1 stays up. But when it downs it takes a big fraction of the digital economy with it.
I reckon Google infra advantage can only shine if there is uptake of managed tools like Spanner. Reasonably sized engineering teams will always be able to make do slogging away with typical cloud tools. Getting the high-level things cheap to starter w/ is key for GCP imo
Certainly true. Firebase is a good example. It also doesn't help Google that a lot of their core GCP and Networking offerings lack a lot of key manual management features.
I think the K8S stuff, which predated my tenure, was a big example there of doing exactly that. And it worked. There are other things in this vein I cannot talk about but they will be pretty exciting when shipped
I'm certain AWS could just match GCP penny for penny and not lose any appreciable ground. The number on the cloud invoice is not the biggest cost, even for greenfield projects. It really is an IBM in the 70s or MS in the 90s kind of situation.
It does matter how much money you have, but it takes a lot of time to secure and build that global network. E.g., securing real estate and power.
I think that I misunderstood your point. I'm not sure they can. If they could, I suspect they would.
My own pedestrian experience vis a vis networking and such w/ AWS is that I've really benefitted from the sort of market indemification where, when my stuff is down, so are much larger services/properties, and customers mostly just shrug. It _almost_ makes outages non-events.
If you don't feel you have an SLA with your customers, then I suppose so. Similarly a German company wants guarantees of user data staying in Germany. The global nature has negative value.
I do, that's the remarkable part. When it's clearly an AWS issue, the SLA isn't even mentioned. 🤷‍♂
I don't really understand why folks do that, but lucky for you at the end of the day! 😄
Neither do I tbh. At some point outages will end up getting credited automatically, but "oh, aws is down" really is a solid get-out-of-jail-free card
I have never given folks that out. If I have an SLA, it's the vendor's job to deliver it. Going, "Regionalization is not something I budgeted for, but I forgot to mention this in the SLA" is not a very professional practice if I'm honest.
Definitely. To clarify, I've never used that as a response, despite the preceding colloquialism; SLAs are SLAs. It really is more that they're not triggered when they could have been. The real pro move is automatically crediting. Not there yet.
Microsoft's SDN tech also has the potential to redefine what's possible and I'm excited to see where it goes. They approach it as a dynamic hardware problem with FPGAs. That may take years more to fully develop, but when they do it has big implications.
I don't this this is true. There's a lot of places where Amazon is taking leadership like serverless and ARM. I also think Nitro their EC2 platform is pretty far ahead of the rest of the industry at this point. Each cloud has strengths and weaknesses and ignore AWS at your peril.
I don't disagree but when I said "bigger" I mean specifically in scope. Microsoft is addressing a bigger part of the cloud stack in a unified way. Google is working from the perspective of a fundamentally global-first system.
It's worth studying history to look at why EC2 became the model for cloud computing and not Azure Cloud Services (the product). Vertical integration has its benefits but is not sufficient to capture a market where customers want maximum flexibility.
I think AWSs biggest handicap is that individually, a lot of their services are just agonizing to work with. Once other options can get the specific feature parity they need, we may see non-whale customers start to shift.
Is it possible that AWS realized you need a few exceptional technical leaders who can plan and spec reliable systems and then regular engineers can build? I’ve also heard AWS’ “mediocre” engineering but I’d also heard that amazon’s principal engineers were _very_ strong.
(the latter part is funny to me because someone I referred to $UNICORN didn’t pass the on-site but two-weeks later was hired as a principal at AWS and has thrived there)
Uptime wasn't the metric that led to my low opinion of Amazon engineering, it was the alacrity with which I ran into "internal error" upon venturing off the happy path. This was cemented after working with GCP for a couple of years and having things work as documented.
You've been a lot luckier than the people I know who've done major GCP migrations :-)
Entirely possible. My sample size of services on both platforms is a pretty small fraction of the whole catalogue.
Really curious what project this was...
anecodtally, it was Snowball. dont remember source but someone on twitter said google made a press release and AWS just shipped the api without the underlying impl actually existing yet. thirdhand info so take with a mountain of salt..
Isn't that just taking the customer obsession to the extreme? – I see nothing wrong on that; it's what every clever company should do if they're unsure is the product hit or miss.
Managing, communication between multiple teams and stakeholders !== Engineering skills Past a certain size, no amount of engineering can fix burecracy To be clear all 3 are good. AWS just seems more focused on communication over raw tech. In their internal processes & hiring
Yes, Google has a real arrogance problem in this regard and has never at large figured out how to move fast. There are very deeply set subcultures that push back against fixing this, and where is bad its driven me out of PAs because I can't watch my teams be hamstrung by it.
Amaya obviously has the best engineering.
To put it harshly, a common attribute of mediocre engineering is to think that it is superior engineering. I wonder if part of Amazon's success is to take a broader view of what counts as engineering excellence.
Yeah my alarm bells for platonic definitions of quality are ringing when I read this. "We ship slower, have less satisfied users, and don't lead the market - but we do it in a theoretically better way"
Dunning-Kruger?
I don't think it's Dunning-Kruger so much as focusing on the wrong stuff, stuff that has become associated with reputation but doesn't really translate to better results. You see this kind of thing in universities, too.
Maybe marketing? I mean google has made a movie about themselves.
If Google has the best engineering, we are all doomed.
Oh, I'm pretty sure we're doomed either way.
Afaict Google takes pride in only “hiring the best”. Therefore, if you work at google you must be the best, and if you don’t you probably aren’t.
Strictly outsider’s impression, fwiw...
I'm a xoogler Google feels they have higher bar for eng hiring than other large companies But they know there are great engineers elsewhere too. Many reasons for it: - Noisy hiring signals mean Google rejects many great engineers - Engineers' preference (eg geography) etc
I got rejected. The fact that Xoogler is a word but XAmazonian isn’t proves that society thinks I’m inferior because I got rejected. Should I kill myself?
- please don't kill yourself over your Job and salary. The life is way larger and valuable than your job / salary. You are still priviledged to get the money. Please breathe, relax and enjoy the life's surprises!
I wonder if this was an accidental admission of Google's failure of imagination to consider that A) some great engineers might not want to work for you. B) some great engineers you shouldn't hire for behavioural/leadership concerns. Tech is just one facet of engineering.
I had an ex-colleague use me as a reference for google. The question they asked was ‘was he the best engineer on the team?’ I said no and she was taken aback. They hired him but the thought he might not be the best of the best was troubling for her
>throwing money at the problem and *undercutting* everyone else What does that even mean? Is it about pricing? Because I find GCP cheaper than AWS quite consistently. AWS is the mainstream but expensive choice. Or where does the undercutting idea come from?
I get the impression that Google has great programmers, but actively self-sabotaging product management. Whereas Amazon has so-so programmers, but a terrifying focus on shipping things customers will pay for. Since programming is <5% of success, Amazon wins.
why would you say amazons programmers are only so-so? by what metric are you measuring this ?
Nothing scientific! Just general impressions, plus the reports of the handful of people I know who have worked at either.
I’ve worked with a few people from AWS and a few people from Google and honestly the AWS people were all *far* more diligent engineers
ex-Googler: "this complicated and fragile nonsense solution is right, I worked at google!" ex-AWS: "Sure, whatever this works and should be pretty robust." Guess who was right nearly all the time?
Interesting. I work with an ex-Googler now, and he's the opposite of that. But then he left a long time ago.
Amazon and AWS are 2 different beasts. AWS has decent tech, and the dev culture is similar to all the other cool places. Amazon retail not so much.
Amazon hires great devs. The problem is that at least Amazon retail has 20+ years of tech debt which is never addressed, because there is no clear code ownership, but heavily extended with new features all the time. Nobody could write great code in that scenario.
Funnily enough, I’ve had Amazon (some still at Amazon, some ex—Amazon) employees who then went on to work at google tell me that the thing they *like* and *miss* about Amazon is the simplicity (unsophisticated-ness) of things.
A lot of systems were very unsophisticated compared to the counterparts at Google (might translate to “worse engineering” in some people’s minds), but that made working with those systems simple and the failure modes (and limitations) of those systems well-understood.
purely going by some postmortems of very public outages, it seems like a lot of GCP’s problems stems from certain engineering patterns used across the board which has very nasty failure modes arising in no small part due to the complexity of the architecture in the first place.
And the same can also be seen in the papers Amazon and Google publish. I find most of Google engineering to be incredibly sophisticated but also almost always incredibly complex. Amazon, almost always, favors pragmatism. I always think about this paper: twitter.com/copyconstruct/…
Link to paper bit.ly/2AYOPBx It's interesting in an era where transactional systems are making something of a comeback and Google's preaching about why we should choose strong consistency, whenever possible (bit.ly/2AYq6wY), Amazon picks different tradeoffs
Interesting and thoughtful analysis by both! What's the saying... "Culture eats strategy for breakfast." 😂
I think this is also an indicator of reward systems at work at Amazon and Google. If a complex design gets 10/10 during code review, and the engineer who designed the system is socially rewarded as a genius within the company - they would build increasingly more complex systems.
Yeah, GCP postmortems are interesting! How many other companies could have an outage that stemmed from running their global SDN on top of their cluster manager? Free blog post idea: a list of GCP outages that most companies couldn't even dream of having. status.cloud.google.com/incident/cloud…
Hmmm. Left unchecked, software engineers tends to gravitate towards 'sophisticated' architecture
Fair point. You *can* engineer this thing to within an inch of its life, but *should* you? A parallel comes up for me with e.g. traction control & safety systems. If engineers start offloading their problems to hyper-engineered solution X, can their system tolerate X failing?
Once in a while I've thought I got a glimpse of how AWS do something, often due to some sort of failure, and I've been struck by how ordinary it seems. I think they use well-understood building blocks and then they must relentlessly hunt down operational problems.
It’s always interested me that software engineering is the only engineering discipline that promotes overly-complicated designs as a good thing. All other forms of engineering require exact precision of design with no tolerance for waste or unnecessary complexity.
Did you happen to ask their thoughts about *why* this was the case? What was it about the companies that resulted in these different strategies?
I can't speak to AWS but one of Google's promotion criteria is dealing with complexity. Guess what feature their systems have in common? GCP has the best UI/UX though. They take that seriously.
Interesting! I wonder if AWS has "rigorous simpleton" as their promotion criteria? :D
If so I would respect it. Some of Steve Yegge's old blog posts go into a bunch of detail about the differences in culture between the two.
I am always amazed and amused by the use of Emacs and Elisp for implementing an early service at Amazon
tour-de-babel - steveyegge2
(no description)
sites.google.com
Picked up on this reading a blog of an ex-googler having to go the pragmatic route that a start up could afford VS the layered resiliencey at n levels for a system implementation. It's another level at @Google VS basically anywhere else.
perhaps it is the selectivity of the interviews? the most visible (to devs) portion of the internal eng culture? google takes pride in high false negative rate.
IME, over 10 years of consulting at companies large and small, every organization has its own version of a reality distortion field in which they view themselves as superior to others. It takes healthy introspection and a lot of swimming upstream to challenge the status quo :|
One of the reasons I wholly subscribe to the notion that "consultancy years are dog years" for career development is the perspective of seeing that companies big and small are often tackling the same hard problems with similar degrees of (non-)success.
sounds like you have some great stories to tell... do tell them someday!
This is of course true for individuals as well.
Yes, everyone thinks I’m a retard because I work at Amazon. I want to kill myself daily
that high false negative has a cost
Only in a universe where "the best" lives on a single dimension would such broad statements be true. Culture, motivation, communication, compromise, consistency, focus... it all goes in the mix.
I'm probably going to regret tweeting this, but I think we're 100% fine with Google folks thinking this, but they would never be comfortable the other way around. And that's the difference.
I'm also going to regret tweeting this, but I can think of at least 3 "big" Google papers (Maglev, Spanner, ALTS) with designs we had rejected years and years earlier, with seemingly no commentary on the problems that we had discovered. Anyway, we prefer to ship and keep on :)
Also, say what you will about Chime, but Amazon has a significantly more clear strategy in the messaging space ;)
Where can we read why they were rejected?
We rejected a Maglev-like design because probabilistic LB doesn't work for the vast majority of workloads. Most customers have only 2 LB targets, they're also often slow, and subject to garbage-collection pauses. Probabilistic LB increases utilization way too much.
It's a design that works well when you have lots of very fast, very consistent targets. You could say it worked well at Google then, but I'm not sure I'd agree. It also imposes that constraint tax on your ecosystem; teams may be forced to optimize way earlier.
Our world view of load balancers is that they primarily an organizational tool designed to free teams from problems and complexity. Helps you not work as much on HA, GC, or long-tail latency, quite as much. The paper reads like awesome bin-packing is what LB is about.
ALTS shares many of the problems I put in my rant about mTLS. Though ALTS is mTLS done about as well as it can be. We put our weight behind SIGv4 instead, even internally, and I think it's much much better.
Ok. tweet thread time! Too long ago I promised to write a screed explaining how much I hated mutual-auth TLS and why. I got distracted, and I wasn't happy with the writing, so here it is in tweet thread form instead! But basically: Client certs and Mutual-Auth TLS is TERRIBAD.
The pre-auth TCB of SIGv4 is basically SHA256. The pre-auth TCB of ALTS includes Protobufs. And it's a layering violation that puts AAA in a layer where request-smuggling can happen. Anyway, it's something now fixed AIUI, and gRPC is a good example of something better.
Spanner relies on assumptions about network reliability and clock reliability that simply aren't true ... and have now been born out by at least one outage.
I am reminded of the marketing papers that say Spanner gets around the CAP theorem because the network will be close to perfect. cloud.google.com/blog/products/… … I am just going to use my outside voice when I roll my eyes from now on. 5/5
Inside Cloud Spanner and the CAP Theorem
cloud.google.com
You can build fancy atomic clocks in data-centers, and cool PTP networks to servers (we do all this too!) , but there's still quartz and clock-cycles where the TX actually happens, and the CAP theorem still holds about networks.
So you’re saying the CAP theorem throws a wrench into spanner?
more like FLP and reality i would say :P
Colm, thank you for sharing. I remember being impressed by many a google paper, including the bit about atomic clocks. It is absolutely fascinating to hear that there is more to the story than just the papers, and also how quickly Amazon moved in data infra.
fwiw we had a similar experience and went with a split L4/L7 design to get the performance and resiliency we wanted and expose just enough config to the user so they can get things done and not worry too much about HA, service discovery, etc
Our very own @williamsjoe is presenting GLB, the system that processes every user request to GitHub, and how it uses #HAProxy to provide intelligent routing, health checking, and observability, all with #ChatOps managed over @SlackHQ github.co/2VwrLot
FB also uses a split L4/L7 design. It seems to be the thing to do.
When you say probabilistic LB, do you mean "something fancier than hashing the flow" (like Maglev is fancier)? Or that the balancing should mostly be done at the application level and use something like round-robin? (Thanks for sharing your thoughts, always thought-provoking)
I think so - Maglev is not aware of anything about the workloads behaviour - it just takes in a model of the target endpoints (e.g. 10% 10% 10% 10% 60% for 4 smalls and a huge), and then spreads new connections in an arbitrary fashion such that in aggregate flows are distributed
in those fractions. OTOH my understanding is that traffic in Google is not terminated by Maglev - it isn't an equivalent to ALB/NLB - rather it is a terminate load balancer - load balancing across your NLB endpoints (for instance). In k8s it routes into a nodeport on every node
in the cluster for instance, which then will have a Service (iptables / nftables) picking that up to deliver to envoy for TLS termination, and from there to actual workloads.
But back to the use cases. The options are: a) push fail-over/ recovery to the client (expose TLS terminator instance IP's to the exterior). b) Use point to point anycast - envelope hashing in routers or a NFV layer to pin sessions to a single TLS terminator. c) Sync TLS and IP
state between TLS terminators to permit packets to hit any instance. d) don't exceed a single terminators network / cpu bottlenecks for that service. The second use case maglev has is extending that same (b) abstraction globally, as detailed in the paper.
Anyhow, in my re-implementation of the maglev whitepaper I eventually concluded that using the linux XDP and extending LVS with the maglev hash would bring all its features in without reimplementing everything (though it was fun doing so).
Hope y'all didn't mind me jumping in - I do think there is merit to @colmmacc position on this too - I'm not saying maglev is best at all, just trying to articulate my understanding of the argument for it, after I did a fairly deep dive a few years ago.
Thanks for proving that not only Google engineers can be pretentious. (opinion mine, not my employer's)
One thing that you can salvage out of his tweet is that Amazon raised the meta-level of the whole industry with virtual-machines-as-a-service. Like there's plenty of discussion happening at higher levels of the stack, but VMaaS is truly the desired baseline in the whole industry.
... and their Graviton2 business is also very interesting to me: usable ARM workhorses. Whether that excuses a dig at Spanner because the network has a control-plane across regions? Your call ;) I remember twitting with a Xoogler who: a) remembers such planes fondly, and b) ...
... b) yeah, that's a trade-off? I think there's a good architecture discussion to be had there, and this “meh, it Just Doesn't Work” is not conducive to anything useful. Because you always have that global plane, once you consider human operators ;)
This is the subthread I'm replying to:
We rejected a Maglev-like design because probabilistic LB doesn't work for the vast majority of workloads. Most customers have only 2 LB targets, they're also often slow, and subject to garbage-collection pauses. Probabilistic LB increases utilization way too much.
It is fair to say they both have different objectives. I'll appeal to apples being compared to oranges.
How do you see the objectives of GCP and AWS being different?
The services are not like for like. It is my experience users of both services tend toward different use cases. Both tool chains came from eventual consistency and getting that click to be a sale. Maybe the end objective is customers but that doesn't tell the story.
+1 - I'd much rather be in the Amazon camp (and was). Engineering is a tool to solve business needs - not the end-all.
Disclaimer: I do not in fact believe amazon has mediocre engineering... but I’ll add, the ability to be successful building and operating amazon services with mediocre engineering would in fact be a major engineering feat... 1/2
Lemma: The best designed architectures are those that get along just fine with having mere mortals perform the implementations for their various components.
For me the difference is: engineers satisfaction vs customers satisfaction. Google probably has more smart engineers but they are making projects for their ego rather than for the customers.
Disclaimer: I work for AWS so not totally neutral here
What even are “the best engineers”?
Depends, right. Can be based on hacker rank or on most delivered products and so on. I guess each big company has their own definition of “the best engineer” and hires accordingly.
maybe covid has peaked my cynicism, but i have a hard time seeing this as more than shallow cred from some combo of industry hype "relevance" (e.g. k8s), jealousy, fart sniffing, and inflated egos.
Best reply 🏆
I've never heard that, but I don't buy into big generalizations, so probably wasn't paying attention 😅. I do perceive that they are good at different things and both have different values. Like I think of Google for machine learning and AWS for cloud infrastructure.
I think people are associating hiring bar difficulty with the engineering quality of the entire org. I’ve often heard that Amazon interviews are much easier than G/F
Yes they are. This is why I’m paid less and why everyone thinks I’m so stupid
Sorry to hear that :( I’d say as long as you focus on your ability to build and produce value (assuming engineering), your long term worth will surpass that of people who are optimizing for passing interview bars
It doesn’t matter I’ll never make that money back. Instead I’m now just an unsophisticated Amazon engineer that just makes $150k a year. I want to jump off a bridge daily because of this
I'm sure $150k plus RSUs.
No I’m including RSUs. I’m L4. The number comes directly from my personal compensation summary from this year that I got ~1 month ago.
Nice. Still beats the national average for a software engineer. Definitely for entry level. You don't want to know how little I made out of college in 2009 during the recession. But that's also my own fault for being too comfortable with where I was interning and not searching
I’ve worked too hard to be compared to the “average new grad engineer” that spent college playing league or smash or partying or whatever. I hustled hard and I’m getting absolutely fucked for it.
It's not just higher than the new grad average, it's higher than the national average for a software engineer's salary. Why did you accept the position if you're unhappy about the compensation?
It was the best offer I got. They were all in the 150k range. Only exception was citadel, which was $175k, 200k first year. I can’t deal with being such a career failure
Being too ahead of one's time is also a bane
Internally at Amzn, I think the feeling was a tradeoff between intellectual purity vs customer obsession. We're much more likely to deliver a more pragmatic, high-impact feature than something intellectually pure with less measurable impact.
Re: throwing money at the problem, that was more true some years than others. There were years when I felt there must be a limitless sea of hardware out there. Others I had to fight for every host. One year I had to present to a Distinguished Engineer our scaling plan for Q4 peak
I think this is the issue that engineering has in general. These is a divorced from reality view that Engineers have where they believe that all that matters is how technically clean a codebase is and what patterns are used. In reality customers just care about what they can use.
Which is not to say those things are not important, they are but in the end you aren't judged based on code cleanliness. A friend of mine worked at amazon in 01 and he explained how much of a mess the ama. code base was. Yet they obviously were killing it, even then.
My intention wasn’t to suggest Amzn didn’t value eng excellence. But there’s a difference btwn academically perfect and practically good enough. E.g., there are many systems at Amzn that weren’t built to handle retail website scale. Some started small and were rebuilt later.
Analytic systems are a good example. When I was on the A/B testing team, we kept our own copy of the clickstream data. So did personalization. So did ad targeting. Rather than a Big Central Data Store for high throughput analytics, each team built a small one perfectly tailored
But we all did code reviews for every change. We all tracked unit test coverage, we all had metrics in our systems and monitors to alert of failures. We wrote down designs and reviewed them thoroughly. There are a lot of tradeoffs that don’t sacrifice code quality.
No, I knew what you meant. It's simply that a lot of times engineering purity wags the dog. If you were doing a demo for a VC are tests necessary? No. It's just a demo. But the mentality some eng's have would make you think you are committing a sin.
If it was something critical to the demo- as in, it absolutely needs to work or I don’t get funding- I’d probably write a test. But otherwise, totally agree. TBH, too much testing hasn’t been the big problem in SW engineering IME. But I concede the point re: test coverage metrics
I think testing is really important for really important things. Banking, transactional stuff, etc. I worked at a company and they had a guy working on staff to write cucumber tests to verify pixel accuracy on the site. It was never right. So somethings don't matter as much
Another possibility is that Amazon has a more well-developed pipeline for software engineering. Higher levels of division of labor, more high-level analysis of how the work gets distributed and handed off- developing a process that is less error-prone
In this scenario it wouldn't be a contradiction to say that Amazon does in fact work by "throwing money at the problem" and "getting by with mediocre engineering [talent, which is what I presume they mean]" since that's actually the strategy.
"getting by with mediocre engineering [talent]" meaning that Amazon tries to lower the amount of skill/knowledge needed to contribute to the process. "throwing money at the problem" by taking the now lowered bar and getting more people on the job.
The High Data Table. They allocate the yearly supply.
Google is probably the most arrogant self aggrandizing company around.
Seattle is packed with companies, including google, that came here to try to leech aws talent.
AWS engineering is pretty good. It's Amazon retail that has tech debt going back 20 years.
Underrated tweet@
From what I recall 20 years ago they have some crazy templating language that you actually have to use comments to actually write code for. Something to that effect. It was crazy but they made it work somehow.
In the meantime, S3, EC2, DynamoDB etc. somehow just work at mind-boggling scale.
Superior engineering? Lmao. They need to resolve their network issues in their GCP product then I might consider that.
At end of day if cos. have good engineers they find innovative solutions to get round the “mediocre” engineering from cloud providers, each has its pros and cons. As an end user you rely heavily on wider adoption and references so being early has its benefits cf. Betamax vs VHS
I'm made a bit uncomfortable by ranking engineers like this. We're all human beings and every engineer I've worked with does their best and brings unique value to the table. You could create a team from the 6 "smartest" engis on the planet and they could still suck. 🤷
Memo from the dept of tech arrogance
The difference for me is that, when I deploy anything to Amazon, I know it will keep working, as is, forever. I won't need to think about it again until it's time to update my credit card details. With MS&Google, the service might be retired before my coffee pours the next day.
I literally ran a box on Amazon I had lost the credentials for, for 9 years, because it had a single script on it that was used by something else :) I couldn't even imagine that with a Microsoft or Google product.
Engineering is such a vast term. You can optimize for different aspects. Anecdotally, when I interviewed at Google, I was not asked a single non-algorthmic coding question. I was no fresh grad - a senior in a small shop. This tells me what Google optimizes for in hiring & product
One perspective on this: I was an Amazon intern after my sophomore year on a brand new team. (and like I actually had “two years of college and one prior internship” experience, not secretly coding since age 12 or something). Full-time new hires didn’t realize I wasn’t full-time!
You could spin this as “ha, they mustn’t have been hiring very good FTEs for an intern to be just as good”, but I think that’s wrong.
My experience was that the agile/sprint/project management framework was just really efficient. I was able to just start contributing *because* the tasks were extremely clear, I was managed very well, etc.
By contrast, it's basically understood at other companies that a new hire does nothing useful for their first year.
Google’s client facing teams are embarrassing. I would bet Amazon’s are at least serviceable.
It’s all PR spin IMO. Internally, different cultures force out different behaviors or ignore kinds of behaviors but most of that is surface level signalling. Rarely does it actually translate into actual “quality”. More on the PR aspect:
Coloring the Whole Egg: Fixing Integrated Marketing
Three kids are selling lemonade in their neighborhoods one hot day, to passers-by. Kid Red yells things like “The best lemonade in town!” Kid Green yells things like “Hey Joe, how…
ribbonfarm.com
I manage the biggest @amazon alumni group on @LinkedIn - happy to put you in touch with a network of 9000+ and growing.
There's an alum group? how come there's no fancy name like xooglers? nobody ever puts X-Amazon in their twitter bios!
Haven't seen code from either Google or Amazon so I may be completely off, I think engineers @ Amazon probably write code to use and maintain for a longer period of time, engineers @ Google write fancy code.
Amazon also seems to put a premium on internal collaboration - the story of bezos mandating that every team have an API comes to mind. Google is the engineering led company that overvalues source code and tech chops and undervalues everything else.
That's something you say to maintain your street cred when you can't compete. "Google" also said their engineers are too dumb to learn Java and that's why they started go
My friends at Amazon often had less prestigious backgrounds then my Google or even Microsoft friends. I suspect some of the issue is Amazon bothers to mine worse schools and companies that Google would never deign to consider.
Goggle probably has a more "academic" culture. Remember sun microsystem?
I would not compare Google with Amazon so deliberately, Google it’s a technology company, Amazon is a service company. Amazon uses technology to provide value to customers, and it’s very good at doing it. Google built it (TensorFlow, Node, Go, Kubernetes, Android, Chrome, etc)
This kind of gels with a concept I’ve been thinking lately: there’s little to no correlation between business success and internal software engineering stack/tooling
Google hires very very smart people - maybe the smartest in the world, maybe not, but close enough. Once upon a time that was an advantage. Nowadays it just means they can tolerate way more complexity than anyone else would, and they pay that complexity tax all day, every day.
Also, smartest != best engineer. Smartest doesn’t have a globally accepted definition, but google has its definition, which isn’t synonymous with good engineering. That doesn’t even seem to be the goal. eg. they hire many researchers; some of the best, but not really engineers.
I’ve seen way too many “too smart by half” solutions over the years at every company I’ve been at. Usually by smart, inexperienced, engineers who haven’t paid the “complexity support tax”. I think everyone should maintain their code for a few years, to learn about that tax.
Super-chicken experiment.
Are you sure that complexity comes from hiring smart people? I've seen unbelievable complexity in much more small places that run a much more small scales. Maybe scale is a more significant factor here.
One problem is that Google’s reputation for being so smart leads small companies to blindly copy their architectures in situations where it’s not remotely justified. That kind of copycat complexity is a different skill than inventing it from first principles, I guess? :)