See the entire conversation

If you're wondering what "P-four-nines" means, it's the latency at the 99.99th percentile, meaning only one in 10,000 requests has a worse latency. Why do we measure latency in percentiles? A thread about how how it came to be at Amazon...
Was in a conversation about some service refactoring and the guy said “mean latency was a bit worse, but our P-four-nines was rock-solid and only twice the mean. Customer loved it.” #CloudThinking
156 replies and sub-replies as of May 30 2019

In 2001, I was managing the Performance Engineering team. We were responsible for the performance of the website, and we were frustrated. We were a few engineers fighting against the performance entropy of hundreds of developers adding features to the web site.
Latency kept going up, but we had a big problem: how to convince people to care. The feature teams had web labs that could show "if we add feature X to the page, we get Y more revenue."
If that feature added 100ms to page latency and our only counter was "but latency is bad," our argument didn't carry the day. So we had to figure out how to make our argument stronger.
On March 21 and 22, 2001, our Oracle databases gave us a gift! At the time, the Amazon website was backed by seven "web" databases that were responsible for keeping track of the web activity, such as the contents of the customer's shopping cart.
On those dates, one of the databases, "web7" was having some weird locking problems. Every so often, when the website tried to query for the customer's data, it would pause for a few seconds. While this made life miserable for our customers, it was an opportunity for us!
Visitors to the website were randomly sharded across these databases, making this a great natural experiment. So I did some log diving! I broke down our page serve times a bunch of different ways, but for this thread, I'll just look at one, which is detail page latency.
On the Amazon web site, the "detail" page is the product page. The page that contains a single product with an add-to-cart button. We serve a lot of those pages and customers visit them for roughly the same reasons, so they were a good data set for me.
Here's a graph I made. The green is a histogram of number of pages served at a given latency. So, for example, we served about 2000 pages in that hour at 250ms latency, and about 8000 at 1s. Note that the x axis is log scale.
(Same graph). The red is the "abandonment rate." It's the ratio of pages for which that page was the last page the customer visited. So you can see that mostly every time we served a detail page, there was about a 20% chance that it would be the last page in a session.
(Same graph). The rate slopes down! Latency is good for keeping people on the website! Probably not, but it illustrates what I was up against. We knew that products with more reviews, more images, etc. sold better, and those things took more latency to serve.
Here's the graph for that same time period for the customers being served by web7. That secondary hump in the histogram is the locking problems. The latency additions were effectively randomly distributed among the page loads (not correlated to customer).
(Same graph). So each time a customer who happened to be mapped to web7 tried to load a page, there was a 20-25% chance that it would be served with an extra ~8-16s of latency. And obviously the abandon rate was much higher for pages served at that latency.
But the really scary thing to me was comparing the "normal" latency sections below. What we saw was that a customer was much more likely to abandon the site even for normal latencies after tolerating a slow page.
The abandon rate went from ~18% at 2s latency to ~32%. That's a lot of pissed off customers! Extra latency for better content went from being an attractor to a detractor because of the customer's prior page serve times!
From this, we decided that we (the performance engineering team) would not fight for better average latency, but for a maximum latency budget. We didn't know exactly what the right answer was, but we knew it was somewhere between 4 and 8 seconds. So we picked 6.
And we knew that we couldn't start off with an edict that no page could *ever* be served with more than 6s of latency. Instead, we thought about probabilities that a customer would have a bad session, which led us to set our SLAs at percentiles.
If a session was 10 pages on average and we set our SLA to say that no more than 1 in 10,000 pages would have a bad latency, then we would know that no more than 1/1000 customers would have a bad session. In shorthand, the p99.99 of latency needed to be <6s.
Armed with these graphs and this reasoning, I went to the VPs who owned the various web pages and argued that they needed to set these SLAs. In parallel, we (the perf engr) team built tools that made it really easy for developers to measure the latency of their pages.
That system, called PMET, let developers put little "start" and "stop" indicators in their code and then our system would scrape the logs and store the latency histograms in a database. If their page wasn't hitting SLA, they could drill down and figure out why.
wrote the prototype service for collecting and aggregating the data, and I wrote a simple visualization tool (using Perl and gnuplot). The rest is history!
All things considered, it's probably the most-impactful thing I have done in my 20 years at Amazon, and I did it in my first few years! That focus on high percentiles instead of averages has driven so much good behavior.
When James Hamilton started in AWS, we were chatting about PMET. When he heard that I had been one of the creators, he told me that when he read the Dynamo paper, the thing that had the biggest impact on his thinking from the paper was that we were focused on percentile latency.
If you work at Amazon, you can hear me talk about this in an old PoA talk. If you search broadcast for "andrew certain gems" it should be the only hit (it's the first ten minutes of that video). That's it!
PS. This outage and many others like it were the driving force behind developing Dynamo and getting off of relational databases. So lots of change happened because of the problems with these databases!
PPS. This thread is getting *a lot* of attention. Lots more than if it were just me yelling at you to "use percentiles." Because humans love stories. And I agree it's one of the good things about Twitter: that we can hear stories from people we don't know.
But as useful as my story might be, it's not nearly as important as stories that help us remember our shared humanity. No matter who you are, but especially if you're a cis white guy in tech like me, I encourage you to follow people different from you and listen to their stories.
It can be challenging sometimes just to listen, especially when the stories express anger about people like you. But I encourage you to try, to hear what they are saying, and to remember that we're all just trying to be OK in the world.
And if you are a cis white dude in tech, I know that you can also be fighting internal battles, getting dumped on by the world, and feeling pain. I'm interested in your stories too. Someday I may have the courage to share some of my deeper vulnerabilities here.
❤️❤️❤️ ace writeup!
Thanks, dude! 💚💚
Well, you're my friend! It says so right there in your handle! :)
Thank you for this.
You just quintupled the quality of my Twitterverse with just this. Thanks.
1) Great write up 2) now I’m going to lose a morning down the rabbit hole of a dozen new perspectives. Thanks for both
I was here for the tech story from a time when I had big dreams but was running a network with just 85 systems at a small company in a small town, being a bit jealous of where you were in 2001, and saw this. Thank you for using your platform and privilege to add this.
Very cool. Always love learning the back stories to tools and techniques that it’s easy to take for granted
Thank you for this thread. Selling performance budgets and improvements among the drive for features is hard enough now, I can only imagine that 18y ago.
I think the key thing we realized was that we would never win fighting individual battles around features. We needed to set a standard and let the business owners make their own decisions about how to get there.
Agreed, trying local changes has a low ceiling. there’s a genius touch on both discovering the standard and extending it to both business (conversion and session metrics) and engineering (dynamo and non relarional design).
Thanks for the background! When I joined Amazon in 2008, moving from an average mindset to a P99.99 one made me a much better developer overall and has positively impacted every project I have undertaken since.
Very cool thread. I think what amazes me most is the fact that you still have those graphs from back then.
I squirrelled them away a long time ago. I'm really glad I did. Part of the reason I had them is because I had to do them all by hand with perl and gnuplot, so they were all sitting in my home directory when I went to look for them years later.
Nice thread, thanks. Pushing for similar mentality shift for frame latency as quality metric by pushing 99th percentile on frame time for games vs average frame time (ms/F) and, most popular today, [shudder] average FPS...
This was a great story and I’m going to look up the PoA talk tomorrow. It’s also a great insight to how data-driven we are and why that’s important. Thanks for sharing it!
I'll have to find and add that video to the monitoring bootcamp! One set of stats that PMET didn't & CloudWatch don't (yet) offer are 'trimmed mean' stats, wondering if you'd every considered them? Seems like they'd compliment percentiles for use cases like observing page latency
I hadn't. I'm very self-taught when it comes to statistics, so didn't have many tools in my toolbox! 😁
Me too, but @JimRoskind sold me on trimmed means at the offsite last week. Added to my list of hackaton ideas.
Related: overstats are an under appreciated gem!
They are, but I haven’t seen anybody implement them well. Presumably because they are computationally expensive for non hard coded values?
Calculating overstats using an exponentially bucketed histogram (with bounded error) isn't too hard. O(N) in the number of buckets.
Cloudwatch supports this, using Simple Exponentiating Histograms (SEH). It’s awesome though I wish it did non-linear interpolation within the buckets but that’s a quibble.
Fun fact: SEHs are CRDTs, and overstat calculation is monotonic, so can be calculated exactly in parallel with no coordination. Percentile calculation isn't monotonic, so needs some coordination.
Is that the same algorithm that hdrhistogram uses?
IIRC, hdrhistogram uses a top level sparse exponential histogram, with a second level of linear buckets inside each exponential bucket. SEH is just the first level.
What are overstats? (I tried googling for the term, but it's buried in overwatch results.)
A simple count of the number of data points over a given value. Eg the 1ms overstat counts every data point over 1ms.
Wow thanks for sharing! Fascinating to learn how complex systems evolve over multiple decades and how good tooling help humans to make progressive improvements and develop sense of ownership 👍
THANK you. This was likely the most interesting thing I've learnt this week!
Any chance that video could be shared more widely?
Honestly, it's just ten minutes of me talking through these exact graphs. I'll see if there's anything I can do, though.
I was at Amazon a few years after that and hence experienced the consequences of this, use of pmet etc. Years later, moving on to Skype, similar experiments repeated pattern of impact on purchasing. Impact of this goes further than Amazon! 😊
I loved gnuplot! That was the workhorse of my PhD. There was even a bit of perl. But mostly awk and bash.
Wow, you folks basically created rudimentary distributed tracing.
It's ironic that PMET has the most horrible web UI I've ever seen...
One of the mostnimooetant points: give the developers a WHAT (decrease the latency), a WHY (direct impact on business) and the magic: a HOW TO CHECK (use this tool we have for you to test and monitor). Simply amazing!
Six seconds! Hello 2001! 😎
Most of our customers were still on dial up! I also looked at page size on the analysis to make sure it was not confounding things!
Hah! Great job on that whole thread ... thanks for this.
Okay this is closer to my graphs.
It's seriously one of the most-useful tools in an operator's toolbox.
BTW, thanks for your work on driving awareness around high percentile latencies in Amazon! that was where I first encountered that concept and it has been very useful for me ever since
That is a hell of a lognormal distribution though, mine don't look anything that good.
Lol wow so you're saying an Oracle outage basically started Amazon on the path to where it is now. good stuff
In fairness, it was way more than one...
100ms is bad. 150ns timings were MAXIMAL in the 1980s, and if I hit them, that made me sad. When I read your BS doing years later public triage of your for profit BS? It just makes me angry and mad.
Hundreds of developers? Therein lies the problem. How many good engineers does it take to run a train? Not hundreds, that way lies misery, stupidity, and you should harbor deserved shame.
Best leg of the thread “frustrated by website performance in 2001” just couldn’t fathom that period of Internet
Thanks for sharing. Last two years has been amazing learning for me, learning P50 P90 P99 and 4 nines and 5 nines. You hate it initially, as you understand them you will love them.
If one in ten thousand requests have that latency, and every visitor does 100 requests, how many visitors actually see the P-four-nines latency?
While I used that as a very simple model, it doesn't necessarily get you super far in answering your question because it's unlikely that the model of "everybody sees 100 pages whose latencies are randomly drawn from the distribution" is accurate enough. For example,
Once you hit a bad latency, you're not likely to continue, so the population of users that get to 100 pages is not going to be the same as all users. Also many latencies are driven by customers, so while my data was uncorrelated with user, most aren't.
I understood few things but i liked your post
Sure, there is correlation. These days, 1 page is a lot more than 1 request, though! More importantly, I'm of the only-slightly-exaggerated opinion that "most of your customers will see your worst latency, so P-many-nines is the most important metric."
I think your thread was great, btw. Thanks for sharing! (And for remembering to preserve the graphs back then ...)
I just got lucky. I had to make them by hand, so they were sitting in my old home directory years ago when they were going away.
The math is scary. If L is the 99th percentile latency for a page and a single user hits 100 pages, only 37% will experience a consistent sub L experience. To bring that up to 99% L needs to be your 99.99th percentile SLO. See blog.golang.org/ismmkeynote
So so interesting ... thanks for sharing.
Amazing story, thank you for sharing!
That is a way to attempt to ape the BS from AT&T "five nines" wake. I've gotten businesses from 99. to 99.9 and even beyond AT&T's horrid uptimes since that monopolistic clam bake.
If it were a normal distribution (but you can't have latency less than zero, so...) P-four-nines is very slightly less than 4-sigma (99.994%) but this is something you can measure fairly easily, and doesn't depend on the shape of the curve.
Affirmative. It is easy to measure @ATT's downtime when they have lots of media coverage about fiber cuts predating as a warning to a strike, e.g. nbcbayarea.com/news/local/Sou… During that time? In 2009? If you were in SF and an AT&T subscriber you couldn't even make a phone call.
$250K Reward Out for Vandals Who Cut AT&T Lines
AT&T upped the reward for information leading to the arrest of whoever is responsible for yesterday massive cell phone outage to $250,000.
nbcbayarea.com
That's not all! The employer for whom I worked in 2009? We lost 1, but only ONE of our transit providers, the other? A-OK fine! You need to get beyond @ATT & Bell Labs and even beyond NANOG and silicon fabs. To know real network gurus. I've already worked with them, I knows.
You maybe know of this. I got my internet access through Demon in the UK. They had independent transatlantic cable routes to the USA. Trouble was, the two distinct pieces of wet string used the same bridge to cross a river in the USA, and it collapsed in a flood. (Last century.)
The UK and the USA will never see eye to eye! Colonial rejects, the monarchy? On it we shall never rely. But yes, bridges collapse in the 'Murica often. If you're in SF it may take > 2 decades to mend.
Well, if you want to take it like that, but I thought the key point was that the independence of the connections turned out to be an illusion. Different companies, different physical cables, but depending on the same bridge.
Which means you completely failed to read what I wrote. In 2009, when AT&T was down? My network was still passing packets, we understood BGP and don't taunt work forces threatening strikes to pound. I shall FOREVER hate. @ATT and those who deceive & charge make me irate.
That graph with the double peak doesn't exclude a normal distribution, it could be two different curves summed together, but the P-four-nines still has a clear meaning. And, tbh, I had to look up the number for 4-sigma. The double peak I would take as a warning of problems.
All the nomenclature is more or less BS meant to create: An extremely compliant workforce that at a drop of a hat, anyone can unpredictably expect a potential job terminate. Makes you feel helpless and irate? That golem doesn't care, upon its slaves it shall forever predate.
It’s entirely possible to use latency and abandon stats appropriately (arguably—they’re limited proxies for true value), but stack ranking is proof management will make up their own to justify misbehavior anyway if they’re so inclined so it’s senseless to hide data from yourself.
The real WTF is expecting to be able to top-down manage 5000+ people in any useful way in the first place. As far as I can tell it’s beyond human capacity.
Affirmative. In my experience, a team gets unwieldy with four or more. A company? Gets unwieldy once you go beyond 40 or more. Beyond that? It is misery galore.
These numbers fit modern infantry combat, 4 men in a fireteam, while 40 is at the high end of a platoon, the command for a single relatively new officer (or the sergeant who usually knows better what is going on).
I concur, there is too much overlap between businesses & militaries. Even if you are in any armed force? Which is horrid, and you should quit, and file for conscious objector status or if necessary a divorce. Still be better to be in a small ops team. If you know what I mean?
Those numbers are part of a general human pattern, with the Dunbar Number at the top end. The military puts some special pressure on the limits, smaller groups than you might find in a hunter-gatherer tribe, which can be smaller groups than in farming communities.
I am anti-military. Guerrilla 民間人. The way militarization? Post Prussian indoctrination? The military BLEEDS OVER into everything. It makes me want every uniform buried, they sting.
I persist in latency and incoherence. Against all currencies, there is no defense. Fight with all your might. Only the poor will fathom the real plight.
Compartmentalizing data, is mandatory to quantization. Otherwise how would be determine an AIFF from a PNG for differentiation? There is no "proof" in the difference as holding social standings. They're just datatypes, which are designed for file processing handlings.
This is so fascinating, thanks for writing this up! What a cool story and such impactful work :)
Interesting, I have a setup where I now show average latency open-bank.gklijs.tech/average-latenc… and max latency open-bank.gklijs.tech/max-latency.ht… not really realized the second one might be much more important.
Noooo... Max is less important, or rather only max is less important. What matters is frequency of your worst response times along with how bad it was compared to good responses.
Is there newer data on this? 2001 is ancient history of online shopping. :-)
And how do you define "page latency" in the age of lazy loading? Time to Start Render? Time to Interaction? Total Load Time?
Lazy loading is not new; "total load time" has historically been called "completion time" and is explicitly not the same as "latency". cf ics.uci.edu/~fielding/pubs…
Using that definition of Latency, "first indication of a response", it would seem impossible to measure latency from the server side because it depends on available client-side ressources.
Very well explained, thanks for sharing.
Thanks for sharing!
Amazing story! Thank you for sharing.
Related aside From the psychology literature. Mean latency is correlated with variance of skewed distributions. To measure cognitive effects, mean percentile differences are far more effective/stable or scale invariant.
It's fascinating how PMET and so many other things at Amazon started out as nifty tools and now each of them is an AWS service of sorts :)
you will probably like this "story"
Because averages lie to you and p95 makes sense.
I longingly remember the early days of your performance team, when no one has ever heard of P90!
Thanks for sharing! A great story and accomplishment.
Good read on SLAs and percentiles