Convopage : @tacertain : If you're wondering what "P-four-nines" means, it's the latency at the 99.99th percentile, meaning only one in 10,000 requests has a worse latency. Why do we measure latency in percentiles? A thread about how how it came to be at Amazon... (url)

Convopage

See the entire conversation

If you're wondering what "P-four-nines" means, it's the latency at the 99.99th percentile, meaning only one in 10,000 requests has a worse latency. Why do we measure latency in percentiles? A thread about how how it came to be at Amazon...

Tim Bray@timbray

Was in a conversation about some service refactoring and the guy said “mean latency was a bit worse, but our P-four-nines was rock-solid and only twice the mean. Customer loved it.” #CloudThinking

156 replies and sub-replies as of May 30 2019

Andrew Certain@tacertain

In 2001, I was managing the Performance Engineering team. We were responsible for the performance of the website, and we were frustrated. We were a few engineers fighting against the performance entropy of hundreds of developers adding features to the web site.

Andrew Certain@tacertain

Latency kept going up, but we had a big problem: how to convince people to care. The feature teams had web labs that could show "if we add feature X to the page, we get Y more revenue."

Andrew Certain@tacertain

If that feature added 100ms to page latency and our only counter was "but latency is bad," our argument didn't carry the day. So we had to figure out how to make our argument stronger.

Andrew Certain@tacertain

On March 21 and 22, 2001, our Oracle databases gave us a gift! At the time, the Amazon website was backed by seven "web" databases that were responsible for keeping track of the web activity, such as the contents of the customer's shopping cart.

Andrew Certain@tacertain

On those dates, one of the databases, "web7" was having some weird locking problems. Every so often, when the website tried to query for the customer's data, it would pause for a few seconds. While this made life miserable for our customers, it was an opportunity for us!

Andrew Certain@tacertain

Visitors to the website were randomly sharded across these databases, making this a great natural experiment. So I did some log diving! I broke down our page serve times a bunch of different ways, but for this thread, I'll just look at one, which is detail page latency.

Andrew Certain@tacertain

On the Amazon web site, the "detail" page is the product page. The page that contains a single product with an add-to-cart button. We serve a lot of those pages and customers visit them for roughly the same reasons, so they were a good data set for me.

Andrew Certain@tacertain

Here's a graph I made. The green is a histogram of number of pages served at a given latency. So, for example, we served about 2000 pages in that hour at 250ms latency, and about 8000 at 1s. Note that the x axis is log scale.

Andrew Certain@tacertain

(Same graph). The red is the "abandonment rate." It's the ratio of pages for which that page was the last page the customer visited. So you can see that mostly every time we served a detail page, there was about a 20% chance that it would be the last page in a session.

Andrew Certain@tacertain

(Same graph). The rate slopes down! Latency is good for keeping people on the website! Probably not, but it illustrates what I was up against. We knew that products with more reviews, more images, etc. sold better, and those things took more latency to serve.

Andrew Certain@tacertain

Here's the graph for that same time period for the customers being served by web7. That secondary hump in the histogram is the locking problems. The latency additions were effectively randomly distributed among the page loads (not correlated to customer).

Andrew Certain@tacertain

(Same graph). So each time a customer who happened to be mapped to web7 tried to load a page, there was a 20-25% chance that it would be served with an extra ~8-16s of latency. And obviously the abandon rate was much higher for pages served at that latency.

Andrew Certain@tacertain

But the really scary thing to me was comparing the "normal" latency sections below. What we saw was that a customer was much more likely to abandon the site even for normal latencies after tolerating a slow page.

Andrew Certain@tacertain

The abandon rate went from ~18% at 2s latency to ~32%. That's a lot of pissed off customers! Extra latency for better content went from being an attractor to a detractor because of the customer's prior page serve times!

Andrew Certain@tacertain

From this, we decided that we (the performance engineering team) would not fight for better average latency, but for a maximum latency budget. We didn't know exactly what the right answer was, but we knew it was somewhere between 4 and 8 seconds. So we picked 6.

Andrew Certain@tacertain

And we knew that we couldn't start off with an edict that no page could *ever* be served with more than 6s of latency. Instead, we thought about probabilities that a customer would have a bad session, which led us to set our SLAs at percentiles.

Andrew Certain@tacertain

If a session was 10 pages on average and we set our SLA to say that no more than 1 in 10,000 pages would have a bad latency, then we would know that no more than 1/1000 customers would have a bad session. In shorthand, the p99.99 of latency needed to be <6s.

Andrew Certain@tacertain

Armed with these graphs and this reasoning, I went to the VPs who owned the various web pages and argued that they needed to set these SLAs. In parallel, we (the perf engr) team built tools that made it really easy for developers to measure the latency of their pages.

Andrew Certain@tacertain

That system, called PMET, let developers put little "start" and "stop" indicators in their code and then our system would scrape the logs and store the latency histograms in a database. If their page wasn't hitting SLA, they could drill down and figure out why.

Andrew Certain@tacertain

wrote the prototype service for collecting and aggregating the data, and I wrote a simple visualization tool (using Perl and gnuplot). The rest is history!

Andrew Certain@tacertain

All things considered, it's probably the most-impactful thing I have done in my 20 years at Amazon, and I did it in my first few years! That focus on high percentiles instead of averages has driven so much good behavior.

Andrew Certain@tacertain

When James Hamilton started in AWS, we were chatting about PMET. When he heard that I had been one of the creators, he told me that when he read the Dynamo paper, the thing that had the biggest impact on his thinking from the paper was that we were focused on percentile latency.

Andrew Certain@tacertain

If you work at Amazon, you can hear me talk about this in an old PoA talk. If you search broadcast for "andrew certain gems" it should be the only hit (it's the first ten minutes of that video). That's it!

Andrew Certain@tacertain

PS. This outage and many others like it were the driving force behind developing Dynamo and getting off of relational databases. So lots of change happened because of the problems with these databases!

Andrew Certain@tacertain

PPS. This thread is getting *a lot* of attention. Lots more than if it were just me yelling at you to "use percentiles." Because humans love stories. And I agree it's one of the good things about Twitter: that we can hear stories from people we don't know.

Wembley G. Leach, Jr.@wembleyleach

I’ve been hearing a lot about this recently. One of the reasons I like twitter so much in learning about something straight from one of the creators like this.

Andrew Certain on Twitter

“If you're wondering what "P-four-nines" means, it's the latency at the 99.99th percentile, meaning only one in 10,000 requests has a worse latency. Why do we measure latency in percentiles? A thread about how how it came to be at Amazon... https://t.co/OCGSc5c6KZ”

twitter.com

Andrew Certain@tacertain

But as useful as my story might be, it's not nearly as important as stories that help us remember our shared humanity. No matter who you are, but especially if you're a cis white guy in tech like me, I encourage you to follow people different from you and listen to their stories.

Andrew Certain@tacertain

It can be challenging sometimes just to listen, especially when the stories express anger about people like you. But I encourage you to try, to hear what they are saying, and to remember that we're all just trying to be OK in the world.

Andrew Certain@tacertain

And if you are a cis white dude in tech, I know that you can also be fighting internal battles, getting dumped on by the world, and feeling pain. I'm interested in your stories too. Someday I may have the courage to share some of my deeper vulnerabilities here.

Andrew Certain@tacertain

Until then, here are a few people whose stories I really value. Maybe give them a try. @EricaJoy @ewindisch @KatyMontgomerie @negrosubversive @kimmaytube @mykola @abbyfuller @ChloeCondon @ASpittel @jessfraz @QuetzalliAle @ErynnBrook @prisonculture @mekkaokereke @hirokonishimura

Katy Montgomerie 🦗@KatyMontgomerie

❤️❤️❤️ ace writeup!

your friend myk@mykola

Thanks, dude! 💚💚

Andrew Certain@tacertain

Well, you're my friend! It says so right there in your handle! :)

ted209er@ted209er

Thank you for this.

nostrilsopen@nostrilsopen

You just quintupled the quality of my Twitterverse with just this. Thanks.

Conor Healy@conorhealy

1) Great write up 2) now I’m going to lose a morning down the rabbit hole of a dozen new perspectives. Thanks for both

Luna Corbden 🏳️‍🌈👾🏳️‍🌈@corbden

I was here for the tech story from a time when I had big dreams but was running a network with just 85 systems at a small company in a small town, being a bit jealous of where you were in 2001, and saw this. Thank you for using your platform and privilege to add this.

Cameron Yule@camyule

Very cool. Always love learning the back stories to tools and techniques that it’s easy to take for granted

gleicon@gleicon

Thank you for this thread. Selling performance budgets and improvements among the drive for features is hard enough now, I can only imagine that 18y ago.

Andrew Certain@tacertain

I think the key thing we realized was that we would never win fighting individual battles around features. We needed to set a standard and let the business owners make their own decisions about how to get there.

gleicon@gleicon

Agreed, trying local changes has a low ceiling. there’s a genius touch on both discovering the standard and extending it to both business (conversion and session metrics) and engineering (dynamo and non relarional design).

Paul Berg@paulwberg

Thanks for the background! When I joined Amazon in 2008, moving from an average mindset to a P99.99 one made me a much better developer overall and has positively impacted every project I have undertaken since.

Daniel Buchholz@danbdo

Very cool thread. I think what amazes me most is the fact that you still have those graphs from back then.

Andrew Certain@tacertain

I squirrelled them away a long time ago. I'm really glad I did. Part of the reason I had them is because I had to do them all by hand with perl and gnuplot, so they were all sitting in my home directory when I went to look for them years later.

Mike Burrows@zebedee666

Nice thread, thanks. Pushing for similar mentality shift for frame latency as quality metric by pushing 99th percentile on frame time for games vs average frame time (ms/F) and, most popular today, [shudder] average FPS...

Marco Munari@TheMyx

This was a great story and I’m going to look up the PoA talk tomorrow. It’s also a great insight to how data-driven we are and why that’s important. Thanks for sharing it!

John O'Shea@joshea

I'll have to find and add that video to the monitoring bootcamp! One set of stats that PMET didn't & CloudWatch don't (yet) offer are 'trimmed mean' stats, wondering if you'd every considered them? Seems like they'd compliment percentiles for use cases like observing page latency

Andrew Certain@tacertain

I hadn't. I'm very self-taught when it comes to statistics, so didn't have many tools in my toolbox! 😁

John O'Shea@joshea

Me too, but @JimRoskind sold me on trimmed means at the offsite last week. Added to my list of hackaton ideas.

Colm MacCárthaigh@colmmacc

Related: overstats are an under appreciated gem!

Joe Magerramov@_joemag

They are, but I haven’t seen anybody implement them well. Presumably because they are computationally expensive for non hard coded values?

Marc Brooker@MarcJBrooker

Calculating overstats using an exponentially bucketed histogram (with bounded error) isn't too hard. O(N) in the number of buckets.

Colm MacCárthaigh@colmmacc

Cloudwatch supports this, using Simple Exponentiating Histograms (SEH). It’s awesome though I wish it did non-linear interpolation within the buckets but that’s a quibble.

Marc Brooker@MarcJBrooker

Fun fact: SEHs are CRDTs, and overstat calculation is monotonic, so can be calculated exactly in parallel with no coordination. Percentile calculation isn't monotonic, so needs some coordination.

adrian cockcroft@adrianco

Is that the same algorithm that hdrhistogram uses?

Marc Brooker@MarcJBrooker

IIRC, hdrhistogram uses a top level sparse exponential histogram, with a second level of linear buckets inside each exponential bucket. SEH is just the first level.

Stuart Marshall@StuartHMarshall

What are overstats? (I tried googling for the term, but it's buried in overwatch results.)

Colm MacCárthaigh@colmmacc

A simple count of the number of data points over a given value. Eg the 1ms overstat counts every data point over 1ms.

Matteo Figus@matteofigus

Wow thanks for sharing! Fascinating to learn how complex systems evolve over multiple decades and how good tooling help humans to make progressive improvements and develop sense of ownership 👍

Edouard Vincent@edvincent

THANK you. This was likely the most interesting thing I've learnt this week!

Pat Patterson@metadaddy

Any chance that video could be shared more widely?

Andrew Certain@tacertain

Honestly, it's just ten minutes of me talking through these exact graphs. I'll see if there's anything I can do, though.

Josh Montague 📊🎉@jrmontag

💯 thread 🙏

Jamie O'Shaughnessy@jamieosh

I was at Amazon a few years after that and hence experienced the consequences of this, use of pmet etc. Years later, moving on to Skype, similar experiments repeated pattern of impact on purchasing. Impact of this goes further than Amazon! 😊

Guy Maskall #FBPE@GuyMaskall

I loved gnuplot! That was the workhorse of my PhD. There was even a bit of perl. But mostly awk and bash.

Mahdi@mahdiech

Wow, you folks basically created rudimentary distributed tracing.

Julian Stecklina@blitzclone

It's ironic that PMET has the most horrible web UI I've ever seen...

Oriol@UriTau

One of the mostnimooetant points: give the developers a WHAT (decrease the latency), a WHY (direct impact on business) and the magic: a HOW TO CHECK (use this tool we have for you to test and monitor). Simply amazing!

Christian Kreibich@ckreibich

Six seconds! Hello 2001! 😎

Andrew Certain@tacertain

Most of our customers were still on dial up! I also looked at page size on the analysis to make sure it was not confounding things!

Christian Kreibich@ckreibich

Hah! Great job on that whole thread ... thanks for this.

A-A-ron H@armahillo

☝️

kurin@RuinTaughtMe

Okay this is closer to my graphs.

Justin Mason@jmason

gnuplot 😍

Andrew Certain@tacertain

It's seriously one of the most-useful tools in an operator's toolbox.

Justin Mason@jmason

BTW, thanks for your work on driving awareness around high percentile latencies in Amazon! that was where I first encountered that concept and it has been very useful for me ever since

kurin@RuinTaughtMe

That is a hell of a lognormal distribution though, mine don't look anything that good.

Kianoosh Raika@djKianoosh

Lol wow so you're saying an Oracle outage basically started Amazon on the path to where it is now. good stuff

Andrew Certain@tacertain

In fairness, it was way more than one...

グレェ「grey」@hyphybyterhymer

100ms is bad. 150ns timings were MAXIMAL in the 1980s, and if I hit them, that made me sad. When I read your BS doing years later public triage of your for profit BS? It just makes me angry and mad.

グレェ「grey」@hyphybyterhymer

Hundreds of developers? Therein lies the problem. How many good engineers does it take to run a train? Not hundreds, that way lies misery, stupidity, and you should harbor deserved shame.

Sandeep Mangalath@sandeepkallazhi

Best leg of the thread “frustrated by website performance in 2001” just couldn’t fathom that period of Internet

Basharat Wani بشارت وانی@basharatw

Thanks for sharing. Last two years has been amazing learning for me, learning P50 P90 P99 and 4 nines and 5 nines. You hate it initially, as you understand them you will love them.

Giant Death Robot@jwatte

If one in ten thousand requests have that latency, and every visitor does 100 requests, how many visitors actually see the P-four-nines latency?

Andrew Certain@tacertain

While I used that as a very simple model, it doesn't necessarily get you super far in answering your question because it's unlikely that the model of "everybody sees 100 pages whose latencies are randomly drawn from the distribution" is accurate enough. For example,

Andrew Certain@tacertain

Once you hit a bad latency, you're not likely to continue, so the population of users that get to 100 pages is not going to be the same as all users. Also many latencies are driven by customers, so while my data was uncorrelated with user, most aren't.

registered@monrowill88

I understood few things but i liked your post

Giant Death Robot@jwatte

Sure, there is correlation. These days, 1 page is a lot more than 1 request, though! More importantly, I'm of the only-slightly-exaggerated opinion that "most of your customers will see your worst latency, so P-many-nines is the most important metric."

Giant Death Robot@jwatte

I think your thread was great, btw. Thanks for sharing! (And for remembering to preserve the graphs back then ...)

Andrew Certain@tacertain

I just got lucky. I had to make them by hand, so they were sitting in my old home directory years ago when they were going away.

eckes@eckes

"How NOT to Measure Latency" by Gil Tene

Time is Money. Understanding application responsiveness and latency is critical but good characterization of bad data is useless. Gil Tene discusses some com...

youtube.com

Richard Hudson@gcrickhudson

The math is scary. If L is the 99th percentile latency for a page and a single user hits 100 pages, only 37% will experience a consistent sub L experience. To bring that up to 99% L needs to be your 99.99th percentile SLO. See blog.golang.org/ismmkeynote

Zeke Pfeifer@ezekielp

So so interesting ... thanks for sharing.

Christopher Peters@statwonk

Amazing story, thank you for sharing!

グレェ「grey」@hyphybyterhymer

That is a way to attempt to ape the BS from AT&T "five nines" wake. I've gotten businesses from 99. to 99.9 and even beyond AT&T's horrid uptimes since that monopolistic clam bake.

Wolf "I am just a bear" Baginski@WolfBaginski

If it were a normal distribution (but you can't have latency less than zero, so...) P-four-nines is very slightly less than 4-sigma (99.994%) but this is something you can measure fairly easily, and doesn't depend on the shape of the curve.

グレェ「grey」@hyphybyterhymer

Affirmative. It is easy to measure @ATT's downtime when they have lots of media coverage about fiber cuts predating as a warning to a strike, e.g. nbcbayarea.com/news/local/Sou… During that time? In 2009? If you were in SF and an AT&T subscriber you couldn't even make a phone call.

$250K Reward Out for Vandals Who Cut AT&T Lines

AT&T upped the reward for information leading to the arrest of whoever is responsible for yesterday massive cell phone outage to $250,000.

nbcbayarea.com

グレェ「grey」@hyphybyterhymer

That's not all! The employer for whom I worked in 2009? We lost 1, but only ONE of our transit providers, the other? A-OK fine! You need to get beyond @ATT & Bell Labs and even beyond NANOG and silicon fabs. To know real network gurus. I've already worked with them, I knows.

Wolf "I am just a bear" Baginski@WolfBaginski

You maybe know of this. I got my internet access through Demon in the UK. They had independent transatlantic cable routes to the USA. Trouble was, the two distinct pieces of wet string used the same bridge to cross a river in the USA, and it collapsed in a flood. (Last century.)

グレェ「grey」@hyphybyterhymer

The UK and the USA will never see eye to eye! Colonial rejects, the monarchy? On it we shall never rely. But yes, bridges collapse in the 'Murica often. If you're in SF it may take > 2 decades to mend.

Wolf "I am just a bear" Baginski@WolfBaginski

Well, if you want to take it like that, but I thought the key point was that the independence of the connections turned out to be an illusion. Different companies, different physical cables, but depending on the same bridge.

グレェ「grey」@hyphybyterhymer

Which means you completely failed to read what I wrote. In 2009, when AT&T was down? My network was still passing packets, we understood BGP and don't taunt work forces threatening strikes to pound. I shall FOREVER hate. @ATT and those who deceive & charge make me irate.

Wolf "I am just a bear" Baginski@WolfBaginski

That graph with the double peak doesn't exclude a normal distribution, it could be two different curves summed together, but the P-four-nines still has a clear meaning. And, tbh, I had to look up the number for 4-sigma. The double peak I would take as a warning of problems.

グレェ「grey」@hyphybyterhymer

All the nomenclature is more or less BS meant to create: An extremely compliant workforce that at a drop of a hat, anyone can unpredictably expect a potential job terminate. Makes you feel helpless and irate? That golem doesn't care, upon its slaves it shall forever predate.

The Truth@LnxPrgr3

It’s entirely possible to use latency and abandon stats appropriately (arguably—they’re limited proxies for true value), but stack ranking is proof management will make up their own to justify misbehavior anyway if they’re so inclined so it’s senseless to hide data from yourself.

The Truth@LnxPrgr3

The real WTF is expecting to be able to top-down manage 5000+ people in any useful way in the first place. As far as I can tell it’s beyond human capacity.

グレェ「grey」@hyphybyterhymer

Affirmative. In my experience, a team gets unwieldy with four or more. A company? Gets unwieldy once you go beyond 40 or more. Beyond that? It is misery galore.

Wolf "I am just a bear" Baginski@WolfBaginski

These numbers fit modern infantry combat, 4 men in a fireteam, while 40 is at the high end of a platoon, the command for a single relatively new officer (or the sergeant who usually knows better what is going on).

グレェ「grey」@hyphybyterhymer

I concur, there is too much overlap between businesses & militaries. Even if you are in any armed force? Which is horrid, and you should quit, and file for conscious objector status or if necessary a divorce. Still be better to be in a small ops team. If you know what I mean?

Wolf "I am just a bear" Baginski@WolfBaginski

Those numbers are part of a general human pattern, with the Dunbar Number at the top end. The military puts some special pressure on the limits, smaller groups than you might find in a hunter-gatherer tribe, which can be smaller groups than in farming communities.

グレェ「grey」@hyphybyterhymer

I am anti-military. Guerrilla 民間人. The way militarization? Post Prussian indoctrination? The military BLEEDS OVER into everything. It makes me want every uniform buried, they sting.

グレェ「grey」@hyphybyterhymer

I persist in latency and incoherence. Against all currencies, there is no defense. Fight with all your might. Only the poor will fathom the real plight.

グレェ「grey」@hyphybyterhymer

Compartmentalizing data, is mandatory to quantization. Otherwise how would be determine an AIFF from a PNG for differentiation? There is no "proof" in the difference as holding social standings. They're just datatypes, which are designed for file processing handlings.

Vaidehi Joshi@vaidehijoshi

This is so fascinating, thanks for writing this up! What a cool story and such impactful work :)

Gerard Klijs@GKlijs

Interesting, I have a setup where I now show average latency open-bank.gklijs.tech/average-latenc… and max latency open-bank.gklijs.tech/max-latency.ht… not really realized the second one might be much more important.

aadipa@aadipa

Noooo... Max is less important, or rather only max is less important. What matters is frequency of your worst response times along with how bad it was compared to good responses.

Daniel AJ Sokolov@newstik

Is there newer data on this? 2001 is ancient history of online shopping. :-)

Daniel AJ Sokolov@newstik

And how do you define "page latency" in the age of lazy loading? Time to Start Render? Time to Interaction? Total Load Time?

Robert Brewer@aminusfu

Lazy loading is not new; "total load time" has historically been called "completion time" and is explicitly not the same as "latency". cf ics.uci.edu/~fielding/pubs…

Daniel AJ Sokolov@newstik

Using that definition of Latency, "first indication of a response", it would seem impossible to measure latency from the server side because it depends on available client-side ressources.

Dioris Moreno@DiorisMoreno

Very well explained, thanks for sharing.

Phil Johnson@pj_mincing

(No comment)

Alexandre Strube@alexandre_ganso

(No comment)

Matt Wells@ninja_p

Thanks for sharing!

Mohamed Hadrouj@hadrouj

Amazing story! Thank you for sharing.

nsram@nsram

Related aside From the psychology literature. Mean latency is correlated with variance of skewed distributions. To measure cognitive effects, mean percentile differences are far more effective/stable or scale invariant.