See the entire conversation

Apple’s platforms offer a few quality modes for variable-speed audio playback. Safari on macOS — not other browsers, and not even iOS Safari — continues to use the worst/fastest one. Depending on the API in use, they could most likely fix this with a single line of code.
+@OvercastFM +@marcoarment Why is it that playing podcasts at 2x speed sounds amazing on Overcast on iOS/iPadOS, and even on macOS via M1 support for iOS apps, but very robotic when listening through Safari macOS? Is there a different audio API you use in the app?
34 replies and sub-replies as of Dec 18 2020

Seems like Apple’s own Podcast app uses the worst one as well.
Even Podcasts.app on Big Sur sounds like warbly garbage at anything other than 1x. Podcasts on iOS sounded great at other speeds around 13.2 or so.
Er, 12.2 not 13.2.
Ha I’ve been wondering about this but assumed the answer
This is super interesting, and shocking that Apple has not addressed this issue on Safari.
I had absolutely no idea that you could even do this on the website. Had never even considered it. This is amazing.
Could it be that they chose the fastest because they’re too obsessed with battery life 🤔
As the person who wrote the line of code in question, I have no idea what you’re talking about. trac.webkit.org/changeset/1810…
What algorithm specifically would you have Safari switch to?
Try AVAudioTimePitchAlgorithmTimeDomain.
You’ll note in the WebKit commit I posted, that is the algorithm we moved away from in favor of Spectral, in this case because of user complaints about YouTube playback. WSOL algorithms like TimeDomain work well for speech but fall apart when there’s any harmonics in the audio.
My fever dream would be to let web authors “hint” their content as “speech” vs “music” and pick the right algorithm in response.
What does iOS Safari use? On macOS, what does Chrome use? Both sound significantly better than Safari, last time I checked.
iOS Safari and macOS Safari use the exact same code in WebKit to pick the pitch algorithm. Chrome, as far as I know, uses their own WSOLA algorithm; not any provided by the OS. It’s possible that iOS and macOS have different implementations of Spectral, but I doubt it.
This is plausible, depending on when the AVPlayer class was written, and when kAudioUnitSubType_NewTimePitch showed up on both platforms. kAudioUnitSubType_TimePitch was macOS-only, and iOS had no equivalent counterpart for quite a while.
Interesting — I just tested it and found that there is indeed no obvious difference between macOS Safari and iOS Safari on sped-up speech. May be recent — a few releases ago, that wasn't the case. FWIW, Chrome/Brave still sound FAR better than Safari (and always have on Macs).
Time-domain methods (PSOLA/WSOLA) are cheap/easy to implement compared to the frequency-domain methods, and they do very well with speech (and other simple) audio content, but they completely fail for most musical and other complex content.
The ideal solution would be to automatically detect the content and switch accordingly. A decent alternative would be to provide the user with options, but that's some _deep_ knob-twiddling for a web browser. This is why we make purpose-built native apps, Marco! 😀
If you ran a FFT over the decoded audio and looked for multiple, simultaneous frequency spikes, you may be able to detect “speech” vs. “non-speech”. But heuristics like this are easy to get wrong, and importantly, aren’t free.
More “free” would be something like allowing the page to declare their content was “speech” vs. “music”. This would solve Marco/Overcast’s use case, but wouldn’t help sites like YouTube with user generated content.
And making the user bear the cognitive cost of figuring out what pitch-correction algorithm to use isn’t great either. We don’t have any obviously great options here.
Maybe you’ve chosen the wrong default. Might speech be the more common case in non-1x variable-speed playback in the context of a web browser? (I’d assume so, but you probably know better.) And what’s worse when guessing wrong: speech under freq domain, or music under *SOLA?
That’s certainly possible! When we made this decision at first, YouTube was the primary user of non-default-rate playback in the wild. A lot has changed since then. I’m wary of just flip-flopping what use cases get the short end of the stick, though.
And now Netflix has added UI for non-1x playback. Anecdotally, I suspect between YouTube and Netflix, harmonic content is probably more commonly played at non-1x rates, but I have no data to back that up.
I bet YouTube and video sites are indeed the most common use for browser audio playback at non-1x. But it doesn’t necessarily follow that music is the most common non-1x content, or that the *SOLA artifacts in music are less acceptable than the spectral artifacts in speech.
As far as what content sounds worse, I don’t have a clear answer. From the samples I’ve been listening to, music and speech sound approximately equally bad, with speech sounding robotic and with music nearly dropping entire instruments.
Our speed-shifting could definitely be better. And it's something we're looking into. I think Jer's point is that it's not just an easy one-liner fix. We can choose a different set of tradeoffs for speech vs music, or else drive improvements to the set of available algorithms.
What’s the use case for speeding up music ? (genuinely intrigued why people would do that)
I wonder that myself, but apparently there are people who want to do it.
Just to add context, my app Capo offers up to 1.5x playback for music practice, and it’s useful for pushing yourself to play a passage/song faster than the 1x speed so that 1x becomes that much more comfortable to play.
FWIW, we still get complaints about Chrome's algorithm (particularly at sub-1x rates), so +1 to it being a complicated series of tradeoffs.
I’ve reported this many years ago, with a recording of a video playing in both Chrome and Safari and how Safari sounded worse. The reply? “Can’t reproduce” 🤦🏻‍♂️ This is literally the only reason I have chrome installed. Is my media playback browser.
I’ve had 3 video chat platforms (zoom,google meet, and an interviewing thing) not work on safari because of the robotic voice thing. Pretty embarrassing for Apple when the other people go “oh are you using safari, yeah switch to chrome it will fix that” 😂
It’s a nightmare. Their Catalyst-based Podcasts app for macOS also uses the same lousy method. It makes me think nobody at Apple uses variable-rate playback on the Mac.