Behold, an AI startup with a real business

Om Malik & Fred Vogelstein Crazy Stupid Tech

2 months ago

16 MIN READ

Jessica Powell still vividly remembers her first set of funding pitches for Audioshake six years ago. They were a disaster. She’d spent a decade at Google rising to become its head of all communications. She knew many of the venture capitalists in Silicon Valley. But when she sat down with investors to actually pitch her company they practically sneered at her.

“They were just like “Anything but this. Anything but music. Anything but audio …. It was just bad on many, many, many fronts,” she said in an interview late last year with Conor Healy at a music conference.

I didn’t understand Audioshake back then either. Audioshake’s tech takes any song from any source and automatically splits that song into its component parts, called stems. So American Idiot by Green Day suddenly becomes a drum track, a guitar track, a bass track and a vocal track.

How was that any kind of business? Her primary customers, the big entertainment companies, were famously suspicious of new technology, especially out of Silicon Valley. It seemed she was facing an incredibly steep hill.

But this is what makes entrepreneurs so unique and interesting. The best ones know how to endure endless rejection because they are convinced they know something everyone around them doesn’t.

In this case she and co-founder Luke Miner – Audioshake’s CTO, a former Plaid data scientist and Powell’s husband – knew musicians and record executives desperately wanted technology like this. They knew they would pay for it. Audioshake just had to deliver it.

And that’s what’s finally starting to happen. Powell says that Audioshake, which now has about 25 employees – mostly PhDs – is now processing hundreds of millions of minutes of audio a year, up from single digit millions of minutes two years ago. She said they already have a handful of engagements worth millions of dollars.

Its customers now include more than 40 businesses including Universal Music Group, Warner Music Group and Studios, Disney Music Group, NFL Films and AI-Media. They are also powering functions in two of the largest music streaming apps as well as many of the big AI companies, which she can’t name. A partnership with another big network is to be unveiled this week.

It all makes Audioshake a unique company in the AI bubble/boom: While the world is pouring money into data centers to fund AI chatbots and agents without any obvious long term business model, Audioshake has built a business that helps its customers make money right now. It sprinkles AI fairy dust on entertainment libraries and makes them exponentially more valuable than they were.

Part of Audioshake’s secret, aside from its AI chops, is that both Powell and Miner are long time musicians themselves. Powell plays piano. Miner plays saxophone. They know how that world thinks. And that cultural knowledge helped them turn the entertainment industry’s insularity into an asset: Help the music licensing division make more money and other divisions want a piece of that too.

“We’d been thinking about something like this since 2013 when Luke and I had lived in Tokyo. I was there for Google. One of my frustrations was that I’d always wanted to karaoke to old punk. But they were never in the karaoke books. “Wouldn’t it be so cool if you just rip the vocals out of any song you wanted and sing along to that?” we asked ourselves.

By 2019 Miner believed AI finally made it technically possible to do just that. ”I remember the first song that we separated was a Smiths song, and Morrissey’s voice sounded really demonic,” Powell said.

They weren’t thinking about starting a company at first. They just thought they’d done something cool for musicians. Then they showed what they’d done to their friend Doug Merrill, Google’s former CIO who’d also worked in the music industry. “ ‘Look what we can do,’ we said to him. And he was like, ‘You idiots. You don’t even realize what you’ve built and how useful it could be. Do you know how many times they want instrumentals out of songs for sync licensing but don’t have them?’ “

So while traditional VCs said, “No” initially, music industry investors said, “Yes.” Powell and Miner raised $5 million in two seed rounds from Metallica-backed Black Squirrel Partners, Crush Ventures, and Billy Mann. Mann is one of the best connected producers, executives and musicians in the industry.

By last fall they had generated enough buzz and revenue among labels and studios to raise another $14 million from traditional VCs.

“One of the things that has happened in the AI boom is that …. it’s become confusing whether buyers (of new AI tech) are using real dollars or play dollars (in the form of data center credits or other kinds of financing), said Alex Hartz, a partner at Shine Capital. which led the $14 million VC round.

That’s not the case with Audioshake, he said. “It feels like foundational technology to me. If you think about what Stripe does for payments or what Twilio does for text messages or phone calls, I think that there’s a similar very large scale opportunity … for Audioshake in voice.”

Powell said that part of what’s taken so long is that training AI to split music into stems is much harder than training it on any other format. A spectrogram – a sonic plot on an x/y axis – doesn’t obviously distinguish between thunder, crowd noise and loud music. Sounds overlap. And big entertainment companies do move super slowly.

You’d think that since music production and output has been digital for a generation that artists and labels would already have these stems lying around. But Powell and Mann both say they typically don’t because few thought the pieces would be valuable until recently.

“It really wasn’t until around 2015 or so that creating the stems and handing the stems over to the label became more standard,” Powell said. “And even now a lot of times they are not made (for budget reasons.)”

It’s true that sound separation tech has existed for a while. It’s also true that a handful of open source AI models can now create music stems quickly too.

But Powell says that competitors don’t do it as well, and they don’t do it in a way that satisfies the many legal, operational and production hurdles associated with handling valuable assets of big entertainment companies. Audioshake doesn’t plan a consumer focused product. “Open source models tend to solve specific, narrow problems. We’ve built a broader set of capabilities – multi-speaker, dialogue/music/ effects, music removal – that work across real-world messy audio,” she said.

For Mann, what makes Audioshake so attractive is how fast it works. “Say I’m doing this new campaign for sneakers, and I want to use “Won’t get Fooled Again” by the Who. It’s a $280,000 sync license (a license to use a certain audio stem). The answer used to be, “Well I gotta call and go to the warehouse and get permission to pull out the two track master.” Now, while we’re zooming, I can get you the solo guitar on Pinball Wizard with nothing else in 30 seconds.”

All this has created a powerful word of mouth cycle inside these giant companies for Audioshake, Powell said. “It started with sync licensing. Then it spread into other departments where they needed it for sampling or re remixing or a remaster. And then the film people found us and asked “Could you separate dialogue, music and effects?” The first film we separated was for the Dr. Who franchise, who wanted to quickly dub the English version into German.”

Now broadcasters have also begun using Audioshake’s technology real time to filter out background noise and improve the quality of broadcasts and of captioning. They’ve also begun using it to filter out music that might show up in the background of replays and other clips so they can avoid paying artist royalties.

Perhaps the biggest opportunity, according to Powell and Hartz at Shine, is for Audioshake to become the company all the AI firms use to clean-up cross talk and to improve the quality of the audio they use for training their models.

At the moment AI can hear in controlled settings, but they are still terrible at it in the real world, like in a noisy bar, Powell said. “So if you can extract and isolate sound at scale running on devices, very low latency, you can help machines start to harness some of these human powers, like these nuanced powers of hearing and provide important inputs for what is happening in real time in the real world.”

“If this whole AGI thing is going to happen, machines need to be able to see. But they also need to be able to hear.”

Here are snippets of our conversation, edited for length and clarity from a zoom interview and three email exchanges:

FV: Tell me more about the sync licensing business.

JP: When you’re watching a movie, a short, a TV show, or a commercial, and there’s music connected to the video playing in the background, that’s a sync. You’re syncing a video to a song and typically there’s some element of manipulation of the music that’s happening or editing of the music that’s quite subtle. Maybe the Lexus is going around the corner of the mountain, and they’re going to increase the energy, the bass to show more excitement, whatever it might be.

And so what we were able to do is have a system that would give them the ability right out the gate to basically create these stems or splits that the (film, commercial, or video) editors could work with.

FV: Can you explain more about how it is that artists and labels don’t already have these stems?

JP: Well for starters, that leaves out a huge amount of old music that wasn’t multi tracked (and never had stems to save) But it also assumes that labels saved all those tracks when they could. In lots of cases people didn’t necessarily see a value in the different components of holding on to them. In other cases, the people might have been interested in saving them, but they were prepared with certain plugins that don’t exist anymore. Or they were saved on physical tape, and that physical tape is hard to find right away and expensive to work with because it’s been sitting in some warehouse for years.

FV: Ok. But doesn’t the technology to do all this already exist? I see open source programs that offer stem creation. And before that, didn’t free audio editing tools exist for years? How else would karaoke versions of songs be available?

JP: Most karaoke tracks historically weren’t made from stems but re-recorded versions of popular songs. They were studio recreations made “in the style of” the original. Labels didn’t provide stems. Isolating vocals from a finished track wasn’t feasible. That’s why only a subset of songs—typically the most popular—ever made it into karaoke.

Now that’s changing. Newer karaoke platforms are striking deals with rights holders to use original recordings. With source separation, they can split the original track into vocals and instrumentals, making virtually any song usable for karaoke—not just the ones that were re-recorded. Companies like Singa, which uses Audioshake, are early examples of this shift. Karaoke is moving from a curated catalog to “any song, on demand.”

FV: Audio editing tools couldn’t do that quickly?

JP: I remember speaking to the audio engineer who worked on The Beatles at the Hollywood Bowl live album (released in 1977), and him saying it took him something like three years to isolate the different elements in those tracks. We would be able to do that in three seconds. So, yes, it’s been possible to create stems for a long time, but it was super time consuming and super expensive.

FV: What about competition from open source stem creators and others?

JP: Well, yes, they exist in music. But a)We now do a lot more than just music. b)We’re still orders of magnitude better than they are even then. There are actual benchmarks that prove this.

And c)Our customers aren’t really choosing between “open source vs AudioShake” on model quality alone—they’re choosing between a model and a production system.

We’ve built a broader set of capabilities—multi-speaker, dialogue/music/effects, music removal—that work across real-world, messy audio. Our models are optimized for streaming, on-prem, SDKs, and APIs—so they can actually sit inside production systems.

And because we’ve built alongside rightsholders and creators from the start, we get access to high-quality datasets. We’re a trusted part of their workflows. That’s something almost no AI companies anywhere have right now.

FV: What makes audio so much harder for AI to understand than say photos, or video or text?

JP: Most AI models learn from data that’s already structured. Audio is structured, but its structure is continuous and harder to discretize than text. Unlike text and images, raw audio contains no explicit, directly observable boundaries. Instead, that segmentation must be inferred from acoustic patterns.

The problem for LLMs to train on audio is even worse. We’re helping them with that. The world doesn’t naturally produce clean, labeled audio (isolated voices, instruments, etc.), so high-quality training data is scarce—you can’t just scrape your way to a model.

That’s why separation is increasingly becoming a foundational step for AI systems that work with audio—before you can understand or generate sound, you first have to structure it.

So what we do is we train on hundreds and thousands of unique sounds to be able to understand the characteristics of those sounds.

Even then then the identification piece is difficult because so many sounds can sound alike. So think about a piano and a guitar or your voice and another man’s voice. On top of that you often have frequency overlap, where sounds bleed into each other.

FV: How do artists feel about this? Do they like it because they get money from stems? Do they object to having their art deconstructed? Have any threatened lawsuits?

JP: Stem separation tends to be much less controversial (for artists) than generative AI. The big debates you hear about—artists objecting, lawsuits, training data concerns—are largely happening in generative music and art.

Our work is different. We’re not training on massive amounts of scraped content to generate new songs, or perceived to be competing with human creations. We’re working with audio to make existing content more usable, which in turn helps human artists make more money from their work.

The training side is also different. Technically, we don’t need specific artists’ catalogs to train our models. We’re learning patterns of sound like frequency overlap and noise. We’re not trying to replicate a particular artist’s style.

Because of that, we’ve generally been embraced by the creative industries. Our tools extend the value of existing work rather than competing with it.

And as AI becomes a bigger part of the ecosystem, we’re often the layer that ensures those systems can work with audio in a way that’s actually usable and rights-compatible.

FV: Are the stems you create new AI generated versions of the music? Or are they the actual individual instrument or singing with all else removed?

JP: AudioShake is not generative AI. We’re not creating new audio or synthetic copies. You can think of it as a subtractive process rather than an additive one. We’re unmixing what’s already there, not creating something new.

For example, take a podcast or news clip recorded in a noisy environment. A generative model (like Adobe Enhance) may “fill in” or modify parts of the signal to improve clarity. Our approach doesn’t introduce new information—it isolates what was actually captured in the original recording.

FV: Can you go into more detail about the other work you are doing besides just music stems?

JP: Music remains a core part of our business—it’s just now one of several large-scale surfaces where the same underlying technology is deployed. Sync licensing is where we got our start—but as we’ve grown, we’ve expanded across the full lifecycle of audio workflows.

In music, that means moving beyond sync into platform-scale use cases. We now work with major labels and streaming platforms, where separation and lyric transcription power things like remixing, catalog optimization, and new product experiences. It’s the same core technology—just applied at a much larger scale and embedded directly into products rather than used case-by-case.

In film and TV, we started with use cases like dubbing and localization, but are now deployed more broadly across studios—for everything from dialogue cleanup to compliance to post-production and broadcast workflows.

And in speech, what began as an editing and enhancement tool has expanded into AI infrastructure—powering training data prep, improving speech transcription accuracy, and enabling real-time voice applications.

Beyond new revenue, a lot of the value we bring is about unlocking or protecting existing business. Broadcasters and rights holders use AudioShake to remove or replace music to avoid copyright fines and takedowns on social, or distribute material to new audiences. Others are re-releasing iconic content like The Wizard of Oz in immersive formats that require stems that didn’t previously exist.

Within AI, the value tends to be in unlocking at-scale workflows that are otherwise thwarted by messy audio. For example, in Voice AI, companies like AI-Media use our real-time tech to clean and isolate speech in real time—improving accuracy in their live dubbing, transcription, and captioning products.

Another example would be in data prep. You can use AudioShake to help create structured training data for AI models. For example, we have partners in the music generation space who are licensing content and need to train on the stems of the content they’ve licensed. Or in speech, you want to have crystal clear speech inputs no matter the environment in which the data was recorded–we make that possible.

In short, in media we help customers make new money, save money, or unlock content that was previously unusable.

Behold, an AI startup with a real business

Share

Want the latest?

More like this

Four Day Climb

Friday

Stocktwits Chart Art: February 12, 2025

Let’s be friends!