“AI has had a big leap forward, and we are finally able to work on mobile”

Voicemod’s Alex Bordanova explains how AI powers real-time voice transformation, prioritising ethical data use.

Dec 10, 2024

Welcome! In this edition, we meet Alex Bordanova, Chief Product and Technology Officer at sound platform Voicemod. Voicemod is a real-time AI voice transformation platform enabling folk to modify their voices and add sound effects during live online conversations. He discusses how AI powers the company’s real-time voice transformation technology. He shares insights on ethical AI implementation, technical challenges of real-time processing, and how the Spanish company deliberately avoids voice cloning to protect against potential misuse.

Each edition of AI|G is a Q&A with a leader working at the intersection of games and AI. We’ve got several great conversations already in the can, including ArenaX Labs and X&Immersion, with more to reveal soon, so keep us bookmarked. If your company is doing something cool in this space, drop us a line, and we’ll schedule a time to get you on board, too.

Scroll right to the end for tons of fresh AI news, including Google DeepMind’s Genie 2 model for 3D environments, itch.io’s AI disclosure tags, Meta’s new Llama 3.3 model, publisher Future and OpenAI striking a partnership, and more.

Alex Bordanova, Voicemod

Alex Bordanova joined the Spanish audio company Voicemod in 2020.

Meet the Chief Product and Technology Officer of Voicemod. Voicemod is a real-time AI voice transformation platform that enables users to modify their voices during gaming, streaming, and online communication. With over 45 million downloads, it offers hundreds of voice options and a huge sound library, enabling users to create unique audio identities.

Top takeaways from this conversation:
There are a great many computations happening to make one voice take on the quality of another, in real time, without compromising performance. AI has a role to play in the process.
Ethical AI implementation is important. Voicemod uses only consented data sets from professional voice actors. It deliberately avoids voice cloning capabilities, showing that responsible considerations can be built into product design without compromising commercial success.

AI Gamechangers: Please give us some background on Voicemod. What was the moment you realised AI voice transformation was going to be the direction you were going in?

Alex Bordanova: As you can presume, 10 years ago, there was not the same space we know now for AI. In fact, we do not occupy a place when it comes to “gen AI”. That's not how we understand AI. We use AI technology in the way that many different industries use it.

Ten years ago, the first application was a mobile app, and it ran on iPhone 3G. That was this small business by three brothers who decided to just launch this application. It was used in an asynchronous mode so you could record your own voice messages and share them. VOIP was that angle for quite a while. It was with the arrival of PUBG that they started to introduce a voice chat, which is basically where our users operate in the gaming space.

[The founders] realised they had a big spike in traffic from desktop, so they said, “There's something there!” They pivoted quickly and put together this application for desktop, which ultimately used the same SDK that they were using for their mobile phone app – they were able to use that in a PC context with essentially the same sort of mechanics that we see today.

It was just voice, but then they realised there was a big business there to invest in, and they added the soundboard feature. It all came very naturally. The users have always been part of our community, and they started to put that together. Users also demanded Voicelab (the tool they use to create the voices). We introduced more and more of those features, and then we recently transitioned from v2 to v3.

“We are the only player in the space that is Fairly Trained, meaning we have been certified as a generative AI company that adheres to fair training data practices”
Alex Bordanova

On that journey, about five years ago, we started with another type of effect. It’s not a common audio effect like reverb or EQ that we could use to dress the voices and set them, for example, in a cave or give a radio effect. We were missing something very relevant: the timbre of the voice. The timbre of the voice is very special. It’s what makes us sound like ourselves. That is a very complex calculation! In order for “Alex” to sound like “John”, I need to import and modify the signal in a way that is very complex. There is the addition of so many different physical limitations: the throat, the age, the mouth, the nose, the accents. There are so many complex values that it’s not just simply a convolution; it’s an addition of so many calculations that only an AI could solve for us.

The minute that we started to work on that, it sounded awful! It sounded like a saxophone – very artificial. But over time, we have invested a lot of resources in order to create the natural, real-time voice conversion that we know today.

You don’t do “voice cloning”, which is a topic in AI at the moment. You do voice changing. Please tell us what that means to you. What makes your product different from other solutions that people might have heard of recently?

Voice cloning refers to copying someone else’s voice. And voice cloning has a lot of situations that we want to be mindful of. For instance, in order to clone a voice, you need to have original samples of that voice to reproduce the same sound. We’re not running down that route – we think this could harm society, and we are very respectful of that angle.

As well as changing your voice, you can drop “sound memes” with Voicemod.

We are the only player in the space that is Fairly Trained, meaning we have been certified as a generative AI company that adheres to fair training data practices. All the data sets that we train are proprietary or open source, meaning the voice actors have given consent. That means that when we use that data, whatever voice we produce is going to sound like those voices or a mix of them. It is with the amount of data that we are using and how we use it that we can protect from voice cloning – the technology is out there, and we have the ability today to do that; it's just that we are not putting that in our product, on purpose.

Ethics is a major concern in AI. Can you tell us more about your approach to working with professional voice actors to get that data and the voices ethically?

We have a very strict process for creating voices. We need to take into consideration that we are an entertainment company, so we do not provide solutions for voiceovers or anything other than having fun and communicating with friends online.

We are also very aware that there are some people in the online space who need another voice for several reasons. Maybe how their voice sounds doesn’t “fit” them. For instance, we see a lot of VTubers (virtual content creators) who need a voice to accompany their visual avatar; therefore Voicemod completes the missing half of their digital identity.

To create the voices ethically, we directly interview the actors that we bring into the process. We hand them the scripts so we are sure that they are the right fit for the audio we need and are consistent with our quality.

As well as services for gamers, you have services for game-makers. Can you give us a taste of the range of b2b solutions you provide?

We are the global world leader in the b2c space. We have our application that runs on desktop, and now we are starting to run on mobile.

We did a big push doing this last year, and we multiplied the performance of our models by six so we can run basically on any device. And that's been a big success. We have a huge variety of Digital Signal Processing [DSP] voices as well, or a mix of both. We have also just launched our first hardware device, Voicemod Key, which brings our real time voice changer and soundboard to games consoles for the very first time.

“We want to hear how the sound and voice works, so we can embody it. For that, it goes to 45 milliseconds. This is where our algorithm sits: 45 milliseconds is what AI is hitting today”
Alex Bordanova

However, we also have an SDK that third parties can integrate into their own product. With this, voice conversion is one of our products. The other one is the soundboard, with over 300,000 sounds. Again, our company is all about entertainment, and there’s a big space for sounds! It’s not just voice. It’s also all the sounds that you can create, share and play within the situation of being with friends and having fun. Like putting on a background sound.

For instance, I'm an RPG player – I love playing with my friends online. I use Voicemod a lot, of course. I put together all the layers to create the background when I’m the Dungeon Master, and I’m changing my voice to be one or other [of the NPCs]. These are the same features that we want to provide to third parties and we’re speaking with video games studios so that they use the SDK to integrate Voicemod into their games. Thus allowing gamers to change their voice as easily as they switch [skin/character] in-game.

You’ve secured partnerships with some IP holders like Warner Bros. How do you see collaborations like that evolving?

For Warner, it’s been almost three years, and it's been very successful. Our users are young adults [so] there’s a range of IPs we know they will love. Rick and Morty, for instance, is one of the big ones. We can also speak about Batman – I’m not that young, but I love Batman! We’re working on The Nun and on other franchises from the Warner Bros space.

Connect Voicemod Key with your smartphone to bring Voicemod’s voice changer and soundboard to your console gaming experience too.

We also work with Rovio Entertainment for Angry Birds and have found significant value for our users in being able to integrate original IP into their social experiences – and not just simply something that resembles it, but the real thing. Just as Fortnite leads the way in visual IP by featuring iconic skins from all franchises, we aim to be the leader in audio IP, creating equally authentic and immersive experiences in the audio space.

We are transitioning more and more to make our product more IP-driven. We work on skins so they can reflect that they own that IP. Streamers are a big part of our business. It's all about streaming! We work on those visual assets that guarantee they have the original IP. There's something about collecting and showing the value of what they purchased, a status that we think is very connected to IP. We just introduced that a few years ago, and we believe it elevates our brand a lot as it creates trust with our users, which is what we’re all about at Voicemod.

“AI” means a lot of different things. What does AI mean to you, and how does it factor into your technical process?

Let’s look for a second at the three big pillars of what audio means because people probably aren’t very familiar with the problems of how audio works when we speak about doing it in real time. One is latency, the second is performance, and the third is quality.

For latency, when talking about real time, we like to say 150 milliseconds is the boundary. If we go beyond, say, 300 milliseconds, we have lip sync problems. We want to make sure that we keep that low. At the same time, we want to make sure that we can listen to ourselves. We want to hear how the sound and voice works, so we can embody it. We can really feel that we are that voice, and this is suddenly part of our character. So for that, it goes to 45 milliseconds. (Even that is on the edge.) This is where our algorithm sits today: more or less 45 milliseconds is what AI is hitting today.

That’s a very big constraint because in order to do that, we need a very small window of less than six milliseconds processing time. That’s very sharp. Those are big calculations, very heavy, and we need to keep consumption very low. So that’s the second pillar, performance. We want to make sure that we are not increasing the consumption to the point that it does not allow the users to play games (or to run Discord). If they have software that consumes the whole CPU, what’s the purpose? They cannot play! Then there’s no need for us because they can’t do their main thing, which is having fun and communicating with their friends.

For the third pillar, quality, you need to think that if you pull on one pillar, the other two will probably be affected! If you keep latency low, and you keep performance low, probably you will degrade quality. We want to keep quality high for intelligibility, so that you understand me, no matter if I use one voice or the other. You are going to be able to understand the message, and it has “naturalness”, so it sounds like it’s supposed to. And there are always going to be nuances. I mean, we all speak differently.

“There is going to be a palette of thousands of different voices that you will be able to use to create perfect AI voices”
Alex Bordanova

So you need to act; you need to add something from your side. Otherwise, this prosody, which is how we let the words out of our mouth, the pauses, the inflections, the level, and how we put things, are also the qualities that make you sound like a real person. But in the end, quality is one aspect that we need to keep improving bit by bit.

These three things are what pose the challenge. With this challenge, how do we make it run so fast that the experience is not broken, with enough quality, and at the end, it sounds perfect? AI has changed a lot. Most of the software you will find is not real-time. It's asynchronous. That’s another, totally different, challenge. If, instead of less than 10 milliseconds, you use two seconds for a process, then we're speaking about another league! Quality and performance will be affected. It’s again three strings pulling, you need to be very conscious of that. The AI has had a very big leap forward during these last years, and we are finally able to work on mobile, which at the beginning was something that was very, very hard to conceive.

Where do you go from here, looking ahead? How do you envisage your technology evolving over the next few years? Does AI open up even more possibilities?

There's a lot to build, and AI is just another tool. For us, we are more into providing an experience for the users that is flawless, where they do not perceive any difference from when you see a movie today.

Look at the landing of the SpaceX engine. To me, that feels like CGI. It's almost like a movie! And that's because our brain, I think, is not capable of assuming that this is real, because, “Wow.” I think that we are just about to get to the point where we find the same [with Voicemod].

We have a tool that’s called Voicedesigner that we are just about to release for end users. It’s the same tool our sound designers use to create the voices. There is going to be a palette of thousands of different voices that you will be able to use to create perfect AI voices. You will craft your own voices, and you will be able to add the DSP effect to it, so you can sound like you’re on a spaceship or create whatever you want for your games, immersive experiences, or just to have fun with your friends.

With Voicemod you can boost the way you sound wherever you chat - in-game roleplay, streaming, or just hanging out.

There’s a lot to do when it comes to creation. We are also expanding the conversations with IP owners because there are a lot of challenges here; we want to make sure that we are respectful with their IPs. Still, I foresee a big path ahead in order to consolidate the industry of voices for such IP owners.

You were recently named one of Spain’s top tech startups. How did that make you feel, and where does that lead you?

When you look at the startup landscape in Spain, you usually won’t find an audio company in the top tier. For us, it’s very relevant because suddenly, the audio is just kicking in. It’s taking the space that we think it deserves.

At the end of the day, voice is the missing half of the digital identity. More and more, we’re going to be looking at a big standard for the industry, and we need to open conversations with other b2b partners. Everything that helps to pave the way to that point is good.

At the same time, in Spain, we have big ventures when it comes to AI, at different levels. On the university or academic side, we have big groups of research (not in size, but in importance) that are doing amazing things.

For instance, a couple of years ago, we acquired VoctroLabs, the global leader in sing-to-sing technology, which originally co-developed Hatsune Miku alongside Yamaha. We wanted to incorporate the ability for users to sing with the voices. We are combining that technology so we can open another entertainment space, which is music. That happened with a very small-scale startup inside the core of a university, spinning out until we could finally introduce it into the wider market.

I’m not a specialist in the singing field, but when you stretch the words, when you do not just say a letter, but you expand it – “aaaaaaa” – when you make it long, and you do inflections, up and down, and you modulate the voice in a way that needs to match a certain tempo and pitch, then the challenge is great.

“Voice cloning refers to copying someone else’s voice. We’re not running down that route – we think this could harm society, and we are very respectful of that angle”
Alex Bordanova

I am not a good singer! We are working on something for people like me, who do not sing at all; we are enabling one-button magic. Just turn it on, and because I use Voicemod, suddenly I can sound like a proper singer. That is magic because it will put my voice in the right pitch. And it will transform my voice to sound like someone else’s voice. We have AISHA, who is one of the singers that we train the voice from. And wow, suddenly I’m singing like a woman, perfectly on pitch. And for really bad singers like me, we can correct the position at which we change the pitch, so it’s not just one pitch in place on the right scale, but also, in the moment, it will move from one note to the other.

And that’s the type of technology that we are [working on] – we can finally activate the singer in every user because that's a big demand from our audiences.

Further down the rabbit hole

Your handy digest of what’s hot in the world of games and AI recently. It’s been a busy week or so. Check out these links:

In the latest newsletter from A16Z, the Silicon Valley venture capital firm published its team’s predictions for games in 2025. Only five of the 14 ideas don’t feature AI in some form. Key features include real-time generated games and intelligent AI NPCs.
Google DeepMind revealed Genie 2, which can generate playable 3D environments from a single image prompt. These environments respond to keyboard and mouse controls, maintain consistency for up to a minute, and can remember off-screen elements. The generated worlds include complex features like physics, lighting, object interactions, and character animations. While it can be used for prototyping, it’s also used to train other models.
And then DeepMind’s CEO (and former game developer) Demis Hassabis responded to an Elon Musk post on X saying they should make games together with it.

Genie 2 has learned to create complex visual scenes, including 3D structures.

Google’s been busy: this week, it unveiled a quantum chip that "performed a standard benchmark computation in under five minutes that would take one of today’s fastest supercomputers 10 septillion years". It says advanced AI will significantly benefit from access to quantum computing.
World Labs has unveiled an AI platform that can generate interactive 3D scenes from a single image.
Indie game hosting site itch.io now has an “AI disclosure” tag, which will indicate whether or not a project uses generative AI.
Future, the publisher of games media brands like Edge, PC Gamer and GamesRadar, has inked a deal with OpenAI. Its three pillars are: Future's editorial content will now be available through ChatGPT; OpenAI-powered chatbots will answer readers’ questions on Future websites; and internal teams will be encouraged to use AI tools to boost productivity.
Speaking of OpenAI, it rolled out ChatGPT Pro on Thursday, a $200 monthly plan that gives access to the latest models and tools, including unlimited access to o1, the smartest model, as well as o1 pro, a version with more compute power that gives even better answers, o1-mini, GPT-4o, and Advanced Voice.
OpenAI has committed to releasing a new feature every day until 20th December. As part of that, this week it finally launched Sora, the AI video generation tool. Bad luck anybody outside the US - it’s only available there.
Strauss Zelnick, CEO of Grand Theft Auto publisher Take-Two Interactive, declared in a recent interview that he doesn’t believe AI will revolutionise game development.
Meanwhile, Sony PlayStation CEO Hermen Hulst told the BBC, “I suspect there will be a dual demand in gaming: one for AI-driven innovative experiences and another for handcrafted, thoughtful content.”
Reforged Labs has launched an AI-powered video ad creation platform in open beta, offering mobile game studios automated ad generation through templated, data-driven creative solutions.

“AI has had a big leap forward, and we are finally able to work on mobile”

Voicemod’s Alex Bordanova explains how AI powers real-time voice transformation, prioritising ethical data use.

Alex Bordanova, Voicemod

Further down the rabbit hole

Discussion about this post