Tj_Vantoll:
Hey everybody and welcome to another episode of React Roundup. I am your host today, TJ VanTole and with me on the panel, I have Paige Neidringhaus.
Paige_Niedringhaus:
Hey everyone.
Tj_Vantoll:
And our special guest today is actually a React Roundup returning champion. We have Ivan Lavrie here. Ian, welcome back to the show.
Ian_Lavery:
Hey, thanks for having me back
Tj_Vantoll:
Yeah, so why don't you start, you know, for people, I think it's show, we were looking back, is the show was about a year ago. We'll have to look up the episode number and toss it in the show notes, but it's been a while. So why don't you tell people who you are, what you do, your background, why you're famous, all those good sorts of things.
Ian_Lavery:
Yeah, so I work for a speech recognition company called PicoVoice and we're a developer focused company that tries to empower developers all over on any platform to have to bring voice to their platform. So we have a whole variety of different products that cover. speech to text, voice activation, wake word, all that. And we just want everybody to have a voice on their platform. Besides that, I do interactive media art and I play bass in a couple bands. Ha ha ha.
Paige_Niedringhaus:
That's awesome. Not just
Tj_Vantoll:
That's
Paige_Niedringhaus:
one band, but multiple bands.
Tj_Vantoll:
it. Okay.
Ian_Lavery:
Yeah, I'm going over it cheaper, I guess.
Tj_Vantoll:
Well, cool. So Pico Voice looks interesting. I remember us talking about it last time, but maybe you can get an overview of how it works. If I use Pico Voice, what am I
Ian_Lavery:
Hmm.
Tj_Vantoll:
getting? Am I getting a service that I can send audio to and it comes back with the words? What other features? Maybe you could give us the rundown of everything it does, everything you do.
Ian_Lavery:
Yeah. So the big thing with us is, and our sort of thing that sets us apart from pretty much every other voice service is that we're entirely on device. And so there is no, there is no service. There's no cloud API that you're calling to send your audio to, which I, I mean, look, look around that's pretty much every single voice thing is just an API. So we're one of the only ones out there that is actually giving you the ability to hold on to your audio data and your users audio data and process it on the device and return. Again, we have like a variety of products. So we have like wake word detection where it's just like, hey Siri and okay Google. It's just all it's doing is sitting there processing frames of audio, waiting for you to say the thing and then. when it wakes up, it does the thing that you tell it to do. But we also have voice activity detection and which just basically peaks when it hears somebody talking and obviously speech to text, everyone wants speech to text. So auto transcription of voice, yeah.
Tj_Vantoll:
It's very cool. It's also like one of those problems that I feel like is it's becoming more commonplace.
Ian_Lavery:
place
Tj_Vantoll:
We have
Ian_Lavery:
he has
Tj_Vantoll:
smart
Ian_Lavery:
smart
Tj_Vantoll:
devices
Ian_Lavery:
devices
Tj_Vantoll:
in our
Ian_Lavery:
in
Tj_Vantoll:
house,
Ian_Lavery:
the house.
Tj_Vantoll:
our phones can listen to wake words and that sort of thing. But I still, I'm still sort of fascinated by the underlying technology.
Ian_Lavery:
Hmm.
Tj_Vantoll:
Maybe you could just start, give us like
Ian_Lavery:
It's
Tj_Vantoll:
the world's simplest rundown
Ian_Lavery:
like the
Tj_Vantoll:
of like,
Ian_Lavery:
world's simplest.
Tj_Vantoll:
how does it actually work on the backend? Like, do you just have... a whole bunch of like low level C code that looks for patterns in audio data or like, I don't know,
Ian_Lavery:
Yeah,
Tj_Vantoll:
we don't need
Ian_Lavery:
so
Tj_Vantoll:
to go
Ian_Lavery:
that's
Tj_Vantoll:
on for like two hours, but I'm just.
Paige_Niedringhaus:
Ha ha.
Ian_Lavery:
a good question. So I mean, basically it's deep learning, right? It's machine learning. So we teach through machine learning, we teach the machine a statistical model of what a word sounds like or what a series of sounds sounds like. So we basically take audio in our actual. uh, when we're teaching our, our machine, all we're doing is sending it frames of audio that, that are labeled and we get it to remember them and like form a little statistical pattern. And then it all for something like wake word, it's just like, Hey, remember this pattern of three things.
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
Uh, just, just remember that and say, Hey, I think I saw it. So it's a lot more complicated when you get into speech to text, because not only are you teaching it. every sound in the language, but you're also teaching it every word in the language.
Paige_Niedringhaus:
Yeah.
Ian_Lavery:
Because then you're dealing with audio and writing, which are different things. I think people think language is a combination of those things, but really they're two entirely separate things. They're like, there's the series of sounds you make with your mouth that... other people understand and then there's the symbols you write them down with and the grammar and the punctuation and everything
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
that you put into the written form and they're different. So we actually have to treat them differently. But you'll see a lot of the big cloud providers out there. The reason they got it so right so fast is because they had such large machines in the cloud in order to do this. So
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
it was sort of like, it outpaced the actual progress of voice recognition.
Paige_Niedringhaus:
Yeah.
Ian_Lavery:
And now everything's kind of caught up and we can actually do it on device, which is a big win because to be honest, we were like boiling the ocean for like a while doing speech to text and now we can do it on like a microcontroller. So.
Paige_Niedringhaus:
So if you're using something like Pico Voice, is it something that you as a user have to train the models or are the models already there? It's trained, it knows you're speaking
Ian_Lavery:
Hmm.
Paige_Niedringhaus:
English or it knows you're speaking Spanish and it will just, it should be smart enough to be able to take that audio and translate it into the correct written words.
Ian_Lavery:
Right. So like for speech to text, for instance, we basically just have a general language model. You just give it, we offer eight different languages and you just give it the language you want and it will understand that language. But we actually use this thing called transfer learning and we have a website, Pico voice console, where you can basically, we have sort of a general model.
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
But then you sort of do train it yourself because for something like WakeWord, we have a model that understands a bunch of sounds in whatever language you give it. But then you want it to represent a certain series of sounds like, okay, Google. So you literally type that in to our console and hit train, and then it will pop out a model that understands that. So that's, that's sort of the. When we say you train it,
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
it's not like, oh, you have to go out and gather 4,000 recordings of this word and, and, you know, submit it to something and watch statistics go and decide, no, no, it's just like, we are, we did the hard work. You
Tj_Vantoll:
I was
Ian_Lavery:
just.
Tj_Vantoll:
going
Paige_Niedringhaus:
Yeah.
Tj_Vantoll:
to say, because by saying that you're implying that you went out and got 4,000 recordings of these different words, right?
Ian_Lavery:
No, no. So the thing is, again, we've trained the general model, so it understands the sounds we needed to understand.
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
You just tell us which sounds you want to, you know, form your wake word, and we pop out a model that just waits for those series of sounds. Yeah.
Tj_Vantoll:
interesting
Paige_Niedringhaus:
That's cool.
Tj_Vantoll:
because I would have I would have guessed that your building of the model was to get a bunch of people to say like it almost seems it kind of breaks my mind a little bit that it's possible right that
Ian_Lavery:
Well,
Tj_Vantoll:
you
Ian_Lavery:
that,
Tj_Vantoll:
can sort of
Ian_Lavery:
that
Tj_Vantoll:
generalize
Ian_Lavery:
was the old style. The, the like, so I worked, I worked at a speech recognition company right out of college. And what we did, we had one of the early, early wake word engines.
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
And what we would do is we, it was all B2B the company. We basically enter a contract with the company that says, Hey, we're going to go out and gather. 4,000 recordings of this wake word, and we're going to train it and then deliver you the model. And, you know, it was very formal. And that was basically state of the art at the time. But
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
we're actually a bit past that now because we're able to use this concept of transfer learning to take a general model and just kind of point it in the right direction. So we no longer need to do all that, you know, pounding the pavement, asking for people to say a wake word, because that was a lot of work and it took months. Like every time somebody signed a contract. And I know, cause I was running the crowdsourcing technology for that company.
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
So I would have to post these jobs and
Tj_Vantoll:
Hehehe
Ian_Lavery:
these people would record it on their mobile device and I'd have to
Paige_Niedringhaus:
Yeah.
Ian_Lavery:
go through all the recordings and like, you know.
Tj_Vantoll:
filter
Ian_Lavery:
Some people
Tj_Vantoll:
out junk.
Ian_Lavery:
would just, yeah, some people would just, you know, speak their manifesto into the phone and I'd
Tj_Vantoll:
Ha ha ha
Ian_Lavery:
have to be like, no, no, no, no.
Paige_Niedringhaus:
So one thing that I'm curious about is I'm assuming that when you would do these wake word gatherings, you would have to take into account accents, because I know that that is something that every automated assistant struggles
Ian_Lavery:
Mm-hmm.
Paige_Niedringhaus:
with is English accents, Scottish accents, Caribbean accents, all speaking English, but all slightly differently. So Is Pico voice able to account for that and be able to interpret, you know, a deep Southern accent
Ian_Lavery:
I'm sorry.
Paige_Niedringhaus:
versus maybe a New York Boston accent?
Ian_Lavery:
Yeah, so I mean that that's still a challenge for us, but I think the reason, the reason we're a bit more resilient to it is because we've trained this general model on like, geez, like 10,000, 100,000 hours of speech, it's heard all the accents. Not not all the accents, but it's heard. It's
Paige_Niedringhaus:
A lot.
Ian_Lavery:
heard a lot of variation.
Paige_Niedringhaus:
Hehehehe.
Ian_Lavery:
So, um, it, it tends to be a bit more resilient. When, when I was doing the old style where we would get people to record, that was actually a lot less resilient to it because we only had like, you know, 300 participants, uh,
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
recording these wake words and, you know, how much variety are you going to get between 300 people, like not enough, but
Paige_Niedringhaus:
Right.
Ian_Lavery:
when we tens of thousands of different speakers, maybe
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
more. So we tend to be a lot more sensitive to the variations.
Paige_Niedringhaus:
Yeah.
Ian_Lavery:
But it is definitely a challenge because even us as humans, if you hear like a really thick accent that you're not used to, it can be confusing.
Paige_Niedringhaus:
Yeah.
Ian_Lavery:
Like, we're not perfect either with it. So
Paige_Niedringhaus:
Right.
Ian_Lavery:
it's a challenge.
Tj_Vantoll:
So I think you added multiple language support. I believe that's new or at least newish from
Ian_Lavery:
Mm-hmm.
Tj_Vantoll:
the last time we talked. So does that more generalized ability make that easier or I imagine there's still all sorts of challenges that go into that.
Ian_Lavery:
Yeah, so when you actually work with a totally different language, that's basically starting over. Because accents is one thing, you've already taught it the series of sounds in the language, and you're just looking for a combination of those sounds and those symbols. But when you move into a new language, there's a new set of symbols, and there's a new set of sounds. You know, there's... Everybody has an inventory. We call it a phonemic inventory. And it's a, uh, basically, um, a S a series of sounds that you hear in the language and every language has a different phonemic inventory. And we need to train the, the machine to understand only that, uh, inventory of sounds and all the symbols that go into that. So when we start a new language, we have to do it completely from scratch. We have to get. new data in that language, we need to get new text in that language, and we need to do our best to even understand the language enough to work with it, because we need to listen to these recordings, we need to, you know, normalize the text we get and make sure it's not like full of symbols and stuff, but understand it enough so that it... it we actually don't confuse the machine learning process. And that can be a real challenge. It's a lot of work actually.
Tj_Vantoll:
So it's fascinating. Does that mean like when you kick off a new language, I feel like you almost need to have like a professional linguist on staff
Ian_Lavery:
Hmm.
Tj_Vantoll:
for almost each of these languages, right? Like, or do you like bring on somebody who's like, you know, a world class, I don't know, Spanish linguist to help
Paige_Niedringhaus:
I'm
Tj_Vantoll:
or
Paige_Niedringhaus:
gonna
Tj_Vantoll:
like,
Paige_Niedringhaus:
go.
Tj_Vantoll:
like how much of it are you able like as a software developer to sort of test? on your own and how much do you have to rely on a native speaker as the only person that can actually figure some of these things out.
Ian_Lavery:
Yeah. Um, so we do have like, basically our like machine learning team. They, they do have to be part. Linguist like, cause, uh, you know, if you've studied languages, uh, you at least understand the components and, uh, basically every language is just a combination of the components. Uh, so, uh, they have a lot of expertise in that field to, uh, understand when they approach a new language, how it works, but then that's not enough. And so usually what we'll do is we'll get somebody, well, we will get a native speaker. Usually, we'll basically hire somebody on like a contract to work with us to help with the language because you do need that expertise. Like The fact is even somebody who's like a language expert, if they sit down to an entirely new language, they're not gonna be able to understand it enough to do the work that needs to be done to actually
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
get it to a like production ready state. So we often do need to get a native speaker in there to provide their input. And that will really... speed the process along. We tried to do it without experts a couple times and it's just like,
Tj_Vantoll:
Hey.
Ian_Lavery:
you just don't get the performance and you spend a lot more time. You waste a lot more time, I should say.
Paige_Niedringhaus:
Sure. I mean, that makes a lot of sense when you think about getting expertise in anything else.
Ian_Lavery:
Hmm
Paige_Niedringhaus:
It's a lot. It will almost undoubtedly go much quicker if you have somebody who is proficient in whatever it is that you're trying to do.
Ian_Lavery:
Yeah. Well, they can recognize mistakes in grammar and stuff, the stuff that's really
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
hard to pick up as a non-native speaker.
Paige_Niedringhaus:
Yes. So what languages do you currently offer PicoVoice for?
Ian_Lavery:
Uh, so we have, um, so I believe last year we announced we had a Spanish, French, German, English,
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
and then this year we added four new languages. We added, um, Japanese, Korean, uh, Portuguese and Italian.
Paige_Niedringhaus:
Oof, those
Ian_Lavery:
Yeah.
Paige_Niedringhaus:
are some tough ones.
Ian_Lavery:
Yeah. Well, especially like when you get into the, uh, you know, written forms of, uh, Korean and Japanese, they
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
become very challenging. Like, in English, we have 26 characters. Japanese has two alphabets of 56, and then an additional alphabet of tens of thousands. So yeah,
Paige_Niedringhaus:
ambitious.
Ian_Lavery:
so the text representation of that is really difficult. The actual spoken version of Japanese, is a lot easier than English because Japanese has 56 sounds and they all map to a combination of characters. English mapping a combination of characters to the sound is incredibly difficult.
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
Turns out we made some mistakes early on and we didn't really fix them.
Paige_Niedringhaus:
I mean, just thinking about the amount of spellings that we have for the same sounding word based
Ian_Lavery:
Yeah.
Paige_Niedringhaus:
on the context, I cannot even imagine how you would be able to figure that out for a transcript.
Ian_Lavery:
And it's all exceptions in English. It's like, oh yeah, it's this
Paige_Niedringhaus:
Hehehe
Ian_Lavery:
unless this or this unless this. And like, here's three different reasons why this rule is wrong.
Tj_Vantoll:
Yeah, you see this when you have younger kids that are starting to write and
Ian_Lavery:
Hmm.
Tj_Vantoll:
you look at their writing, because they don't know the exceptions yet, right? But
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
Right.
Tj_Vantoll:
they can speak it
Ian_Lavery:
Yeah.
Tj_Vantoll:
because they know. So you get, it's words you don't even think about too, because we internalize them so quickly, because
Ian_Lavery:
Mm-hmm.
Tj_Vantoll:
one of my kids spelled because wrong. And then you're like, oh, Bill, because is pretty easy. But then you think about it for like a half a second and you realize, actually the word
Paige_Niedringhaus:
It's not.
Tj_Vantoll:
because makes absolutely no sense. Like
Ian_Lavery:
Right?
Tj_Vantoll:
I can't.
Ian_Lavery:
Like if you try and explain it, you suddenly find yourself going, ah, it just is what it is.
Tj_Vantoll:
Yes.
Paige_Niedringhaus:
Just memorize it.
Ian_Lavery:
Just memorize it. Yep.
Paige_Niedringhaus:
Well, that's, I mean, that's really fantastic that you have taken on and it sounds like gotten through some very difficult dialects. What are what are future future languages that you hope to be able to process as well? Chinese.
Ian_Lavery:
So we're going, yeah, exactly. So we're going to try, next year, we're gonna double our language count again, I think. And we're gonna do, we're gonna do Chinese, Vietnamese. What else? Dutch.
Paige_Niedringhaus:
Mmm.
Ian_Lavery:
I believe Russian. Polish, I think. Yeah, I can't remember all of them. But you basically need to be like a fully inclusive speech recognition company, you basically need like a bare minimum of like 50 languages. So like, we're gonna get to like 20 of the most popular and hold there for a while
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
is kind of our plan because that covers a lot. of people. Like that
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
covers the majority of people. Because even in the cases where the people might not speak the language, they usually are like, oh, but I speak this more popular language.
Paige_Niedringhaus:
Yeah.
Ian_Lavery:
But to really get up there, like, I mean, you do need
Paige_Niedringhaus:
Oh.
Ian_Lavery:
to get to like 50 or something. And I mean, Google has like 150. So...
Paige_Niedringhaus:
Hehehe
Ian_Lavery:
You know, it's kind of a never ending thing for us.
Paige_Niedringhaus:
Yeah, how about Hindi? That's a big one.
Ian_Lavery:
Oh yeah,
Tj_Vantoll:
Mm-hmm.
Ian_Lavery:
that's actually one of the other ones we're going to do next year.
Paige_Niedringhaus:
nice.
Tj_Vantoll:
Sorry, I guess I gotta ask one last question. Are there any languages like you've come to hate like because it was like very difficult or
Ian_Lavery:
English!
Tj_Vantoll:
like come to love?
Paige_Niedringhaus:
English.
Tj_Vantoll:
Yeah. Yeah.
Ian_Lavery:
It's funny how much you can hate your own language. No, actually, like, seriously, English is the only... Like, I look at all other languages we've done and I'm like, these are so much easier. Like,
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
English is actually... It's just... it came out of a mess of languages. Uh, it was a lot of combinations that happened over time. And, you know, a lot of them happened during like, you know, a lot of English developed during like illiteracy and stuff. So there's like really interesting examples you can find of like stuff where it's just like, oh yeah, this was just a mistake that happened, uh, you know, 200 years ago that they kept it. Or actually I have a fun fact. The word dumb. So you look at that, you're like, why does it have a B at the end? That apparently was, there was a time where the, the like ruling class of England was trying to make it harder to write English so that the peasantry couldn't
Tj_Vantoll:
Ha ha
Ian_Lavery:
like
Tj_Vantoll:
ha.
Ian_Lavery:
pick it up and they literally just added some letters to the language here
Tj_Vantoll:
Thanks.
Ian_Lavery:
or there and, and we're like, this is the proper way to write it. And then just to confuse people. And we literally still have that to this day. So. English is so weird.
Paige_Niedringhaus:
Mm-hmm. So that's why knife has a K in front of it.
Ian_Lavery:
Yeah, yeah, like stuff like that. I think they were just messing with us, and now we're just like, we have to
Paige_Niedringhaus:
We're
Ian_Lavery:
live with
Paige_Niedringhaus:
stuck
Ian_Lavery:
it.
Paige_Niedringhaus:
with it.
Tj_Vantoll:
So I want to pivot a little bit and talk about the actual web development, the
Ian_Lavery:
Hmm.
Tj_Vantoll:
side where you might actually use a service like this. Because
Ian_Lavery:
Yeah.
Tj_Vantoll:
I remember last time we chatted a little bit, too, about common use cases. So
Ian_Lavery:
Mm-hmm.
Tj_Vantoll:
maybe we could just start with an overview. We have a lot of web developers that listen to this show.
Ian_Lavery:
Yeah.
Tj_Vantoll:
What do you think, I guess, A, what would using something like this look like? How do you actually get it in an app?
Ian_Lavery:
Hmm.
Tj_Vantoll:
And B, I guess, what are some common use cases that you see for use on the web as well?
Ian_Lavery:
Right, so one of the big things is obviously on the web, people are a lot more comfortable calling like an API. And that is what they've come to expect for a speech recognition and stuff. But we're actually kind of bringing back the power of the browser itself. So, I mean, the browser is a virtual environment that can run whatever you want. And... we actually can run entirely in the browser on the client side. And that's big because, I mean, these days, we're getting a lot of progressive web apps. And the sort of web app is a big thing, especially with SaaS companies and stuff. So if you're running a SaaS company and you want to integrate voice into your console or something. Having it on the client side is, I mean, it lowers the latency. It gives you a lot more direct control of what happens when you get voice. And it means you can be robust to connection issues, which like, that's a huge thing. Not everyone has amazing internet and You don't want to have to be making calls out to an API and just hoping it comes back for your feature to work. This will just work. And also on top of all that, it's less expensive because we're not calling an API. We're not depending on cloud infrastructure. So
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
you're actually, if you're a developer and you integrate PicoVoice into your web app, your client is going to be using their machine to do the processing. So I think it's just a win-win situation for that.
Tj_Vantoll:
Yeah, I feel it's especially important considering it's audio too. So like bandwidth is like
Ian_Lavery:
Mm-hmm.
Tj_Vantoll:
you're, you're not just shipping off like a couple of things in a query string to some service. You're
Ian_Lavery:
No.
Tj_Vantoll:
like uploading audio,
Ian_Lavery:
Yeah.
Tj_Vantoll:
uh,
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
Megabytes
Tj_Vantoll:
which
Ian_Lavery:
of
Tj_Vantoll:
is.
Ian_Lavery:
audio.
Tj_Vantoll:
Yeah. So the bandwidth consideration is, uh, amplified significantly.
Ian_Lavery:
Yeah, no, and it makes everything that actually allows for like something like real time audio. Real time audio is very challenging to do for an API because you basically need to stream it to the service
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
and have responses being streamed back. That's really expensive. That's like a constant bandwidth issue.
Paige_Niedringhaus:
Yeah.
Ian_Lavery:
But when you're doing real time audio and it's all running in your browser on the client side. It's snappy and you can do things that require timing. And yeah.
Tj_Vantoll:
Very cool. And I know, I think I remember from last time too, that because one of the ways you keep it snappy is it's not JavaScript code running
Ian_Lavery:
Hmm.
Tj_Vantoll:
in the browser, right? It's your, uh, well, I don't remember your exact tech stack, but I know you have some sort of fancy way of doing that. Uh, maybe you could walk people through some of the magic and challenges of how that works.
Ian_Lavery:
Yeah, so our core code is in like C because we were trying to keep it as efficient and snappy as possible. Now C code and when you think of C code next to React, you're like, how does this even work? Like,
Paige_Niedringhaus:
Hehehehe
Tj_Vantoll:
Hehehehe.
Ian_Lavery:
can these two ever talk? Well, it turns out they can with Wasm. And what we do is we compile basically all our core code into a Wasm binary blob. And then we ship that with our like npm package. So when you when you npm install Pico voice, part of the part of what's going to be shipped with your website is our wasm blob. And basically wasms really cool because it basically just wraps your native code in JavaScript. And then allows you to basically attach to it like any sort of dynamic library. Say here's the functions I want to call, here's the data I'm going to put into it, and then you just call it like you would any any other library. It's a little trickier to work with because you're dealing with, I mean JavaScript obviously one of the things where pretty aware of. And I'm sure the listeners of your show are aware of is JavaScript is like types, whatever.
Paige_Niedringhaus:
Hahaha
Tj_Vantoll:
Hehehe
Ian_Lavery:
Even TypeScript is like is like, yeah, types, but like, you know, a number is a number, right? Well,
Paige_Niedringhaus:
Right.
Ian_Lavery:
C is like what, how many bits is your number? Like, like it
Tj_Vantoll:
Yeah.
Ian_Lavery:
needs to know. So you start to need to think about that when you work with Wasm.
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
You start to need to think about, okay, is this a 32 bit int going in here? and you need to start to think of memory. Like, okay, I need to have a pointer. I need to pass
Paige_Niedringhaus:
Hehehe
Ian_Lavery:
in a pointer here to get something back and then convert that pointer to like a JavaScript object of some sort.
Paige_Niedringhaus:
Yeah.
Ian_Lavery:
So it's challenging to work with, but once you get it working, it's extremely powerful because then we can ship something that's incredibly complex piece of code and just put... basically a slim interface of JavaScript around it. And then any JavaScript developer can just call, is just talking to it like it's JavaScript. They don't need to worry about the wasm. That was our problem. Yeah, so it's challenging to work with, but I do, if anybody's thinking of, you know, has a challenging problem that requires the efficiency of C. Don't be afraid of it. It's not that hard. And it is pretty amazing when you start working with it, actually.
Paige_Niedringhaus:
Okay, so it works or there is an npm package if you want to use JavaScript with it, but what if you are a Python developer or maybe you're working with a microcontroller like Arduino?
Ian_Lavery:
Mm-hmm.
Paige_Niedringhaus:
Are there options for other languages like that?
Ian_Lavery:
Yeah, so I mean, we support, since we're a developer focused company, uh, we're pretty obsessed with our SDKs. So I think for our, uh, our two most popular products, I think we have like 20 SDKs for each one and it covers all, all the favorites and we even have, you know, we have three, no, we have four different web SDKs, we have, uh, vanilla JavaScript. but we also have React, Angular, and Vue. So it allows, we basically want it to be like, use it in your favorite environment. Like,
Paige_Niedringhaus:
Yeah.
Ian_Lavery:
use it like you use anything else in your stack. We don't want to disturb that, basically.
Paige_Niedringhaus:
Right. That's awesome. So what are some of the use cases that you've seen people employing it for recently?
Ian_Lavery:
So we've seen, so we've actually come to some interesting ones lately. So Auto content moderation is a big one right now. So let's say you're Minecraft or something,
Paige_Niedringhaus:
Hehehe
Ian_Lavery:
or you're, I guess, let's go Fortnite, and you have open audio streams
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
on hundreds of thousands of players, and you're trying to moderate all that,
Paige_Niedringhaus:
Mmm.
Ian_Lavery:
that's very difficult. And it turns out a lot of big companies out there are using auto moderation, which basically takes that audio and is basically looking for key phrases.
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
Let's call
Tj_Vantoll:
Yeah,
Ian_Lavery:
them.
Paige_Niedringhaus:
Yes, it's
Tj_Vantoll:
with
Paige_Niedringhaus:
a good
Tj_Vantoll:
some
Paige_Niedringhaus:
way
Tj_Vantoll:
air
Paige_Niedringhaus:
to put
Tj_Vantoll:
quotes
Paige_Niedringhaus:
it.
Tj_Vantoll:
there. Yeah.
Ian_Lavery:
Yeah. Yeah. And it's just looking to flag them. And then they'll usually have a person go in and inspect the actual
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
content of it and decide whether it was a mistake or whether it is actually a banable offense. So that's actually a new exciting one. Also, call centers, it turns
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
out. Again, we've got open phone lines, like a whole building full of them. And we're trying to understand, you know, what's being said on all these different calls. And you can't have people listening to all that audio. So a lot of big call center companies need some sort of automated system to take in all the audio from all their phones and do something with it. So we're, we're encountering more use cases like that lately, actually.
Tj_Vantoll:
are both really fascinating. It's funny, the content moderation one really resonated with me because I play a, I don't know, my kids are 11 so they're right at that
Ian_Lavery:
Mm.
Tj_Vantoll:
impressionable age, but they're also right in the age where they want to play like games that are the sort where they have open audio.
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
Right.
Tj_Vantoll:
So we, there's a game we play that's like 5e5, so five people on each team, and it has, it has
Ian_Lavery:
It has
Tj_Vantoll:
a way
Ian_Lavery:
a way
Tj_Vantoll:
for
Ian_Lavery:
for
Tj_Vantoll:
you
Ian_Lavery:
you
Tj_Vantoll:
to do
Ian_Lavery:
to...
Tj_Vantoll:
audio communication. And the very first thing I did was make sure to shut that off. This is like disable it. because
Ian_Lavery:
I'm a professional
Tj_Vantoll:
like I'm a professional
Ian_Lavery:
internet
Tj_Vantoll:
internet
Ian_Lavery:
user.
Tj_Vantoll:
user
Ian_Lavery:
Yeah.
Tj_Vantoll:
and
Paige_Niedringhaus:
right?
Tj_Vantoll:
that's the first thing you learn is I don't trust anybody.
Paige_Niedringhaus:
anyone.
Tj_Vantoll:
I wouldn't
Ian_Lavery:
No.
Tj_Vantoll:
even wanna hear it myself much less my kids though.
Paige_Niedringhaus:
Yep.
Ian_Lavery:
I know it like brings me back to like, when I was like, you know, 11 or 12, like the internet was like a new exciting thing. And I just would like, I remember going to like, you know, I like was like really into like going to like, like different video game websites and stuff. And then there was
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
these just these chat rooms about video games you could go
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
to. And it was literally just like. a room with everybody their microphones are on and you just start talking.
Tj_Vantoll:
Hehehehe
Ian_Lavery:
And it was like, when I think of that now, I'm like, oh my God, that's frightening.
Paige_Niedringhaus:
Yeah, yeah.
Ian_Lavery:
But yeah, I mean, the fact is, we can keep those spaces safe with these sorts of tools because then the ne'er-do-wells out there
Paige_Niedringhaus:
Ha
Ian_Lavery:
at
Paige_Niedringhaus:
ha
Ian_Lavery:
least
Paige_Niedringhaus:
ha!
Ian_Lavery:
get banned when they're being inappropriate or whatever.
Paige_Niedringhaus:
Oh, God. Well, one thing that you put in the show notes today that I would really like to hear more about is a new speech to text engine or engines cheetah and leopard. So maybe you could tell us a little bit more about those.
Ian_Lavery:
Yeah, so I think, yeah, last time we spoke, we actually didn't have a publicly available speech to text engine. And we were using our speech to intent engine, which was called Rhino, which was basically like you basically, yeah, the founder of the company is pretty obsessed with animals. So Rhino, basically you teach it a small grammar. and then it would understand that grammar, which is great for stuff like, you know, controlling a coffee maker or like, you know, there's only so many functions it needs to understand. But we decided to kind of go that extra mile and bring speech to text to devices. And traditionally, language models are in the gigabyte realm of size, and we actually got
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
ours down to 20 megabytes for language. And that's sort of the big win for this is like we can run on, you know, anything that can take 20 megabytes of memory or of storage. And so Leopard and Cheetah are actually two different sides of the same coin. So Leopard is a speech-to-text engine that takes in a set amount of audio. So like
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
an audio file or something and gives you a transcription of that. And that's a lot, that's an easier problem because you can basically say, okay, this is all the audio I'm gonna get. So I'm gonna look forward, I'm gonna look back, I'm gonna make inferences based on the future and the past and give you a response. But then cheetah, of course, because it's the fast one, it goes, it's real time. So it has zero look ahead. which
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
means it will take in every frame of audio that you give it and it will return what it thinks is being said.
Paige_Niedringhaus:
Wow.
Ian_Lavery:
So they're both speech to text engines, but they just work at different use cases. So, I mean, audio files, the accuracy is much better,
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
but of course you sacrifice the sort of real time effect.
Paige_Niedringhaus:
Yeah.
Tj_Vantoll:
So 20 megs is impressive, but is that still like small enough for a browser to use? Like does a user have to download that
Ian_Lavery:
Hmm.
Tj_Vantoll:
to use it in their web app?
Ian_Lavery:
So that was a challenge. Uh, we recently, so we recently did the web SDKs for Cheetah and Leopard. And we actually had to kind of redesign our whole system of delivering language to the browser to handle this. So yes, 20 megabytes is a lot, but, um, we actually separate the language model from the package. So basically we let the developer decide how that's delivered to the user. But we also made it a part of our system that it could be either a base 64 representation that you can bake into your website if you just want it to always be there. Or if you want to be kind of smarter about it, what you can do is put it in your public folder and have it downloaded to the user's browser on first load. And then cached in local storage for the rest of the time. So that the next, so the very first load, yeah, it'll be a 20 megabyte load, but the second load will be instant because they already
Paige_Niedringhaus:
Mm.
Ian_Lavery:
have the language model.
Tj_Vantoll:
Yeah, it's a pretty neat system because I think it's the nature of the beast.
Ian_Lavery:
Hmm.
Tj_Vantoll:
Because I mean, in a way, it's kind of more of like a native app feature and native apps are downloading like gigs at times of stuff. And so it's like a feature that helps the web sort of compete with that. So I think it makes sense.
Ian_Lavery:
Yeah.
Tj_Vantoll:
And I think honestly, I think that's the best. The solution is kind of clever because that's kind of all you can do because you can't.
Ian_Lavery:
Yeah.
Tj_Vantoll:
You can't magically get it to the user ahead of time, like through an
Ian_Lavery:
No,
Tj_Vantoll:
app store or something. So.
Ian_Lavery:
and a developer, if they're, you know, if they wanna be clever about it, they can stream it from their public folder asynchronously on the first load so that it's just like, you know, by the time the user wants to activate the voice feature, it's already downloaded, you know?
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
It's just the sort of thing you need to handle these sorts of ways. Cause you know, We were working with a company recently that, uh, they do this all the time in their mobile, uh, apps,
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
though their mobile app actually downloads like. Stuff all the time to keep their, their app working and it does it all asynchronously, like when you open up the app and you know, the user is none the wiser, but behind the scenes, there's all this stuff. So when, when you look up, you know, why is this app using 3.6? gigabytes when I downloaded it, it was only 500 megabytes. That's because they only delivered like the core code and the rest of it was downloaded later on.
Paige_Niedringhaus:
Yeah.
Ian_Lavery:
So it's just how you do stuff now,
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
is just keep the package size small, but then just deliver the features kind of as they're being used.
Tj_Vantoll:
Yeah, I know iOS and Android even have APIs built in to help you do that sort
Ian_Lavery:
Mm-hmm.
Tj_Vantoll:
of thing because it's such a common model.
Ian_Lavery:
Yeah, I think all the big companies want that. You know, if you're Spotify, you just, you got to have the features. Uh, you don't want people to see 3.6 gigabytes when they go to download your app. There's like a sticker shock thing
Tj_Vantoll:
Yep.
Ian_Lavery:
that happens. So it's kind of a funny thing because it ends up being that it's like, it's like when you know, you book like. an Airbnb and there's all these extra expenses that like get reported later or like a,
Paige_Niedringhaus:
Hehehe
Tj_Vantoll:
Hehehehe.
Ian_Lavery:
or a flight where you get like the info later. It's sort of like that. It's like reduce the sticker shock and then we'll show you the expenses after.
Tj_Vantoll:
Yeah. So you also have an article in here about writing a podcast at transcription server, which I'm
Paige_Niedringhaus:
Mm-hmm.
Tj_Vantoll:
struggling to pronounce for
Ian_Lavery:
Yeah
Tj_Vantoll:
some reason, which is a fascinating idea that I think like, I know when we were talking before the show too, we've done transcriptions on videos. I'm sure there's other people that, or call centers is another example, right?
Ian_Lavery:
Mm-hmm.
Tj_Vantoll:
Things that you want to transcribe. So
Paige_Niedringhaus:
Mm-hmm.
Tj_Vantoll:
does that use Cheetah Leopard or how does that work?
Ian_Lavery:
Yeah, so it uses Leopard because we actually have the ability to get, you know, a whole file like a hour long podcast and transcribe it from start to finish. And yeah, the reason I kind of came up with that as an idea to sort of demo our technology is like, I know I've listened to podcasts for years and like, it's so often on a long running show, I'm sure on this show, you get the, hey, have we talked about that? Did we talk about this? I feel like we talked about this. Um, and. You know, having show notes, uh, to go back to, uh, is, is probably a really helpful thing or, or, or like, you know, I was thinking of doing a next phase of the article where I actually make, uh, you know, uh, a podcast like searchable. So I made it transcribable and basically stored the like text representation, but Once you have the text representation, you can make it searchable. And then you can start being like, oh, when did I say this? And then it will just pop up the episode you set it in. So it was just kind of an idea I came up with because I see a lot of people using Leopard on a server to basically hook into an event that's happening somewhere, whether it be on our RSS feed, that's like updated. or,
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
you know, yeah, like a new audio file or video is uploaded and it hooks into that event, it runs it through Leopard and then stores it in a database. I thought that was like kind of a universal use case. Like it's just so, it seems like a fundamental part of the web to like have something like that in a server.
Paige_Niedringhaus:
Yeah, I mean, it's, it would be so useful. And it would, it would help, I think everybody from people who just want to re reread part of a podcast if
Ian_Lavery:
Hmm.
Paige_Niedringhaus:
they're looking for something specific instead of having to just kind of hop through trying to figure out where it was that that useful bit of information was.
Ian_Lavery:
Well, and, and you can, you can think too, like you can deliver these, like, let's say you attached it to your podcast. You can like deliver these transcripts along with your podcast, because if you have
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
the server hook in, uh, transcribe it and then deliver the transcript along with the podcast, suddenly you've got a, uh, you know, follow along with the transcript, uh, uh, podcast.
Paige_Niedringhaus:
Yeah.
Ian_Lavery:
So these sorts of things are useful for like auto captioning, uh, like videos or audio as well for like accessibility.
Tj_Vantoll:
It's accessible. It's also like marketing. People like it for SEO purposes too, because
Paige_Niedringhaus:
Mm-hmm.
Tj_Vantoll:
you know, audio, Google can index audio,
Ian_Lavery:
That's
Tj_Vantoll:
but if you
Ian_Lavery:
right.
Tj_Vantoll:
have a transcription, it absolutely can.
Paige_Niedringhaus:
even better.
Ian_Lavery:
Yeah, a hundred percent correct. Yeah. Search engines aren't very good at indexing audio.
Paige_Niedringhaus:
I'm sorry.
Ian_Lavery:
So you just have to plaster the tech somewhere.
Tj_Vantoll:
Can you recognize different speakers? Because that's the other thing about a transcript,
Ian_Lavery:
Mm.
Paige_Niedringhaus:
Mmm.
Tj_Vantoll:
right? Is knowing who's talking. Do you have the ability internally, even if you don't know names, obviously, but can you say like, this is voice one, voice two?
Paige_Niedringhaus:
Yeah.
Ian_Lavery:
Yeah, so actually we're working right now on a, I'll give you guys the scoop. We're working right
Tj_Vantoll:
Yeah.
Ian_Lavery:
now on a speaker identification system that will basically be able to tell people apart. Because yeah,
Paige_Niedringhaus:
Yeah.
Ian_Lavery:
when you think of something like doing like a Zoom meeting, if you want like to have meeting notes.
Tj_Vantoll:
Yeah.
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
It would be really useful to have like, this came from this person, this came from this person, this came from this person. And you can
Paige_Niedringhaus:
Right.
Ian_Lavery:
use different, I mean, zoom obviously has the ability to know where the audio is coming from. So it can kind
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
of just label it. But if you have a anonymous audio stream with a bunch of different voices, that's, that's challenging because you don't know. You just have to base your assumptions on the character of the voice.
Paige_Niedringhaus:
right?
Ian_Lavery:
Who's, who's different? So that's actually a problem we're working on right now. And I mean, that's useful for not only speaker labeling, but also speaker verification. So like, if we wanna voice activate something that only responds to your voice,
Paige_Niedringhaus:
Mmm.
Ian_Lavery:
that's also
Tj_Vantoll:
Yeah.
Ian_Lavery:
like another use case for it.
Paige_Niedringhaus:
That would be cool.
Tj_Vantoll:
Well, even this has been a blast. Is there anything that you wanted to discuss today that we have not gotten to at all?
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
Um, no, I think, I think, uh, we covered, uh, we covered a lot here. Yeah.
Tj_Vantoll:
Yeah, excellent. So then why don't we move into our picks and page? Do you wanna kick us off?
Paige_Niedringhaus:
Sure. So my pick is going to continue the trend that I started last week, which was Star Trek. As many of you who have been listening for a while, I've been on the Star Trek journey through Next Generation and forward. So most recently, I've begun watching Star Trek Lower Decks, which is their animated series. And there's only, I think there's maybe two, maybe three seasons of it. But it is... It is in the style of Rick and Morty, and it is
Ian_Lavery:
Bye.
Paige_Niedringhaus:
the funniest Star Trek that I've ever seen, to the point where I'm actually laughing, which is unusual for anything animated, but it is really good. There's a lot of references to other Star Trek franchises. So if you are familiar with Next Generation or Voyager or Enterprise, they throw in all sorts of little jokes that are related to those characters. So I would definitely recommend it. It's as family friendly as the rest of the Star Trek franchise is. And it's also got a much bigger dose of humor than most of them do. So if you're looking for something that's quick 20, 25 minute episodes, I would definitely say it's a good one.
Tj_Vantoll:
Excellent. I still have not gotten into the Star Trek world. So it's
Paige_Niedringhaus:
Hahaha
Tj_Vantoll:
at some point I've had it recommended several times, but I feel like it's such an, like,
Ian_Lavery:
such that
Tj_Vantoll:
you
Ian_Lavery:
you
Tj_Vantoll:
can't
Ian_Lavery:
can't
Tj_Vantoll:
like
Ian_Lavery:
like
Tj_Vantoll:
casually
Ian_Lavery:
casually
Tj_Vantoll:
wade
Ian_Lavery:
wade
Tj_Vantoll:
into
Ian_Lavery:
into it.
Tj_Vantoll:
it, right?
Paige_Niedringhaus:
It's
Tj_Vantoll:
Like
Paige_Niedringhaus:
an
Tj_Vantoll:
you
Paige_Niedringhaus:
undertaking.
Tj_Vantoll:
kind of have to, yeah.
Ian_Lavery:
But it's fine.
Tj_Vantoll:
So, uh, my pick this week is going to be the great British bake-off, which I think was
Ian_Lavery:
Oh
Tj_Vantoll:
a
Ian_Lavery:
nice.
Tj_Vantoll:
previous
Paige_Niedringhaus:
I love that show.
Tj_Vantoll:
pick of yours page. I started off watching it because I just wanted to know what it was about. right, just that sort of thing. And then next thing I knew I'd watched a few episodes and I didn't even really understand why, but.
Ian_Lavery:
It's so comforting, that show. Like
Paige_Niedringhaus:
It is.
Ian_Lavery:
there's just something so positive and warm about it.
Tj_Vantoll:
It's strangely compelling. Like
Ian_Lavery:
Yeah.
Paige_Niedringhaus:
Mm-hmm
Tj_Vantoll:
I can't even understand why I ended up watching it, but it's quite good. I think Netflix has like five seasons or so. I mean, I don't know how much I'm gonna watch, but it's a good thing to just have on when you're not sure what to do. It's just comforting, nice to have on in the background. So
Paige_Niedringhaus:
Mm-hmm.
Tj_Vantoll:
I've been sucked into that as well.
Ian_Lavery:
That's funny. My wife and I like literally just started watching that like a few weeks ago. And for the same reason like, why are people talking about this so much? And
Paige_Niedringhaus:
I'm sorry.
Ian_Lavery:
yeah,
Tj_Vantoll:
Yeah.
Ian_Lavery:
now we're like, you know, shouting out the screen that that's not a good bake. Look at the
Tj_Vantoll:
Yeah,
Ian_Lavery:
crumb
Paige_Niedringhaus:
Bruh!
Ian_Lavery:
on that.
Tj_Vantoll:
it's amateur hour. Yeah.
Paige_Niedringhaus:
Yeah, no.
Tj_Vantoll:
Excellent. Ian, what picks do you have for us?
Ian_Lavery:
Yeah, so I think last time I brought Mandy a really gnarly horror film. So I figure I'll, uh,
Tj_Vantoll:
Mm-hmm.
Ian_Lavery:
back up and maybe do something a little different this time and actually go with something tech related. Um, so we've been working with, uh, Mixpanel recently, which is an amazing service. It's really helped us cause we're trying to, uh, add some analytics to our website and our console. But. custom analytics that allow us to basically track like basically when somebody enters the website, what they interact with and how long and develop our own metrics based on the code we actually put in the website.
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
And Mixpanel is amazing at this. What they basically do is they're just, all they do is they say, hey, we're just gonna take events and we're gonna represent them in a whole bunch of different ways. You can filter, you can form funnels, you can show user flow, you can take a user and actually watch where they go on the website and stuff. It's super, super helpful for us. We've actually been, we've had a total crush on them since we started working with their product because
Tj_Vantoll:
Yeah
Ian_Lavery:
they just, not only is their UI like so nice to work with, but it's just made our life. We were thinking of building this for... Basically, we wanted to add analytics to our website, but Google Analytics and that sort of general analytics were just not
Paige_Niedringhaus:
Yeah.
Ian_Lavery:
enough. We needed like really specific ones.
Paige_Niedringhaus:
Right.
Ian_Lavery:
And we were gonna build it ourselves. And then we stumbled upon Mixpanel and it was like, oh my God, this saved us. Like they made what we could have made, you know, it would have taken us, we would have had to start a new company to make what they made. And it's just, it's so, it's so helpful. So I definitely for any web developer out there that wants to add like custom analytics, uh, mixed panels, really, really helpful.
Paige_Niedringhaus:
Awesome.
Tj_Vantoll:
Excellent. Well, even this has been amazing. My last question for you, if people want to follow you, keep up with you, what are the best, where are the best places to go to do that?
Ian_Lavery:
Yeah, I mean, let's see. I don't have like professional socials out there, but I do have, I am on Medium as Ian Lavery, so you can read any articles I put up there.
Paige_Niedringhaus:
Mm-hmm.
Ian_Lavery:
You can follow Check out my bands, fellow kids in Sleep Circle. Check them out. Yeah.
Tj_Vantoll:
No, excellent. That's great. We'll get those in the show notes. And yeah, thanks for joining us today. This is a great chat.
Ian_Lavery:
Yeah, thanks for having me. This is great.
Tj_Vantoll:
Cool. All right, everybody. Well, until next week.
Paige_Niedringhaus:
See you then.