Wednesday, October 29, 2025
HomeTechnologyRaiza Martin on Constructing AI Functions for Audio – O’Reilly

Raiza Martin on Constructing AI Functions for Audio – O’Reilly

Generative AI in the Real World

Generative AI within the Actual World

Generative AI within the Actual World: Raiza Martin on Constructing AI Functions for Audio



Loading





/

Audio is being added to AI in every single place: each in multimodal fashions that may perceive and generate audio and in functions that use audio for enter. Now that we are able to work with spoken language, what does that imply for the functions that we are able to develop? How can we take into consideration audio interfaces—how will folks use them, and what is going to they wish to do? Raiza Martin, who labored on Google’s groundbreaking NotebookLM, joins Ben Lorica to debate how she thinks about audio and what you may construct with it.

Concerning the Generative AI within the Actual World podcast: In 2023, ChatGPT put AI on everybody’s agenda. In 2025, the problem will probably be turning these agendas into actuality. In Generative AI within the Actual World, Ben Lorica interviews leaders who’re constructing with AI. Study from their expertise to assist put AI to work in your enterprise.

Take a look at different episodes of this podcast on the O’Reilly studying platform.

Timestamps

  • 0:00: Introduction to Raiza Martin, who cofounded Huxe and previously led Google’s NotebookLM workforce. What made you assume this was the time to commerce the comforts of huge tech for a storage startup?
  • 1:01: It was a private choice for all of us. It was a pleasure to take NotebookLM from an concept to one thing that resonated so broadly. We realized that AI was actually blowing up. We didn’t know what it could be like at a startup, however we wished to strive. Seven months down the highway, we’re having a good time.
  • 1:54: For the 1% who aren’t acquainted with NotebookLM, give a brief description.
  • 2:06: It’s principally contextualized intelligence, the place you give NotebookLM the sources you care about and NotebookLM stays grounded to these sources. Considered one of our most typical use circumstances was that college students would create notebooks and add their class supplies, and it grew to become an professional that you might discuss with.
  • 2:43: Right here’s a use case for householders: put all of your consumer manuals in there. 
  • 3:14: We now have had lots of people inform us that they use NotebookLM for Airbnbs. They put all of the manuals and directions in there, and customers can discuss to it.
  • 3:41: Why do folks want a private day by day podcast?
  • 3:57: There are a variety of totally different ways in which I take into consideration constructing new merchandise. On one hand, there are acute ache factors. However Huxe comes from a distinct angle: What if we may attempt to construct very pleasant issues? The inputs are just a little totally different. We tried to think about what the typical individual’s day by day life is like. You get up, you verify your telephone, you journey to work; we considered alternatives to make one thing extra pleasant. I feel so much about TikTok. When do I exploit it? After I’m standing in line. We landed on transit time or commute time. We wished to do one thing novel and attention-grabbing with that area in time. So one of many first issues was creating actually customized audio content material. That was the provocation: What do folks wish to hearken to? Even on this quick time, we’ve discovered so much in regards to the quantity of alternative.
  • 6:04: Huxe is cellular first, audio first, proper? Why audio?
  • 6:45: Coming from our learnings from NotebookLM, you study basically various things whenever you change the modality of one thing. After I go on walks with ChatGPT, I simply discuss my day. I seen that was a really totally different interplay from after I sort issues out to ChatGPT. The flip aspect is much less about interplay and extra about consumption. One thing in regards to the audio format made the kinds of sources totally different as effectively. The sources we uploaded to NotebookLM had been totally different because of wanting audio output. By specializing in audio, I feel we’ll study totally different use circumstances than the chat use circumstances. Voice continues to be largely untapped. 
  • 8:24: Even in textual content, folks began exploring different type elements: lengthy articles, bullet factors. What sorts of issues can be found for voice?
  • 8:49: I consider two codecs: one passive and one interactive. With passive codecs, there are a variety of various things you may create for the consumer. The issues you find yourself enjoying with are (1) what’s the content material about and (2) how versatile is the content material? Is it quick, lengthy, malleable to consumer suggestions? With interactive content material, possibly I’m listening to audio, however I wish to work together with it. Perhaps I wish to take part. Perhaps I would like my mates to hitch in. Each of these contexts are new. I feel that is what’s going to emerge within the subsequent few years. I feel we’ll study that the kinds of issues we are going to use audio for are basically totally different from the issues we use chat for.
  • 10:19: What are a few of the key classes to keep away from from good audio system?
  • 10:25: I’ve owned so lots of them. And I like them. My main use for the good audio system continues to be a timer. It’s costly and doesn’t dwell as much as the promise. I simply don’t assume the know-how was prepared for what folks actually wished to do. It’s onerous to consider how that would have labored with out AI. Second, one of the tough issues about audio is that there is no such thing as a UI. A sensible speaker is a bodily machine. There’s nothing that tells you what to do. So the educational curve is steep. So now you may have a consumer who doesn’t know what they will use the factor for. 
  • 12:20: Now it could possibly achieve this way more. Even with no UI, the consumer can simply strive issues. However there’s a danger in that it nonetheless requires enter from the consumer. How can we take into consideration a system that’s so supportive that you just don’t need to provide you with the way to make it work? That’s the problem from the good speaker period.
  • 12:56: It’s attention-grabbing that you just level out the UI. With a chatbot it’s a must to sort one thing. With a wise speaker, folks began getting creeped out by surveillance. So, will Huxe surveil me?
  • 13:18: I feel there’s one thing easy about it, which is the wake phrase. As a result of good audio system are triggered by wake phrases, they’re at all times on. If the consumer says one thing, it’s most likely choosing it up, and it’s most likely logged someplace. With Huxe, we wish to be actually cautious about the place we consider shopper readiness is. You wish to push just a little bit however not too far. When you push too far, folks get creeped out. 
  • 14:32: For Huxe, it’s a must to flip it on to make use of it. It’s clunky in some methods, however we are able to push on that boundary and see if we are able to push for one thing that’s extra ambiently on. We’re beginning to see the emergence of extra instruments which can be at all times on. There are instruments like Granola and Cluely: They’re at all times on, your display screen, transcribing your audio. I’m curious—are we prepared for know-how like that? In actual life, you may most likely get essentially the most utility from one thing that’s at all times on. However whether or not customers are prepared continues to be TBD.
  • 15:25: So that you’re ingesting calendars, electronic mail, and different issues from the customers. What about privateness? What are the steps you’ve taken?
  • 15:48: We’re very privateness centered. I feel that comes from constructing NotebookLM. We wished to ensure we had been very respectful of consumer information. We didn’t prepare on any consumer information; consumer information stayed personal. We’re taking the identical method with Huxe. We use the info you share with Huxe to enhance your private expertise. There’s one thing attention-grabbing in creating private advice fashions that don’t transcend your utilization of the app. It’s just a little more durable for us to construct one thing good, however it respects privateness, and that’s what it takes to get folks to belief.
  • 17:08: Huxe might discover that I’ve a flight tomorrow and inform me that the flight is delayed. To take action, it has needed to contact an exterior service, which now is aware of about my flight.
  • 17:26: That’s a very good level. I take into consideration constructing Huxe like this: If I had been in your pocket, what would I do? If I noticed a calendar that stated “Ben has a flight,” I can verify that flight with out leaking your private info. I can simply lookup the flight quantity. There are a variety of methods you are able to do one thing that gives utility however doesn’t leak information to a different service. We’re making an attempt to know issues which can be way more motion oriented. We attempt to inform you about climate, about site visitors; these are issues we are able to do with out stepping on consumer privateness.
  • 18:38: The way in which you described the system, there’s no social part. However you find yourself studying issues about me. So there’s the potential for constructing a extra subtle filter bubble. How do you make it possible for I’m ingesting issues past my filter bubble?
  • 19:08: It comes all the way down to what I consider an individual ought to or shouldn’t be consuming. That’s at all times tough. We’ve seen what these feeds can do to us. I don’t know the proper components but. There’s one thing attention-grabbing about “How do I get sufficient consumer enter so I can provide them a greater expertise?” There’s sign there. I strive to consider a consumer’s feed from the angle of relevance and fewer from an editorial perspective. I feel the relevance of data might be sufficient. We’ll most likely check this as soon as we begin surfacing extra customized info. 
  • 20:42: The opposite factor that’s actually vital is surfacing the proper controls: I like this; right here’s why. I don’t like this; why not? The place you inject stress within the system, the place you assume the system ought to push again—that takes just a little time to determine the way to do it proper.
  • 21:01: What in regards to the boundary between giving me content material and offering companionship?
  • 21:09: How do we all know the distinction between an assistant and a companion? Essentially the capabilities are the identical. I don’t know if the query issues. The consumer will use it how the consumer intends to make use of it. That query issues most within the packaging and the advertising. I discuss to individuals who discuss ChatGPT as their finest pal. I discuss to others who discuss it as an worker. On a capabilities degree, they’re most likely the identical factor. On a advertising degree, they’re totally different.
  • 22:22: For Huxe, the best way I take into consideration that is which set of use circumstances you prioritize. Past a easy dialog, the capabilities will most likely begin diverging. 
  • 22:47: You’re now a part of a really small startup. I assume you’re not constructing your personal fashions; you’re utilizing exterior fashions. Stroll us by means of privateness, given that you just’re utilizing exterior fashions. As that mannequin learns extra about me, how a lot does that mannequin retain over time? To be a very good companion, you may’t be clearing that cache each time I log off.
  • 23:21: That query pertains to the place we retailer information and the way it’s handed off. We go for fashions that don’t prepare on the info we ship them. The following layer is how we take into consideration continuity. Folks anticipate ChatGPT to have data of all of the conversations you may have. 
  • 24:03: To assist that it’s a must to construct a really sturdy context layer. However you don’t need to think about that every one of that will get handed to the mannequin. A number of technical limitations forestall you from doing that anyway. That context is saved on the software layer. We retailer it, and we strive to determine the suitable issues to go to the mannequin, passing as little as attainable.
  • 25:17: You’re from Google. I do know that you just measure, measure, measure. What are a few of the alerts you measure? 
  • 25:40: I take into consideration metrics just a little in a different way within the early levels. Metrics at first are nonobvious. You’ll get a variety of trial conduct at first. It’s just a little more durable to know the preliminary consumer expertise from the uncooked metrics. There are some fundamental metrics that I care about—the speed at which individuals are capable of onboard. However so far as crossing the chasm (I consider product constructing as a collection of chasms that by no means finish), you search for individuals who actually like it, who rave about it; it’s a must to hearken to them. After which the individuals who used the product and hated it. Whenever you hearken to them, you uncover that they anticipated it to do one thing and it didn’t. It allow them to down. You need to hear to those two teams, after which you may triangulate what the product seems to be wish to the surface world. The factor I’m making an attempt to determine is much less “Is it a success?” however “Is the market prepared for it? Is the market prepared for one thing this bizarre?” Within the AI world, the truth is that you just’re testing shopper readiness and want, and the way they’re evolving collectively. We did this with NotebookLM. After we confirmed it to college students, there was zero time between once they noticed it and once they understood it. That’s the primary chasm. Can you discover individuals who perceive what they assume it’s and really feel strongly about it?
  • 28:45: Now that you just’re outdoors of Google, what would you need the inspiration mannequin builders to concentrate on? What features of those fashions would you wish to see improved?
  • 29:20: We share a lot suggestions with the mannequin suppliers—I can present suggestions to all of the labs, not simply Google, and that’s been enjoyable. The universe of issues proper now’s fairly well-known. We haven’t touched the area the place we’re pushing for brand new issues but. We at all times attempt to drive down latency. It’s a dialog—you may interrupt. There’s some fundamental conduct there that the fashions can get higher at. Issues like tool-calling, making it higher and parallelizing it with voice mannequin synthesis. Even simply the variety of voices, languages, and accents; that sounds fundamental, however it’s truly fairly onerous. These prime three issues are fairly well-known, however it is going to take us by means of the remainder of the yr.
  • 30:48: And narrowing the hole between the cloud mannequin and the on-device mannequin.
  • 30:52: That’s attention-grabbing too. As we speak we’re making a variety of progress on the smaller on-device fashions, however whenever you consider supporting an LLM and a voice mannequin on prime of it, it truly will get just a little bit furry, the place most individuals would simply return to industrial fashions.
  • 31:26: What’s one prediction within the shopper AI area that you’d make that most individuals would discover stunning?
  • 31:37: Lots of people use AI for companionship, and never within the ways in which we think about. Virtually everybody I discuss to, the utility may be very private. There are a variety of work use circumstances. However the rising aspect of AI is private. There’s much more space for discovery. For instance, I exploit ChatGPT as my operating coach. It ingests all of my operating information and creates operating plans for me. The place would I slot that? It’s not productiveness, however it’s not my finest pal; it’s simply my operating coach. Increasingly individuals are doing these difficult private issues which can be nearer to companionship than enterprise use circumstances. 
  • 33:02: You had been imagined to say Gemini!
  • 33:04: I like all the fashions. I’ve a use case for all of them. However all of us use all of the fashions. I don’t know anybody who solely makes use of one. 
  • 33:22: What you’re saying in regards to the nonwork use circumstances is so true. I come throughout so many individuals who deal with chatbots as their mates. 
  • 33:36: I do it on a regular basis now. When you begin doing it, it’s so much stickier than the work use circumstances. I took my canine to get groomed, they usually wished me to add his rabies vaccine. So I began enthusiastic about how effectively it’s protected. I opened up ChatGPT, and spent eight minutes speaking about rabies. Persons are turning into extra curious, and now there’s a direct outlet for that curiosity. It’s a lot enjoyable. There’s a lot alternative for us to proceed to discover that. 
  • 34:48: Doesn’t this point out that these fashions will get sticky over time? If I discuss to Gemini so much, why would I swap to ChatGPT?
  • 35:04: I agree. We see that now. I like Claude. I like Gemini. However I actually just like the ChatGPT app. As a result of the app is an efficient expertise, there’s no purpose for me to modify. I’ve talked to ChatGPT a lot that there’s no method for me to port my information. There’s information lock-in.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments