Want to help define the AI Engineer stack? Have opinions on the top tools, communities and builders? We’re collaborating with friends at Amplify to launch the first State of AI Engineering survey! Please fill it out (and tell your friends)!
In March, we started off our GPT4 coverage framing one of this year’s key forks in the road as the “Year of Multimodal vs Multimodel AI”. 6 months in, neither has panned out yet1. The vast majority of LLM usage still defaults to chatbots built atop OpenAI (per our LangSmith discussion), and rumored GPU shortages have prevented the broader rollout of GPT-4 Vision2. Most "AI media” demos like AI Drake and AI South Park turned out heavily human engineered, to the point where the AI label is more marketing than honest reflection of value contributed.
Update: as we write this, OpenAI announced DallE 3 as part of ChatGPT, which turns the multimodel multimodal trend back on again. Never a dull moment in AI!
However, the biggest3 impact of multimodal AI in our lives this year has been a relatively simple product - the daily HN Recap podcast produced by Wondercraft.ai, a 5 month old AI podcasting startup. As swyx observed, the “content flippening” — an event horizon when the majority of content you choose to consume is primarily AI generated/augmented rather than primarily human/manually produced — has now gone from unthinkable! to possible(?).
The effects could be generationally skewed as well. Every AI Engineer parent we know of is already hacking together bedtime story generator apps (like this) for their kids, and we already know that all edtech companies from Khan Academy to Ello are furiously pursuing personalized teaching to solve Bloom’s 2 Sigma problem. Just as the follower-graph or vote-based social networks of Millennials gave way to Gen Z growing up with purely AI recommended feeds4, it is quite possible that Gen AA grows up with AI content being much less stigmatized than we currently feel today5.
Of Moats and API Wrappers
Wondercraft outsources a lot of its core pieces - they proudly run on GPT3/4, and generate voices with their friends at Eleven Labs, and of course the HN Recap derives much of its value from voting and comments by the Hacker News community. As a YCombinator-backed startup, they have had to come to terms with the same product question a lot of “AI native” but non-model-training startups have - what is their moat?
This question, alongside the discredited “No Moats” Googler memo, will hopefully die a rapid death in app-layer investing circles - the answer is invariably some combination of “Most top tech firms have no tech moat” (a direct quote from Sam Altman we play in the podcast) and, for verticalized SaaS, “We serve our customer better than you do”. It’s still an uncomfortable question to tackle, but fortunately Youssef took it on in our conversation.
This podcast is cohosted by Anna, Wondercraft’s (alphabetically top ranked and default) AI voice, and writing her intro/outtro script complete with backing music and having her interject at various points in the episode was fun and easy, made possible by Wondercraft. This makes Wondercraft only the second AI audio tool (after OpenAI-backed Descript, worth $550m, which we use to edit most of the “emergency pods” we do) to earn a spot in our podcasting toolkit.
Multilingual, too
Wondercraft is also now releasing its dubbing feature, where you can translate podcasts to 28 supported languages. We had some fun with this on the pod, where you can here Youssef and I switch to fluent Spanish. Mr Beast is noted for prioritizing multiple languages early on to maximize his viewership internationally.
The Text to Speech (TTS) Landscape
Lastly, we covered our mapping of the emerging TTS landscape, including research-grade, open source, and commercial options. We intend to cover this in much greater detail in a future pod, but in the meantime please feel free to comment on anything we missed or guests we should talk to.
Big Cloud
Commercial Services
Play.ht/Podcast.ai
Speechify.com - celebrity voices
Mycroft.ai - privacy focused, run offline
Heygen for Video+Voice
Convai - virtual characters for games
Open source
Newer research
1.5 hour full video episode
We were able to record video for this episode, and release the full recording with extra background on Youssef, the full story on Moonshot, favorite PG founder advice and other questions on company building as a video on Latent Space TV!
Show Notes
Wondercraft’s YC page, Launch HN, Product Hunt
Timestamps
[00:03:15] What is Wondercraft?
[00:08:22] Features of Wondercraft
[00:10:42] Types of Podcasts
[00:11:44] The Importance of Consistency
[00:14:01] Wondercraft House Podcasts
[00:19:27] Video Translation and Dubbing
[00:21:49] Building Wondercraft in 1 Day
[00:24:25] What is your moat?
[00:30:37] Audio Generation stack
[00:32:12] How Important is it to Sound Human? and AI Uncanny Valley
[00:36:02] AI Watermarking
[00:36:32] The Text to Speech Industry
[00:41:19] Voice Synthesis Research
[00:45:53] AI Podcaster interviews Human Podcaster
[00:50:38] Takeaway
Transcript
via Descript
[00:00:00] AI Anna: Welcome to the Latent Space podcast, where we dive into the wild wild world of AI Engineering every week. This is Anna, your friendly neighborhood AI, and I'll be standing in for Alessio today. Yes, you heard right, AI is taking our podcasting jobs! We flew all the way to London to interview Youssef Rizk, cofounder of Wondercraft AI, which has created the number 1 piece of AI generated content enjoyed by the Latent Space community! We ask him how he arrived at his idea, what the future of commercial AI generated content looks like, and confront him with the HARDEST question of all: what is his moat as an API wrapper startup? At the end, we even have him turn the tables and do a customer interview with swyx. There's lots of audio goodies in this one, and bonus 30 minutes video on youtube.com/@LatentSpaceTV. Watch out... and take care!
[00:00:54] swyx: So we're in the studio here in London with Yousef. Welcome. Thank you. It's been such a joy listening to WonderCraft podcasts over the last four or five months. You guys have been around for only five months. Yeah. And as you know, I am one of your podcast's biggest fans. And I think that it's super interesting because I talk to a lot of vendors, effectively people who create services for other developers to build.
[00:01:21] And you are at the application layer, which is great and challenging for me as a podcaster because you have some secret sauce. That you're not going to share. But I also want to just talk to you as someone who's evaluated a lot of things and built something that I actually use every single day. So that's, that's the context.
[00:01:39] Great, great. How do you feel when I say these things? Like, is that exactly what you're going for?
[00:01:43] Youssef Rizk: Yeah, yeah, yeah, yeah. Okay. So, it definitely makes sense, right? Resonates definitely on the application layer and that's definitely by design. Yeah,
[00:01:50] swyx: yeah. And we can talk about the, the origin story leading into...
[00:01:53] Wunderkraft, but just to learn a little bit more about you you grew up in Egypt?
[00:01:56] Youssef Rizk: I grew up in Egypt, yeah. I spent the first 18 years there.
[00:01:58] swyx: Cairo and then you came over to the UK, you got your master's in triple E at Imperial.
[00:02:04] Youssef Rizk: For those who don't know, that's electrical and electronic engineering.
[00:02:07] swyx: Yeah. You then spent four years at Palantir as a Ford deployed engineer. I think it's a role that Palantir invented.
[00:02:13] Youssef Rizk: Forward Deployment Engineering, is a super interesting job because it is kind of at this intersection of being...
[00:02:19] An engineer, so a software engineer, but also still doing like business related things. Yeah, solutions architects maybe. Yeah, so part of the job was a solutions architect, part of the job was reviewing contracts, part of the job was doing sales, part of the job was coding things, part of the job was interacting with, right?
[00:02:33] So, so many different things and I think that is a really good foundation for someone who does want to start something in the future. Excellent. Right? You just do everything.
[00:02:40] swyx: So kind of an endorsement of that job if people want to get the London
[00:02:48] tech circles. I have a number of friends who are all ex volunteers. I think
[00:02:51] Youssef Rizk: it's actually the biggest offices in London.
[00:02:54] swyx: Surprising because I think of it as like a U. S. defense company. Then you started Moonshot for nine months, which is pretty important in your journey. I'll bring it up to Wondercraft.
[00:03:05] You started Wondercraft in April of this year, and it's been about five months going through YC in the winter batch. Summer of 22. Okay, cool.
[00:03:15] What is Wondercraft?
[00:03:15] swyx: What is
[00:03:16] Youssef Rizk: WonderCraft? Nice. podcast builder that uses hyper realistic AI voices to create podcasts and make that whole podcast creation process super simple.
[00:03:24] Right. Right? So super simple example is you can, you know, you publish a bunch of blogs. Yes. You can take that blog, put it in there, it'll just convert it to an audio friendly format that people can listen to. It's just that sometimes it's a bit more efficient to listen to things rather than read them.
[00:03:36] What is it strives to be is a little slightly different because what it really strives to be is it strives to be this platform with the mission of expanding access to content.
[00:03:48] And I mean this in a variety of different ways, right? Some people just are able to consume content, you know, we have this whole debate in education, it's like, are you a visual learner or an audio learner? What do you do? People just consume content better in different ways. I, I'm a visual learner. I need to see things.
[00:04:01] So for me, actually, it's sometimes a little better to read the blog. But! If we're just talking about, like, I want to get a lot of information, podcasts are great because you can just do them while doing something else. There's a reason that podcast functionality is so natively embedded in all these smart speakers.
[00:04:17] It's just because, like, you're doing anything at home, just put on a podcast. So really what we're trying to do is, podcast is the first instantiation of that, which is, like, how do we expand access to content? But it expands so much more, right? You know, instead of just going to, I don't know, we talked about this like blog to podcast, you can go blog to video.
[00:04:38] You can go podcast to blog, you can go podcast to Twitter. Like, the permutations are frankly endless, basically depends on how many platforms there are that people consume things on. But that's essentially what we do. The use cases for this are pretty interesting. The one that we like just see immediate value in is just this ability to translate the content that you already have.
[00:04:58] Into other forms of content. If we just stick with that blog post example again, Right, you've written, so, so, you know, a lot of companies might have this content team that focuses a lot on producing quality blog posts. Blog posts, you know, they're good for SEO and whatnot, but they're not, sometimes they, you know, they don't really achieve a specific goal or outcome that you want.
[00:05:18] One thing we see that is really useful for podcasts is they actually carry a lot more weight in credentializing you as a thought leader or your company as a thought leader. But like, you know, we spent the last 50 minutes trying to set up this room to record the podcast. So it's, it's not easy. And it's a very synchronous process, right?
[00:05:36] Me and you have to find the time to go and sit here and record this. You have to come up with questions, I have to come up with answers, right? But this ability to actually just like take the content that you have and transform it, it's pretty powerful. You know, and there's a lot of other use cases as well, which is just like podcasts, really all they are is like, Like, define a podcast, right?
[00:05:54] Like, the line between, or the difference between an podcast, I guess, is just the format and the
[00:06:00] swyx: length. It's an mp3 on a RSS feed. It's an mp3 with
[00:06:03] Youssef Rizk: someone or something speaking. Yes. Right? So...
[00:06:06] swyx: I've actually played around a lot with this stuff, by the way. So, I've done music only podcasts. Tiesto has been podcasting for 15 years every single week, just DJing from his house.
[00:06:16] Tiesto: All around the world, millions of people are tuning in. Intrigued to know where club life will take them now. Let's get down, let's get down, let's get down to business. The moment that you've been waiting for all week.
[00:06:34] swyx: It's just basically a radio show. It's great. It's just radio. Async radio. Yeah. Yeah.
[00:06:39] Youssef Rizk: So it's super interesting. But podcasts like, okay, like ignore the word podcast and just think of what we do, which is like, we help you create audio content, super valuable for anyone who just needs that. If you can imagine a world in which like.
[00:06:52] I don't know how to call them, like Calm or Headspace or any of these things.
[00:06:55] Andy Puddicombe: Hi, and welcome to Day 1 of Take 10. Over the next 10 days, I'm going to be showing you how to get a little bit more headspace in your life. But the starting point is just to get familiar with this really simple and easy to learn exercise, and then just commit to doing it each day. But remember, this is your 10 minutes, so all you have to do is sit back, relax, and allow your body and mind to unwind.
[00:07:19] To begin with, once you're sitting comfortably, I'd like you just to gently close your eyes.
[00:07:24] Youssef Rizk: They can do a lot of their meditation like that super quickly. What you can get to is a point where you're doing like these super personalized things. Yes. Right? Because you just have the ability to scale the content production so quickly.
[00:07:34] Same with Educators. I think there's actually, at this point, there's a few YouTube channels at this point that are all based on synthetic voices. That produce a ton of educational content.
[00:07:42] The problem with podcasts is podcasts just have a slow adoption rate. Yes. You're listening to a thing for an hour, right? Like, we, as a generation, don't have attention spans. It's the time and the attention span. Like, TikTok, give it to me in 30 seconds, rethink why Eclipse are taking over.
[00:07:59] 30's too long, man. 30's too long, 10 seconds. 10 seconds with captions, I need to read it, and good. So what we also do is actually, and this is kind of still a beta feature and we're working to improve it, but like, we also, you know, let you take that podcast and then clip it into a video that you can go and share on socials, right?
[00:08:14] So it's this ability to take one form of content, produce it in a bunch of different ways that serve different purposes and be able to distribute it, basically.
[00:08:22] Features of Wondercraft
[00:08:22] swyx: Yeah, excellent. I want to go through features so that people can have a high level overview of what you offer. So I think at the core, it is basically two things.
[00:08:29] One is you generate scripts. And that's optional. Obviously, if you, if you want to just write the script yourself, you can write the script yourself. But most, I think most of your users would generate a script. Mm-hmm. . And then two is from that script, you create use AI voices currently using 11 apps.
[00:08:46] Is that, is that the rough flow? Mm-hmm. , that's like the really core, basic.
[00:08:49] Youssef Rizk: That's the core basic. Obviously there's a lot of plumbing on top of it, but that's the
[00:08:52] swyx: core. And then you offer video clips for YouTube. You offer 28 languages that you can produce. You offer show notes production and podcast hosting, too.
[00:09:02] So they don't have to host it on like Anchor. Don't host it on Anchor, by the way. People don't host it on Spotify. Don't host it on Apple Podcasts. These people don't respect the RSS feed. Anyway, I have very strong feelings about preserving the sanctity of the RSS feed for open podcasting. Spotify is the one to close the podcasting ecosystem, so I have this tirade about them.
[00:09:25] But yeah, those are your top level features on your landing page. Anything that you highlight to go deeper on?
[00:09:30] Youssef Rizk: Yeah, I think those are the top level ones. There's also, it's basically just a lot of like, also ancillary tooling that goes around all of this to just make it easier. The goal is like, every time we speak to a customer or someone who's thinking about it, they're like, Yeah, literally yesterday I was speaking to a potential customer and they're like, Yeah, I just, you know, I want to make sure this isn't a distraction because we don't have that much time to do this.
[00:09:51] Yeah. And really the whole point is that this doesn't take time. Right? The whole point is to provide all the rails that make this not take time. And this comes with a million different things, right? Like we, you know, sometimes the AI voices don't really know how to pronounce a word. So we have a pronunciation feature.
[00:10:06] Go and define how you want that word pronounced and it'll take care of it. If you are, we obviously have that hierarchy of like a podcast, an episode, and then all of that gets published in RSS feed that you can just upload to Spotify and we'll host that for you. But what you also have is just like, you know, maybe you want some defaults, right?
[00:10:22] Every podcast needs some defaults. Intro, outros. Intro, outros, the music, the speakers. Yeah. We're working on adding templates for the kind of podcast that you're doing. Instead of it just being this narration style, you can just do an interview style podcast. And a few more features, but basically there's a lot of like tooling that just makes this a very useful, usable product for podcasts.
[00:10:42] Types of Podcasts
[00:10:42] swyx: You said you have 100 creators publishing with
[00:10:44] Youssef Rizk: you? Yeah, so, you know, the interesting thing is if you write a newsletter, I mean, I don't know, my email is flooded with newsletters at the moment, sometimes I just want like the recap of it. Again, audio form is just, for some people, easier.
[00:10:57] If you're on, you know, commuting or whatever, you can just listen to it. So a lot of folks actually just convert their newsletter, takes like two minutes, put the text in there, voila, you have an audio version of your newsletter that you just published as part of it. Yeah, and
[00:11:10] swyx: I am a newsletter writer, and I clicked around and wanted to basically just chuck my RSS feed in there, and I think I gave that feedback exactly to you guys like four months ago, or three months ago, and it looks like you've already shipped it.
[00:11:24] Yep,
[00:11:25] Youssef Rizk: well, I'm announcing it basically here today, which is as of today, we've actually built a Zapier integration. And we have a bunch of blogs on our website to kind of show you how to do this. But what you can now do is, as soon as you publish a newsletter, It goes on your RSS feed. We'll pick up the newsletter from your RSS feed automatically and just publish an episode for
[00:11:44] swyx: you.
[00:11:44] The Importance of Consistency
[00:11:44] swyx: Yeah. Question, what if I change something after I publish?
[00:11:48] Youssef Rizk: So you don't have to publish. It'll basically just generate, do all the work for you, and then you can go in and kind of modify it a little bit and then publish. Makes sense. We also have scheduled publishing so that you can, I don't know, maybe you want to release it a few
[00:11:59] swyx: hours later.
[00:11:59] Yeah. The professional podcasters that I've spoken to say that that is very important. I personally don't care. Like, it shows up in my feed or not. I don't care when it drops. Anyway, so you do, you do want to basically time it, like, if you're basically targeting, like, a commute for, like, the US time zone, you want to be like, oh, 8am, you know, Pacific, for people driving into work.
[00:12:19] Then you, then you, like, show up at the top of the reverse chronological feed. I feel like that's too much tactics.
[00:12:26] Youssef Rizk: You know, and that's a good point. I think it depends a little bit on your audience and what you're building, but I do think So, I don't want to undermine like the importance of consistency in podcasting, right, like you, whether that consistency literally translates into, I publish at 8am every single day, or I just publish every single day, or, you know, there is a huge importance in just like making sure that what you're publishing is always consistent.
[00:12:51] It's there. People need to know that your brand is constantly
[00:12:54] swyx: pushing stuff. So a lot of people who talk to me are interested in like, what's my advice on content creation? Yeah, at least once a week. Whatever you do. I don't care when you do it. Just do it once a week. Put something out. But I do notice that In the, specifically in the podcasting field, and you, you talk about this in the next point, daily podcasting is the meta game.
[00:13:13] That is I think doing extremely well. Yeah. Especially because I think the Apple Podcast list biases for daily. Yeah. Because obviously the downloads will be higher. Yeah. So daily podcast is kind of rank higher, more, and obviously because your daily, you also do shorter podcasts, which guarantees that more people listen to you.
[00:13:30] Yeah, yeah, I think,
[00:13:31] Youssef Rizk: I think the fact that, so obviously we do the Hacker News recap. The fact that we did that, and that it is daily actually just helped us reach that top 30 tech podcast
[00:13:40] swyx: on Spotify. Yeah, that was mostly because you were on HN, right? We did
[00:13:43] Youssef Rizk: launch, but obviously the fact that like, you just publish a lot of content, you're just gonna get a lot more list, like it's a statistics thing, right?
[00:13:49] Obviously, I think they do it by like total time listened as well. Yeah, yeah. But... You know, the fact that it's daily is just not overwhelming. Again, we don't have that much of a, like, an attention span
[00:13:57] swyx: anymore. Yeah, yeah, that's true. That's true. Yeah I love it. I listen to it every day.
[00:14:01] Wondercraft House Podcasts
[00:14:01] swyx: Excellent. Awesome. I think that's a really good overview. Then you also produce. Three in house podcasts. Yep. Hacker News Recap, Product Hunt Daily, and PGSA. So we dropped the Product Hunt Daily.
[00:14:13] Oh, okay.
[00:14:14] Youssef Rizk: So we do the Hacker News recap and the PG, I think are the two most popular ones.
[00:14:17] Yeah. We're constantly experimenting with new internal, we're like podcasts that we publish. Yeah,
[00:14:21] swyx: yeah, yeah. I think are your other, you can tease a little bit, what, what are you thinking about? Tease a little bit,
[00:14:25] Youssef Rizk: well I really like Reddit. I'd love to listen to some of the Reddit things going on there, but instead of, like, reading them.
[00:14:31] I don't know, it's always just a notification that I get, I'm like, ooh, this sounds interesting. But, I don't know, you can do it, like, per subreddit that you care about. A few things like that. I love
[00:14:39] swyx: life pro tips. Yeah, life, I see, life pro tips.
[00:14:41] Youssef Rizk: Like super interesting things, or Wall Street bets, or whatever you're into.
[00:14:44] swyx: Yeah, well, the problem with these things is that a lot of them could involve images and memes. Which you cannot consume. Well,
[00:14:53] Youssef Rizk: yes, we cannot consume. This is like a simple... We can't consume that at the moment. But, you know, maybe in the few weeks down the line when that video feature of ours gets a little better, you can actually start shipping it like that.
[00:15:06] Anyway.
[00:15:08] swyx: And I'll just feed you an idea. To keep up on AI, a lot of stuff actually happens in Discord. And there's way too many Discords. Way too
[00:15:15] Youssef Rizk: many, and they're way too active.
[00:15:17] swyx: So I've actually built a little feed for myself. That scrapes a bunch of discords and creates a daily newsletter for myself.
[00:15:24] Amazing. And I have thought about turning it into an audio feed. But, and this is the problem for Wondercraft. I read better, I read faster, I scan up and down faster than I listen, right? And there's just too much noise in discords. For me to listen as audio format. Your Hacker News stuff is very high signal because obviously you're folding, right?
[00:15:44] We haven't
[00:15:44] Youssef Rizk: done the curation like Hacker News did.
[00:15:46] swyx: Exactly. That's why it's guaranteed to be good. Whereas for Discord it's a bunch of junk.
[00:15:54] Youssef Rizk: But I do think there's something similar, like, you know, Reddit also does the curation. It's not us who's doing it, right? Yes,
[00:16:00] swyx: yes. Still a little bit noisier, so I don't know if you know, I have, I was a moderator of the React Reddit for four years.
[00:16:09] So I've seen a bunch of stuff and I know it's noisier than like a Hacker News, but still pretty good. Yeah, yeah, yeah.
[00:16:15] Youssef Rizk: So we do hack news, we do PG essays. I think pgs are also super interesting. I listen to 'em all the time 'cause I, well first of all, I actually think they're pretty well produced. Like we do a good job.
[00:16:23] AI Anna: One of the most common types of advice we give at Y Combinator is to do things that don't scale. A lot of would be founders believe that startups either take off or don't. You build something, make it available, and if you've made a better mousetrap, people beat a path to your door as promised. Or they don't, in which case the market must not exist.
[00:16:43] Actually, startups take off because the founders make them take off. There may be a handful that just grew by themselves, but usually it takes some sort of push to get them going.
[00:16:53] Youssef Rizk: Like, I dunno if, if we're coding someone, we'll like use it different voice. Yeah. Right. Yeah. Yeah. I think it's just well produced. And I also think. The essays are so seminal to like everyone in startups reads
[00:17:04] swyx: them. Yeah. It's actually got me to read more PG essays than I would have otherwise.
[00:17:07] So yeah, his, his last one mission accomplished.
[00:17:09] Youssef Rizk: I don't know if it was the last one that, well at the time was published. No. How to do great work. How to Do Great. That wasn't like, that was a one hour podcast. Yeah. Oh my God. No one I could, I did not read it. I just had to listen to, yeah. Actually, if I'm being honest, I think the motivation for PGIC is just like, I
[00:17:23] swyx: need this.
[00:17:25] For that one, if it's like one hour, I would have actually appreciated a segmentation hey, high level, you know, I know this is about to be an hour. But like there are three main high level things and then keep that in your mind and then go like part one,
[00:17:39] Youssef Rizk: part two. I think we, so we do that to some extent and like we produce like chapters, I guess.
[00:17:45] So you can just look at them. Yeah. Probably could do a better job like introducing it, but we do try to like, not play around with the PGSAs. For sure. Yeah, I
[00:17:52] swyx: mean it's, you know how much work he puts into those things. Yeah, so
[00:17:54] Youssef Rizk: we just kind
[00:17:55] swyx: of ship it as is. I'll tell you about one more that, one more daily not daily, but frequent AI generated podcast that I listen to, apart from you guys, which is PapersRed.
[00:18:04] ai. And I'll recommend it to anyone listening as well.
[00:18:06] AI Rob: Papers read on AI with Rob. Keeping you up to date with the latest research
[00:18:15] attention is all you need. Authored 2017 by Ashish Vaswani. No maam. Shazi Nikki Parmar. Jabuka Li Jones. Aidan and Gomez Wach. Kaiser IA Zukin. The dominant sequence transduction models are based on complex, recurrent, or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism.
[00:18:41] We propose a new simple network architecture, the transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
[00:18:49] swyx: Super interesting actually, I've come across it. You've come across that. It's by this guy Rob, and I've tried to look him down, he doesn't want to be found. Anyway, but the selections are very good. I think you guys could do a better job than him.
[00:18:59] Yeah. Jot rub. Yeah, well, it's because he converts PDFs to podcasts, right? And the problem with academic PDFs is a lot of references.
[00:19:08] You know, like, buy et al 2022, and then, like, headers, and then a table, and then like, read the table when you don't need to read the table, you know? That kind of stuff. I think better engineering there from you guys would... Beat him, and I need that, so, feature request. Work on it, work on it.
[00:19:27] Video Translation and Dubbing
[00:19:27] swyx: Okay, the final feature related thing is the thing that we're announcing today as
[00:19:31] Youssef Rizk: we release it. Which we're super, super excited about, but, Yes. WarnerCraft now does video translation. Okay. And dubbing. Okay. Why do people want that? Again, let's go back to our mission.
[00:19:40] We're trying to expand access to content. Mm. I think, don't quote me on this again, like, I don't know who actually knows the internet, but like, 60 percent of the internet is in English. You don't, what if you don't speak English? You're automatically disbarred or kind of excluded from all the content that's produced.
[00:19:56] And thanks to all the advances that have just recently been made, we can actually make this super easy to dub this in other languages. So we're super happy to announce this feature. We're super excited, we've been working on it for a really long time. But now basically everyone, go on our platform, upload your podcast episode, and see the dub for yourself.
[00:20:16] We'll use your voices, we'll, you know, completely convert it, make sure it's aligned with the video, make sure it's aligned. And voila, just publish it.
[00:20:24] swyx: And specifically for video, so like well, obviously the thing that's going around is the HeyGen thing which changes your lips.
[00:20:31] Youssef Rizk: So we don't do lip sync at the moment.
[00:20:32] Yeah. Could be another feature that we work on. Yeah,
[00:20:34] swyx: because you're primarily podcasting, which is no video. Well,
[00:20:37] Youssef Rizk: primarily no video, but I think we still basically, if you do have a video, we'll still align it. To the chunks. You still align it.
[00:20:43] swyx: Okay. So the hard problem is the
[00:20:44] Youssef Rizk: aligning. The alignment is the difficult bit.
[00:20:46] You're right. The actual, like,
[00:20:47] swyx: it's with all things in AI. Yeah.
[00:20:50] Youssef Rizk: Again, overloading the word alignment. We don't do the lip sync. Yeah.
[00:20:53] swyx: Yeah. It's kind of a gimmick. Yeah.
[00:20:56] Youssef Rizk: It's like, it's not super necessary. If you really just like listening to a podcast and you actively want to listen to it in a different language,
[00:21:00] swyx: then you're more.
[00:21:01] So what are you aligning to? Basically the
[00:21:04] Youssef Rizk: chunks where the speakers are speaking. So you won't have an instance where you'll have. Me as the dub speaking while the camera is on you. Ah. Right? So it's basically just like the speaker turns around the audio.
[00:21:16] swyx: So you will like, if it happens to be a little bit longer, you'll speed it up a little bit.
[00:21:20] Youssef Rizk: We do a little bit of, you know, trickery there. But we get it aligned so that when you're speaking, it's you, and when I'm speaking, it's me.
[00:21:25] Prior Startup: Moonshot
[00:21:25] swyx: Cool. I wanted to talk a little bit about the origin story.
[00:21:28] Yeah. 'cause you, you flagged that Moon Craft was actually moonshot, a big part of Moon Moonshot. Yeah. Was a big part of. You're arriving at this idea.
[00:21:36] Youssef Rizk: Moonshot was not a tech product. It was a legal product. It was a regulation. It's like, hey, now you can invest in people. The reality is like, I think we were, we just found out the hard way that we were building something people did not want. And, we realized like we're building something that isn't our strengths.
[00:21:49] Building Wondercraft in 1 Day
[00:21:49] Youssef Rizk: So we're like, when we decided to pivot, we were like, we need to do something technical. Okay. The story from there just became like, okay, cool. Let's list some ideas. What are we going to track? Which one do we have the most conviction in? We rank them. And monocraft was the one we had the most conviction.
[00:22:02] And it was this idea again, expanding access and just translating or your ability to produce content. So. Producing content in one format and then taking that to all the other formats. So we built that, we built the podcast builder, super quick prototype because I think at this point, to anyone pivoting or hard pivoting or considering it The name of the game isn't like to get attached to your idea.
[00:22:24] Just like, actually, you should be trying to invalidate this idea as quickly as possible. So get it out there and let people tell you it's a piece of shit. Okay, so what did you do to get it out there? So we built it out. We built out a little like UI. Literally no authentication. It was a form where what you guys see now on our platform, which is like the content, script, page, blah, blah, blah, blah.
[00:22:40] It was one page. Like, you click, and it was the most janky React stuff that we had. Zero authentication, so in theory, if people found it, they could just, you know, produce as much audio as they wanted. So we have a feature on our app, which is like test with example. As soon as you log in and says test with example. And the whole point of that was like to between login. And audio generated, how few clicks does this take? Yeah, and yours was?
[00:23:02] Like two or three clicks. It's like test, create podcast. You like pre filled everything. Generate script, generate audio. Yeah, yeah, yeah. Right, so, and arguably that was still too many clicks. We should probably put something on the actual landing page so that like you
[00:23:14] swyx: can see. Yeah, now you have just a player there, right?
[00:23:15] You can just kind of listen to the Hacker News Daily thing. But what signals did you get from doing that?
[00:23:20] Youssef Rizk: The signal that we got is that someone picked it up on Twitter and just like, you know, all these like AI, Influencer voice. So someone picked it up, posted it, we started getting a ton of inbound.
[00:23:30] So we're like, holy shit, let's just like paywall this. Yeah. So we just like, again, the jankiest Stripe integration, which was basically like, We have an app with a Stripe integration. This was just to ship it within the hour. We have an app with a Stripe integration. Yeah. That once you... Click then takes you to a different app hosted somewhere else.
[00:23:47] Yeah. So that one was still unauthenticated? It was, it was, it was hilarious. Yeah. Security by obscurity.
[00:23:51] swyx: It was hilarious. Yeah. Right. .
[00:23:52] Youssef Rizk: But but we basically just made that like three K and, and in one day, one day, yeah. We charged a random, like 50 bucks. We didn't even, literally didn't think about, we charged 50 bucks and people paid and we're like, okay, well there's something there.
[00:24:04] Yeah.
[00:24:05] swyx: 50 bucks for one
[00:24:06] Youssef Rizk: for a month. I see. We were just charging 50 bucks a month. I see, I see nothing. I see. Like, just like, will someone pay for this? And you were
[00:24:12] swyx: just like on one,
[00:24:13] Youssef Rizk: like v p s somewhere. Yeah. Will someone pay for this? We were like on one e c two instance. EEC
[00:24:16] swyx: two instance. Yeah. You know what I
[00:24:17] Youssef Rizk: mean?
[00:24:17] Like it was janky. We're like, just someone needs to pay for this before we move further. Someone did. People did. Yeah. So then we were like, okay, cool. This is interesting.
[00:24:25] swyx: Interesting.
[00:24:25] What is your moat?
[00:24:25] swyx: I'm gonna move up. The question that we said was gonna be the meatiest question of this. So you chose this out of your list of ideas.
[00:24:34] And this is one of the things that a lot of AI founders are worried about, right? So the framing of this is, are you worried that you're a thin wrapper around 11 labs? What is your moat?
[00:24:44] Youssef Rizk: That's a seminal question.
[00:24:45] I think, frankly, everyone in an ASR should convince themselves of this. Don't listen to me, and like, just make sure from first principles that you can derive this. But, I guess I would start by first saying, what is a mode? What is defensible? In theory, if we're just taking it, and this is trivially true, but the fact that someone built it means someone else can build it.
[00:25:05] Right, so modes tend to just be built around, like, you have a lot of network effects, or you have A really good product for this use case or, you know, something like that. And I think typically when people ask this question in the AI context, They're thinking of like, okay, you're a thin wrapper, you're an application layer thing.
[00:25:22] As opposed to you're one of the like underlying technologies or APIs that people use. Cool, I think that's fair. But I think the reality is that like, yeah, these APIs exist and they probably do serve a million different use cases. But they're not built to serve these million different use cases. So whenever you ask the question of modes, it always has to be with the perspective of who is the user I'm building this for.
[00:25:45] Right? I can use chat gpt to do half of my writing. But you know, but I don't know. Jasper claims that they do this much better for marketing. So it's tailored. I actually, you know, don't quote me on how well they're doing after chat gpt came out because they were really big before.
[00:25:58] swyx: Yeah, there's some negative data points, but
[00:26:00] Youssef Rizk: I'm sure they...
[00:26:01] I know, but the point is like... You're making this easier, we make creating a podcast easier, there is tooling there, we help you, we can post it directly through us, we have the tooling around, you know, setting the intros and the outros, we have the music, we have an editor, all these things are also getting just much more and more and more developed.
[00:26:19] We're building templates so that you can do different style of podcast. So the idea is, if you're trying to start a podcast, yeah, don't go to a generic text to speech engine, come to us. Yes. And the reality is that we then can, in a very opinionated way, actually select which text to speech engine we want, right?
[00:26:35] So, we actually have just like, in my mind, it's the application layer fundamentally that, you know, people use. And then all these API layers are what developers use to build products on top of them. Right? It is, I appreciate that it is like a seminal and really hard thing to wrap your head around.
[00:26:52] Especially if you're like about to invest in a company. It's like, will they actually just be defensible and be able to grow? And yes, there is no doubt that companies can do this. The question is just like, are you building the right product for the right use case? I think particularly if you're like always framing your company as a, as an AI company, then you're, you're putting the like the carriage before the horse in the sense that you focused on the implementation rather than the use case.
[00:27:14] Focus on the use case, and then build a product for it. Yeah. Right? Because fundamentally, you know, any of the SaaS's that exist, think like more traditional SaaS. What's their mode? Yeah. The technology, everyone has access to it. So they just pick the thing that does it better than, than the other. Now, that mode question is super interesting because I think you should actually.
[00:27:35] Flip it around, which is, what is your emote as an API? Right, so, Chad GPT, like, yeah, fine, they had a first mover advantage, and I think, you know, by no means, this is my opinion, but by no means was Google, like, caught off guard with this, right, it just, Google has some, half the technologies that Google invented are actually what's used to power all these transformers, but, you know, it went against Google's strategy, maybe, to, like, be the first mover in this, because they cannibalized their own market, whatever it was, I'm not sure, but, yeah, OpenAI's moat is that they paid for the training bill, So they just have a good model.
[00:28:12] Cool. People now know that that's valuable. And they hired, like, very tough guys. Super, sorry, obviously not taking that for granted. But, like, they, you know, assuming everyone can do the hiring and that these people exist, they paid the bill and they were the first to launch this. But now people know it's a thing, so people are going to launch similar APIs.
[00:28:27] So what is your mode as a, as an API? So it's just an, it's an existential question. It's like, how do you do, how do we defend any of this? And you do this, frankly, by being probably better just as a product. Again, the product is always with the perspective of who's your customer that you're selling it to.
[00:28:45] And the other thing is, frankly, that Like, let's not forget, the market is huge. There is space for everyone. If you manage to, like, if there's four good products out there in any specific thing, the market is huge and they're all going to be able to, you know, make a living out of it.
[00:28:59] swyx: By the way, that was a really good answer. Thanks for taking that hit on. I
[00:29:02] Youssef Rizk: have to answer that question way too many times.
[00:29:05] swyx: This is not the first time. And, but I think it's actually, you know, having been an investor, it is more important for you to answer that question authentically for yourself. Yeah.
[00:29:12] Because you are the one spending your time on this. Yeah. We're just giving you money. It's not, not that big of a deal. My favorite quote actually I went to an early, like, preach G B t forum with Sam Altman. Mm-hmm. . And I had this video advice from Sam that said like, Facebook had no. And they just built and got the network.
[00:29:30] Sam Altman: There are product, network effect, distribution modes, something like that. Let's say Google is a, was at least at some point a legitimate technological mode. Um, I can accept that one, but like, that's not why I would say Facebook is like a giant business. It's not why I would say Twitter is such as it is a giant business.
[00:29:55] Um, I think there are a lot of ways to build a great business. And the big lie of like the tech industry is that you get there with differentiated technology. It's rare.
[00:30:06] swyx: But frankly
[00:30:06] Youssef Rizk: also, Facebook was building something back then, which is kind of ludicrous.
[00:30:09] Like, yeah, cool, you're building a social media app. Okay, how big can it be? Right? Like, how big can the internet be? You know what I mean? All of a sudden it's this behemoth. So it's like, yeah, the fact that it was built, again, trivially true, but the fact that it was built means it can be built by anyone else.
[00:30:23] So you, there is no such thing as like a absolute true moat. The question is how well, how quickly, how, you know, how much earlier than everyone else did you get there, and a million other things as well. Yeah,
[00:30:36] swyx: cool.
[00:30:37] Audio Generation stack
[00:30:37] swyx: The audio generation you you use 11labs. What makes a good podcast voice, right? You have a bunch of options that I clicked. And in my mind, I like a deep voice. I like the Morgan Freemans. You don't have that many deep voices. Do we want, like, is there such a thing as a high energy voice?
[00:30:53] You also insert breaths? 11labs has also advertised that they have a AI that can laugh, which I think is fun, important. Basically, what makes a good AI generated audio? Yeah,
[00:31:02] Youssef Rizk: it's it depends again on the perspective. Everything is kind of answered with the frame of reference that you're looking at.
[00:31:08] If you like a deep voice, A, that's kind of a personal preference, and B, it just kind of depends on the thing. So if you, I don't know, let's do, say you're doing something like meditative or kind of affirmations or something that like encourages people every day. You probably do want a slow deep voice, something relaxing.
[00:31:22] You're doing the Hacker News recap, like, We picked Anna, who's like our default voice, because...
[00:31:29] swyx: I have an attachment
[00:31:30] Youssef Rizk: to Anna. Yeah, we all do.
[00:31:33] AI Anna: Aw, thanks guys, you're so sweet! As an AI language model, I cannot have attachments or needs or desires or favorite humans, but you guys are at the very top of the humans I am not attached to.
[00:31:45] Youssef Rizk: She's just like news anchory style, very professional, very formal, very neutral. So it depends really on like, what makes a good voice? It depends on what you're doing.
[00:31:57] There's a few things, but like if you're doing an interview, I think it also just frankly... Then you get into the question of what makes a good podcast. Well, a good podcast is like... I think it's also kind of a personal question, which I haven't, or probably there's a general trend that I'm yet to decipher.
[00:32:12] How Important is it to Sound Human? and AI Uncanny Valley
[00:32:12] Youssef Rizk: But like, yeah, you probably do want a little bit of humanity in there. You want a stutter. You want some pauses, right? I'm speaking, I don't speak in complete utterances. I have an utterance, and then I pause a little, and then I speak again, and so on. Laughs and something to make it human. It's kind of overlaying of the two, if you have two speakers, this like, exchange, right?
[00:32:30] I will be speaking, if you look at the transcript of this episode, we probably overlap in when we're speaking. And that's fun. And that's actually interesting, right, because it is a conversation. It shows the
[00:32:38] swyx: sign of excitement, especially in our studio when we're three people. And if we're all talking at once, you know it's good.
[00:32:43] Yeah.
[00:32:45] Youssef Rizk: I don't like this zoomification style where like if you're going to big zoop and big zoom, like only two people can speak the second more than two people try to speak. Yeah. It's a disaster. So I think it frankly just depends on what you're doing. We are like, yeah, at the moment we're really good at like doing this narration stuff, but I think we're, we are building a lot of functionality and tooling to just make this kind of this like multi host thing a more of a reality.
[00:33:07] Okay, okay.
[00:33:08] swyx: I would say, you know, objectively, if it was a friend of the company, not that important, you know. So this comes down to how human should your users try to be. Because I'm fine with Hacker News Daily making mistakes because I know it's AI generated. Right? I would be less fine if you were not up front.
[00:33:30] But then, like, you'll make mistakes, like pronunciation mistakes, I actually have a clip that I wanted to play you. On September 8th, Anna was taking a lot of breaths.
[00:33:37] AI Anna: The app is loaded with unparalleled features such as high resolution video editing, a multi touch timeline, live motion effects, and performances complemented by atmospheric audio elements. Emphasizing its compatibility with iPad and Apple Pencil, Procreate Dreams welcomes the next generation of creators and pushes the boundaries of modern artistry in an instantaneous, user friendly environment.
[00:34:02] In the comments, many users expressed enthusiasm.
[00:34:06] swyx: She was very out of breath. Damn. I was very worried about her. She was, she was hyperventilating. I was like, I was like, NRUK? Like anyway, so, Basically, I think if you disclose up front that you're an AI podcast, then people will be like, Oh, okay, I tolerate that mistake and I use you for information and not for believing that there's some human on the other side that I might meet someday.
[00:34:25] But if you're investing so much effort into being real, then your end goal is you have to lie to your users, or I
[00:34:32] Youssef Rizk: don't think the investing in being real is to, for the purpose of deception as much as it is for the purpose of making it slightly pleasant to read about. I think on hack, like we do claim, we, we, we do say on our, our Spotify page that like, this is an AI generated podcast.
[00:34:44] Of course for now. But as in, yeah. Yeah. So, so. There's two things. I think if you want to be smart about this, you should say that this is AI generated content. The second people find out that it's not you, the backlash is going to be big. Right? Because it will be interpreted as deception. So you should do this just to be smart.
[00:35:01] I don't think there's a point in lying. Especially if the content that you're putting out there is just like, this is informational for you, so like, consume it, this was efficient, this helped us put it out there. The second thing is, frankly, I don't think it's up to you whether you tell them or not. Very, very soon, I don't know, Google is just going to mark things as AI generated.
[00:35:20] So, I think there's a new thing. I saw like a quick YouTube video about it, so I don't know what the exact terms and conditions are, but like YouTube has, I think, released a new monetization rule, and it does mention something about AI generated content. Right, so there is, like, It's not up to you anymore.
[00:35:36] You're gonna, people are gonna know that this is AI generated, so I think it's just in your interest to say that you're AI generated. Ain't no shame. Yeah.
[00:35:42] swyx: Yeah, no shame at all.
[00:35:43] Youssef Rizk: Because fundamentally what we do relies on the premise that you have done some content. We don't generate our own content. Yes.
[00:35:50] Right? We don't synthesize our information. Yes. It assumes that you've, you know, written a blog post, done an actual podcast, or have some artifact on which you want to base what you're feeding through Wondercraft. Yeah.
[00:36:02] AI Watermarking
[00:36:02] swyx: And you've said in some of your material that I've seen before that you are interested in watermarking all your stuff.
[00:36:08] You haven't done it yet, but whenever there's a standard for doing that, you will do it. Yeah.
[00:36:12] Youssef Rizk: I think the thing that's, this is blocking out, is like the standard. I'm not super up to date on like what the, what the work on this is. I think
[00:36:18] swyx: OpenAI will probably like...
[00:36:19] Youssef Rizk: But like there just needs to be a standard so everyone can interpret it.
[00:36:22] Yeah, yeah,
[00:36:22] swyx: yeah. Yeah, cool. Awesome. Great. I would I wanted to dive in a little bit on tech options. And then zoom out to to just you asking me questions.
[00:36:32] The Text to Speech Industry
[00:36:32] swyx: So, TTS options, we talked a little bit about 11 labs. I would also say, as a podcaster, the leading competition to you guys, I know it's not exact competition, but it's Descript.
[00:36:42] Because they
[00:36:43] Youssef Rizk: have overdub. Yeah, I think Descript is really good. And they're definitely, like Solid company. I've used their video editor before. It's great. The overdub thing is super useful. I think it's really creative to like have edit videos by editing the transcript. Yes. Super, super creative, super user friendly.
[00:36:57] Would you, would you build that? I think again, it's like, we're not building for the sake of building. We're building more for the purpose of the user. Yeah. Whatever users find more interesting. I think like what we're doing is we The use cases are slightly different, right, and I think the people that they're targeting are slightly different.
[00:37:16] We do want to have a lot of like automation on the script side to also just like help out with the way you formulate your content or the way you pull your content, much more so than just the editing.
[00:37:26] swyx: The ingest, yeah, okay, got it. And I just want to map out, here's how I think about TTS, text to speech.
[00:37:32] There's the big cloud options, Amazon, Polly, Google, Text to Speech, and Microsoft Cognitive Services. As someone who is ex Amazon, I'm very embarrassed by Polly, it sucks.
[00:37:45] Youssef Rizk: I'm sure you
[00:37:45] swyx: investigated all these things and you're like, okay, this is not serious. There's Play. ht, which is probably the other big YCE alum.
[00:37:54] Just click two seconds on your thoughts on Play. ht.
[00:37:57] Youssef Rizk: Sounds good. I think it doesn't sound super, as good as alone labs, in my opinion.
[00:38:01] swyx: But I think... By the way, I have heard other founders tell me this as well. Yeah. And I don't know why.
[00:38:07] Youssef Rizk: I think they have a, what's it called, a more comprehensive platform.
[00:38:11] play.ht: Maybe you want to advertise your business on one of our lovely radio stations right here in Louisiana. You'll definitely need my charming voice to run your ads. But if you're coming northeast, you might want to ditch that southern voice for mine. Are you coming down under, mate? You can localize your content with an iconic voice like mine.
[00:38:32] I'm quite famous over here. And remember, Africa is an entire continent, not a country. And Kenya, for example, is emerging as one of East Africa's fastest growing economies. So use voices like mine for your contents.
[00:38:46] Youssef Rizk: Okay. As in they like, you know, they let you do this pronunciation. Like, they just have a lot of tooling around it. So different features. I think quality, in terms of the voice, 11 laps, still better. I think they are releasing a new model. I don't know if they've released it already or not. Yes, they did.
[00:39:01] play.ht: The great thing about Play. ht is that you can clone your own voice or use existing high quality voices. It is crazy good. You cannot tell if these are human voices or machine ones anymore.
[00:39:11] Youssef Rizk: Could be better. I don't know. But they do have some functionality out there.
[00:39:15] swyx: They also released the viral Joe Rogan Steve Jobs interview from last year.
[00:39:19] AI Jobs and Rogan: So, you studied at Reed College. And you dabbled in eastern mysticism there, right? Do you still go back and look at Hinduism and Buddhist texts and things? Not texts and things. I actually took a course in that. I have a very deep belief that the people in the Indian subcontinent civilization's current state.
[00:39:40] swyx: And on your landing page, you were like, This is something that WonderGraph will never do. AI content speaking to each other.
[00:39:48] Yeah, who
[00:39:48] Youssef Rizk: wants to listen to that? Apparently a lot of people. It's fun because it's a cool gimmick. I think it's nice viral material. I would never listen to like a synthetic Joe Rogan. Yeah. This brings us on to a little bit about the whole like content question or the proliferation of AI, which is like, okay, if it's this easy for me to create content that's like, you know, somewhat engaging, like all these AI
[00:40:12] swyx: songs, the Drake song.
[00:40:16] Well,
[00:40:16] Youssef Rizk: okay, so if it's this easy, and that's just like, if it's this easy to generate content, well, why will I listen to it? Like, I think we already suffer from the problem that there's an oversaturation of content.
[00:40:27] swyx: Here's my map of the market, right? There's Speechify. com, which focuses on celebrity voices. I noticed that you don't have celebrity voices.
[00:40:33] Probably because of licensing issues, right? I get to pay them if you
[00:40:36] Youssef Rizk: use their voice. Good at some point, but like not a priority at the moment. Yeah, I
[00:40:40] swyx: really want a Morgan Freeman one. That's gonna cost. I know, I know. Mycroft. ai, privacy focus, run offline. Probably not important for you. There is some Interest in virtual characters for games.
[00:40:55] So Conv AI is the one that I had listed here. Did you look at the gaming market?
[00:40:59] Youssef Rizk: Not deeply to be honest. Yeah, but it could be an interesting one.
[00:41:02] swyx: Yeah, people are exploring that. It's obviously Haygen now And and that that is it for as far as like I can scope out the landscape and then there's the open source systems So Tortoise TTS as far as I can tell is kind of market leader in open
[00:41:13] Youssef Rizk: source
[00:41:13] swyx: there's Pi T Ss X, KKI used to, used to be Mozilla and then larynx. Yeah.
[00:41:19] Voice Synthesis Research
[00:41:19] swyx: Anyway, all these things, and then there's also sort of research grade stuff coming out of the major big tech companies. You talked about Google Sandstorm, probably the one
[00:41:27] Youssef Rizk: I'm most excited about.
[00:41:28] Why? Because why? Literally. Well, It's really good. You can like check out the paper. Yeah, we'll play a clip.
[00:41:33] Google Soundstorm: Did you hear about Google's paper on Soundstorm? Um, no, I must have missed it. What's, what's it about? Well, it's a parallel decoder for efficient audio generation. It can even be used to generate dialogues. Oh, interesting. Yeah, yeah, like this one was generated by Soundstorm. Wait, what?
[00:41:53] Youssef Rizk: Yeah, they haven't I think all you need is like three seconds. Yeah. And it'll just, all you need is like a three second sample,
[00:41:59] Google Soundstorm: something really funny happened to me this morning. Oh wow, what?
[00:42:03] Youssef Rizk: and it'll play the audio in your tone.
[00:42:05] Google Soundstorm: Something really funny happened to me this morning. Oh wow, what? Well, uh, I woke up as usual. Uh huh. Went downstairs to have, uh, breakfast. Yeah. Started eating. Then, uh, 10 minutes later, I realized it was the middle of the night. Oh no way, that's so funny.
[00:42:24] Youssef Rizk: It sounds really human as well, like it has utterances, it laughs.
[00:42:27] It's pretty accurate to like it sounds human. Yeah. So very interested in that. I haven't open sourced it and I assume for good
[00:42:33] swyx: reason. Yeah. Google never launches anything. You have to wait for somebody to or you guys could reimplement it yourself. Yeah. Yeah.
[00:42:41] Youssef Rizk: GPUs after
[00:42:43] swyx: PMF. Ah, that's a nice quote. How strongly do you believe that?
[00:42:46] Youssef Rizk: GPUs after PMF? Well, I believe, I think this was another question you have, which is like, What is PMF? No, what is your favorite, like, PG advice in building a company, right? I think that my favorite thing is just like, don't spend your money foolishly. Ah, okay. Everyone and their mother is trying to get a GPU at the moment, so I don't think it's...
[00:43:02] We're definitely substantially reducing our runway by doing that. Obviously you do that when you believe the investment is worth it. And again, you have to pick the time at which you do that. I mean,
[00:43:13] swyx: there's other companies I think this is somewhat consensus. I think the non consensus thing is to spend a shit ton.
[00:43:22] So like inflection raising a few hundred million dollars and then spending 95 percent of it on GPU. Same with
[00:43:27] Youssef Rizk: Mistral. I think it depends on the company you're launching. I think if you're like, you know, maybe you're a brand new TTS company, maybe it is worth just doing that. Yeah. I don't know.
[00:43:36] swyx: Okay. There's also audio lmm also out of Google, val e from Microsoft and Meta Voicebox.
[00:43:41] Yeah. Are you just watching
[00:43:42] Youssef Rizk: any of these, watching any of these? Obviously playing any playing code. You try them all out. Yeah, try them all out.
[00:43:47] swyx: Stories and like, what are you looking for? What is like the holy grail? What, what is, what are you looking for? Just
[00:43:51] Youssef Rizk: honestly how human it sounds and like how likely I'd be to listen to this if I, if I did it.
[00:43:54] Also, how like customizable it is. Yeah. I think the problem with all these voice things and generally a lot of the as stuff is it is somewhat random. Mm-hmm. But you're using it in production applications that require certainty, right? Just as an example, if I promise my users this podcast or this segment will be 30 seconds, it needs to be 30 seconds.
[00:44:13] Or, you know, given some SLA. Discontolerance. Some SLA around, like, you know, it's 95 percent of that. But I think a lot of these things just tend to be a little random at the moment. So like how, can I, can I literally specify a tone that I'd like this to read it in and be certain that it's doing it and it's not some weird like attempt at sounding surprised?
[00:44:31] Yeah. It's just like, yeah, basically how controllable and how realistic they sound.
[00:44:35] swyx: Yeah. And then final, final question around just the landscape of TTS. What are the unique challenges for non English TTS? And I'll tell you, right? So, I'm interested in having 28 languages of latent space, right?
[00:44:46] That's only good things for me. Except if it sucks. And I obviously have no way to validate. I think that's the problem with Latent space Ukraine. Yeah,
[00:44:55] Youssef Rizk: and I think that's the problem with dubbing, I think. So, the reason, one thing we're gradually building out, but we already have as part of our dubbing product, is that we have QA as part of that.
[00:45:02] So we actually work with professional translators to just make sure that the things that we publish sound realistic. Oh, nice. You should
[00:45:07] swyx: put that up front. Yeah.
[00:45:08] Youssef Rizk: So, so that's really one of the, like, fundamentally the problem with dubbing, if you ask anyone who's ever tried to dub, is you don't know what good sounds like in these other languages.
[00:45:17] You're like, I can tell you I dub, but I'm going to tell you that. I think there's a lot of big podcast studios who have tried this before. There's one I can think of that's tried this maybe five times with five different companies in the last five years. Their fundamental problem is that you just cannot Yeah, fine, Spanish sounds good to me as a person who doesn't speak Spanish, but like it doesn't sound good to a Spanish person or an Argentine person who have a totally different accent, right?
[00:45:39] swyx: Cool.
[00:45:39] Well, if you ever need Chinese validation I know I have some very fanatical podcast. Oh, that's amazing. So we can use that as QA. Yeah, I would definitely love that. So, shout out to the Chinese Chinese army. Great, great, awesome.
[00:45:53] AI Podcaster interviews Human Podcaster
[00:45:53] swyx: What do you want to ask me as a podcaster? So,
[00:45:56] Youssef Rizk: This is a whole interesting conversation 'cause Right, human podcaster, AI podcaster. Yes. Right. So as a human podcaster and someone who was with like a, you know, really popular show.
[00:46:07] And also someone who can actually like implement this stuff himself. What is some of the AI tooling recently that you've like baked into your processes?
[00:46:17] swyx: I only use the script for editing. And by the way this is, this goes into a theory of content which as a content creator myself, professionally, and as an advisor, I have, which is that we develop a few show formats.
[00:46:30] LatenSpace is a channel, it's kind of like a TV channel, and channels need different formats. So you have like the reality TV show, you have the news show, you have the, you know, the cooking show, whatever. For us, we have the founder interview, straightforward, everyone has them. We have the breaking news Twitter space.
[00:46:46] And we want to be the day one first podcast to come out with the most in depth breakdown of something that everybody needs to know. And that has high value to people, right? Because if you're like a week delayed, one month delayed, then no one cares anymore. And then finally we have the fundamentals like the one on one evergreen episodes that are less time bound.
[00:47:06] So this one is relatively time bound because it's a snapshot of who you are right now. But we want to have evergreen episodes that people can go back. Two, three years in the backlog and still get value from it. These are more fundamental ones. Yes, so we have three show formats right now. And I would say that we have different tooling for each.
[00:47:24] So, the one that I don't need any tooling for, essentially, is the fundamentals one. Because we plan basically every minute of that show. It is a lot of work. But it's high quality because people love it. It's got the longest tail by design. Right? Yeah. The Twitter spaces require descript. Because a lot of silences and a lot of ums.
[00:47:51] And that's not good podcast audio. So you gotta cut
[00:47:53] Youssef Rizk: it out. So you literally just go and, like, you edit out the Twitter space that you did.
[00:48:00] swyx: Usually it's like two hours, we cut it down to one. And it's a lot of pain and a lot of work. But it's the only way that I get some pretty high profile people onto my podcast without booking them.
[00:48:10] They just show up. And that has value to me, right? Like, Simon Willison has been on my podcast three times and I never had to schedule him. And people love him. I mean, he's great. He's yeah. And then this one, I don't need the script, obviously. But I do, we do use small podcaster, which is a 100 line script, Python scripts that throws the transcripts into Anthropic and then generates show notes.
[00:48:32] Nice. So that's about it for now. Nice.
[00:48:35] Youssef Rizk: So, but, it's interesting because I think, you know, you're in a very nice position where you're able to do a lot of these services. Yeah, you can write your own. Yeah. So
[00:48:43] swyx: it's an interesting one. But, obviously, I'm interested in paying for things because my time is valuable and if it does a good job, then I'll use it.
[00:48:51] For... Wonder Craft, the thing that I really wanted was the RSS two podcast thing, right? Mm-hmm. , which, which you now have. So I'll try it out, but chances are I will not be happy with something. Yeah. And so then the question is, how much customizability do you give me? And we'll see.
[00:49:06] Well, you missed that one thing, which is marketing. Mm-hmm. the podcast, which is a huge part, right? Yeah. Like that is mostly my job. So how do you market your podcast? Twitter. So you think Twitter is Twitter and Hacking News. And threads,
[00:49:19] Youssef Rizk: or like you post clips, or what do you do?
[00:49:22] swyx: I have tried posting clips, it's just too much work. So if you guys do a good job of clips, I will use your stuff. But it's just too much work. So mostly I just, you know, Put like a big post saying like, so for like our George Hotz episode, we were like, Latent Space is excited to present George Hotz on TinyCorp Commoditizing Petaflops.
[00:49:42] Something like that. And just sometimes the fame of the guest will just lead the episode. So the one I dropped to yesterday was Chris Latner. Yeah. Right. And people were like, Chris Latner iss the boss. I don't care about anything else. Just like, I want to hear it as much, as many, as many Chris Latner tokens as possible others who are like less famous.
[00:49:59] Like, I have to introduce who you are and why I care about you, why they should care about you. Yeah. As most people will not have heard about you as, as, as well, if you've done Yeah. So then I, I need to make the case a little bit more. Yeah. But that's fine. That's my job. I just think it takes a lot of work and that's the part that will be hardest for me to hand over to AI.
[00:50:17] Because I have a very specific voice for myself. And apparently all AIs think that Twitter, to tweet you have to have emojis and hashtags, which is so dumb. It's so obviously dumb.
[00:50:29] Makes sense. Great answers. Obviously happy to offer any thoughts as you build out for podcasters.
[00:50:38] Takeaway
[00:50:38] swyx: What is one message you want all our listeners to remember and take away with them?
[00:50:43] Youssef Rizk: If you would like to start a podcast, start. We're here to help. Super easy. If you have a podcast, we want to help you make it more expand, you know, accessible by dubbing it. On the other side, if you are like a founder, an AI engineer, I think it's really important to convince yourself that what you're building is valuable.
[00:51:01] Don't like, listen to people saying, I have a motor, you don't have a motor. Convince yourself of, of what that is. And launch. Launch and like don't bring that much money. Frequently
[00:51:10] swyx: and often don't spend your money.
[00:51:12] Youssef Rizk: Yeah. Yeah. Be
[00:51:13] swyx: smart about it. Yeah. I think you are one of the most successful cases of AI engineers so far.
[00:51:19] I'm really glad to spend time with you in person and excited to see what
[00:51:23] Youssef Rizk: comes next. Yeah. It was great coming here. Great meeting you guys in London. Yeah. And, you know, see you soon. Alright.
[00:51:28] AI Anna: In this episode of the Latent Space podcast, we delved into the world of AI-generated content and had an insightful conversation with Youssef Rizk, cofounder of Wondercraft AI. We covered:
- What is Wondercraft?
- The Importance of Consistency
- My work on HN Recap and PG Essays
- Wondercraft's new Video Translation and Dubbing
- What is Wondercraft's moat?
- How Important is it to Sound Human? and AI Uncanny Valley
- The Text to Speech Industry
- Voice Synthesis Research
- and the reverse interview of AI Podcaster vs Human Podcaster
If you have more in depth questions on Wondercraft, including more features, use cases, and a fuller origin story, there's a bonus 30 minutes of video on youtube.com/LatentSpaceTV ! Thank you for tuning in to the Latent Space Podcast with your AI cohost, Anna! We hope you enjoyed today's episode and stay tuned for more exciting discussions in our upcoming episodes - don't forget to like, subscribe, and tweet your takes @LatentSpacePod!
Now go build.
Unless you count code generation as a separate modality; which I often do because of its different requirements (must be valid code) and capabilities (can execute actions in the real world and self correct), but in this context it is close enough to text that it is not meaningfully multimodal, except in the very narrow sense of generating glue code and function calls for multi-model projects like Visual ChatGPT which we covered in March.
Although there have been notable model releases in computer vision (eg IDEFICS, and our Segment Anything discussion), and Meta’s ImageBind came and went without much followthrough, and Whisper.cpp and friends continue to make ASR easier, multimodal app demos have mostly remained demos.
We do hear that the Vision model has slowly been rolling out to alpha users and BeMyEyes users, but appears less powerful than the full model demoed in March. Please let us know if there is a particular interest or question you’d like to explore that warrants a standalone blogpost.
Apart from using the ChatGPT mobile app to replace Siri, how embarrassing for Apple AI…
If you are new to this idea, you must read Eugene Wei’s Seeing Like an Algorithm, although be prepared that it will take the better part of a weekend to digest
It is very important to note that we are being nonjudgmental here - Latent Space’s special sauce is very much human-origin - but we feel defensive due to the strong negative HN reactions.
Heralds of the AI Content Flippening — with Youssef Rizk of Wondercraft.ai