Logo
    Search

    State Space Models and Real-time Intelligence with Karan Goel and Albert Gu from Cartesia

    enJune 27, 2024

    Podcast Summary

    • Cartagia's rebellion against TransformersCo-founders Karen Gull and Albert Gu challenge Transformers' dominance with fast models S4 and Mamba, showing significant potential in low-latency applications like gaming and voice agents.

      Karen Gull and Albert Gu, the co-founders of Cartagia, are leading a rebellion against the dominant architecture of Transformers with their innovative models S4 and Mamba. They have seen significant excitement in their technology in the areas of gaming and voice agents, where low latency is crucial. Their fast text-to-speech engine, Sonic, is already shaving off 150 milliseconds from typical response times, with the goal of reaching the next 600 milliseconds. Both Karen and Albert have a research background from Stanford, where they worked on sequence modeling and alternate recurrent models. They became interested in these models due to their elegance and effectiveness in various applications. Their most recent model, Mamba, has shown impressive results in language modeling. Karen is now a professor at Carnegie Mellon University, where her research lab continues to explore these questions academically, while Cartagia puts the technology into production. Karen grew up in India with a family of engineers and initially aimed to be a doctor before switching to engineering. She started her PhD journey with no clear focus, eventually settling on reinforcement learning. Her PhD advisor, Chris Ray, was skeptical of the field, leading to an interesting transition period where they explored various projects together.

    • Google Cloud project collaborationCollaborating on a Google Cloud project led the speaker and their collaborator to a shared interest in State Space Models, resulting in advancements in their field.

      The speaker's professional relationship began when they collaborated on a project involving filling up and expanding a disk on Google Cloud. They worked together on this project, and the speaker's insistence on only adding a terabyte at a time caused some frustration. After the project, the speaker joined the team working on S4 push at Albert, initially to help out during a NeurIPS deadline. The speaker then went on to explain the concept of State Space Models (SSMs), which they had become interested in due to its sequential processing nature. SSMs are a type of model that processes data one piece at a time, updating the belief state with new information. The speaker found this approach fundamental and was inspired by connections to other dynamical systems. SSMs can be applied to various types of data, and the speaker mentioned that the first models they worked on were particularly effective for modeling perceptual signals, such as text data. They also mentioned that there are different advantages for different types of data and various variants of SSMs. The speaker's professional relationship with their collaborator began with a project that required them to work together and led to a shared interest in SSMs. They went on to apply this approach to various types of data, leading to advancements in their field.

    • Model architecture selectionThe choice of machine learning model architecture depends on the nature of the data and desired trade-offs. Transformers excel at text but struggle with raw data and have a quadratic scaling problem. Alternative models like state space models can be efficient but lack exact retrieval capabilities. Consider factors like natural fit, computational efficiency, and trade-offs.

      The choice of machine learning model architecture depends on the nature of the data and the desired trade-offs between modeling capabilities and computational efficiency. The discussion highlighted the strengths and limitations of transformers and other models in handling different types of data, such as raw waveforms, raw pixels, and text. Transformers excel at modeling text due to their ability to capture long-range dependencies and context, but they struggle with raw data and have a quadratic scaling problem, making them less efficient for large datasets. Alternative models like state space models, which are more efficient due to their linear scaling, can serve as effective fuzzy compressors but lack the exact retrieval capabilities of transformers. The development of these models has been interconnected with advancements in data processing techniques, such as tokenization, which can influence the effectiveness of different modeling assumptions. When considering the advantages of various architectures, it's essential to weigh factors like natural fit for specific data types, computational efficiency, and the trade-offs between modeling capabilities and processing speed.

    • Audio technology edge computingAudio technology, specifically SSN, is gaining importance in commercial applications due to real-time capabilities and edge computing is expected to make it more accessible and cost-effective by reducing the need for large data centers and expensive GPUs.

      Audio technology is gaining significant importance in various commercial applications, particularly in voice agents and gaming, due to its real-time capabilities with signals and sensor data. SSN, a specific technology, is making strides in this field by offering an efficient model that can be implemented on smaller hardware and pushed towards the edge. This shift towards edge computing is expected to reduce the need for large data centers and expensive GPUs, making the technology more accessible and cost-effective. The second wave of development in this field is focusing on efficiency, following the initial wave of exploration and discovery. Companies like Apple are already demonstrating the potential of running large models on devices, and SSN is poised to contribute to this trend by making the technology more edge-oriented.

    • On-device AI modelsThe future of AI lies in making models more capable and accessible on-device, enabling new applications and improved quality without the cost and ongoing computing issues.

      The future of AI models lies in making them more capable while keeping the cost low, enabling their use at scale. Current 3D models have potential but are not yet capable enough. The challenge with Transformers is their high resource consumption, which is a concern even in data centers. The shift towards making intelligence more accessible everywhere requires infrastructure and technology that can run models on-device, leading to new applications and improved quality without the cost and ongoing computing issues. As an investor, the assumption that running models on hardware will become possible opens up a world of opportunities for full-stack companies. In the same dollars, they can do way more intelligent computation, leading to different applications and a focus on quality. For instance, having a music model on-device could result in a personal musician that generates music based on user input, eliminating the need to go to the cloud. Cartagena's recent launch of its Texas speech product showcases impressive performance and fast shipping. The company is focusing on building efficient systems to perform tasks like voice and audio generation using fairly general models. This approach allows for conditioning on various inputs, such as text transcripts, and paves the way for further advancements in AI technology.

    • Text-to-Speech ChallengesDespite advancements, creating high-quality, engaging Text-to-Speech systems remains challenging due to the need for efficient model stacks, effective training methods, emotional depth, and nuanced speech patterns. Multimodal models with language understanding capabilities are necessary to perfect TTS and adapt to various roles and contexts.

      While significant progress has been made in text-to-speech (TTS) technology, it is not yet solved. The efficiency and real-time capabilities of audio generation are crucial, requiring robust model stacks and effective training methods. The goal is to create a high-quality, engaging experience that can hold a user's attention for more than 30 seconds. Current TTS systems lack the emotional depth and nuanced speech patterns found in human interaction. The evaluation of TTS systems is challenging due to their qualitative nature and the subjective perception of users. Emotion and the embodiment of societal roles through speech are particularly difficult to replicate. Even basic evaluations, such as recognizing words or pronouncing them correctly, require a more comprehensive understanding of language. To truly perfect TTS, multimodal models with language understanding capabilities are necessary. The ultimate goal is to create a system that can adapt to various roles and contexts, mimicking the unique speech patterns of professions and regions. This would significantly enhance user engagement and bring TTS technology closer to the ceiling of its potential.

    • Real-time multimodal model improvementThe team is improving their real-time audio model, Sonic, and exploring other modalities, aiming to create versatile building blocks for various applications, with a focus on enabling conversational capabilities and real-time processing on devices.

      The team is focusing on improving their real-time audio model, Sonic, while also exploring other modalities. Multi-modality, specifically the combination of speech and text, has presented new challenges but has not been the primary motivation for their work. Instead, they aim to create versatile and general building blocks for various applications. The team is excited about the potential of Sonic, which can generate a signal in real-time and capture the idea of generating a response. They plan to continue improving Sonic while also addressing the need for real-time processing on devices, such as laptops. Additionally, they aim to enable conversational capabilities with these models, allowing for intelligent responses to user inputs and reasoning over data and context. Long-term goals include building a large-scale multimodal language model, but the team is developing unique techniques to make this a reality. Overall, their focus is on creating a powerful multimodal model that is easy to run on devices.

    • Unified and elegant speech synthesis modelsThe future of speech synthesis models lies in creating unified, efficient solutions with improved audio quality and reduced latency, inspired by mathematical and computational elegance, aiming for a single, unified model for all tasks, and bringing SSMs closer to on-device and edge computing.

      The future of speech synthesis models (SSMs) lies in making them more unified, efficient, and elegant. The focus should be on improving the audio quality and reducing latency by minimizing the need for multimodality and model orchestration. The speaker expresses a preference for simple, elegant solutions to complex problems, drawing inspiration from the world of mathematics and computer science. This aesthetic approach has driven the development of SSMs, and the goal is to continue finding elegant solutions while addressing the engineering challenges. The eventual goal is to have a single, unified model that can handle all speech synthesis tasks, making the systems and engineers obsolete. The speaker also emphasizes the importance of bringing SSMs closer to on-device and edge computing for better accessibility and performance.

    • Trust in local data processing technologyThe team behind SSM advocates for trusting their local real-time data processing technology, emphasizing the importance of faith in the system's capabilities, and comparing it to trusting a book over a deity. They are expanding their team and encourage engagement with their community.

      The team behind SSM, a company specializing in real-time data processing, emphasizes the importance of trusting their technology, which provides the same feature set as the cloud version but runs locally. Radesh, a mathematician from Hungary, highlights the need for faith in the system's capabilities, comparing it to believing in a book rather than a deity. The team, consisting of 15 members and several interns, is actively hiring for model roles to expand their modeling team and contribute to the future of their technology, which they refer to as "overthrowing the empire" or "the rebellion." SSM's offerings include real-time audio streaming and they encourage engagement with their community through social media channels and their website.

    Recent Episodes from No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

    State Space Models and Real-time Intelligence with Karan Goel and Albert Gu from Cartesia

    State Space Models and Real-time Intelligence with Karan Goel and Albert Gu from Cartesia
    This week on No Priors, Sarah Guo and Elad Gil sit down with Karan Goel and Albert Gu from Cartesia. Karan and Albert first met as Stanford AI Lab PhDs, where their lab invented Space Models or SSMs, a fundamental new primitive for training large-scale foundation models. In 2023, they Founded Cartesia to build real-time intelligence for every device. One year later, Cartesia released Sonic which generates high quality and lifelike speech with a model latency of 135ms—the fastest for a model of this class. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @krandiash | @_albertgu Show Notes:  (0:00) Introduction (0:28) Use Cases for Cartesia and Sonic  (1:32) Karan Goel & Albert Gu’s professional backgrounds (5:06) Steady State Models (SSMs) versus Transformer Based Architectures  (11:51) Domain Applications for Hybrid Approaches  (13:10) Text to Speech and Voice (17:29) Data, Size of Models and Efficiency  (20:34) Recent Launch of Text to Speech Product (25:01) Multimodality & Building Blocks (25:54) What’s Next at Cartesia?  (28:28) Latency in Text to Speech (29:30) Choosing Research Problems Based on Aesthetic  (31:23) Product Demo (32:48) Cartesia Team & Hiring

    Can AI replace the camera? with Joshua Xu from HeyGen

    Can AI replace the camera? with Joshua Xu from HeyGen
    AI video generation models still have a long way to go when it comes to making compelling and complex videos but the HeyGen team are well on their way to streamlining the video creation process by using a combination of language, video, and voice models to create videos featuring personalized avatars, b-roll, and dialogue. This week on No Priors, Joshua Xu the co-founder and CEO of HeyGen,  joins Sarah and Elad to discuss how the HeyGen team broke down the elements of a video and built or found models to use for each one, the commercial applications for these AI videos, and how they’re safeguarding against deep fakes.  Links from episode: HeyGen McDonald’s commercial Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil |  @joshua_xu_ Show Notes:  (0:00) Introduction (3:08) Applications of AI content creation (5:49) Best use cases for Hey Gen (7:34) Building for quality in AI video generation (11:17) The models powering HeyGen (14:49) Research approach (16:39) Safeguarding against deep fakes (18:31) How AI video generation will change video creation (24:02) Challenges in building the model (26:29) HeyGen team and company

    How the ARC Prize is democratizing the race to AGI with Mike Knoop from Zapier

    How the ARC Prize is democratizing  the race to AGI with Mike Knoop from Zapier
    The first step in achieving AGI is nailing down a concise definition and  Mike Knoop, the co-founder and Head of AI at Zapier, believes François Chollet got it right when he defined general intelligence as a system that can efficiently acquire new skills. This week on No Priors, Miked joins Elad to discuss ARC Prize which is a multi-million dollar non-profit public challenge that is looking for someone to beat the Abstraction and Reasoning Corpus (ARC) evaluation. In this episode, they also get into why Mike thinks LLMs will not get us to AGI, how Zapier is incorporating AI into their products and the power of agents, and why it’s dangerous to regulate AGI before discovering its full potential.  Show Links: About the Abstraction and Reasoning Corpus Zapier Central Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @mikeknoop Show Notes:  (0:00) Introduction (1:10) Redefining AGI (2:16) Introducing ARC Prize (3:08) Definition of AGI (5:14) LLMs and AGI (8:20) Promising techniques to developing AGI (11:0) Sentience and intelligence (13:51) Prize model vs investing (16:28) Zapier AI innovations (19:08) Economic value of agents (21:48) Open source to achieve AGI (24:20) Regulating AI and AGI

    The evolution and promise of RAG architecture with Tengyu Ma from Voyage AI

    The evolution and promise of RAG architecture with Tengyu Ma from Voyage AI
    After Tengyu Ma spent years at Stanford researching AI optimization, embedding models, and transformers, he took a break from academia to start Voyage AI which allows enterprise customers to have the most accurate retrieval possible through the most useful foundational data. Tengyu joins Sarah on this week’s episode of No priors to discuss why RAG systems are winning as the dominant architecture in enterprise and the evolution of foundational data that has allowed RAG to flourish. And while fine-tuning is still in the conversation, Tengyu argues that RAG will continue to evolve as the cheapest, quickest, and most accurate system for data retrieval.  They also discuss methods for growing context windows and managing latency budgets, how Tengyu’s research has informed his work at Voyage, and the role academia should play as AI grows as an industry.  Show Links: Tengyu Ma Key Research Papers: Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training Non-convex optimization for machine learning: design, analysis, and understanding Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss Larger language models do in-context learning differently, 2023 Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning On the Optimization Landscape of Tensor Decompositions Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @tengyuma Show Notes:  (0:00) Introduction (1:59) Key points of Tengyu’s research (4:28) Academia compared to industry (6:46) Voyage AI overview (9:44) Enterprise RAG use cases (15:23) LLM long-term memory and token limitations (18:03) Agent chaining and data management (22:01) Improving enterprise RAG  (25:44) Latency budgets (27:48) Advice for building RAG systems (31:06) Learnings as an AI founder (32:55) The role of academia in AI

    How YC fosters AI Innovation with Garry Tan

    How YC fosters AI Innovation with Garry Tan
    Garry Tan is a notorious founder-turned-investor who is now running one of the most prestigious accelerators in the world, Y Combinator. As the president and CEO of YC, Garry has been credited with reinvigorating the program. On this week’s episode of No Priors, Sarah, Elad, and Garry discuss the shifting demographics of YC founders and how AI is encouraging younger founders to launch companies, predicting which early stage startups will have longevity, and making YC a beacon for innovation in AI companies. They also discussed the importance of building companies in person and if San Francisco is, in fact, back.  Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @garrytan Show Notes:  (0:00) Introduction (0:53) Transitioning from founder to investing (5:10) Early social media startups (7:50) Trend predicting at YC (10:03) Selecting YC founders (12:06) AI trends emerging in YC batch (18:34) Motivating culture at YC (20:39) Choosing the startups with longevity (24:01) Shifting YC found demographics (29:24) Building in San Francisco  (31:01) Making YC a beacon for creators (33:17) Garry Tan is bringing San Francisco back

    The Data Foundry for AI with Alexandr Wang from Scale

    The Data Foundry for AI with Alexandr Wang from Scale
    Alexandr Wang was 19 when he realized that gathering data will be crucial as AI becomes more prevalent, so he dropped out of MIT and started Scale AI. This week on No Priors, Alexandr joins Sarah and Elad to discuss how Scale is providing infrastructure and building a robust data foundry that is crucial to the future of AI. While the company started working with autonomous vehicles, they’ve expanded by partnering with research labs and even the U.S. government.   In this episode, they get into the importance of data quality in building trust in AI systems and a possible future where we can build better self-improvement loops, AI in the enterprise, and where human and AI intelligence will work together to produce better outcomes.  Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @alexandr_wang (0:00) Introduction (3:01) Data infrastructure for autonomous vehicles (5:51) Data abundance and organization (12:06)  Data quality and collection (15:34) The role of human expertise (20:18) Building trust in AI systems (23:28) Evaluating AI models (29:59) AI and government contracts (32:21) Multi-modality and scaling challenges

    Music consumers are becoming the creators with Suno CEO Mikey Shulman

    Music consumers are becoming the creators with Suno CEO Mikey Shulman
    Mikey Shulman, the CEO and co-founder of Suno, can see a future where the Venn diagram of music creators and consumers becomes one big circle. The AI music generation tool trying to democratize music has been making waves in the AI community ever since they came out of stealth mode last year. Suno users can make a song complete with lyrics, just by entering a text prompt, for example, “koto boom bap lofi intricate beats.” You can hear it in action as Mikey, Sarah, and Elad create a song live in this episode.  In this episode, Elad, Sarah, And Mikey talk about how the Suno team took their experience making at transcription tool and applied it to music generation, how the Suno team evaluates aesthetics and taste because there is no standardized test you can give an AI model for music, and why Mikey doesn’t think AI-generated music will affect people’s consumption of human made music.  Listen to the full songs played and created in this episode: Whispers of Sakura Stone  Statistical Paradise Statistical Paradise 2 Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @MikeyShulman Show Notes:  (0:00) Mikey’s background (3:48) Bark and music generation (5:33) Architecture for music generation AI (6:57) Assessing music quality (8:20) Mikey’s music background as an asset (10:02) Challenges in generative music AI (11:30) Business model (14:38) Surprising use cases of Suno (18:43) Creating a song on Suno live (21:44) Ratio of creators to consumers (25:00) The digitization of music (27:20) Mikey’s favorite song on Suno (29:35) Suno is hiring

    Context windows, computer constraints, and energy consumption with Sarah and Elad

    Context windows, computer constraints, and energy consumption with Sarah and Elad
    This week on No Priors hosts, Sarah and Elad are catching up on the latest AI news. They discuss the recent developments in AI music generation, and if you’re interested in generative AI music, stay tuned for next week’s interview! Sarah and Elad also get into device-resident models, AI hardware, and ask just how smart smaller models can really get. These hardware constraints were compared to the hurdles AI platforms are continuing to face including computing constraints, energy consumption, context windows, and how to best integrate these products in apps that users are familiar with.  Have a question for our next host-only episode or feedback for our team? Reach out to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil  Show Notes:  (0:00) Intro (1:25) Music AI generation (4:02) Apple’s LLM (11:39) The role of AI-specific hardware (15:25) AI platform updates (18:01) Forward thinking in investing in AI (20:33) Unlimited context (23:03) Energy constraints

    Cognition’s Scott Wu on how Devin, the AI software engineer, will work for you

    Cognition’s Scott Wu on how Devin, the AI software engineer, will work for you
    Scott Wu loves code. He grew up competing in the International Olympiad in Informatics (IOI) and is a world class coder, and now he's building an AI agent designed to create more, not fewer, human engineers. This week on No Priors, Sarah and Elad talk to Scott, the co-founder and CEO of Cognition, an AI lab focusing on reasoning. Recently, the Cognition team released a demo of Devin, an AI software engineer that can increasingly handle entire tasks end to end. In this episode, they talk about why the team built Devin with a UI that mimics looking over another engineer’s shoulder as they work and how this transparency makes for a better result. Scott discusses why he thinks Devin will make it possible for there to be more human engineers in the world, and what will be important for software engineers to focus on as these roles evolve. They also get into how Scott thinks about building the Cognition team and that they’re just getting started.  Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @ScottWu46 Show Notes:  (0:00) Introduction (1:12) IOI training and community (6:39) Cognition’s founding team (8:20) Meet Devin (9:17) The discourse around Devin (12:14) Building Devin’s UI (14:28) Devin’s strengths and weakness  (18:44) The evolution of coding agents (22:43) Tips for human engineers (26:48) Hiring at Cognition

    OpenAI’s Sora team thinks we’ve only seen the "GPT-1 of video models"

    OpenAI’s Sora team thinks we’ve only seen the "GPT-1 of video models"
    AI-generated videos are not just leveled-up image generators. But rather, they could be a big step forward on the path to AGI. This week on No Priors, the team from Sora is here to discuss OpenAI’s recently announced generative video model, which can take a text prompt and create realistic, visually coherent, high-definition clips that are up to a minute long. Sora team leads, Aditya Ramesh, Tim Brooks, and Bill Peebles join Elad and Sarah to talk about developing Sora. The generative video model isn’t yet available for public use but the examples of its work are very impressive. However, they believe we’re still in the GPT-1 era of AI video models and are focused on a slow rollout to ensure the model is in the best place possible to offer value to the user and more importantly they’ve applied all the safety measures possible to avoid deep fakes and misinformation. They also discuss what they’re learning from implementing diffusion transformers, why they believe video generation is taking us one step closer to AGI, and why entertainment may not be the main use case for this tool in the future.  Show Links: Bling Zoo video Man eating a burger video Tokyo Walk video Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @_tim_brooks l @billpeeb l @model_mechanic Show Notes:  (0:00) Sora team Introduction (1:05) Simulating the world with Sora (2:25) Building the most valuable consumer product (5:50) Alternative use cases and simulation capabilities (8:41) Diffusion transformers explanation (10:15) Scaling laws for video (13:08) Applying end-to-end deep learning to video (15:30) Tuning the visual aesthetic of Sora (17:08) The road to “desktop Pixar” for everyone (20:12) Safety for visual models (22:34) Limitations of Sora (25:04) Learning from how Sora is learning (29:32) The biggest misconceptions about video models