Logo
    Search

    The evolution and promise of RAG architecture with Tengyu Ma from Voyage AI

    enJune 06, 2024

    Podcast Summary

    • Tengu Ma's researchTengu Ma, a Stanford professor and Voyage CEO, researches improving efficiency of training large language models and enhancing their reasoning abilities. His optimizer Sofia can train models up to 2x faster.

      Tengu Ma, a computer science assistant professor at Stanford and the CEO of Voyage, has a diverse research agenda that spans from theoretical deep learning to practical applications such as large language models and optimizers. His work focuses on improving the efficiency of training large language models and enhancing their reasoning abilities, as he believes these areas will become increasingly important due to the limited data and compute resources. Ma's research includes early work on matrix completion optimization, sentence embeddings, transformers, and contrastive learning. Recently, he and his team developed the optimizer Sofia, which can improve the training efficiency of large language models by up to 2x. Ma's entrepreneurial spirit led him to start a company last year while on leave from Stanford, and his work has already shown significant impact, with Facebook reporting a 1.6x improvement in training efficiency on a large scale.

    • AI commercializationThe maturity of AI and machine learning technologies, combined with the ease and affordability of implementation, make this an optimal time for commercialization. Retrieval Augmented Generation (RAG) systems are being utilized to improve the quality of retrieval systems and reduce hallucination rates, leading to more accurate and contextually relevant AI applications in industry.

      We are currently witnessing a maturing of technologies, particularly in the field of artificial intelligence (AI) and machine learning, making this an opportune time for commercialization. Seven years ago, applying AI to industry involved a complex process with multiple steps, but now, with the rise of foundation models, the process has been simplified significantly. Companies like Voyage are focusing on improving the quality of retrieval systems, as this is currently identified as a bottleneck in implementing AI in industry. Retrieval Augmented Generation (RAG) systems are being utilized to address this issue. These systems involve a retrieval step, where relevant information is retrieved and vectorized, followed by a generation step where a large language model uses this information to generate accurate and relevant responses. By reducing the hallucination rate, RAG systems are able to provide more accurate and contextually relevant answers, improving the overall effectiveness of AI applications in industry. The ease and affordability of implementing AI in industry, combined with the maturity of the underlying technologies, make this an ideal time for commercialization.

    • RAG architectureRAG architecture converts data into vectors and stores them in a vector database for efficient handling and semantic-based search. It's a cost-effective solution compared to long context transformers, which are currently expensive due to memory requirements.

      RAG (Reconstructed Abstract Graph) architecture is an emerging technology that allows for efficient handling and search of various types of data, including documents, videos, and code, by converting them into vectors and storing them in a vector database. This approach enables semantic-based search and can be applied to various industries such as finance, legal, chemistry, and individual use cases. RAG is considered easier to implement than fine-tuning and is debated against alternative architectures like agent chaining and long context transformers. While long context transformers have the potential to process vast amounts of data, they are currently expensive due to the need to save all activations or intermediate computations in memory, making RAG a more practical and cost-effective solution in the near term. The RAG architecture is relatively new and has gained popularity in recent years, but there is ongoing debate about its necessity for working with proprietary data.

    • Rack vs LLMsAdvancements in technology and increasing data sizes make cost-effective and efficient Rack a more viable solution for LLMs than long context transformers, with potential long-term memory management through RAG.

      The use of Rack, a more cost-efficient and hierarchical system, is predicted to become more prevalent than long context transformers due to advancements in technology and the increasing size of data sources for Large Language Models (LLMs). The discussion also touched upon the idea of context as short-term memory and long-term memory, with RAG (Rack's embedded models) potentially managing the long-term memory for LLMs. The cost difference between managing a 1,000,000 token context window versus a 100,000,000 token context window is significant, with the latter being 100 times more expensive. For many companies, this cost difference is unacceptable, making it essential to find more cost-effective solutions like Rack. Moreover, agent chaining and using LLMs to manage data is an emerging area of research. The discussion emphasized the importance of efficiency, both from a cost perspective and in terms of hallucination management, which further supports the use of Rack and its hierarchical system. In essence, the advancements in technology and the increasing size of data sources necessitate a shift towards more cost-effective and efficient solutions like Rack. The hierarchical system allows for a more manageable and efficient approach to processing large amounts of data, making it a promising solution for the future of LLMs.

    • Agent training vs RAG systemsAgent training and RAG systems serve different purposes in managing knowledge systems. Agent training involves a multi-step system with large and small models, while RAG systems focus on iterative retrieval. Improving these systems involves enhancing large language models, prompting, and retrieval, with heavy data-driven training best suited for companies.

      Agent training and Retrieval-as-a-Service (RAG) systems are orthogonal approaches in the context of managing knowledge systems. While both involve the use of embedding models, large language models, and iterative retrieval, they serve different purposes. Agent training can be seen as a multi-step, retrieval-augmented system where some parts are managed by large language models, some by small models, and some by embedding models. The motivation for agent chaining is similar to RAG in terms of efficiency, but managing a large knowledge system with a very large language model can lead to inefficiencies, making smaller models more suitable. Another perspective is the debate between iterative retrieval and retrieving all at once. Iterative retrieval is beneficial due to the current limitations of embedding models, but in the long run, as models become more clever, fewer rounds of retrieval may be necessary. Improving a RAG system involves enhancing the large language model (LLM), prompting, and retrieval. Improving the LLM requires heavy data-driven training and fine-tuning for specific use cases, which is a task best suited for companies rather than end-users. The long-term vision is that software engineering layers on top of networks will become less necessary as networks become more clever.

    • Long context embedding modelsAdvanced long context embedding models can decrease the need for data truncation and text conversion for text-based models, significantly improving performance in specific domains, and reducing latency through smaller embedding dimensions.

      As context windows become longer and more advanced long context embedding models emerge, the need for data truncation and converting data into text for text-based models will decrease. This is because these advanced models will be able to understand and process longer contexts effectively. Additionally, fine-tuning and domain-specific embedding models can significantly improve performance in specific domains. This is important because the number of parameters in embedding models is limited, and customizing models to specific domains allows for the efficient use of these parameters. The improvements in performance can range from 5% to over 20%, depending on the domain and the amount of data available. For example, in the code domain, where deep understanding is required, significant improvements of up to 20% have been seen. In contrast, in the legal domain, where the baseline performance is higher, improvements of 5% to 15% have been observed. It's important to note that the latency cost comes not only from the search system processing the query but also from the query being translated into an embedding and compared to the embeddings of the knowledge base. The dimension of the vectors produced for the embeddings also affects the latency for vector-based search. Smaller dimensions result in faster search times. Voyage, for instance, produces embeddings with dimensions that are 3x to 4x smaller than some competitors.

    • RAG systems improvementsBalancing efficiency and domain-specific improvements in RAG systems involves using efficient embedding models and fine-tuning on proprietary data. Companies can invest in these components early on and simplify RAG systems as LLMs improve.

      Creating effective Relevance and Answering (RAG) systems involves a balance between using efficient embedding models with a limited number of parameters and dimensions, and fine-tuning on proprietary data for domain-specific improvements. The degree of improvement varies depending on the starting accuracy level. Companies can begin investing in these components from the early stages, focusing on evaluating retrieval quality and identifying bottlenecks. As Large Language Models (LLMs) continue to improve, RAG systems are predicted to become simpler, with fewer components requiring less engineering effort. From my experience as a founder, starting a company as an academic involves a significant shift from academic research to practical application, requiring a clear focus on market needs and a willingness to adapt to new technologies and trends.

    • Efficiency in Academia and IndustryBoth academia and industry require efficiency and a focus on long-term innovations to complement each other in the age of AI scaling. Academia should focus on challenging research areas, while industry prioritizes near-term impact.

      Starting a company involves wearing multiple hats and learning from various resources, even the most basic books, to minimize mistakes and maximize efficiency. Academia, on the other hand, plays a crucial role in long-term innovations, focusing on questions that industry may not prioritize due to resource constraints. The speaker emphasizes the importance of efficiency in both academia and industry, and encourages researchers to think about breakthroughs that could significantly impact the landscape within 3-5 years. In academia, the focus should be on challenging research areas like reasoning tasks, where the scaling laws may not be enough to achieve superhuman performance or prove complex mathematical conjectures. By improving efficiency and focusing on long-term innovations, academia and industry can complement each other in the age of AI scaling.

    • Mathematician's knowledgeRelying on common crowd data isn't enough for becoming a good mathematician, universities and labs are essential for deeper innovations and long-term research.

      Relying solely on common crowd data from the web may not be sufficient for becoming a good mathematician. The field requires deeper innovations and long-term research, which is what universities and labs are working on. This is an inspiring reminder of the vast amount of knowledge still to be discovered. Thanks for tuning in to No Priorities Podcast. Connect with us on Twitter, subscribe to our YouTube channel, and follow us on Apple Podcasts, Spotify, or wherever you listen for a new episode every week. Don't forget to sign up for emails or check out transcripts for every episode at nodashpriors.com.

    Recent Episodes from No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

    State Space Models and Real-time Intelligence with Karan Goel and Albert Gu from Cartesia

    State Space Models and Real-time Intelligence with Karan Goel and Albert Gu from Cartesia
    This week on No Priors, Sarah Guo and Elad Gil sit down with Karan Goel and Albert Gu from Cartesia. Karan and Albert first met as Stanford AI Lab PhDs, where their lab invented Space Models or SSMs, a fundamental new primitive for training large-scale foundation models. In 2023, they Founded Cartesia to build real-time intelligence for every device. One year later, Cartesia released Sonic which generates high quality and lifelike speech with a model latency of 135ms—the fastest for a model of this class. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @krandiash | @_albertgu Show Notes:  (0:00) Introduction (0:28) Use Cases for Cartesia and Sonic  (1:32) Karan Goel & Albert Gu’s professional backgrounds (5:06) Steady State Models (SSMs) versus Transformer Based Architectures  (11:51) Domain Applications for Hybrid Approaches  (13:10) Text to Speech and Voice (17:29) Data, Size of Models and Efficiency  (20:34) Recent Launch of Text to Speech Product (25:01) Multimodality & Building Blocks (25:54) What’s Next at Cartesia?  (28:28) Latency in Text to Speech (29:30) Choosing Research Problems Based on Aesthetic  (31:23) Product Demo (32:48) Cartesia Team & Hiring

    Can AI replace the camera? with Joshua Xu from HeyGen

    Can AI replace the camera? with Joshua Xu from HeyGen
    AI video generation models still have a long way to go when it comes to making compelling and complex videos but the HeyGen team are well on their way to streamlining the video creation process by using a combination of language, video, and voice models to create videos featuring personalized avatars, b-roll, and dialogue. This week on No Priors, Joshua Xu the co-founder and CEO of HeyGen,  joins Sarah and Elad to discuss how the HeyGen team broke down the elements of a video and built or found models to use for each one, the commercial applications for these AI videos, and how they’re safeguarding against deep fakes.  Links from episode: HeyGen McDonald’s commercial Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil |  @joshua_xu_ Show Notes:  (0:00) Introduction (3:08) Applications of AI content creation (5:49) Best use cases for Hey Gen (7:34) Building for quality in AI video generation (11:17) The models powering HeyGen (14:49) Research approach (16:39) Safeguarding against deep fakes (18:31) How AI video generation will change video creation (24:02) Challenges in building the model (26:29) HeyGen team and company

    How the ARC Prize is democratizing the race to AGI with Mike Knoop from Zapier

    How the ARC Prize is democratizing  the race to AGI with Mike Knoop from Zapier
    The first step in achieving AGI is nailing down a concise definition and  Mike Knoop, the co-founder and Head of AI at Zapier, believes François Chollet got it right when he defined general intelligence as a system that can efficiently acquire new skills. This week on No Priors, Miked joins Elad to discuss ARC Prize which is a multi-million dollar non-profit public challenge that is looking for someone to beat the Abstraction and Reasoning Corpus (ARC) evaluation. In this episode, they also get into why Mike thinks LLMs will not get us to AGI, how Zapier is incorporating AI into their products and the power of agents, and why it’s dangerous to regulate AGI before discovering its full potential.  Show Links: About the Abstraction and Reasoning Corpus Zapier Central Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @mikeknoop Show Notes:  (0:00) Introduction (1:10) Redefining AGI (2:16) Introducing ARC Prize (3:08) Definition of AGI (5:14) LLMs and AGI (8:20) Promising techniques to developing AGI (11:0) Sentience and intelligence (13:51) Prize model vs investing (16:28) Zapier AI innovations (19:08) Economic value of agents (21:48) Open source to achieve AGI (24:20) Regulating AI and AGI

    The evolution and promise of RAG architecture with Tengyu Ma from Voyage AI

    The evolution and promise of RAG architecture with Tengyu Ma from Voyage AI
    After Tengyu Ma spent years at Stanford researching AI optimization, embedding models, and transformers, he took a break from academia to start Voyage AI which allows enterprise customers to have the most accurate retrieval possible through the most useful foundational data. Tengyu joins Sarah on this week’s episode of No priors to discuss why RAG systems are winning as the dominant architecture in enterprise and the evolution of foundational data that has allowed RAG to flourish. And while fine-tuning is still in the conversation, Tengyu argues that RAG will continue to evolve as the cheapest, quickest, and most accurate system for data retrieval.  They also discuss methods for growing context windows and managing latency budgets, how Tengyu’s research has informed his work at Voyage, and the role academia should play as AI grows as an industry.  Show Links: Tengyu Ma Key Research Papers: Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training Non-convex optimization for machine learning: design, analysis, and understanding Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss Larger language models do in-context learning differently, 2023 Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning On the Optimization Landscape of Tensor Decompositions Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @tengyuma Show Notes:  (0:00) Introduction (1:59) Key points of Tengyu’s research (4:28) Academia compared to industry (6:46) Voyage AI overview (9:44) Enterprise RAG use cases (15:23) LLM long-term memory and token limitations (18:03) Agent chaining and data management (22:01) Improving enterprise RAG  (25:44) Latency budgets (27:48) Advice for building RAG systems (31:06) Learnings as an AI founder (32:55) The role of academia in AI

    How YC fosters AI Innovation with Garry Tan

    How YC fosters AI Innovation with Garry Tan
    Garry Tan is a notorious founder-turned-investor who is now running one of the most prestigious accelerators in the world, Y Combinator. As the president and CEO of YC, Garry has been credited with reinvigorating the program. On this week’s episode of No Priors, Sarah, Elad, and Garry discuss the shifting demographics of YC founders and how AI is encouraging younger founders to launch companies, predicting which early stage startups will have longevity, and making YC a beacon for innovation in AI companies. They also discussed the importance of building companies in person and if San Francisco is, in fact, back.  Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @garrytan Show Notes:  (0:00) Introduction (0:53) Transitioning from founder to investing (5:10) Early social media startups (7:50) Trend predicting at YC (10:03) Selecting YC founders (12:06) AI trends emerging in YC batch (18:34) Motivating culture at YC (20:39) Choosing the startups with longevity (24:01) Shifting YC found demographics (29:24) Building in San Francisco  (31:01) Making YC a beacon for creators (33:17) Garry Tan is bringing San Francisco back

    The Data Foundry for AI with Alexandr Wang from Scale

    The Data Foundry for AI with Alexandr Wang from Scale
    Alexandr Wang was 19 when he realized that gathering data will be crucial as AI becomes more prevalent, so he dropped out of MIT and started Scale AI. This week on No Priors, Alexandr joins Sarah and Elad to discuss how Scale is providing infrastructure and building a robust data foundry that is crucial to the future of AI. While the company started working with autonomous vehicles, they’ve expanded by partnering with research labs and even the U.S. government.   In this episode, they get into the importance of data quality in building trust in AI systems and a possible future where we can build better self-improvement loops, AI in the enterprise, and where human and AI intelligence will work together to produce better outcomes.  Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @alexandr_wang (0:00) Introduction (3:01) Data infrastructure for autonomous vehicles (5:51) Data abundance and organization (12:06)  Data quality and collection (15:34) The role of human expertise (20:18) Building trust in AI systems (23:28) Evaluating AI models (29:59) AI and government contracts (32:21) Multi-modality and scaling challenges

    Music consumers are becoming the creators with Suno CEO Mikey Shulman

    Music consumers are becoming the creators with Suno CEO Mikey Shulman
    Mikey Shulman, the CEO and co-founder of Suno, can see a future where the Venn diagram of music creators and consumers becomes one big circle. The AI music generation tool trying to democratize music has been making waves in the AI community ever since they came out of stealth mode last year. Suno users can make a song complete with lyrics, just by entering a text prompt, for example, “koto boom bap lofi intricate beats.” You can hear it in action as Mikey, Sarah, and Elad create a song live in this episode.  In this episode, Elad, Sarah, And Mikey talk about how the Suno team took their experience making at transcription tool and applied it to music generation, how the Suno team evaluates aesthetics and taste because there is no standardized test you can give an AI model for music, and why Mikey doesn’t think AI-generated music will affect people’s consumption of human made music.  Listen to the full songs played and created in this episode: Whispers of Sakura Stone  Statistical Paradise Statistical Paradise 2 Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @MikeyShulman Show Notes:  (0:00) Mikey’s background (3:48) Bark and music generation (5:33) Architecture for music generation AI (6:57) Assessing music quality (8:20) Mikey’s music background as an asset (10:02) Challenges in generative music AI (11:30) Business model (14:38) Surprising use cases of Suno (18:43) Creating a song on Suno live (21:44) Ratio of creators to consumers (25:00) The digitization of music (27:20) Mikey’s favorite song on Suno (29:35) Suno is hiring

    Context windows, computer constraints, and energy consumption with Sarah and Elad

    Context windows, computer constraints, and energy consumption with Sarah and Elad
    This week on No Priors hosts, Sarah and Elad are catching up on the latest AI news. They discuss the recent developments in AI music generation, and if you’re interested in generative AI music, stay tuned for next week’s interview! Sarah and Elad also get into device-resident models, AI hardware, and ask just how smart smaller models can really get. These hardware constraints were compared to the hurdles AI platforms are continuing to face including computing constraints, energy consumption, context windows, and how to best integrate these products in apps that users are familiar with.  Have a question for our next host-only episode or feedback for our team? Reach out to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil  Show Notes:  (0:00) Intro (1:25) Music AI generation (4:02) Apple’s LLM (11:39) The role of AI-specific hardware (15:25) AI platform updates (18:01) Forward thinking in investing in AI (20:33) Unlimited context (23:03) Energy constraints

    Cognition’s Scott Wu on how Devin, the AI software engineer, will work for you

    Cognition’s Scott Wu on how Devin, the AI software engineer, will work for you
    Scott Wu loves code. He grew up competing in the International Olympiad in Informatics (IOI) and is a world class coder, and now he's building an AI agent designed to create more, not fewer, human engineers. This week on No Priors, Sarah and Elad talk to Scott, the co-founder and CEO of Cognition, an AI lab focusing on reasoning. Recently, the Cognition team released a demo of Devin, an AI software engineer that can increasingly handle entire tasks end to end. In this episode, they talk about why the team built Devin with a UI that mimics looking over another engineer’s shoulder as they work and how this transparency makes for a better result. Scott discusses why he thinks Devin will make it possible for there to be more human engineers in the world, and what will be important for software engineers to focus on as these roles evolve. They also get into how Scott thinks about building the Cognition team and that they’re just getting started.  Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @ScottWu46 Show Notes:  (0:00) Introduction (1:12) IOI training and community (6:39) Cognition’s founding team (8:20) Meet Devin (9:17) The discourse around Devin (12:14) Building Devin’s UI (14:28) Devin’s strengths and weakness  (18:44) The evolution of coding agents (22:43) Tips for human engineers (26:48) Hiring at Cognition

    OpenAI’s Sora team thinks we’ve only seen the "GPT-1 of video models"

    OpenAI’s Sora team thinks we’ve only seen the "GPT-1 of video models"
    AI-generated videos are not just leveled-up image generators. But rather, they could be a big step forward on the path to AGI. This week on No Priors, the team from Sora is here to discuss OpenAI’s recently announced generative video model, which can take a text prompt and create realistic, visually coherent, high-definition clips that are up to a minute long. Sora team leads, Aditya Ramesh, Tim Brooks, and Bill Peebles join Elad and Sarah to talk about developing Sora. The generative video model isn’t yet available for public use but the examples of its work are very impressive. However, they believe we’re still in the GPT-1 era of AI video models and are focused on a slow rollout to ensure the model is in the best place possible to offer value to the user and more importantly they’ve applied all the safety measures possible to avoid deep fakes and misinformation. They also discuss what they’re learning from implementing diffusion transformers, why they believe video generation is taking us one step closer to AGI, and why entertainment may not be the main use case for this tool in the future.  Show Links: Bling Zoo video Man eating a burger video Tokyo Walk video Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @_tim_brooks l @billpeeb l @model_mechanic Show Notes:  (0:00) Sora team Introduction (1:05) Simulating the world with Sora (2:25) Building the most valuable consumer product (5:50) Alternative use cases and simulation capabilities (8:41) Diffusion transformers explanation (10:15) Scaling laws for video (13:08) Applying end-to-end deep learning to video (15:30) Tuning the visual aesthetic of Sora (17:08) The road to “desktop Pixar” for everyone (20:12) Safety for visual models (22:34) Limitations of Sora (25:04) Learning from how Sora is learning (29:32) The biggest misconceptions about video models