The Future of AI Artistry with Suhail Doshi from Playground AI

enApril 18, 2024

No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

Podcast Summary

Founder's fascination with AI image technology leads to third company: Doshi started Playground AI due to his fascination with AI image technology, a user-friendly solution gap, and a desire to make a meaningful impact
Suhail Doshi, the founder of Playground AI, was inspired by the advancements in AI technology, specifically in image generation and editing, to start his third company. He was intrigued by the potential of creating a user-friendly interface for these models, which were then primarily being used in Google Colab notebooks. Doshi had previously considered working in music due to his personal interest, but he couldn't envision a useful application for the public. Instead, he saw an opportunity in images due to their inherent distribution and the combination of creativity and tooling. When considering the competitive landscape, Doshi acknowledged the large number of language companies and the significant funding they had received. He expressed a desire to work on something where he could make a meaningful impact and build on his existing experiences with creative tools. Overall, Doshi's decision to start Playground AI was driven by his fascination with the latest AI advancements, his personal interests, and his goal to create a user-friendly solution for image generation and editing.
Unlocking the full potential of text-to-image models: Playground aims to advance text-to-image models by investing in research and development, creating high-quality, versatile, and practical applications
The current state of text-to-image models, such as stable diffusion, has not yet reached its full potential for practicality and utility. While many people are creating art with these models, there is a lack of advanced features like editing, blending real and synthetic imagery, or stylizing existing images. Moreover, most companies are not investing significantly in improving these models. To address this gap, Playground decided to build its own models instead of fine-tuning existing ones. The team trained these models from scratch and recently launched version 2.5, which creates high-quality, beautiful imagery. To achieve this, Playground assembled a dedicated team and approached the project with a long-term focus. By investing in research and development, they aim to unlock the full potential of text-to-image models and create more versatile and practical applications.
Improving AI models like DALL E 2 and DALL E 3 involves more than just using proven architectures and applying enough compute.: The team aimed to surpass open-source model performance by optimizing color and contrast with EDM formulation, but balancing hand-tuning and learning during training is a challenge.
Creating advanced AI models like DALL E 2 or DALL E 3 involves more complexity than just using a proven architecture, gathering data, and applying enough compute. The team's goal was to push the boundaries of existing architectures, such as Stable Diffusion XL's UNet, CLIP, and VAE, to surpass the performance of the open-source model. They discovered that improving color and contrast in images was crucial for aesthetically pleasing results, leading them to employ an EDM formulation to optimize this aspect. However, the balance between hand-tuning parameters and allowing the model to learn during training is an ongoing challenge. These models have numerous dimensions, and optimizing each one requires careful consideration.
Combining techniques and curating high-quality data for effective AI: Effective AI involves a mix of techniques, curated data, and user understanding for optimal performance. Continuously striving for better evaluations and keeping up with new methods is crucial.
While there are various techniques and tricks in the field of AI, particularly in image and language processing, the most effective approach often involves a combination of these methods and a great deal of meticulous work, especially in the final stages of supervised fine-tuning. This includes curating high-quality data and applying good taste and judgment. However, evaluating what constitutes good taste can be challenging, as not everyone has the same aesthetic sensibilities. The industry's evaluations are often flawed and may not accurately reflect what users truly value. For instance, large language models may excel at tasks related to academic homework due to the nature of the evaluations used. Therefore, it's essential to continually strive for better evaluations that better align with user needs and preferences. Additionally, the field is constantly evolving, with new tricks and techniques emerging regularly. For example, Power EMA, EDM, offset noise, and DPO are just a few of the many approaches that can lead to significant improvements in performance. Overall, success in AI requires a combination of technical expertise, creativity, and a deep understanding of user needs and preferences.
Improving image-generating AI models through rigorous evaluation and user feedback: Rigorous evaluation of AI models through examination of thousands of images and user feedback is crucial for improving image-generating AI models. Playground, a top text-to-art platform, focuses on editing capabilities and has a complex data curation strategy.
While some evaluations in image-generating AI models may not have sufficient coverage, particularly in areas like judgment and taste, the overall goal is to make these evaluations stronger. This is achieved through rigorous examination of thousands of images across various grids and checkpoints. However, there is room for improvement in the use of user feedback within the product itself, such as voting scenes or user studies. Playground, specifically, stands out in the market by focusing on editing capabilities, allowing users to tweak and customize images they find or create. Despite the simplicity of the user interface, the data curation strategy behind Playground is complex, with a sophisticated process for collecting and ranking user data. Playground is currently number 2 in text-to-art, but is expected to diverge from competitors due to its emphasis on editing capabilities.
Creating a large vision model for images: The team is developing a model to create, edit, and understand images, with potential applications in robotics and advanced technologies.
The company is focusing on scaling pixels and creating a large vision model for images, with the long-term goal of making a multitask vision model capable of creating, editing, and understanding things. The team is prioritizing images over other media types due to the lower utility and computational efficiency. The ultimate vision is to build a model that can create, edit, and understand visual content, potentially leading to applications in robotics and other advanced technologies. The team is currently in the early stages of this endeavor, primarily focusing on image processing and understanding. The motivation behind this work is to provide higher utility and less effort for users, moving beyond simple image posting on social media. The team is also exploring ways to help users incorporate their own logos or images into new contexts. While there are other areas of research in vision and pixels, such as 3D content and video, the team believes that images offer the most promising opportunities for efficient progress and practical applications.
Combining transformers and multimodal models for future AI image generation: Transformers and multimodal models are expected to shape the future of AI image generation by combining strengths, allowing for better long context handling and knowledge reasoning, while incorporating interpretable knowledge from language and image models.
The future of AI models, particularly those focused on image generation, is likely to involve a combination of transformer-based architectures and multimodal models that can effectively marry text and image understanding. Traditional diffusion model-based architectures have their merits, but transformers are seen as the right direction due to their ability to handle long context and knowledge reasoning. However, there's a need to incorporate interpretable knowledge from language models and image models, which can be achieved through approaches like DIT (Diffusion Transformers) or d I DIT (discriminator-integrated DIT). The ultimate goal is to develop a truly multimodal general model that can handle any modality, not just language or images, but also audio and more. The architecture of AI models is expected to change significantly in the coming year, with a focus on combining the strengths of various model types. It's important to note that DIT is just one approach, and transformers are likely to play a crucial role in this evolution. Ultimately, the goal is to create a model that can understand and generate outputs in various modalities, providing a more comprehensive and versatile AI system.
The Interconnection of Language and Vision: Language and vision, two distinct forms of data, are interconnected due to technology advancements. Vision has a vast amount of data but can be challenging to filter, while language has potential to control things but may have a lower ceiling.
Language and vision, two different forms of data, are increasingly interconnected due to advancements in technology. Language, as a form of compressed information, has the ability to control things, but it may have a lower ceiling than vision, which is rich in data and can be easily expanded through collecting more pixel data. The Internet, as a source for vision data, is vast but may not be sufficient, and filtering and cleaning the data can be challenging. On the other hand, audio, another form of data, is also enormous and has potential in areas like music production. Elad, who has experience in both vision and music, believes that audio will be significant, and initiatives like 11 Labs are interesting developments in this field. To gain a better understanding of the future of these technologies, Elad focuses on using them as a user, providing a stronger sense of their potential applications and directions.
Exploring Music Creation with AI Tools: AI tools can generate instrumental music, but high-quality vocals and emotional depth remain a challenge. Human touch is crucial in creating emotionally resonant music.
While creating instrumental music using AI tools is relatively easy, finding and working with singers and obtaining high-quality lyrics and vocals remains a significant challenge in the music industry. The speaker shares his experience using AI tools like Suno and others to create instrumentals and extract lyrics, but the true value lies in the human element of music - the emotional depth and flow of the vocals. The speaker also mentions that while there are still errors in the songs produced using these methods, it's an exciting new way to explore music creation. Overall, the conversation highlights the potential of AI in music production but emphasizes the importance of the human touch in creating emotionally resonant music.

Recent Episodes from No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

State Space Models and Real-time Intelligence with Karan Goel and Albert Gu from Cartesia

This week on No Priors, Sarah Guo and Elad Gil sit down with Karan Goel and Albert Gu from Cartesia. Karan and Albert first met as Stanford AI Lab PhDs, where their lab invented Space Models or SSMs, a fundamental new primitive for training large-scale foundation models. In 2023, they Founded Cartesia to build real-time intelligence for every device. One year later, Cartesia released Sonic which generates high quality and lifelike speech with a model latency of 135ms—the fastest for a model of this class. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @krandiash | @_albertgu Show Notes: (0:00) Introduction (0:28) Use Cases for Cartesia and Sonic (1:32) Karan Goel & Albert Gu’s professional backgrounds (5:06) Steady State Models (SSMs) versus Transformer Based Architectures (11:51) Domain Applications for Hybrid Approaches (13:10) Text to Speech and Voice (17:29) Data, Size of Models and Efficiency (20:34) Recent Launch of Text to Speech Product (25:01) Multimodality & Building Blocks (25:54) What’s Next at Cartesia? (28:28) Latency in Text to Speech (29:30) Choosing Research Problems Based on Aesthetic (31:23) Product Demo (32:48) Cartesia Team & Hiring

No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

enJune 27, 2024

Can AI replace the camera? with Joshua Xu from HeyGen

AI video generation models still have a long way to go when it comes to making compelling and complex videos but the HeyGen team are well on their way to streamlining the video creation process by using a combination of language, video, and voice models to create videos featuring personalized avatars, b-roll, and dialogue. This week on No Priors, Joshua Xu the co-founder and CEO of HeyGen, joins Sarah and Elad to discuss how the HeyGen team broke down the elements of a video and built or found models to use for each one, the commercial applications for these AI videos, and how they’re safeguarding against deep fakes. Links from episode: HeyGen McDonald’s commercial Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @joshua_xu_ Show Notes: (0:00) Introduction (3:08) Applications of AI content creation (5:49) Best use cases for Hey Gen (7:34) Building for quality in AI video generation (11:17) The models powering HeyGen (14:49) Research approach (16:39) Safeguarding against deep fakes (18:31) How AI video generation will change video creation (24:02) Challenges in building the model (26:29) HeyGen team and company

No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

enJune 20, 2024

How the ARC Prize is democratizing the race to AGI with Mike Knoop from Zapier

The first step in achieving AGI is nailing down a concise definition and Mike Knoop, the co-founder and Head of AI at Zapier, believes François Chollet got it right when he defined general intelligence as a system that can efficiently acquire new skills. This week on No Priors, Miked joins Elad to discuss ARC Prize which is a multi-million dollar non-profit public challenge that is looking for someone to beat the Abstraction and Reasoning Corpus (ARC) evaluation. In this episode, they also get into why Mike thinks LLMs will not get us to AGI, how Zapier is incorporating AI into their products and the power of agents, and why it’s dangerous to regulate AGI before discovering its full potential. Show Links: About the Abstraction and Reasoning Corpus Zapier Central Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @mikeknoop Show Notes: (0:00) Introduction (1:10) Redefining AGI (2:16) Introducing ARC Prize (3:08) Definition of AGI (5:14) LLMs and AGI (8:20) Promising techniques to developing AGI (11:0) Sentience and intelligence (13:51) Prize model vs investing (16:28) Zapier AI innovations (19:08) Economic value of agents (21:48) Open source to achieve AGI (24:20) Regulating AI and AGI

No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

enJune 11, 2024

The evolution and promise of RAG architecture with Tengyu Ma from Voyage AI

After Tengyu Ma spent years at Stanford researching AI optimization, embedding models, and transformers, he took a break from academia to start Voyage AI which allows enterprise customers to have the most accurate retrieval possible through the most useful foundational data. Tengyu joins Sarah on this week’s episode of No priors to discuss why RAG systems are winning as the dominant architecture in enterprise and the evolution of foundational data that has allowed RAG to flourish. And while fine-tuning is still in the conversation, Tengyu argues that RAG will continue to evolve as the cheapest, quickest, and most accurate system for data retrieval. They also discuss methods for growing context windows and managing latency budgets, how Tengyu’s research has informed his work at Voyage, and the role academia should play as AI grows as an industry. Show Links: Tengyu Ma Key Research Papers: Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training Non-convex optimization for machine learning: design, analysis, and understanding Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss Larger language models do in-context learning differently, 2023 Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning On the Optimization Landscape of Tensor Decompositions Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @tengyuma Show Notes: (0:00) Introduction (1:59) Key points of Tengyu’s research (4:28) Academia compared to industry (6:46) Voyage AI overview (9:44) Enterprise RAG use cases (15:23) LLM long-term memory and token limitations (18:03) Agent chaining and data management (22:01) Improving enterprise RAG (25:44) Latency budgets (27:48) Advice for building RAG systems (31:06) Learnings as an AI founder (32:55) The role of academia in AI

No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

enJune 06, 2024

How YC fosters AI Innovation with Garry Tan

Garry Tan is a notorious founder-turned-investor who is now running one of the most prestigious accelerators in the world, Y Combinator. As the president and CEO of YC, Garry has been credited with reinvigorating the program. On this week’s episode of No Priors, Sarah, Elad, and Garry discuss the shifting demographics of YC founders and how AI is encouraging younger founders to launch companies, predicting which early stage startups will have longevity, and making YC a beacon for innovation in AI companies. They also discussed the importance of building companies in person and if San Francisco is, in fact, back. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @garrytan Show Notes: (0:00) Introduction (0:53) Transitioning from founder to investing (5:10) Early social media startups (7:50) Trend predicting at YC (10:03) Selecting YC founders (12:06) AI trends emerging in YC batch (18:34) Motivating culture at YC (20:39) Choosing the startups with longevity (24:01) Shifting YC found demographics (29:24) Building in San Francisco (31:01) Making YC a beacon for creators (33:17) Garry Tan is bringing San Francisco back

No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

enMay 23, 2024

The Data Foundry for AI with Alexandr Wang from Scale

Alexandr Wang was 19 when he realized that gathering data will be crucial as AI becomes more prevalent, so he dropped out of MIT and started Scale AI. This week on No Priors, Alexandr joins Sarah and Elad to discuss how Scale is providing infrastructure and building a robust data foundry that is crucial to the future of AI. While the company started working with autonomous vehicles, they’ve expanded by partnering with research labs and even the U.S. government. In this episode, they get into the importance of data quality in building trust in AI systems and a possible future where we can build better self-improvement loops, AI in the enterprise, and where human and AI intelligence will work together to produce better outcomes. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @alexandr_wang (0:00) Introduction (3:01) Data infrastructure for autonomous vehicles (5:51) Data abundance and organization (12:06) Data quality and collection (15:34) The role of human expertise (20:18) Building trust in AI systems (23:28) Evaluating AI models (29:59) AI and government contracts (32:21) Multi-modality and scaling challenges

No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

enMay 22, 2024

Music consumers are becoming the creators with Suno CEO Mikey Shulman

Mikey Shulman, the CEO and co-founder of Suno, can see a future where the Venn diagram of music creators and consumers becomes one big circle. The AI music generation tool trying to democratize music has been making waves in the AI community ever since they came out of stealth mode last year. Suno users can make a song complete with lyrics, just by entering a text prompt, for example, “koto boom bap lofi intricate beats.” You can hear it in action as Mikey, Sarah, and Elad create a song live in this episode. In this episode, Elad, Sarah, And Mikey talk about how the Suno team took their experience making at transcription tool and applied it to music generation, how the Suno team evaluates aesthetics and taste because there is no standardized test you can give an AI model for music, and why Mikey doesn’t think AI-generated music will affect people’s consumption of human made music. Listen to the full songs played and created in this episode: Whispers of Sakura Stone Statistical Paradise Statistical Paradise 2 Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @MikeyShulman Show Notes: (0:00) Mikey’s background (3:48) Bark and music generation (5:33) Architecture for music generation AI (6:57) Assessing music quality (8:20) Mikey’s music background as an asset (10:02) Challenges in generative music AI (11:30) Business model (14:38) Surprising use cases of Suno (18:43) Creating a song on Suno live (21:44) Ratio of creators to consumers (25:00) The digitization of music (27:20) Mikey’s favorite song on Suno (29:35) Suno is hiring

No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

enMay 16, 2024

Context windows, computer constraints, and energy consumption with Sarah and Elad

This week on No Priors hosts, Sarah and Elad are catching up on the latest AI news. They discuss the recent developments in AI music generation, and if you’re interested in generative AI music, stay tuned for next week’s interview! Sarah and Elad also get into device-resident models, AI hardware, and ask just how smart smaller models can really get. These hardware constraints were compared to the hurdles AI platforms are continuing to face including computing constraints, energy consumption, context windows, and how to best integrate these products in apps that users are familiar with. Have a question for our next host-only episode or feedback for our team? Reach out to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil Show Notes: (0:00) Intro (1:25) Music AI generation (4:02) Apple’s LLM (11:39) The role of AI-specific hardware (15:25) AI platform updates (18:01) Forward thinking in investing in AI (20:33) Unlimited context (23:03) Energy constraints

No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

enMay 09, 2024

Cognition’s Scott Wu on how Devin, the AI software engineer, will work for you

Scott Wu loves code. He grew up competing in the International Olympiad in Informatics (IOI) and is a world class coder, and now he's building an AI agent designed to create more, not fewer, human engineers. This week on No Priors, Sarah and Elad talk to Scott, the co-founder and CEO of Cognition, an AI lab focusing on reasoning. Recently, the Cognition team released a demo of Devin, an AI software engineer that can increasingly handle entire tasks end to end. In this episode, they talk about why the team built Devin with a UI that mimics looking over another engineer’s shoulder as they work and how this transparency makes for a better result. Scott discusses why he thinks Devin will make it possible for there to be more human engineers in the world, and what will be important for software engineers to focus on as these roles evolve. They also get into how Scott thinks about building the Cognition team and that they’re just getting started. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @ScottWu46 Show Notes: (0:00) Introduction (1:12) IOI training and community (6:39) Cognition’s founding team (8:20) Meet Devin (9:17) The discourse around Devin (12:14) Building Devin’s UI (14:28) Devin’s strengths and weakness (18:44) The evolution of coding agents (22:43) Tips for human engineers (26:48) Hiring at Cognition

No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

enMay 02, 2024

OpenAI’s Sora team thinks we’ve only seen the "GPT-1 of video models"

AI-generated videos are not just leveled-up image generators. But rather, they could be a big step forward on the path to AGI. This week on No Priors, the team from Sora is here to discuss OpenAI’s recently announced generative video model, which can take a text prompt and create realistic, visually coherent, high-definition clips that are up to a minute long. Sora team leads, Aditya Ramesh, Tim Brooks, and Bill Peebles join Elad and Sarah to talk about developing Sora. The generative video model isn’t yet available for public use but the examples of its work are very impressive. However, they believe we’re still in the GPT-1 era of AI video models and are focused on a slow rollout to ensure the model is in the best place possible to offer value to the user and more importantly they’ve applied all the safety measures possible to avoid deep fakes and misinformation. They also discuss what they’re learning from implementing diffusion transformers, why they believe video generation is taking us one step closer to AGI, and why entertainment may not be the main use case for this tool in the future. Show Links: Bling Zoo video Man eating a burger video Tokyo Walk video Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @_tim_brooks l @billpeeb l @model_mechanic Show Notes: (0:00) Sora team Introduction (1:05) Simulating the world with Sora (2:25) Building the most valuable consumer product (5:50) Alternative use cases and simulation capabilities (8:41) Diffusion transformers explanation (10:15) Scaling laws for video (13:08) Applying end-to-end deep learning to video (15:30) Tuning the visual aesthetic of Sora (17:08) The road to “desktop Pixar” for everyone (20:12) Safety for visual models (22:34) Limitations of Sora (25:04) Learning from how Sora is learning (29:32) The biggest misconceptions about video models

No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

enApril 25, 2024

Related Episodes

Episode 773: Tom Scioli's TRANSFORMERS VS GI JOE

Today's episode presents two looks at Tom Scioli's Transformers vs GI Joe comic in lead-up to his appearance on the upcoming Transformers: The Movie 35th anniversary episode later this week. First up is a segment from the March 22, 2017 Pull List episode covering his Transformers vs GI Joe: The Official Movie Adaptation oneshot, and then the bulk of the episode is Mike's conversation with the cartoonist shortly after the series was announced at NYCC 2013 almost 8 years ago.

Robots From Tomorrow is a twice-weekly comics podcast recorded deep beneath the Earth’s surface. You can subscribe to it via iTunes or through the RSS feed at RobotsFromTomorrow.com. You can also follow Mike and Greg on Twitter. Stay safe and enjoy your funny books.

Robots From Tomorrow!

enNovember 08, 2021

The RFC Mini-cast – 064: Tokyo Toy Show and The Last Knight Reveals

This week on the RFC Mini-cast: John, Exvee, Rob, and Melllvarr talk Tokyo Toy Show and Transformers 5: The Last Knight toy reveals!

enJune 04, 2017

Moonbase 2 Episode 766

I didnt think we would get anything this week but MCM London actually had some pretty big reveals and some leaks as well

enOctober 30, 2023

Moonbase 2 Interview John Barber 2018

With the end of IDW's current run on Transformers ever approaching young, pretty Mikey decided to sit down with John Barber once again to pick is his brain over Optimus Prime other comic stuff.

enJuly 19, 2018