Francois Chollet, Mike Knoop - LLMs won’t lead to AGI - $1,000,000 Prize to find true solution

enJune 11, 2024

Dwarkesh Podcast

Podcast Summary

Arc benchmark: The Arc benchmark tests machine intelligence by requiring core knowledge and resistance to memorization, and aims to observe an AI system's ability to adapt to novelty and learn new skills efficiently.
The Arc benchmark, created by Francois Chollet from Google and intended as a test for machine intelligence, is designed to be resistant to memorization and requires only core knowledge, making it challenging for large language models (LLMs). The benchmark consists of novel puzzles that cannot be solved by relying on memorized information alone. While some progress has been made using approaches like discrete program search and program synthesis, these methods still rely on some overlap between the tasks they've been trained on and the new, unseen tasks. The ultimate goal is to observe an AI system that can adapt to novelty on the fly and pick up new skills efficiently, demonstrating general intelligence. This is crucial because the world is constantly changing, and no model can be pre-trained on everything it might encounter at test time. Instead, humans possess the ability to learn and adapt to new situations, making us unique. The Arc benchmark serves as an important step towards understanding the capabilities of AI systems and identifying when we might be on the path to achieving artificial general intelligence (AGI).
Arc tasks and LLMs: Arc tasks, with their small, 2D grids and unfamiliar nature, challenge LLMs' ability to generate new solution programs, limiting their performance in real-world scenarios and impacting benchmarking measures.
While Large Language Models (LLMs) excel at processing sequential data, they struggle with unfamiliar tasks that require on-the-fly program synthesis, such as Arc. The reason for this lies in the nature of Arc tasks, which involve small, 2D grids with only 10 possible symbols, making them easy to flatten and process. However, the unfamiliarity of each new task requires LLMs to generate new solution programs, which they currently lack the ability to do effectively. This is in contrast to humans, who can easily understand and solve Arc tasks, even with limited knowledge. The ongoing debate revolves around the need to add active inference or adaptive compute to LLMs to address this limitation. The importance of this issue goes beyond a technical detail, as it directly impacts the performance measures used for benchmarking LLMs and the questions we ask of these models. The current benchmarks, which are primarily memorization-based, favor LLMs' strengths in memorizing static programs, but do not accurately reflect their ability to reason and synthesize new programs on the fly. This is a critical limitation that needs to be addressed to unlock the full potential of LLMs.
Memorization vs intelligence: Memorization improves performance on benchmarks but does not increase intelligence. Intelligence requires the ability to adapt, learn new tasks efficiently, and synthesize new programs on the fly.
While increasing the size of a model's database can improve its performance on memorization-based benchmarks, it does not necessarily increase the intelligence of the system. Skill and intelligence are not the same. Memorization allows a system to recall and apply stored knowledge, but it does not enable the system to adapt and learn new skills or tasks on its own. The confusion lies in the fact that as we add more knowledge to a system, it becomes more skillful and capable, but it does not become more intelligent in the sense of having the ability to approach any problem and quickly master it using valid or data. Instead, intelligence requires the ability to adapt, learn on the fly efficiently, and synthesize new programs on the fly to solve unfamiliar tasks. This is the definition of generality, and it cannot be achieved solely through memorization or the scaling up of specific skills. Instead, it requires the ability to adapt and learn new tasks efficiently, which is what these large language models are demonstrating through their ability to learn new languages and adapt to new tasks with minimal training.
Human vs LLM adaptation abilities: Humans can adapt to new situations and navigate novel environments, while LLMs rely heavily on their training data and have limited capabilities for generalization and synthesizing new knowledge.
While large language models (LLMs) like GPT-1.5 have shown impressive language learning abilities, they are not limited to memorization and pattern matching. Instead, humans and advanced AI systems possess the unique ability to adapt to new situations and navigate novel environments, which is essential for daily life and complex problem-solving tasks. LLMs, on the other hand, rely heavily on their extensive training data and have limited capabilities for generalization and synthesizing new knowledge. For instance, a human programmer cannot be entirely replaced by an LLM because they face novel challenges every day, and their job requires more than just memorization and pattern matching. While LLMs can learn and generate code snippets based on their training, they cannot develop software to solve entirely new problems. In conclusion, while LLMs have made significant strides in language understanding, they are not yet capable of the extreme generalization and adaptation abilities that humans possess.
Creativity and AI limitations: While larger models can learn efficiently, they still struggle with novel concepts and true reasoning. AI might be using program synthesis but it's not proven yet. Arc benchmark tests model's ability to solve novel tasks, and even if a model solves 80%, it might not equal AGI. Intelligence is a fast-finding algorithm, but AI lacks ability to anticipate changes and adapt to new information.
Creativity is not just interpolation in a higher dimension for larger models, but rather a combination of pattern matching, memorization, and true reasoning. While larger models can learn more efficiently due to reusable building blocks, they still have limitations when encountering truly novel concepts. The discussion also touched upon the idea that models might be using a form of program synthesis to combine and reason with inputs, but this is still a theory and not yet proven. The Arc benchmark, which tests a model's ability to solve novel tasks, was brought up as a challenge for models, and it was suggested that even if a model could solve 80% of Arc tasks, it might not equate to AGI. The conversation also touched upon the idea of intelligence as a fast-finding algorithm in the space of possible future situations, and the limitations of models due to their lack of ability to anticipate changes and adapt to new information. Ultimately, the discussion highlighted the differences and similarities between human intelligence and AI, and the ongoing challenges in understanding and replicating the full range of human cognitive abilities.
Merging LLMs and Discrete Models: The future of AI might involve merging the strengths of parametric curves (LLMs) and discrete program search (discrete models) to create a hybrid system capable of handling both memorization and generalization effectively, leading to a more intelligent AI system.
While Large Language Models (LLMs) can automate various tasks and generate significant economic value, they are not yet capable of true intelligence. LLMs excel in memorization and can generalize to some extent due to compression and regularization. However, they struggle with dealing with novelty and uncertainty, which requires intelligence. The future may involve merging the strengths of parametric curves (LLMs) and discrete program search (discrete models) to create a hybrid system that can handle both memorization and generalization effectively. This could lead to a more intelligent AI system that can adapt to new situations and learn from limited data.
Deep planning and deep learning: Deep planning models will guide the search process, while deep learning will be used for common sense knowledge and knowledge in general, resulting in an efficient and effective problem-solving system.
The future of AI development lies in combining deep planning models with deep learning for more efficient and effective problem-solving. Deep planning models, which can reason and intuitively understand the shape of a solution, will guide the search process. However, the actual search will not be done through brute force but by asking another deep planning model for suggestions and using yet another deep planning model for feedback. Deep learning will be used for common sense knowledge and knowledge in general, resulting in an on-the-fly synthesis engine that can adapt to new situations. The key is to make quick program search dramatically more efficient through the use of deep learning. Additionally, the importance of architecture in intelligence was emphasized, with memory and intelligence being separate components. The debate continues on the degree to which scaling and architectural improvements are necessary for achieving human-level intelligence. The Co-founder of Zapier, Mike Kanojia, launched the prize with Francois Chollet due to their shared interest in advancing AI research and pushing the boundaries of what is possible with deep learning.
Arc competition progress: Despite significant resources invested in AI research, the lack of progress towards solving the Arc competition indicates that the trend of closing frontier research and focusing on LLMs may hinder progress towards AGI. An open competition with an open-source model could help uncover potential solutions and latent capabilities.
The Arc competition, a unique evaluation for measuring the progress towards General Artificial Intelligence (AGI), has not gained widespread recognition due to its complexity and the fact that it requires new ideas to solve. The speaker, who was initially introduced to the Arc puzzles during the COVID-19 pandemic, was shocked by the lack of progress made towards solving it since its release, despite the significant resources being invested in AI research. The speaker believes that the trend of closing frontier research and the focus on Large Language Models (LLMs) have hindered progress towards AGI and that an open competition with an open-source model could help uncover potential solutions and latent capabilities in existing models. The Arc competition offers a prize pool of over a million dollars for solving it or achieving the best vision-based solution.
Arc competition: The Arc competition, with a $500,000 prize for reaching 85% accuracy and requirements to make solutions public, aims to push boundaries in AI research and potentially reveal new insights into human-level AI capabilities.
The Arc competition, hosted by Zapier, aims to solve a complex problem that currently has a human average performance of 35%, with the ultimate goal of reaching 85% accuracy. The competition offers a $500,000 prize for the first team to reach the 85% benchmark, but it's expected to take several years. To encourage progress and knowledge sharing, there's a $100,000 progress prize and a $50,000 paper award. Unlike typical contests, winners are required to make their solutions public to foster collaboration and advance the field. The competition is intriguing because it challenges current machine learning models and may reveal new insights into human-level AI capabilities. Jack Colis' approach, which involves active inference and shallow search, is unique and may offer a different perspective compared to traditional discrete program search methods. The optimal solution likely lies somewhere between shallow and deep search, leveraging both memorization and deep learning. The Arc competition is expected to push the boundaries of AI research and provide valuable insights, regardless of the outcome.
AI model efficiency evaluation: The Arc competition evaluates AI model efficiency and effectiveness with limited resources, including a 12-hour runtime limit and 100 test tasks, while acknowledging the potential of larger models and planning to make a private test set available for researchers.
The ongoing competition, which involves testing AI models on a specific benchmark called Arc, is designed to evaluate the efficiency and effectiveness of models with limited resources. With a 12-hour runtime limit and only 100 test tasks, each task requires significant compute efficiency. The competition's organizers are interested in seeing how much progress can be made with these constraints. However, they also acknowledge the open question of what larger models could potentially accomplish on Arc and plan to make a private test set available for researchers to explore. The public test set, which is available on GitHub, may introduce uncertainty due to potential memorization of tasks by models trained on the same platform. The competition aims to evolve the Arc data set and eventually release a new version, while maintaining security and preventing unintended training on the data. The competition's high reward is intended to attract new ideas and approaches to solving the benchmark, and the organizers expect to learn important insights about the current state of AI research.
Program synthesis and HGI: Program synthesis, enabling the creation of complex programs from few examples, is a significant step towards Human-Level General Intelligence and a new paradigm for software development.
The ongoing research in program synthesis represents a significant step towards Human-Level General Intelligence (HGI) by enabling the creation of complex programs from just a few examples, which is a new paradigm for software development. This approach allows for the synthesis of problem-solving programs without the need for explicit programming, and it's expected to be a major milestone on the path to HGI. The use of a code interpreter and fine-tuning are valid approaches, but it's important to avoid brute-force solutions and memorization systems that cheat the system. The solution to the R competition will likely be a combination of deep learning and discrete form search paradigms. Core knowledge, which can be acquired and learned, plays a crucial role in intelligence, and humans acquire most of it during the first few years of their lives. The goal of the ongoing competition is to accelerate progress towards HGI by ensuring that any meaningful progress is shared publicly. The competition will continue until a reproducible open-source version is made available. The competition also serves as a way to test the actual limits of compute power.
Artificial Intelligence Prize: A $1 million reward is available for advancements in artificial intelligence at ArtPrize.org. The prize is committed to continuous improvement and welcomes feedback and adjustments.
The Artificial Intelligence Prize, with a $1 million reward, is now live at ArtPrize.org. During this podcast, we discussed the importance of continuous improvement and learning in the field of artificial intelligence. Both the host and the guest expressed their commitment to evolving the prize and making it the best it can be. Initially, some decisions may be arbitrary, but they will be updated and adjusted based on feedback and progress. If you're interested in participating, you can visit ArtPrize.org and give it a try. The conversation also touched upon the excitement and surprise of delving deeper into the topic of intelligence and gaining a new perspective. So, whether you're an expert or just starting out, this is an opportunity to be a part of something innovative and potentially groundbreaking. Good luck!

Recent Episodes from Dwarkesh Podcast

Tony Blair - Life of a PM, The Deep State, Lee Kuan Yew, & AI's 1914 Moment

I chatted with Tony Blair about:

- What he learned from Lee Kuan Yew

- Intelligence agencies track record on Iraq & Ukraine

- What he tells the dozens of world leaders who come seek advice from him

- How much of a PM’s time is actually spent governing

- What will AI’s July 1914 moment look like from inside the Cabinet?

Enjoy!

Watch the video on YouTube. Read the full transcript here.

Follow me on Twitter for updates on future episodes.

Sponsors

- Prelude Security is the world’s leading cyber threat management automation platform. Prelude Detect quickly transforms threat intelligence into validated protections so organizations can know with certainty that their defenses will protect them against the latest threats. Prelude is backed by Sequoia Capital, Insight Partners, The MITRE Corporation, CrowdStrike, and other leading investors. Learn more here.

- This episode is brought to you by Stripe, financial infrastructure for the internet. Millions of companies from Anthropic to Amazon use Stripe to accept payments, automate financial processes and grow their revenue.

If you’re interested in advertising on the podcast, check out this page.

Timestamps

(00:00:00) – A prime minister’s constraints

(00:04:12) – CEOs vs. politicians

(00:10:31) – COVID, AI, & how government deals with crisis

(00:21:24) – Learning from Lee Kuan Yew

(00:27:37) – Foreign policy & intelligence

(00:31:12) – How much leadership actually matters

(00:35:34) – Private vs. public tech

(00:39:14) – Advising global leaders

(00:46:45) – The unipolar moment in the 90s

Get full access to Dwarkesh Podcast at www.dwarkeshpatel.com/subscribe

Dwarkesh Podcast

enJune 26, 2024

Francois Chollet, Mike Knoop - LLMs won’t lead to AGI - $1,000,000 Prize to find true solution

Here is my conversation with Francois Chollet and Mike Knoop on the $1 million ARC-AGI Prize they're launching today.

I did a bunch of socratic grilling throughout, but Francois’s arguments about why LLMs won’t lead to AGI are very interesting and worth thinking through.

It was really fun discussing/debating the cruxes. Enjoy!

Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here.

Timestamps

(00:00:00) – The ARC benchmark

(00:11:10) – Why LLMs struggle with ARC

(00:19:00) – Skill vs intelligence

(00:27:55) - Do we need “AGI” to automate most jobs?

(00:48:28) – Future of AI progress: deep learning + program synthesis

(01:00:40) – How Mike Knoop got nerd-sniped by ARC

(01:08:37) – Million $ ARC Prize

(01:10:33) – Resisting benchmark saturation

(01:18:08) – ARC scores on frontier vs open source models

(01:26:19) – Possible solutions to ARC Prize

Get full access to Dwarkesh Podcast at www.dwarkeshpatel.com/subscribe

Dwarkesh Podcast

enJune 11, 2024

Leopold Aschenbrenner - China/US Super Intelligence Race, 2027 AGI, & The Return of History

Chatted with my friend Leopold Aschenbrenner on the trillion dollar nationalized cluster, CCP espionage at AI labs, how unhobblings and scaling can lead to 2027 AGI, dangers of outsourcing clusters to Middle East, leaving OpenAI, and situational awareness.

Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here.

Follow me on Twitter for updates on future episodes. Follow Leopold on Twitter.

Timestamps

(00:00:00) – The trillion-dollar cluster and unhobbling

(00:20:31) – AI 2028: The return of history

(00:40:26) – Espionage & American AI superiority

(01:08:20) – Geopolitical implications of AI

(01:31:23) – State-led vs. private-led AI

(02:12:23) – Becoming Valedictorian of Columbia at 19

(02:30:35) – What happened at OpenAI

(02:45:11) – Accelerating AI research progress

(03:25:58) – Alignment

(03:41:26) – On Germany, and understanding foreign perspectives

(03:57:04) – Dwarkesh’s immigration story and path to the podcast

(04:07:58) – Launching an AGI hedge fund

(04:19:14) – Lessons from WWII

(04:29:08) – Coda: Frederick the Great

Get full access to Dwarkesh Podcast at www.dwarkeshpatel.com/subscribe

Dwarkesh Podcast

enJune 04, 2024

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

Chatted with John Schulman (cofounded OpenAI and led ChatGPT creation) on how posttraining tames the shoggoth, and the nature of the progress to come...

Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here. Follow me on Twitter for updates on future episodes.

Timestamps

(00:00:00) - Pre-training, post-training, and future capabilities

(00:16:57) - Plan for AGI 2025

(00:29:19) - Teaching models to reason

(00:40:50) - The Road to ChatGPT

(00:52:13) - What makes for a good RL researcher?

(01:00:58) - Keeping humans in the loop

(01:15:15) - State of research, plateaus, and moats

Sponsors

If you’re interested in advertising on the podcast, fill out this form.

* Your DNA shapes everything about you. Want to know how? Take 10% off our Premium DNA kit with code DWARKESH at mynucleus.com.

* CommandBar is an AI user assistant that any software product can embed to non-annoyingly assist, support, and unleash their users. Used by forward-thinking CX, product, growth, and marketing teams. Learn more at commandbar.com.

Get full access to Dwarkesh Podcast at www.dwarkeshpatel.com/subscribe

Dwarkesh Podcast

enMay 15, 2024

Mark Zuckerberg - Llama 3, Open Sourcing $10b Models, & Caesar Augustus

Mark Zuckerberg on:

- Llama 3

- open sourcing towards AGI

- custom silicon, synthetic data, & energy constraints on scaling

- Caesar Augustus, intelligence explosion, bioweapons, $10b models, & much more

Enjoy!

Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Human edited transcript with helpful links here.

Timestamps

(00:00:00) - Llama 3

(00:08:32) - Coding on path to AGI

(00:25:24) - Energy bottlenecks

(00:33:20) - Is AI the most important technology ever?

(00:37:21) - Dangers of open source

(00:53:57) - Caesar Augustus and metaverse

(01:04:53) - Open sourcing the $10b model & custom silicon

(01:15:19) - Zuck as CEO of Google+

Sponsors

If you’re interested in advertising on the podcast, fill out this form.

* This episode is brought to you by Stripe, financial infrastructure for the internet. Millions of companies from Anthropic to Amazon use Stripe to accept payments, automate financial processes and grow their revenue. Learn more at stripe.com.

* V7 Go is a tool to automate multimodal tasks using GenAI, reliably and at scale. Use code DWARKESH20 for 20% off on the pro plan. Learn more here.

Get full access to Dwarkesh Podcast at www.dwarkeshpatel.com/subscribe

enApril 18, 2024

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Had so much fun chatting with my good friends Trenton Bricken and Sholto Douglas on the podcast.

No way to summarize it, except:

This is the best context dump out there on how LLMs are trained, what capabilities they're likely to soon have, and what exactly is going on inside them.

You would be shocked how much of what I know about this field, I've learned just from talking with them.

To the extent that you've enjoyed my other AI interviews, now you know why.

So excited to put this out. Enjoy! I certainly did :)

Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform.

There's a transcript with links to all the papers the boys were throwing down - may help you follow along.

Follow Trenton and Sholto on Twitter.

Timestamps

(00:00:00) - Long contexts

(00:16:12) - Intelligence is just associations

(00:32:35) - Intelligence explosion & great researchers

(01:06:52) - Superposition & secret communication

(01:22:34) - Agents & true reasoning

(01:34:40) - How Sholto & Trenton got into AI research

(02:07:16) - Are feature spaces the wrong way to think about intelligence?

(02:21:12) - Will interp actually work on superhuman models

(02:45:05) - Sholto’s technical challenge for the audience

(03:03:57) - Rapid fire

Get full access to Dwarkesh Podcast at www.dwarkeshpatel.com/subscribe

enMarch 28, 2024

model interpretability

Demis Hassabis - Scaling, Superhuman AIs, AlphaZero atop LLMs, Rogue Nations Threat

Here is my episode with Demis Hassabis, CEO of Google DeepMind

We discuss:

* Why scaling is an artform

* Adding search, planning, & AlphaZero type training atop LLMs

* Making sure rogue nations can't steal weights

* The right way to align superhuman AIs and do an intelligence explosion

Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here.

Timestamps

(0:00:00) - Nature of intelligence

(0:05:56) - RL atop LLMs

(0:16:31) - Scaling and alignment

(0:24:13) - Timelines and intelligence explosion

(0:28:42) - Gemini training

(0:35:30) - Governance of superhuman AIs

(0:40:42) - Safety, open source, and security of weights

(0:47:00) - Multimodal and further progress

(0:54:18) - Inside Google DeepMind

Get full access to Dwarkesh Podcast at www.dwarkeshpatel.com/subscribe

enFebruary 28, 2024

Patrick Collison (Stripe CEO) - Craft, Beauty, & The Future of Payments

We discuss:

* what it takes to process $1 trillion/year

* how to build multi-decade APIs, companies, and relationships

* what's next for Stripe (increasing the GDP of the internet is quite an open ended prompt, and the Collison brothers are just getting started).

Plus the amazing stuff they're doing at Arc Institute, the financial infrastructure for AI agents, playing devil's advocate against progress studies, and much more.

Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here. Follow me on Twitter for updates on future episodes.

Timestamps

(00:00:00) - Advice for 20-30 year olds

(00:12:12) - Progress studies

(00:22:21) - Arc Institute

(00:34:27) - AI & Fast Grants

(00:43:46) - Stripe history

(00:55:44) - Stripe Climate

(01:01:39) - Beauty & APIs

(01:11:51) - Financial innards

(01:28:16) - Stripe culture & future

(01:41:56) - Virtues of big businesses

(01:51:41) - John

Get full access to Dwarkesh Podcast at www.dwarkeshpatel.com/subscribe

Dwarkesh Podcast

enFebruary 21, 2024

innovation

funding

organizational success

science research

technical careers

Tyler Cowen - Hayek, Keynes, & Smith on AI, Animal Spirits, Anarchy, & Growth

It was a great pleasure speaking with Tyler Cowen for the 3rd time.

We discussed GOAT: Who is the Greatest Economist of all Time and Why Does it Matter?, especially in the context of how the insights of Hayek, Keynes, Smith, and other great economists help us make sense of AI, growth, animal spirits, prediction markets, alignment, central planning, and much more.

The topics covered in this episode are too many to summarize. Hope you enjoy!

Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here. Follow me on Twitter for updates on future episodes.

Timestamps

(0:00:00) - John Maynard Keynes

(00:17:16) - Controversy

(00:25:02) - Fredrick von Hayek

(00:47:41) - John Stuart Mill

(00:52:41) - Adam Smith

(00:58:31) - Coase, Schelling, & George

(01:08:07) - Anarchy

(01:13:16) - Cheap WMDs

(01:23:18) - Technocracy & political philosophy

(01:34:16) - AI & Scaling

Get full access to Dwarkesh Podcast at www.dwarkeshpatel.com/subscribe