Podcast Summary
Foundry's innovative cloud infrastructure for AI: Foundry, a cloud built for AI, aims to improve economics by 12-20x compared to existing solutions and address issues of low utilization rate due to hardware failures and idle time between workloads, making advanced AI resources more accessible to a wider audience.
Foundry, a public cloud built specifically for AI, aims to make advanced computational resources more accessible and affordable for a broader audience. Jared Davis, Foundry's CEO, was inspired by the achievements of small teams with significant computational power, such as DeepMind and AlphaFold 2. Foundry's mission is to reimagine cloud infrastructure from the ground up for AI workloads, improving economics by 12 to 20x compared to existing solutions. The current utilization rate of GPU clouds is often lower than expected due to frequent hardware failures and idle time between workloads. Foundry aims to address these issues and increase the frequency of groundbreaking AI developments. Their primary offerings include infrastructure as a service and tools for seamless access to state-of-the-art systems. The underutilization of GPU clouds is prevalent across various user categories, including hyperscalers, large clusters, and individual users. Despite high expectations, actual utilization rates can be significantly lower due to hardware failures and idle time between workloads. Foundry's innovative approach to cloud infrastructure could help bridge this gap and make advanced AI resources more accessible to a wider audience.
Complexity of GPUs and machine learning systems: The complexity of GPUs and machine learning systems, consisting of thousands to tens of thousands of components, leads to more frequent failures in newer, advanced systems and requires specialized tooling and orchestration to keep them running smoothly.
GPUs, which are often thought of as just chips, are actually complex systems consisting of thousands to tens of thousands of components. NVIDIA's DGX and HCX systems are able to compress an entire data center's worth of infrastructure into a single box, but the failure probability of these large supercomputers, made up of interconnected GPUs, is essentially zero due to the vast number of components. This complexity leads to more frequent failures in newer, more advanced systems. The large regime in machine learning requires orchestrating a cluster of GPUs to perform a single synchronous calculation, leading to many components collaborating to perform a single calculation and increasing the potential for degradation or failure if one component fails. The hyperscalers, such as Amazon Web Services (AWS), have made assumptions about the depreciation cycle of these large, complex systems, and the definition of cloud has shifted from the originally intended sense of being a service rather than just co-location of hardware. The cloud, as we know it today, started in 2003 with the launch of AWS in 2006, and it took time for the model to catch on due to uncertainty about its value proposition. The complexities of these large-scale systems require specialized tooling and orchestration to keep them running smoothly.
Cloud Elasticity: Cloud Elasticity allowed users to access on-demand compute resources and pay for only what they used, a significant departure from the traditional model of buying and maintaining physical servers, leading to cost savings and making the cloud a critical component of modern technology infrastructure.
The cloud computing revolution, which started around 2007, was not immediately recognized as a game-changer by everyone. Early adopters, particularly startups and early cloud service providers, saw the value in the elasticity and cost savings that the cloud offered. However, it took time for enterprises, especially those in regulated industries, to fully embrace the cloud. One of the key advantages of the cloud was its ability to make compute resources available on demand and pay only for what was used, a concept known as elasticity. This was a significant departure from the traditional model of buying and maintaining physical servers. While the cloud has come a long way since then, it's important to remember that its full potential was not immediately clear to everyone. The cloud made compute resources fast and free, enabling users to run workloads much faster for the same cost. However, realizing this potential was not trivial, requiring the ability to reshape workloads and have the necessary capacity available in the cloud. Despite these challenges, the cloud's elasticity and cost savings have proven to be a major advantage, making it a critical component of modern technology infrastructure. Today, it's hard for younger engineering teams to imagine a world without the cloud, but it's important to remember that its adoption was not an overnight success.
AI hardware resources challenges: AI hardware resources present significant challenges for companies due to upfront capital requirements, inflexibility, and risk management. Innovative business models and technical solutions aim to create a more efficient and flexible system.
The current state of AI hardware resources presents significant challenges for companies, particularly in the areas of upfront capital requirements, inflexibility, and risk management. This is due to the long-term commitments and high costs associated with purchasing and maintaining hardware for AI workloads. The situation is reminiscent of traditional parking lot businesses, where customers can either pay as they go or reserve a spot upfront, but the latter comes with significant upfront cost. To address these challenges, there's a need for business model and technical innovations that enable more efficient use of resources and provide greater flexibility. One such innovation is enabling pay-as-you-go users to utilize reserved spots, creating a win-win situation for both parties. However, implementing this requires a convenient and seamless system, such as a valet service, to manage the handover of reserved spots. The current state of AI hardware resources is a challenging one, but there are efforts being made to improve the situation through innovative business models and technical solutions. The goal is to create a more efficient and flexible system that reduces the upfront capital requirements and risks associated with AI hardware resources.
GPU spot management: AWS's new product, Spot, automates and optimizes GPU usage, benefiting companies using GPUs for training and inference, with substantial scale in GPU usage across various applications
Amazon Web Services (AWS) has launched a new product on the Foundry Cloud platform called Spot, which is a mechanism for managing and automating the use of GPU spots. This system uses sensors to detect when a user is present, automatically moves their car to another spot, and brings it back when they return. This creates more effective space and better economics for companies. The use of Spot is particularly beneficial for companies using GPUs for training and inference, and it opens up interesting conversations about the different classes of workloads and their needs. AWS found that companies are using this mechanism quite extensively for these purposes. The importance of Spot usability and automation is a significant trend in the tech industry. To put the scale of GPU capacity in context, during the training of GPT-3, 10,000 B100 GPUs were used for about 14.6 days. At the peak of Ethereum, there were around 10 to 20 million V100 equivalent GPUs in use, running continuously. This shows the substantial scale of GPU usage in various applications. The Ethereum network, which had a higher relative ratio of GPUs compared to Bitcoin, had tens of thousands of NVIDIA GPUs and less than 1% of global hash power. These numbers give an idea of the vast amount of GPU power being used in various applications.
Compute power utilization in AI: Despite an abundance of high-end compute power, utilization rates are low due to factors like healing buffers and market dynamics. The future of AI may involve a shift towards smaller, smart models and distributed training across multiple data centers.
While there is an abundance of compute power in the world, with high-end GPUs like the iPhone 15 Pro having around 35 teraflops, the utilization of this compute power is quite low. According to some sources, the utilization of even high-end H100 systems is only at most 20-25%. This is due in part to the fact that during pre-training rounds, utilization can be as low as 80% due to healing buffers. However, there are tools like Mars, which monitor, alert, and ensure resiliency and security, that can help boost the availability and uptime of GPUs. Additionally, there are market dynamics that make access to the largest, most interconnected clusters a premium, but there are also paradigms emerging that don't require these large clusters. This includes the concept of "pumpkin AI systems," which are smaller, but extremely smart models that can be trained on smaller clusters. Google, for example, has been experimenting with training models across multiple data centers. The future of AI may look less like everything requiring large clusters and more like a shift towards these new paradigms. In summary, while there is an abundance of compute power, the utilization is low, and the future of AI may involve a shift towards new paradigms that don't require large, interconnected clusters.
Scaling up AI models: Researchers are exploring new ways to scale up AI models by generating candidate responses and filtering down to the best one, utilizing synthetic data generation, compound systems, batch inference, and horizontally scalable workflows, and prioritizing verifiability for high performance.
Researchers are exploring new ways to scale up AI models by making more efficient use of existing resources and parallelizing workloads. This includes generating a large number of candidate responses from a model and filtering down to the best one, as demonstrated in the Chinchilla paper. This approach is becoming more common as systems like AlphaGeometry, which utilize synthetic data generation and compound systems, gain popularity. Additionally, the cost of training and inference is becoming a more significant consideration, leading to a shift towards batch inference and horizontally scalable workflows. The paper I recently authored delves deeper into this concept of compound AI systems, where many calls to a model are composed into a network of networks. The principle of verifiability, which refers to problems where it's easier to check an answer than generate one, can guide the architecture of these systems, resulting in high performance. This approach to scaling up AI models is gaining traction as a cost-effective and efficient alternative to traditional methods.
Massive language models: Combining multiple language models in a massive network can significantly improve performance on parallelizable tasks, such as code generation and neural network design, and is expected to become a common approach in the future.
A new approach using pre-training and composing massive networks with multiple language models could significantly improve performance on various tasks, especially those that are more parallelizable. This was demonstrated in a recent paper, where a 3% improvement was achieved on the MMLEU benchmark, which is a notable gap compared to previous best models. The idea is to have each stage in the network composed of the best of multiple language models, making millions of calls to answer a question and then choosing the best response. This approach, while seeming far-fetched now, is expected to become common sense in the future for tasks like code generation, design, and neural network design. The hope is that the community will explore this further, as it seems applicable to downstream tasks and could reduce the need for large, interconnected clusters for cutting-edge work.