Logo

    At scale, anything that could fail definitely will

    enSeptember 03, 2024
    What was the main topic of the podcast episode?
    Summarise the key points discussed in the episode?
    Were there any notable quotes or insights from the speakers?
    Which popular books were mentioned in this episode?
    Were there any points particularly controversial or thought-provoking discussed in the episode?
    Were any current events or trending topics addressed in the episode?

    Podcast Summary

    • Web3 with A16Z crypto podcastThe Web3 with A16Z crypto podcast provides insights from top tech leaders on the latest trends and innovations in the tech industry, valuable for coders, business leaders, and tech enthusiasts alike.

      The Web3 with A16Z crypto podcast is a valuable resource for those interested in the future of the internet, offering insights from top developers, scientists, and creators on the latest trends and innovations in tech. Ben Popper, the host of the Stack Overflow podcast, recommended this podcast for coders seeking more ownership of their work, business leaders trying to prepare for the future, and anyone curious about the next tech trends. Pradeep Vincent, the SVP and Chief Technical Architect of Oracle Cloud Infrastructure, shared his background in software engineering and how he transitioned from hands-on work to leadership roles, including at Amazon and Oracle. He emphasized the importance of understanding enterprise customers and markets to disrupt the cloud industry. Additionally, Ben and Ryan discussed the topic of incentivizing high-quality engineering while maintaining developer velocity in the context of large cloud platforms.

    • Cloud platform foundationA strong foundation for a large-scale cloud platform includes engineering culture, a skilled team, focus on handling scale, failures, and trade-offs, and prioritizing security to address customer concerns.

      When building a large-scale cloud platform, it's crucial to have a strong foundation based on engineering culture and a team with prior experience. Oracle's Chief Technical Architect shared his experience of moving from AWS and Azure, forming a dream team, and launching Oracle Cloud Infrastructure (OCI) in 2017. They focused on handling scale, failures, and trade-offs, such as architecting for resilience and minimizing blast radii. However, they identified a key difference in their approach: security. Enterprise customers had concerns about trust and visibility in a multi-tenanted cloud, so OCI prioritized providing control and transparency to address these concerns. This emphasis on security has been a significant factor in Oracle's success in attracting enterprise customers to the cloud.

    • OCI Security MeasuresOracle Cloud Infrastructure prioritizes enterprise trust through robust security measures in their Gen2 architecture, distributed cloud strategy, and continuous improvement against multi-tenancy attacks

      Oracle Cloud Infrastructure (OCI) prioritizes earning the trust of enterprise customers by implementing robust security measures. This is evident in their Gen2 architecture, which views even their own VM and network layers as untrusted and adds additional layers of protection. OCI's distributed cloud strategy also plays a role in building trust by allowing customers to have more control over their regions and software deployments. Although multi-tenancy attacks like row hammer and side-channel attacks are still valid concerns, OCI continues to improve security measures and offers options like single-tenanted chips to further solidify trust. In essence, OCI's approach to security and trust is multi-faceted, with a strong emphasis on layers of defense and customer control.

    • Large-scale architecture failures and handlingTo minimize the impact of failures in large-scale architectures, use common networks and regional services spread across multiple failure domains, and implement software architecture for detecting and managing failovers with trade-offs between quick failover and minimizing customer impact.

      When designing large-scale architectures, ensuring seamless failovers and handling failures effectively is crucial. Two approaches to mitigate side-channel attacks are providing customers limited access or implementing an abstracted interface with enough confidence. However, the industry hasn't eliminated side-channel attacks, so continued protection is necessary. When designing within a region, the goal is to minimize the impact of failures. Failure domains, such as data centers, have distinct power, cooling, and network sources. To build a high-level seamless layer, a common network and regional services are used. These services, like load balancers and databases, are spread across multiple failure domains and have software architecture to detect and manage failovers. The architecture's implications include the trade-off between quick failover and minimizing customer impact. While we want to fail over as quickly as possible, we also want to minimize the chaos caused by mass migrations. To address this, various throttling and slowdown mechanisms are used. Overall, handling failures effectively in large-scale architectures is a complex problem to solve.

    • Oracle Cloud Infrastructure and AIOracle Cloud Infrastructure prioritizes software independence and isolation, uses autonomous databases, cross-region replication, and automatic propagation for seamless integration, and is an essential partner in the AI industry's growth due to its security, data richness, and massive infrastructure requirements

      Oracle Cloud Infrastructure (OCI) prioritizes software independence and isolation across multiple availability domains and regions, while balancing the need for seamless integration. This approach involves using services like autonomous databases, cross-region replication, and automatic propagation of identity rules and object store buckets. OCI also intentionally avoids a central backbone connecting regions, leading to DNS-based migration during application failovers. The last year or two has seen the continued growth of cloud computing and data, as well as the rise of AI and machine learning services. OCI views itself as an enabler for the AI industry by providing the necessary infrastructure and capacity for innovators and startups to develop new models without having to build their own capacity from scratch. The security, data richness, and the massive infrastructure requirements of AI models have made cloud providers like OCI essential partners in the AI industry's growth.

    • AI workloads and cloud providersCloud providers face unique challenges in handling AI workloads due to the tight clustering of GPUs required for syncing model weights, but are expected to evolve and adapt to lessen this dependency in the future.

      The rapid growth of AI workloads, specifically those requiring large clusters of GPUs, presents both a significant opportunity and a notable challenge for cloud providers. The opportunity lies in the potential for increased productivity gains, but the challenge comes from the engineering standpoint, particularly in dealing with availability, change management, and scale. Traditional cloud systems have evolved to handle failure and loose coupling, but with the return of large, tightly clustered GPU systems, new engineering practices are required. For instance, change management in large clusters is typically done in a big way, rather than the staggered approach used in traditional systems. The reason for the tight clustering of GPUs is due to the nature of AI training algorithms, which require GPUs to sync model weights across various runs. However, this tight coupling is expected to lessen in the next five years as algorithms evolve and infrastructure adapts. In conclusion, the opportunity presented by AI workloads is immense, but it comes with unique challenges that require cloud providers to adapt their engineering and operational practices. The tight clustering of GPUs is a necessity for now, but it's expected to loosen up in the future. Cloud providers need to meet the application halfway and help drive this evolution.

    • Cloud adaptability and resilienceThe need for cloud applications to be adaptable and resilient to transient failures is crucial, and the Stack Overflow community plays a vital role in sharing knowledge and answering common questions.

      The importance of adaptability and resilience in cloud computing. Ben Popper and his guests discussed the challenges of running applications in the cloud and the need for them to become more accommodating of transient failures. They also acknowledged the growth of the Stack Overflow community, where users like Shen Tuna share their knowledge and help answer common questions. As a reminder, if you have any questions or suggestions for the show, you can reach out to Ben Popper on Twitter or email podcasts@Stack Overflow. Ryan Donovan, who edits the blog, can be found on LinkedIn. Priti Vincent, the SVP and chief technical architect for Oracle Cloud Infrastructure (OCI), encourages those interested in learning more about OCI to check out their blog series called "OCI First Principles," which is available online and will be linked in the show notes. Overall, the conversation emphasized the continuous learning and problem-solving nature of cloud engineering and the importance of collaboration and community in addressing these challenges.

    Recent Episodes from The Stack Overflow Podcast

    The world’s largest open-source business has plans for enhancing LLMs

    The world’s largest open-source business has plans for enhancing LLMs

    Red Hat Enterprise Linux may be the world’s largest open-source software business. You can dive into the docs here.

    Created by IBM and Red Hat, InstructLab is an open-source project for enhancing LLMs. Learn more here or join the community on GitHub.

    Connect with Scott on LinkedIn.  

    User AffluentOwl earned a Great Question badge by wondering How to force JavaScript to deep copy a string?

    The evolution of full stack engineers

    The evolution of full stack engineers

    From her early days coding on a TI-84 calculator, to working as an engineer at IBM, to pivoting over to her new role in DevRel, speaking, and community, Mrina has seen the world of coding from many angles. 

    You can follow her on Twitter here and on LinkedIn here.

    You can learn more about CK editor here and TinyMCE here.

    Congrats to Stack Overflow user NYI for earning a great question badge by asking: 

    How do I convert a bare git repository into a normal one (in-place)?

    The Stack Overflow Podcast
    enSeptember 10, 2024

    At scale, anything that could fail definitely will

    At scale, anything that could fail definitely will

    Pradeep talks about building at global scale and preparing for inevitable system failures. He talks about extra layers of security, including viewing your own VMs as untrustworthy. And he lays out where he thinks the world of cloud computing is headed as GenAI becomes a bigger piece of many company’s tech stack. 

    You can find Pradeep on LinkedIn. He also writes a blog and hosts a podcast over at Oracle First Principles

    Congrats to Stack Overflow user shantanu, who earned a Great Question badge for asking: 

    Which shell I am using in mac?

     Over 100,000 people have benefited from your curiosity.

    The Stack Overflow Podcast
    enSeptember 03, 2024

    Mobile Observability: monitoring performance through cracked screens, old batteries, and crappy Wi-Fi

    Mobile Observability: monitoring performance through cracked screens, old batteries, and crappy Wi-Fi

    You can learn more about Austin on LinkedIn and check out a blog he wrote on building the SDK for Open Telemetry here.

    You can find Austin at the CNCF Slack community, in the OTel SIG channel, or the client-side SIG channels. The calendar is public on opentelemetry.io. Embrace has its own Slack community to talk all things Embrace or all things mobile observability. You can join that by going to embrace.io as well.

    Congrats to Stack Overflow user Cottentail for earning an Illuminator badge, awarded when a user edits and answers 500 questions, both actions within 12 hours.

    Where does Postgres fit in a world of GenAI and vector databases?

    Where does Postgres fit in a world of GenAI and vector databases?

    For the last two years, Postgres has been the most popular database among respondents to our Annual Developer Survey. 

    Timescale is a startup working on an open-source PostgreSQEL stack for AI applications. You can follow the company on X and check out their work on GitHub

    You can learn more about Avthar on his website and on LinkedIn

    Congrats to Stack Overflow user Haymaker for earning a Great Question badge. They asked: 

    How Can I Override the Default SQLConnection Timeout

    ? Nearly 250,000 other people have been curious about this same question.

    Ryan Dahl explains why Deno had to evolve with version 2.0

    Ryan Dahl explains why Deno had to evolve with version 2.0

    If you’ve never seen it, check out Ryan’s classic talk, 10 Things I Regret About Node.JS, which gives a great overview of the reasons he felt compelled to create Deno.

    You can learn more about Ryan on Wikipedia, his website, and his Github page.

    To learn more about Deno 2.0, listen to Ryan talk about it here and check out the project’s Github page here.

    Congrats to Hugo G, who earned a Great Answer Badge for his input on the following question: 

    How can I declare and use Boolean variables in a shell script?

    Battling ticket bots and untangling taxes at the frontiers of e-commerce

    Battling ticket bots and untangling taxes at the frontiers of e-commerce

    You can find Ilya on LinkedIn here.

    You can listen to Ilya talk about Commerce Components here, a system he describes as a "modern way to approach your commerce architecture without reducing it to a (false) binary choice between microservices and monoliths."

    As Ilya notes, “there are a lot of interesting implications for runtime and how we're solving it at Shopify. There is a direct bridge there to a performance conversation as well: moving untrusted scripts off the main thread, sandboxing UI extensions, and more.” 

    No badge winner today. Instead, user Kaizen has a question about Shopify that still needs an answer. Maybe you can help! 

    How to Activate Shopify Web Pixel Extension on Production Store?

    Scaling systems to manage the data about the data

    Scaling systems to manage the data about the data

    Coalesce is a solution to transform data at scale. 

    You can find Satish on LinkedIn

    We previously spoke to Satish for a Q&A on the blog: AI is only as good as the data: Q&A with Satish Jayanthi of Coalesce

    We previously covered metadata on the blog: Metadata, not data, is what drags your database down

    Congrats to Lifeboat winner nwinkler for saving this question with a great answer: Docker run hello-world not working