Logo
    Search

    Podcast Summary

    • Combining Machine Learning and PostgreSQLMontana Love and Jeremy Stanley integrated machine learning models directly into PostgreSQL, making predictions more accessible to data scientists using simpler solutions

      In many business contexts, machine learning applications often involve working with relational data and making predictions based on it. Contrary to the hype around complex deep learning models, linear regression and xgboost algorithms may be sufficient for most predictions. Montana Love and Lev shared their story of how they found themselves combining machine learning and a popular database, PostgreSQL. Montana, who joined Instacart seven years ago, had a background in data science and experience with databases and search systems. When Instacart grew and needed a more scalable architecture, Montana was tasked with separating the product catalog data from the single Postgres database and fronting it with Elasticsearch. As data science became more important at Instacart, Montana led initiatives to distribute architectures and stitch systems together. They brought on a VP of engineering, Jeremy Stanley, who called for help in making machine learning more accessible to data scientists. Together, they began to explore how to integrate machine learning models directly into PostgreSQL, leading to the creation of PostgresML. This practical application of machine learning in a database system demonstrates that often, simpler solutions can be more effective than the flashy and complex models that dominate the headlines.

    • Managing machine learning models at Instacart: Complex challenges and solutionsInstacart overcame complexities in managing machine learning models through real-time data processing using Elasticsearch, despite dealing with large and rapidly growing clusters and unique business constraints.

      Building and productionizing machine learning models at Instacart was a complex process involving dependency management, real-time data processing, and scaling large Elasticsearch clusters. Initially, Instacart built various tools and libraries in-house to handle these challenges, but as the ecosystem evolved, they found that existing solutions were better suited for their needs. The heart of Instacart's data architecture became Elasticsearch, with all data, including machine learning feature data, being stored and accessed in real-time. However, this led to large and rapidly growing Elasticsearch clusters, as Instacart dealt with millions of products and thousands of stores, each with unique data. The real-time nature of Instacart's business added to the complexity, as rapid and online responses were required to ensure customer satisfaction. Despite these challenges, Instacart's unique constraints drove the need for advanced data architecture solutions, making it a fascinating and complex problem in both technology and business domains.

    • Discovering the importance of performing joins at read time instead of index timeInstacart team learned that eliminating unnecessary indexing and joins at index time by performing them at read time significantly reduces workload and improves system performance.

      Scaling Elasticsearch for Instacart's growing business presented challenges due to incremental update penalties and tight time constraints. The team discovered that many documents were being joined and indexed unnecessarily, leading to a significant amount of work in the system. They realized that performing joins at read time rather than index time could eliminate a large portion of this work. The team considered using PostgreSQL, which has full text search capabilities and is known for sharding, as an alternative. After building a prototype, they found that most of the work was happening at the application layer due to complex joins between various microsystem data stores. The team also discovered several bugs in their data pipeline and PostgreSQL implementation. Despite these challenges, they continued to work on the project until the pandemic hit, at which point they had to prioritize other urgent issues. Overall, this experience taught the team the importance of understanding the nature of their system and exploring alternative solutions to scaling challenges.

    • Instacart's Database Challenges During Pandemic SurgeInstacart learned the importance of horizontally scalable databases and quick adaptation to unexpected traffic patterns during the pandemic, but faced challenges with integrating deep learning models due to compatibility issues.

      During the early stages of the pandemic, Instacart faced a sudden surge in traffic that put their primary database, Elasticsearch, under immense load and risked missing SLAs. To mitigate this, they quickly shifted the load to a secondary database, Postgres, which initially caused high CPU usage and even a fire, but eventually allowed them to serve traffic again. This incident highlighted the importance of having a horizontally scalable database system and the ability to quickly adapt to unexpected traffic patterns. The team learned that their other databases, including Redis and Cassandra, were also struggling with the growth and lacked horizontal scalability. They resolved to identify the database with the highest CPU usage and move data to their new horizontally scalable Postgres cluster. Despite some challenges and missed optimizations, they managed to keep up with their rapid week-over-week growth for several weeks. However, they were unable to fully integrate deep learning models into their system due to compatibility issues with the Madlib library. Overall, this experience taught Instacart valuable lessons about the importance of scalable databases and the need for quick adaptation to unexpected traffic patterns.

    • Dealing with complex distributed systemsTransitioning from Elasticsearch to PostgreSQL on raw SSDs led to significant improvements in performance, highlighting the importance of simplifying complex systems and leveraging advancements in technology

      The complexity of large-scale systems can lead to inefficiencies and lengthy development processes. Montana's experience at Instacart involved dealing with a distributed system using Elasticsearch, which required numerous teams and resources to function. However, the process was problematic and time-consuming, leading Montana to appreciate the benefits of a more centralized system, like PostgreSQL. Lev also shared his experience of transitioning from Elasticsearch to PostgreSQL, driven by the need for faster workloads and the limitations of network disks in managed database services like RDS. The adoption of PostgreSQL on raw SSDs resulted in significant improvements in performance, allowing for larger tables to be scanned quickly. Both Montana and Lev's experiences demonstrate the importance of simplifying complex systems and leveraging advancements in technology to enhance performance and efficiency.

    • Integrating Machine Learning into Databases like PostgreSQLBy integrating machine learning directly into databases, businesses can improve application performance and efficiency by reducing latency and complexity in the system.

      For many business applications of machine learning, the data is primarily relational and the goal is to predict a single column. Instead of dealing with the complex data architectures and expensive operations at the application layer, the speakers suggest keeping things in the data layer using PostgreSQL. They found that simple models like linear regression and XGBoost often sufficed, and the latency in the system comes mostly from extracting data from the database, serializing it, and sending it to the application layer for processing. The speakers were inspired by the idea of cutting out this complexity and latency, and Lev even managed to implement deep learning in PostgreSQL while the other speaker was on vacation. This discussion highlights the potential benefits of integrating machine learning directly into databases like PostgreSQL, which could lead to faster and more efficient machine learning applications.

    • Postgres ML: Bringing Machine Learning to PostgreSQLPostgres ML is an emerging technology that aims to integrate machine learning capabilities into PostgreSQL, addressing challenges like sharding and load balancing through PGCAT

      Postgres ML is an emerging technology that aims to bring machine learning capabilities directly into the PostgreSQL database. The creators of Postgres ML were initially exploring the idea of a database architecture without the need for ETL or ELT, but they soon realized that there were other challenges to address, such as sharding and load balancing. While Postgres itself doesn't have built-in sharding capabilities, they developed a solution called PGCAT (PostgreSQL Cat) that handles sharding, load balancing, and failover at the infrastructure layer, allowing clients to access data seamlessly. Postgres ML is still in its early stages, and there are ongoing debates about the benefits and drawbacks of performing machine learning operations directly on the primary data store. Despite the challenges, the potential simplicity and expertise that come with using a single data store have led the creators to explore combining Postgres ML and PGCAT in a full-time company venture.

    • Perform Machine Learning tasks directly within PostgreSQL using Python librariesPostgreSQL ML allows businesses to easily integrate ML into their workflows by defining SQL or PL Python functions that call out to popular libraries and automating hyperparameter search for optimal model deployment.

      PostgreSQL is making strides in the machine learning (ML) world with its new venture, PostgreSQL ML. This beta project allows users to perform ML tasks directly within PostgreSQL, using their favorite Python libraries. The experience involves defining SQL or PL Python functions that call out to libraries like scikit learn or XGBoost, making training a simple function call. Businesses can treat ML as a black box, focusing on the inputs and outputs, while PostgreSQL handles the data curation and feature engineering. A key feature is automatic hyperparameter search, which tests different configurations and deploys the best model. The data manipulation capabilities of PostgreSQL make it ideal for cleaning and curating data, a crucial aspect of data science work. A simple workflow would involve selecting data, defining and training a model using the functions, and then deploying the best model directly into the database. This approach allows businesses to leverage the power of ML without needing to understand the underlying algorithms or math.

    • Perform machine learning tasks directly in PostgreSQL with pgMLPgML enables data scientists to train and use machine learning models within PostgreSQL, simplifying workflows and ensuring data security and consistency

      PostgreSQL ML (pgML) allows data scientists to perform machine learning tasks directly within the PostgreSQL database, making it easier to store, manage, and use trained models for making predictions. This is achieved by creating a view of the training data, ensuring it's append-only to avoid training on false data, and then using that view to train models using various algorithms. Once the model is trained, it is stored in the database and can be used for making predictions through the pgml predict function. For those new to machine learning, pgML offers a user-friendly dashboard with a wizard that guides users through the process of selecting an algorithm and training it on their data. The delta between what one knows in the PostgreSQL world and what it takes to be productive with pgML is relatively small, especially with the dashboard's assistance. Currently, pgML supports supervised learning, which can be either classification or regression, and the choice between the two depends on whether you're trying to predict a class or a continuous value. Overall, pgML simplifies the process of integrating machine learning into existing PostgreSQL workflows, making it a powerful tool for data scientists and developers alike.

    • PostgreSQL's machine learning capabilities simplify data manipulation and model selectionPostgreSQL's machine learning features enable users to clean, transform, and prepare data, and choose the best model without needing to understand complex differences. Vector operations are essential for data manipulations and transformations in machine learning processes.

      PostgreSQL's machine learning capabilities allow users to evaluate various algorithms and choose the best one without needing to understand the underlying differences. The platform includes vector operations, which are essential for many data manipulations and transformations commonly used in machine learning processes. These operations enable users to clean, transform, and prepare their data before feeding it into the chosen model. However, there is still room for improvement, and feedback from data scientists and machine learning engineers is encouraged to expand the functionality and support additional feature engineering tasks. Overall, PostgreSQL's machine learning features aim to simplify the process of data manipulation and model selection for users, allowing them to focus on gaining valuable insights from their data.

    • Bringing simplicity to machine learning with PostgreSQLMLPostgreSQLML simplifies machine learning deployment by abstracting and streamlining workflows, allowing smaller teams to focus on machine learning instead of complex deployments.

      PostgreSQL and machine learning are coming together to simplify workflows and make machine learning more accessible to smaller teams. Montana Love, a guest on the Practical AI podcast, shared her excitement about this development, which she believes will bring simplicity and better ergonomics to machine learning deployment. Previously, at Instacart, Montana noted that it was unrealistic to expect one person to have all the necessary skill sets for data engineering, data science, machine learning engineering, infrastructure operations, and software engineering. The ML deployment checklist was long, and it required a team of people to cover all the bases. Montana expressed her frustration with having to deal with complicated Python code and PhD-level mathematicians who didn't know how to launch their models into production. She wished it were as simple as running a query and deploying everything immediately. This is where PostgreSQLML comes in. By abstracting and simplifying a lot of the work, smaller teams can reasonably get back into production at a high level of quality. Montana is motivated by making people's lives easier and allowing machine learning engineers to focus on machine learning instead of figuring out how to deploy their models. The impact of this development is exciting for Montana and could make a significant difference for many people in the field.

    Recent Episodes from Practical AI: Machine Learning, Data Science

    Stanford's AI Index Report 2024

    Stanford's AI Index Report 2024
    We’ve had representatives from Stanford’s Institute for Human-Centered Artificial Intelligence (HAI) on the show in the past, but we were super excited to talk through their 2024 AI Index Report after such a crazy year in AI! Nestor from HAI joins us in this episode to talk about some of the main takeaways including how AI makes workers more productive, the US is increasing regulations sharply, and industry continues to dominate frontier AI research.

    Apple Intelligence & Advanced RAG

    Apple Intelligence & Advanced RAG
    Daniel & Chris engage in an impromptu discussion of the state of AI in the enterprise. Then they dive into the recent Apple Intelligence announcement to explore its implications. Finally, Daniel leads a deep dive into a new topic - Advanced RAG - covering everything you need to know to be practical & productive.

    The perplexities of information retrieval

    The perplexities of information retrieval
    Daniel & Chris sit down with Denis Yarats, Co-founder & CTO at Perplexity, to discuss Perplexity’s sophisticated AI-driven answer engine. Denis outlines some of the deficiencies in search engines, and how Perplexity’s approach to information retrieval improves on traditional search engine systems, with a focus on accuracy and validation of the information provided.

    Using edge models to find sensitive data

    Using edge models to find sensitive data
    We’ve all heard about breaches of privacy and leaks of private health information (PHI). For healthcare providers and those storing this data, knowing where all the sensitive data is stored is non-trivial. Ramin, from Tausight, joins us to discuss how they have deploy edge AI models to help company search through billions of records for PHI.

    Rise of the AI PC & local LLMs

    Rise of the AI PC & local LLMs
    We’ve seen a rise in interest recently and a number of major announcements related to local LLMs and AI PCs. NVIDIA, Apple, and Intel are getting into this along with models like the Phi family from Microsoft. In this episode, we dig into local AI tooling, frameworks, and optimizations to help you navigate this AI niche, and we talk about how this might impact AI adoption in the longer term.

    AI in the U.S. Congress

    AI in the U.S. Congress
    At the age of 72, U.S. Representative Don Beyer of Virginia enrolled at GMU to pursue a Master’s degree in C.S. with a concentration in Machine Learning. Rep. Beyer is Vice Chair of the bipartisan Artificial Intelligence Caucus & Vice Chair of the NDC’s AI Working Group. He is the author of the AI Foundation Model Transparency Act & a lead cosponsor of the CREATE AI Act, the Federal Artificial Intelligence Risk Management Act & the Artificial Intelligence Environmental Impacts Act. We hope you tune into this inspiring, nonpartisan conversation with Rep. Beyer about his decision to dive into the deep end of the AI pool & his leadership in bringing that expertise to Capitol Hill.

    Full-stack approach for effective AI agents

    Full-stack approach for effective AI agents
    There’s a lot of hype about AI agents right now, but developing robust agents isn’t yet a reality in general. Imbue is leading the way towards more robust agents by taking a full-stack approach; from hardware innovations through to user interface. In this episode, Josh, Imbue’s CTO, tell us more about their approach and some of what they have learned along the way.

    Private, open source chat UIs

    Private, open source chat UIs
    We recently gathered some Practical AI listeners for a live webinar with Danny from LibreChat to discuss the future of private, open source chat UIs. During the discussion we hear about the motivations behind LibreChat, why enterprise users are hosting their own chat UIs, and how Danny (and the LibreChat community) is creating amazing features (like RAG and plugins).

    Related Episodes

    When data leakage turns into a flood of trouble

    When data leakage turns into a flood of trouble
    Rajiv Shah teaches Daniel and Chris about data leakage, and its major impact upon machine learning models. It’s the kind of topic that we don’t often think about, but which can ruin our results. Raj discusses how to use activation maps and image embedding to find leakage, so that leaking information in our test set does not find its way into our training set.

    Stable Diffusion (Practical AI #193)

    Stable Diffusion (Practical AI #193)
    The new stable diffusion model is everywhere! Of course you can use this model to quickly and easily create amazing, dream-like images to post on twitter, reddit, discord, etc., but this technology is also poised to be used in very pragmatic ways across industry. In this episode, Chris and Daniel take a deep dive into all things stable diffusion. They discuss the motivations for the work, the model architecture, and the differences between this model and other related releases (e.g., DALL·E 2). (Image from stability.ai)

    AlphaFold is revolutionizing biology

    AlphaFold is revolutionizing biology
    AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. It regularly achieves accuracy competitive with experiment, and is accelerating research in nearly every field of biology. Daniel and Chris delve into protein folding, and explore the implications of this revolutionary and hugely impactful application of AI.

    Zero-shot multitask learning (Practical AI #158)

    Zero-shot multitask learning (Practical AI #158)
    In this Fully-Connected episode, Daniel and Chris ponder whether in-person AI conferences are on the verge of making a post-pandemic comeback. Then on to BigScience from Hugging Face, a year-long research workshop on large multilingual models and datasets. Specifically they dive into the T0, a series of natural language processing (NLP) AI models specifically trained for researching zero-shot multitask learning. Daniel provides a brief tour of the possible with the T0 family. They finish up with a couple of new learning resources.