Logo
    Search

    Podcast Summary

    • The Advent of Large Language Models Brings New Challenges for AI Development TeamsLarge Language Models introduce new challenges, including the need for new workflows, collaboration between non-technical team members, versioning and managing prompts, and subjective performance measurement.

      The rapid advancements in Large Language Models (LLMs) are revolutionizing the way we approach AI development, but they also introduce new challenges. Reza Habib, CEO and co-founder of HumanLoop, shared his journey into this field, having worked on it for years with his co-founders. They were initially helping companies fine-tune smaller models but became convinced that the rate of progress for larger models would soon surpass everything else in terms of performance and usability. However, this new way of customizing AI models comes with its own set of challenges. Instead of hand-annotating datasets, users can now write instructions in natural language for the models to follow. This shift requires new workflows and collaboration between non-technical team members, such as product managers, and engineers. The prompts, which have an impact similar to code on end applications, need to be versioned, managed, and treated with the same level of respect and rigor as code. Moreover, measuring performance with LLMs is subjective, as there are no clear-cut metrics or ground truth datasets like in traditional coding or machine learning. This calls for new workflows and collaboration methods to ensure effective communication and collaboration between team members. In summary, the advent of LLMs brings exciting opportunities but also introduces new challenges, requiring a shift in mindset and workflows for AI development teams.

    • Managing and evaluating large language models presents unique challengesSpecialized tools like HumanLoop offer a collaborative, interactive development environment for managing prompts and evaluating model performance, bridging the gap between technical and non-technical teams, leading to better model performance and improved productivity.

      Managing and evaluating generative AI models, especially in the context of large language models (LLMs), presents unique challenges. As models become more complex, measuring performance and managing versioning of prompts becomes harder. Companies often start with simple prototypes using publicly available models but soon encounter issues with collaboration, versioning, and managing the quality of their models. These challenges have led to the development of specialized tools like HumanLoop. HumanLoop offers a solution by providing a collaborative, interactive development environment for managing prompts and evaluating model performance. This approach allows for a more seamless transition from development to production. Additionally, it bridges the gap between technical and non-technical teams, enabling domain experts and data scientists to work together more effectively. This collaborative approach can lead to better model performance and improved overall productivity.

    • Collaboration between non-technical and technical teams in AI developmentProduct managers and experts collaborate more closely with engineers to implement AI applications, with a focus on product teams leading the development process, potentially broadening participation and impacting Internet service economics.

      The collaboration between non-technical and technical teams is becoming more complex with the advent of large language models (LLMs) and AI. Traditionally, product managers (PMs) and subject matter experts distill problems, produce specifications, and then engineers implement them. However, the ability to program in natural language makes it possible for PMs and experts to be directly involved in implementing applications. This collaboration is essential as the product manager provides domain expertise, and the engineer implements the bulk of the application. Moreover, the involvement of machine learning or AI experts is decreasing, and instead, there is an increasing focus on product teams leading the development process. Some organizations even have linguists develop prompts and prevent engineers from editing them. This collaboration is similar to tools like Figma, where multiple stakeholders come together to iterate and develop applications. The necessity of this collaboration is exciting as it expands who can be involved in developing AI applications, potentially impacting the economics of Internet services and the participation of a broader range of individuals in their creation.

    • Collaborative platform for developing, evaluating, and improving language models and AI applicationsHumanLoop is a user-friendly interface for domain experts and technical teams to test, compare, save, and deploy preferred language models and AI applications, supporting various evaluation methods and enabling the capture of end-user feedback for continuous improvement.

      The HumanLoop platform is a collaborative environment designed to facilitate the development, evaluation, and improvement of language models and AI applications. It offers a user-friendly interface where domain experts and technical teams can interactively test and compare different prompts and models, save preferred versions, and deploy them to various stages of production. HumanLoop also supports various evaluation methods, including traditional metrics and model-as-judge and human evaluation for more subjective assessments. This system is particularly valuable for AI applications where correct answers are not always definitive, as it allows for the capture of end-user feedback and implicit signals, which can be used for debugging and fine-tuning the model over time. Ultimately, HumanLoop empowers developers to build and refine AI applications that better meet the needs of their users and deliver more effective solutions.

    • Understanding Prompt Engineering vs Fine TuningPrompt engineering and fine tuning are distinct concepts in AI. Prompt engineering alters instructions and data context, while fine tuning adjusts model weights. Prompt engineering is faster and more common, but fine tuning improves latency, cost, and specific output constraints.

      While the terms "prompt engineering" and "fine tuning" are often used interchangeably in the context of AI, they have distinct meanings. Prompt engineering refers to changing the instructions given to a language model or the data context it's working with, while fine tuning involves training the model further by adjusting its weights based on new data. Many teams new to AI assume they'll be fine tuning their models, but in reality, they may only be adjusting prompts or data without actually fine tuning. This can lead to confusion and misunderstandings about the capabilities and limitations of their AI systems. Prompt engineering is typically the first step in optimizing AI systems due to its speed and ease, and it can have a significant impact on output and data requirements. Fine tuning, on the other hand, is useful for improving latency, cost, or enforcing specific output constraints. Despite initial expectations, there has been less fine tuning than anticipated in the industry. This may be due to the surprising power of prompt engineering and the shift towards getting factual context into models. However, as more teams adopt AI technology and gain better tooling, there may be an increase in fine tuning in the future. In essence, it's crucial to understand the differences between prompt engineering and fine tuning to effectively optimize AI systems and avoid misconceptions.

    • Combining Information Retrieval with Generation for Company-Specific DataRAG (Retrieval Augmented Generation) improves LLMs' access to private or up-to-date company information by combining information retrieval with generation. Fine-tuning LLMs on specific company data is still an option, but RAG is more efficient and effective for adapting to specific use cases.

      While Language Models (LLMs) are powerful tools for information generation, they have limitations when it comes to private or up-to-date information about specific companies. A solution to this issue is the use of Retrieval Augmented Generation (RAG), which combines information retrieval with generation. Fine-tuning LLMs on specific company data is still an option, but it's mainly used for optimizing cost, latency, and tone of voice, rather than adapting the model to a specific use case. The process of fine-tuning is more complex and time-consuming than prompt engineering. The decision to support both closed proprietary models and open models in Human Loop is driven by customer demand, as many organizations use a mix of models for various use cases, including those related to privacy and latency. The gap in performance between the best closed models and open models is closing, but there are still economic considerations for hosting and using open models, particularly for those with high volumes of data or real-time requirements. Open source models are particularly useful for companies that cannot or do not want to send data to third parties, and for those working on the edge with low latency requirements. Vana.AI, a Python RAG framework, is an example of a tool that allows users to chat with relational databases by accurately generating SQL queries using RAG, making it easier to work with any LLM.

    • Collaborative System for Domain Experts and EngineersVana's Human Loop System allows domain experts and engineers to work together, ensuring technically sound and domain-specific question answering systems through iteration, evaluation, and collaboration.

      Vana's Human Loop System is a collaborative solution that enables domain experts and engineers to build and refine complex question answering systems together. For a domain expert, the process begins with trying out different models and prompts in a playground environment, iterating until answers are factually correct and appropriate. Once satisfied, they move on to rigorous evaluation and, if necessary, debugging. Engineers, on the other hand, are responsible for integrating data sources, setting up evaluation, and orchestrating the code. They also log inputs, outputs, and user feedback to Human Loop for analysis. By combining the expertise of domain experts and engineers, Vana's Human Loop System ensures that question answering systems are both technically sound and aligned with the specific requirements of the domain.

    • Evaluating Large Language Models for Application DevelopmentEffective evaluation is essential for maintaining system stability and functionality when implementing new LLM models or upgrades. It involves testing the output of models on a diverse set of examples, logging feedback, and refining prompts to ensure desired performance.

      Effective evaluation is crucial in the development and implementation of Large Language Model (LLM) applications. This evaluation process helps prevent regressions, ensuring that improvements made in the system do not negatively impact previously functioning aspects. It also aids in managing model upgrades, allowing developers to assess the impact of new models on existing use cases. The framework for LLM applications includes data sources, logging calls within an application, and a decision-making phase where domain experts test and refine prompts. These prompts can be integrated into the code or application, giving domain experts control over the prompting process. Feedback from the system is logged, enabling iterations on prompts, which are then pushed back into the production system. Components of LLM applications consist of a base model, prompt template, data collection strategy, and tools for data retrieval. These components offer significant design choices, resulting in a combinatorially large number of decisions to make during development. Evaluation plays a vital role in ensuring the system's stability and effectiveness when introducing new models or model upgrades. By testing the output of models on a wide range of examples, developers can identify any unintended consequences or differences in behavior between the old and new models. This process helps maintain the desired performance and functionality of the LLM application.

    • Effective machine learning model evaluation stagesFast feedback during interactive stage, regression testing for changes, and monitoring in production improve model evaluation and collaboration systems streamline communication and workflow between teams.

      Effective evaluation of machine learning models involves different stages during development, each with its unique testing methods. During the interactive stage, fast feedback is essential, making eyeballing examples and adversarial testing helpful. Regression testing is crucial when making changes, ensuring performance doesn't deteriorate. Lastly, monitoring is essential in production to alert teams when performance drops below a threshold. Collaboration systems, similar to those developed for code version control, can significantly improve the process by enabling seamless communication and workflow between domain experts and technical staff. Surprising examples include a publicly listed company that encountered performance differences due to white space discrepancies when sharing code via Teams and the increasing complexity of building assistance agents using existing software. These collaboration systems don't necessarily enable new behaviors but make the old behaviors faster and less error-prone, leading to fewer mistakes and less time spent.

    • The Future of AI: Human-AI CollaborationAI models need human intervention for complex use cases. Companies automate workflows with sophisticated tooling, but lack of proper tooling can lead to costly mistakes. HumanLoop aims to become more proactive in suggesting improvements and reducing costs, signaling a future where human expertise and AI capabilities intersect.

      While AI models have advanced significantly, they still require human intervention to function effectively, especially in complex use cases. Companies like IronClad have automated workflows by building sophisticated tooling, such as Rivet, to leverage existing infrastructure and improve accuracy. However, the lack of proper tooling can lead to costly mistakes, like duplicate annotation jobs. Looking ahead, there's excitement about the potential of agent use cases and multimodal models in production. HumanLoop, a platform for human-AI collaboration, aims to become more proactive by suggesting improvements to applications and potentially reducing costs. The future of AI lies in the intersection of human expertise and AI capabilities, with systems that can act autonomously and adapt to new situations.

    • Maintaining your brake master cylinder for safe brakingRegularly inspect and replace your brake master cylinder for safe and effective braking, preventing longer stopping distances or brake failure

      Maintaining your vehicle's brake system, specifically the brake master cylinder, is crucial for ensuring safe and effective braking. The brake master cylinder converts the force applied to the brake pedal into hydraulic pressure that activates the brakes. Neglecting this component can lead to dangerous situations, such as longer stopping distances or even brake failure. Regular inspections and timely replacement are essential to keep your vehicle's braking system in top shape. So, don't forget to prioritize your brake master cylinder in your vehicle maintenance routine.

    Recent Episodes from Practical AI: Machine Learning, Data Science

    Apple Intelligence & Advanced RAG

    Apple Intelligence & Advanced RAG
    Daniel & Chris engage in an impromptu discussion of the state of AI in the enterprise. Then they dive into the recent Apple Intelligence announcement to explore its implications. Finally, Daniel leads a deep dive into a new topic - Advanced RAG - covering everything you need to know to be practical & productive.

    The perplexities of information retrieval

    The perplexities of information retrieval
    Daniel & Chris sit down with Denis Yarats, Co-founder & CTO at Perplexity, to discuss Perplexity’s sophisticated AI-driven answer engine. Denis outlines some of the deficiencies in search engines, and how Perplexity’s approach to information retrieval improves on traditional search engine systems, with a focus on accuracy and validation of the information provided.

    Using edge models to find sensitive data

    Using edge models to find sensitive data
    We’ve all heard about breaches of privacy and leaks of private health information (PHI). For healthcare providers and those storing this data, knowing where all the sensitive data is stored is non-trivial. Ramin, from Tausight, joins us to discuss how they have deploy edge AI models to help company search through billions of records for PHI.

    Rise of the AI PC & local LLMs

    Rise of the AI PC & local LLMs
    We’ve seen a rise in interest recently and a number of major announcements related to local LLMs and AI PCs. NVIDIA, Apple, and Intel are getting into this along with models like the Phi family from Microsoft. In this episode, we dig into local AI tooling, frameworks, and optimizations to help you navigate this AI niche, and we talk about how this might impact AI adoption in the longer term.

    AI in the U.S. Congress

    AI in the U.S. Congress
    At the age of 72, U.S. Representative Don Beyer of Virginia enrolled at GMU to pursue a Master’s degree in C.S. with a concentration in Machine Learning. Rep. Beyer is Vice Chair of the bipartisan Artificial Intelligence Caucus & Vice Chair of the NDC’s AI Working Group. He is the author of the AI Foundation Model Transparency Act & a lead cosponsor of the CREATE AI Act, the Federal Artificial Intelligence Risk Management Act & the Artificial Intelligence Environmental Impacts Act. We hope you tune into this inspiring, nonpartisan conversation with Rep. Beyer about his decision to dive into the deep end of the AI pool & his leadership in bringing that expertise to Capitol Hill.

    Full-stack approach for effective AI agents

    Full-stack approach for effective AI agents
    There’s a lot of hype about AI agents right now, but developing robust agents isn’t yet a reality in general. Imbue is leading the way towards more robust agents by taking a full-stack approach; from hardware innovations through to user interface. In this episode, Josh, Imbue’s CTO, tell us more about their approach and some of what they have learned along the way.

    Private, open source chat UIs

    Private, open source chat UIs
    We recently gathered some Practical AI listeners for a live webinar with Danny from LibreChat to discuss the future of private, open source chat UIs. During the discussion we hear about the motivations behind LibreChat, why enterprise users are hosting their own chat UIs, and how Danny (and the LibreChat community) is creating amazing features (like RAG and plugins).

    Mamba & Jamba

    Mamba & Jamba
    First there was Mamba… now there is Jamba from AI21. This is a model that combines the best non-transformer goodness of Mamba with good ‘ol attention layers. This results in a highly performant and efficient model that AI21 has open sourced! We hear all about it (along with a variety of other LLM things) from AI21’s co-founder Yoav.

    Related Episodes

    When data leakage turns into a flood of trouble

    When data leakage turns into a flood of trouble
    Rajiv Shah teaches Daniel and Chris about data leakage, and its major impact upon machine learning models. It’s the kind of topic that we don’t often think about, but which can ruin our results. Raj discusses how to use activation maps and image embedding to find leakage, so that leaking information in our test set does not find its way into our training set.

    Stable Diffusion (Practical AI #193)

    Stable Diffusion (Practical AI #193)
    The new stable diffusion model is everywhere! Of course you can use this model to quickly and easily create amazing, dream-like images to post on twitter, reddit, discord, etc., but this technology is also poised to be used in very pragmatic ways across industry. In this episode, Chris and Daniel take a deep dive into all things stable diffusion. They discuss the motivations for the work, the model architecture, and the differences between this model and other related releases (e.g., DALL·E 2). (Image from stability.ai)

    AlphaFold is revolutionizing biology

    AlphaFold is revolutionizing biology
    AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. It regularly achieves accuracy competitive with experiment, and is accelerating research in nearly every field of biology. Daniel and Chris delve into protein folding, and explore the implications of this revolutionary and hugely impactful application of AI.

    Zero-shot multitask learning (Practical AI #158)

    Zero-shot multitask learning (Practical AI #158)
    In this Fully-Connected episode, Daniel and Chris ponder whether in-person AI conferences are on the verge of making a post-pandemic comeback. Then on to BigScience from Hugging Face, a year-long research workshop on large multilingual models and datasets. Specifically they dive into the T0, a series of natural language processing (NLP) AI models specifically trained for researching zero-shot multitask learning. Daniel provides a brief tour of the possible with the T0 family. They finish up with a couple of new learning resources.