Logo
    Search

    Podcast Summary

    • Managing large volumes of data in data scienceAutomating workflows and utilizing project management platforms can help data scientists manage their time effectively and focus on higher-level priorities, increasing efficiency and productivity.

      Data preparation is a significant challenge for data scientists due to the vast amount of data they have to manage. This manual process often prevents them from utilizing their full potential, leading to inefficiencies and time wastage. Mark Christiansen, the CEO of Zellex, shares his expertise in data labeling and custom data processes, having come from a background in managing healthcare data at scale. Zellex specializes in managing large volumes of data, specifically in the area of dictation and transcription. In healthcare, they transcribed audio recordings from healthcare providers into completed healthcare notes using skilled medical transcriptionists. A few years ago, they met with an NLP company owner who was impressed with their platform and saw a need for it in managing trading data for NLP workflows. This encounter led Mark and his team to explore the application of their platform in the data science field. The use of automation tools and project management platforms can help data scientists refocus their energies on higher-level priorities, allowing software applications or platforms to automate workflows and enable other team members to manage a larger percentage of the workflow. This not only increases efficiency but also allows data scientists to make the most of their talents and skills.

    • Healthcare's data security requirements impact AI data labelingHealthcare companies prioritize data security, leading to investment in well-trained, long-term labelers for AI projects to maintain data quality

      The healthcare industry's strict data security and compliance requirements have significantly influenced the approach to data labeling for AI projects. This was evident in the development of Zellix AI, where the company identified a strong fit due to the existing healthcare workflow's emphasis on data security and audit trails. Data labeling for data scientists is a common challenge, particularly for specialized use cases or new language modeling. Smaller projects may attempt to handle data labeling in-house for quality control, but larger projects require a combination of in-house and external resources. Companies like Zellix AI have chosen to invest in training and compensating labelers to build long-term relationships and maintain consistent data quality, rather than commoditizing these roles. Some clients have attempted crowdsourcing for data labeling but have found it less effective, especially for more complex projects. The importance of data quality and the challenges of maintaining it have led to a shift in the industry towards more investment in labelers and longer-term relationships.

    • Understanding the importance of well-trained data labelersClearly communicating project goals and use cases to data labelers improves data quality, reduces costs, and completes projects on time and on budget.

      The commoditization of data labeling workforce can lead to significant quality issues in data aggregation projects. Unclear instructions and varying motivations of labelers can result in inconsistent data, requiring extensive iteration and increasing costs. Training and upskilling data labelers is crucial for producing high-quality data. This involves sharing project descriptions and use cases with labelers to help them better understand the project's goals and increase their vested interest in the work. For instance, a project description might involve training a software application to assist call center agents and increase their efficiency. By ensuring labelers have a clear understanding of the project, companies can improve data quality, reduce costs, and ultimately, complete data aggregation projects on time and on budget.

    • Automating call center interactions with NLPShifting towards off-the-shelf models for data labeling, but bespoke tuning is still needed for specialty applications like medical documentation, sentiment analysis, and business intelligence.

      Automating call center interactions using NLP-driven process automation is becoming increasingly important for businesses. This involves translating English language scripts into target languages to enable practical applications, giving translators and editors a clear understanding of their role in the project. The market is currently seeing a shift towards off-the-shelf models for data labeling and annotation, which are improving and often used as is or with in-house tuning. However, there are still specialty applications where unique vocabularies and highly customer-specific language require bespoke model tuning. Examples of such applications include medical documentation labeling, sentiment and intent projects, and gathering business intelligence from call center interactions. The landscape is rapidly changing, and businesses need to adapt to these advancements to remain efficient and competitive. From a process perspective, it's crucial to have a solid understanding of the data labeling and annotation landscape to effectively manage workflows and ensure the accuracy of off-the-shelf models or the success of bespoke projects.

    • Managing data labeling challenges in NLP industryThe NLP industry faces challenges in managing data labeling workflows, including the need for customized models and languages, high volume of data, and manual processes. Zelleck addresses these issues by focusing on production processes, providing workflow platforms, and managing skilled labor.

      The data labeling process in the NLP industry is facing several challenges, particularly in the workflow management side. Data scientists are spending a significant amount of time and resources on manual data labeling tasks, which could be automated through workflow platforms. The industry is also seeing an increase in the need for customized models and labeling in various languages, but there's a lack of tooling to manage these hybrid labeling approaches in a cohesive way. This results in manual aggregation and one-off coding to unify results. Another challenge is the high volume of data requiring labeling, which often results in highly skilled and paid data scientists managing projects in a manual way, leading to inefficient use of their time and talent. Zelleck specifically addresses these challenges by focusing on production processes, providing workflow platforms to move off of manual processes and spreadsheets, and managing skilled labor to meet deliverables on time and at expected quality levels.

    • Effective communication and transparency in data science projectsClear communication and transparency between stakeholders, enabled by simplified workflows and project monitoring tools, are vital for successful data science projects. Hiring and training in-house teams for new languages or project areas ensures quality and control.

      Effective communication and transparency are crucial for successful data science projects, especially when dealing with multiple stakeholders. Training data services companies, like the one discussed, contribute significantly by simplifying complex workflows and enabling stakeholders to monitor project progress. This visibility extends to various teams involved, including sales, operations, procurement, and quality assurance, allowing them to ensure projects remain on budget, on time, and meet quality standards. However, outsourcing projects to third-party vendors without proper planning and control can lead to poor quality data, delayed projects, and additional correction costs. Therefore, it's essential for service providers to invest in hiring and training their own teams when entering new languages or project areas to maintain quality, control, and ultimately, project success.

    • Maximizing Data Labeling Project SuccessInvesting in workflow management and team dynamics upfront leads to higher quality data and on-time deliverables, while using third-party vendors or a black box approach can result in delays and issues ultimately costing more.

      Investing more time and money upfront in managing a data labeling project's workflow and team dynamics can lead to higher quality data and on-time deliverables, despite initial cost sensitivities. Using third-party vendors and working in a black box can result in delays and issues that may ultimately cost more. Managing an online workforce for data labeling projects comes with inherent challenges, but a well-developed, robust workflow application can mitigate these issues through centralized controls and real-time visibility into the status of data objects and progress towards deliverables. When it comes to setting up QA workflows for data labeling, proper communication, clear instructions, and regular reviews can help ensure accuracy and consistency. At Zelix, we prioritize these investments to deliver high-quality data and meet turnaround times for our clients.

    • Measuring consistency and identifying improvement areas in data labelingEffective data labeling involves establishing ground truth, measuring distance to editor work, and using multi-level QA workflows to ensure consistency and improve processes.

      Effective data labeling involves establishing the ground truth version of data objects and measuring the distance between that and the work of editors to generate various metrics. This helps ensure consistency and identify areas for improvement. Additionally, a multi-level QA workflow automatically routes work, and an error script dynamically checks against known errors to recycle items through the workflow. Using multiple layers of judgments is also crucial. Looking ahead, I'm excited about the potential of data labeling tools and expertise being applied to other major world languages and developing economies, where AI can significantly impact customer and employee experience data in unstructured formats. This could lead to numerous benefits and advancements in various industries. I appreciate Dan sharing his insights on data labeling challenges and workflows at Xellix and look forward to continuing our conversations in the future.

    • Sharing Knowledge and Supporting the CommunityPay it forward by sharing Practical AI with others to spread valuable AI knowledge and insights. Appreciate partnerships and sponsors for making the show possible.

      Key takeaway from this episode of Practical AI is the importance of sharing knowledge and supporting the community. The hosts expressed their gratitude to listeners for tuning in and encouraged those who have benefited from the show to pay it forward by sharing it with others. Word-of-mouth is the primary way new listeners discover podcasts, and by sharing Practical AI, you're helping to spread valuable AI knowledge and insights. Additionally, the episode highlighted the importance of partnerships and sponsors in making the show possible. Fastly and Fly Dot IO were specifically mentioned for their support in hosting the show's static and dynamic assets, respectively. Break master cylinder provided the beats to keep the show lively. Overall, this episode emphasized the value of community, knowledge sharing, and partnerships in the world of AI. So, if you've learned something new from Practical AI, consider sharing it with a friend or colleague. And, if you're a business looking to partner with a high-quality AI podcast, reach out to Practical AI. They appreciate your support and look forward to continuing to bring you valuable content.

    Recent Episodes from Practical AI: Machine Learning, Data Science

    Stanford's AI Index Report 2024

    Stanford's AI Index Report 2024
    We’ve had representatives from Stanford’s Institute for Human-Centered Artificial Intelligence (HAI) on the show in the past, but we were super excited to talk through their 2024 AI Index Report after such a crazy year in AI! Nestor from HAI joins us in this episode to talk about some of the main takeaways including how AI makes workers more productive, the US is increasing regulations sharply, and industry continues to dominate frontier AI research.

    Apple Intelligence & Advanced RAG

    Apple Intelligence & Advanced RAG
    Daniel & Chris engage in an impromptu discussion of the state of AI in the enterprise. Then they dive into the recent Apple Intelligence announcement to explore its implications. Finally, Daniel leads a deep dive into a new topic - Advanced RAG - covering everything you need to know to be practical & productive.

    The perplexities of information retrieval

    The perplexities of information retrieval
    Daniel & Chris sit down with Denis Yarats, Co-founder & CTO at Perplexity, to discuss Perplexity’s sophisticated AI-driven answer engine. Denis outlines some of the deficiencies in search engines, and how Perplexity’s approach to information retrieval improves on traditional search engine systems, with a focus on accuracy and validation of the information provided.

    Using edge models to find sensitive data

    Using edge models to find sensitive data
    We’ve all heard about breaches of privacy and leaks of private health information (PHI). For healthcare providers and those storing this data, knowing where all the sensitive data is stored is non-trivial. Ramin, from Tausight, joins us to discuss how they have deploy edge AI models to help company search through billions of records for PHI.

    Rise of the AI PC & local LLMs

    Rise of the AI PC & local LLMs
    We’ve seen a rise in interest recently and a number of major announcements related to local LLMs and AI PCs. NVIDIA, Apple, and Intel are getting into this along with models like the Phi family from Microsoft. In this episode, we dig into local AI tooling, frameworks, and optimizations to help you navigate this AI niche, and we talk about how this might impact AI adoption in the longer term.

    AI in the U.S. Congress

    AI in the U.S. Congress
    At the age of 72, U.S. Representative Don Beyer of Virginia enrolled at GMU to pursue a Master’s degree in C.S. with a concentration in Machine Learning. Rep. Beyer is Vice Chair of the bipartisan Artificial Intelligence Caucus & Vice Chair of the NDC’s AI Working Group. He is the author of the AI Foundation Model Transparency Act & a lead cosponsor of the CREATE AI Act, the Federal Artificial Intelligence Risk Management Act & the Artificial Intelligence Environmental Impacts Act. We hope you tune into this inspiring, nonpartisan conversation with Rep. Beyer about his decision to dive into the deep end of the AI pool & his leadership in bringing that expertise to Capitol Hill.

    Full-stack approach for effective AI agents

    Full-stack approach for effective AI agents
    There’s a lot of hype about AI agents right now, but developing robust agents isn’t yet a reality in general. Imbue is leading the way towards more robust agents by taking a full-stack approach; from hardware innovations through to user interface. In this episode, Josh, Imbue’s CTO, tell us more about their approach and some of what they have learned along the way.

    Private, open source chat UIs

    Private, open source chat UIs
    We recently gathered some Practical AI listeners for a live webinar with Danny from LibreChat to discuss the future of private, open source chat UIs. During the discussion we hear about the motivations behind LibreChat, why enterprise users are hosting their own chat UIs, and how Danny (and the LibreChat community) is creating amazing features (like RAG and plugins).

    Related Episodes

    When data leakage turns into a flood of trouble

    When data leakage turns into a flood of trouble
    Rajiv Shah teaches Daniel and Chris about data leakage, and its major impact upon machine learning models. It’s the kind of topic that we don’t often think about, but which can ruin our results. Raj discusses how to use activation maps and image embedding to find leakage, so that leaking information in our test set does not find its way into our training set.

    Stable Diffusion (Practical AI #193)

    Stable Diffusion (Practical AI #193)
    The new stable diffusion model is everywhere! Of course you can use this model to quickly and easily create amazing, dream-like images to post on twitter, reddit, discord, etc., but this technology is also poised to be used in very pragmatic ways across industry. In this episode, Chris and Daniel take a deep dive into all things stable diffusion. They discuss the motivations for the work, the model architecture, and the differences between this model and other related releases (e.g., DALL·E 2). (Image from stability.ai)

    AlphaFold is revolutionizing biology

    AlphaFold is revolutionizing biology
    AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. It regularly achieves accuracy competitive with experiment, and is accelerating research in nearly every field of biology. Daniel and Chris delve into protein folding, and explore the implications of this revolutionary and hugely impactful application of AI.

    Zero-shot multitask learning (Practical AI #158)

    Zero-shot multitask learning (Practical AI #158)
    In this Fully-Connected episode, Daniel and Chris ponder whether in-person AI conferences are on the verge of making a post-pandemic comeback. Then on to BigScience from Hugging Face, a year-long research workshop on large multilingual models and datasets. Specifically they dive into the T0, a series of natural language processing (NLP) AI models specifically trained for researching zero-shot multitask learning. Daniel provides a brief tour of the possible with the T0 family. They finish up with a couple of new learning resources.