Logo
    Search

    Podcast Summary

    • Discovering the importance of language documentationLinguist Sarah Moeller shares her journey into language documentation, highlighting the significance of preserving minority languages and the impact on communities.

      There is a crucial need for documenting and preserving minority languages before they disappear. Sarah Moeller, an assistant professor in the Department of Linguistics at the University of Florida, shared her personal journey of becoming a linguist and her experiences with minority language speakers. She began teaching English as a foreign language as a teenager when her family moved to Russia, and later discovered her passion for the theoretical aspects of languages. However, she realized she didn't enjoy teaching languages as much as studying them. During her time working as a translator in Siberia, she encountered speakers of minority languages facing social issues due to their disappearing languages. This experience led her to the field of language documentation and description, which involves documenting, analyzing, and preserving languages that have not been studied scientifically. By documenting and preserving these languages, communities can revitalize them if they choose to do so. This important work highlights the significance of linguistic research and the need to preserve linguistic diversity.

    • The loss of indigenous languages can lead to emotional traumaLanguage loss can result in depression, suicide, and substance abuse. Efforts to document and preserve endangered languages can help prevent this trauma.

      The loss of a language, particularly when it's not by choice, can lead to high rates of depression, suicide, and substance abuse. This was discovered during a summer spent in Siberia, where the speaker encountered the rapid disappearance of indigenous languages. This phenomenon, known as language documentation and description, aims to record and analyze languages that are barely studied scientifically to preserve them for future generations. The impact on individuals whose languages are disappearing can be traumatic, affecting their identity and sense of heritage. Some communities have successfully preserved their languages, while others have not. In the past, forced assimilation through education and punishment for speaking native languages has led to deep emotional wounds, as seen in the experiences of Native American and First Nations communities in North America and Australia. The revitalization of endangered languages presents an interesting challenge, with some communities having recorded dictionaries or other resources to work from. The stories of language loss and the resulting trauma are particularly poignant when it comes to communities that have been forcibly separated from their languages and families.

    • Language revitalization programs linked to improved physical healthLanguage preservation can lead to better health outcomes and stronger cultural connections

      The preservation and documentation of languages has far-reaching impacts beyond just scientific interest. A study conducted in a Canadian community revealed a correlation between language revitalization programs and improved physical health, such as lower rates of diabetes. This connection may be linked to a sense of identity and self-worth. When a language and its speakers are devalued or lost, individuals may feel insignificant and neglect their own well-being. Language documentation involves various practices, such as recording stories and using recording equipment to capture the spoken language. In the case of lesser-known or unstudied languages, researchers often start by visiting the community, recording stories, and transcribing and translating the recordings. The importance of language documentation goes beyond the individual, as it also helps preserve cultural heritage and maintain connections to family histories, which are crucial for emotional and social well-being.

    • Building relationships in language documentationLanguage documentation goes beyond learning a language, it's about building relationships and trust within a community, using techniques like stories, speeches, and surveys to facilitate the process, and transcribing audio recordings for linguistic analysis.

      Language documentation is a deeply personal and cultural experience that goes beyond just learning the language itself. It's about building relationships and trust within the community, and creating a natural and comfortable environment for people to speak their language. The process of language documentation starts with learning the techniques to make people feel comfortable speaking, which can involve using stories, speeches, or even translating common words. For someone new to a community, this experience can vary greatly, but building relationships is a crucial aspect. Linguists often work with existing community members to help facilitate this process and create a focus for their work. This can involve creating surveys or conducting interviews to understand which parts of the language and culture are disappearing. Once audio recordings are made, the next step is to transcribe the spoken language into written form for linguistic analysis. This involves breaking down words into their smallest meaningful parts, known as morphemes, and analyzing their meanings and how they are built. Computing plays a significant role in this process, from transcribing audio to analyzing morphology and even creating digital resources for language preservation. However, the human connection and relationship-building aspects of language documentation cannot be replicated by machines and remain a crucial component of this important work.

    • Documenting endangered languagesPreserving endangered languages involves handwork like transcription, word analysis, and translation. It's a long process, but essential for future generations. Software tools help, but human input is crucial.

      Documenting endangered languages involves crucial steps such as transcription, basic word analysis, morphological analysis, and rough translations. These steps are essential for preserving the language for future generations, as without them, the language may become as incomprehensible as Egyptian hieroglyphics without the Rosetta Stone. Although software tools have existed since the eighties or early nineties to help with this process, the majority of the work is still done by hand. For instance, a linguist recently finished analyzing and translating a 30,000-word corpus of a language in the Solomon Islands after working on it for 20 to 30 years. With approximately 7,000 languages in the world, around 40% of which are in danger of disappearing, there is a significant need for more linguists to invest their time and effort into documenting these languages. However, it's important to note that understanding the basics of a language can be achieved in a few years with a relatively small corpus of text, while fully understanding its intricacies might take longer. This highlights the importance of documenting as much data as possible in the initial stages. While machine learning tools can aid in the process, they currently do not replace the need for human input and analysis.

    • Leveraging commonalities in learning multiple languagesLinguists can build on previous knowledge to identify complications and focus efforts effectively, but annotating spoken data remains a challenge and requires community trust and relationships.

      While studying different languages, even dissimilar ones, there are commonalities that can be leveraged to make the learning process more efficient. Linguists can build on their previous understanding to identify complications and focus their efforts effectively. However, the actual annotation of spoken data remains a challenge and requires building trust and relationships with communities to ensure accurate recordings and translations. The importance of technology in language preservation is evident, but cultural acceptance and sensitivity are crucial factors in implementing solutions. Ultimately, the goal is to create a comprehensive linguistic record for future generations, but the unpredictability of people and cultures necessitates a flexible and collaborative approach.

    • Exploring the intersection of linguistics and technologyLinguists can bridge the gap between traditional linguistic methods and modern technological advancements by exploring the intersection of linguistics, computational linguistics, NLP, machine learning, and AI. This can lead to advancements in tasks like transcription, speech to text, named entity recognition, and morphological analysis.

      The intersection of language documentation, computational linguistics, NLP, machine learning, and AI presents an untapped opportunity to bridge the gap between traditional linguistic methods and modern technological advancements. During her journey in language documentation and academia, our speaker discovered the potential for computers to augment and assist the process. However, it wasn't until she encountered a job opportunity at an NLP company that she was introduced to computational linguistics and AI. Despite her initial belief that this technology was only applicable to well-resourced languages, she later realized the potential during her fieldwork. While working with software for transcription, translation, and linguistic analysis, she recognized the need for computers to process data faster and more efficiently. This realization led her to pursue a PhD in computational linguistics and endangered languages, which at the time was not an easy find. The intersection of these fields can lead to advancements in tasks such as transcription, speech to text, named entity recognition, and morphological analysis. However, there is still a significant gap between these areas that needs to be bridged. Our speaker encourages linguists to explore this intersection and utilize their programming skills to contribute to this growing field.

    • Intersection of Linguistics and Computer Science for Endangered LanguagesBy combining human expertise and machine learning, we can efficiently and effectively make strides in understanding and preserving complex, endangered languages. Morphological analysis is crucial for languages with complex word structures, but computers will inevitably encounter unique aspects requiring human intervention.

      The intersection of linguistics and computer science, specifically in the context of transcribing, translating, and analyzing complex, endangered languages, presents a significant opportunity for innovation and collaboration. The speaker described their personal journey of learning to program and addressing the bottleneck of manually transcribing, morphologically analyzing, and translating vast amounts of data for linguistic research and machine learning. This process is crucial for making these languages accessible to their communities and for advancing NLP research. Morphological analysis, in particular, is essential for languages with complex word structures, as it allows the computer to understand the parts of a word and learn from them. However, the speaker emphasized that computers will inevitably encounter new, unique aspects of a language that they haven't seen before, necessitating human intervention. This intersection of human expertise and machine learning can be described as an active learning or human-in-the-loop process. Linguists are needed to focus on unique, interesting aspects of the language that the computer may not be able to learn on its own, making the process more efficient and effective. By combining the strengths of both humans and computers, we can make significant strides in understanding and preserving complex, endangered languages.

    • Continuous feedback loop between NLP methods and traditional linguistic analysisBy combining modern NLP methods and traditional linguistic analysis, humans can efficiently guide machine learning models to improve their accuracy and ultimately preserve and promote linguistic diversity.

      Improving the accuracy of machine learning models in understanding and processing human language involves a continuous feedback loop between modern NLP methods and traditional linguistic analysis. The speaker discussed the challenge of analyzing and correcting the errors made by a machine learning model in understanding the meaning of words, especially when dealing with a large dataset of unannotated text. The goal is to identify and correct the most impactful errors to help the model learn more effectively and improve its overall accuracy. The speaker also mentioned the potential value of incorporating linguistic knowledge into the process, such as focusing on low-confidence predictions and distributing incorrect examples in a way that strengthens the model's weaknesses. By doing so, humans can efficiently guide the model's learning and help it reach higher levels of accuracy. This intersection of NLP methods and traditional linguistic analysis is a virtuous cycle. It allows for the development of more accurate models, which in turn can lead to more data becoming available for analysis, and ultimately, the preservation and promotion of lesser-known languages. This not only benefits the technological advancement but also has positive implications for cultural preservation, mental and emotional health, and overall linguistic diversity.

    • Exploring the intersection of linguistics and AIAdvancements in linguistics and AI can lead to better language understanding, preservation of endangered languages, and practical applications like language learning tools.

      The intersection of linguistics and AI is a rapidly evolving field with numerous possibilities. Providing more data for training models, particularly for multilingual models, can lead to better understanding and learning from diverse languages. Techniques developed for low resource languages can also be applied to high resource languages in specialized contexts where data is limited. Looking towards the future, the possibilities are exciting, especially when it comes to preserving endangered languages through the use of NLP. The human element is crucial in this process, as we cannot solely rely on data or data augmentation methods. It's also important to note that the advancements in this field can lead to practical applications such as language learning apps, spell checkers, and other tools that can make a significant impact on people's lives. The growth of computational morphology and the intersection of documenting and describing endangered languages with NLP are just a few examples of the momentum that has been building in this area. Ultimately, the potential for positive change and the creation of valuable technology for various communities make this an exciting and worthwhile pursuit.

    • Making NLP accessible and inclusive for linguists and native speakersThe future of NLP lies in making it accessible and inclusive for linguists and native speakers from minority communities, involving them in development and implementation to ensure cultural sensitivity and accuracy.

      The future of Natural Language Processing (NLP) lies in making it accessible and inclusive for linguists and native speakers from minority communities. This realization is crucial for improving the effectiveness and accuracy of NLP tools. The goal is to bring these communities into the loop, creating interfaces that are accessible to them, and making NLP tools more than just a domain of computer science departments and big companies. This trend is exciting as it has the potential to democratize NLP technology and make it a powerful tool for communities that can benefit the most from it. It's essential to involve linguists and native speakers in the development and implementation of NLP tools to ensure they are culturally sensitive and accurate. This perspective is gaining recognition, and it's an exciting time for those passionate about languages and computers. This week's conversation with Sarah, a linguist, confirmed this belief, and it's inspiring to see more people recognizing the importance of this trend. If you're new to Practical AI, please subscribe to our show at practicalai.fm or search for Practical AI in your favorite podcast app. And if you're a long-time listener, please share the show with your friends to help us grow. Thanks for tuning in, and we'll talk to you again next time.

    Recent Episodes from Practical AI: Machine Learning, Data Science

    Stanford's AI Index Report 2024

    Stanford's AI Index Report 2024
    We’ve had representatives from Stanford’s Institute for Human-Centered Artificial Intelligence (HAI) on the show in the past, but we were super excited to talk through their 2024 AI Index Report after such a crazy year in AI! Nestor from HAI joins us in this episode to talk about some of the main takeaways including how AI makes workers more productive, the US is increasing regulations sharply, and industry continues to dominate frontier AI research.

    Apple Intelligence & Advanced RAG

    Apple Intelligence & Advanced RAG
    Daniel & Chris engage in an impromptu discussion of the state of AI in the enterprise. Then they dive into the recent Apple Intelligence announcement to explore its implications. Finally, Daniel leads a deep dive into a new topic - Advanced RAG - covering everything you need to know to be practical & productive.

    The perplexities of information retrieval

    The perplexities of information retrieval
    Daniel & Chris sit down with Denis Yarats, Co-founder & CTO at Perplexity, to discuss Perplexity’s sophisticated AI-driven answer engine. Denis outlines some of the deficiencies in search engines, and how Perplexity’s approach to information retrieval improves on traditional search engine systems, with a focus on accuracy and validation of the information provided.

    Using edge models to find sensitive data

    Using edge models to find sensitive data
    We’ve all heard about breaches of privacy and leaks of private health information (PHI). For healthcare providers and those storing this data, knowing where all the sensitive data is stored is non-trivial. Ramin, from Tausight, joins us to discuss how they have deploy edge AI models to help company search through billions of records for PHI.

    Rise of the AI PC & local LLMs

    Rise of the AI PC & local LLMs
    We’ve seen a rise in interest recently and a number of major announcements related to local LLMs and AI PCs. NVIDIA, Apple, and Intel are getting into this along with models like the Phi family from Microsoft. In this episode, we dig into local AI tooling, frameworks, and optimizations to help you navigate this AI niche, and we talk about how this might impact AI adoption in the longer term.

    AI in the U.S. Congress

    AI in the U.S. Congress
    At the age of 72, U.S. Representative Don Beyer of Virginia enrolled at GMU to pursue a Master’s degree in C.S. with a concentration in Machine Learning. Rep. Beyer is Vice Chair of the bipartisan Artificial Intelligence Caucus & Vice Chair of the NDC’s AI Working Group. He is the author of the AI Foundation Model Transparency Act & a lead cosponsor of the CREATE AI Act, the Federal Artificial Intelligence Risk Management Act & the Artificial Intelligence Environmental Impacts Act. We hope you tune into this inspiring, nonpartisan conversation with Rep. Beyer about his decision to dive into the deep end of the AI pool & his leadership in bringing that expertise to Capitol Hill.

    Full-stack approach for effective AI agents

    Full-stack approach for effective AI agents
    There’s a lot of hype about AI agents right now, but developing robust agents isn’t yet a reality in general. Imbue is leading the way towards more robust agents by taking a full-stack approach; from hardware innovations through to user interface. In this episode, Josh, Imbue’s CTO, tell us more about their approach and some of what they have learned along the way.

    Private, open source chat UIs

    Private, open source chat UIs
    We recently gathered some Practical AI listeners for a live webinar with Danny from LibreChat to discuss the future of private, open source chat UIs. During the discussion we hear about the motivations behind LibreChat, why enterprise users are hosting their own chat UIs, and how Danny (and the LibreChat community) is creating amazing features (like RAG and plugins).

    Related Episodes

    When data leakage turns into a flood of trouble

    When data leakage turns into a flood of trouble
    Rajiv Shah teaches Daniel and Chris about data leakage, and its major impact upon machine learning models. It’s the kind of topic that we don’t often think about, but which can ruin our results. Raj discusses how to use activation maps and image embedding to find leakage, so that leaking information in our test set does not find its way into our training set.

    Stable Diffusion (Practical AI #193)

    Stable Diffusion (Practical AI #193)
    The new stable diffusion model is everywhere! Of course you can use this model to quickly and easily create amazing, dream-like images to post on twitter, reddit, discord, etc., but this technology is also poised to be used in very pragmatic ways across industry. In this episode, Chris and Daniel take a deep dive into all things stable diffusion. They discuss the motivations for the work, the model architecture, and the differences between this model and other related releases (e.g., DALL·E 2). (Image from stability.ai)

    AlphaFold is revolutionizing biology

    AlphaFold is revolutionizing biology
    AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. It regularly achieves accuracy competitive with experiment, and is accelerating research in nearly every field of biology. Daniel and Chris delve into protein folding, and explore the implications of this revolutionary and hugely impactful application of AI.

    Zero-shot multitask learning (Practical AI #158)

    Zero-shot multitask learning (Practical AI #158)
    In this Fully-Connected episode, Daniel and Chris ponder whether in-person AI conferences are on the verge of making a post-pandemic comeback. Then on to BigScience from Hugging Face, a year-long research workshop on large multilingual models and datasets. Specifically they dive into the T0, a series of natural language processing (NLP) AI models specifically trained for researching zero-shot multitask learning. Daniel provides a brief tour of the possible with the T0 family. They finish up with a couple of new learning resources.