What's up, DocQuery?

en-usOctober 12, 2022

Practical AI: Machine Learning, Data Science

Podcast Summary

Creating a solution for unstructured data using databases and machine learning: Impera uses a combination of database and machine learning technology to help users work with unstructured data, allowing for easy verification of machine learning predictions and quick training through user feedback.
Impera, a company founded by Ankur Goyal, aims to make it easy for users to work with unstructured data using a combination of database and machine learning technology. The motivation behind the company came from speaking with customers who struggled to use relational databases for complex, unstructured data. Initially, they believed the biggest challenge would be helping companies understand image and video content. However, they later learned that the bottleneck was actually the creation of this content rather than its understanding. Goyal, who has a background in relational databases, saw the potential of machine learning to help people work with any kind of data, no matter how messy or complicated. Impera's approach is designed to allow users to easily see if machine learning predictions are correct or not and make adjustments, which drives feedback into the model and incrementally trains it. This design results in a very lightweight machine learning approach that can train and evaluate quickly.
Discovering the Potential of Unstructured Data Processing: Through a fortunate discovery, a team recognized the potential of machine learning models in processing unstructured data, leading to the development of accessible and powerful solutions for businesses.
Companies discovered an opportunity to work with unstructured data, specifically in invoices and documents, through a happy coincidence with a machine learning model that utilized optical character recognition (OCR). This discovery led to the realization that there was a significant potential for helping businesses process and analyze unstructured data, which was not typically considered in machine learning applications. At first, this challenge did not concern the team, as they had confidence in their ability to find solutions. They leaned on computer vision to reason about PDF files due to their hybrid nature of text and visual elements. However, they soon realized that they could work with various document formats, including emails, HTML files, scanned images, and pictures from phones. They preprocessed uploaded files into a consistent data structure, normalizing them into pixels, text, and bounding boxes. During the early machine learning days, OCR and reading data from invoices were not new concepts. However, most businesses did not take advantage of these solutions due to the lack of user-friendly options. The team identified this gap and aimed to create an accessible and powerful solution to help businesses work with unstructured data more efficiently.
Challenges with pre-trained OCR models: While convenient, pre-trained OCR models may struggle with low-quality images or handwriting and may not extract all necessary fields, leading to inconsistent data extraction and the need for additional processing.
While pre-trained OCR models like Textract offer convenience by eliminating the need for template definition, they come with their own set of challenges. These models may struggle with low-quality images or handwriting, and may not extract all the necessary fields from a document. Moreover, if a document contains fields that are consistently missed, there's no way to instruct the model to improve. These issues can lead to inconsistent data extraction and the need for additional processing to normalize the results. This is where our solution comes in, providing a more accurate and customizable OCR experience by allowing users to define their own templates and leveraging machine learning to continually improve the extraction process.
Addressing unexpected schema differences in document processing: To enhance document processing with AI models, address unexpected schema differences by catering to both beginner and advanced Excel users, allowing complex expressions and formulas, and simplifying data extraction with natural language queries.
When dealing with document processing using AI models, unexpected schema differences between uploaded documents and pretrained models can lead to a manual and time-consuming process for users. This issue was not initially considered but resulted in the need for users to implement their own machine learning models to translate the inferred schema back to the intended one. This process was laborious and required a significant amount of manual effort, even with available tooling. Initially, the focus was on OCR and visual models, but the models were not pretrained, and they learned solely from the user's documents. The challenges included addressing white space issues and dealing with inferred schemas that were not intended. To improve the experience, the team set constraints for a self-service product, supporting any document schema, and catering to both beginner and advanced Excel users. Through user research, they discovered that most users were either basic or advanced Excel users. As a result, Impira allowed users to create complex expressions and formulas, making the product more accessible and user-friendly for non-technical users. This approach ultimately led to the development of dotquery, which simplifies the process of extracting data from documents using natural language queries.
Designing a machine learning approach for document processing with user feedback: The team created Impirus, a lightweight model for document processing with quick training, but later introduced Stock Query to handle manual judgment and interpretation tasks, enhancing overall efficiency.
The team aimed to create a machine learning approach that allows users to easily check the accuracy of predictions and provide feedback for incremental training. This design resulted in a lightweight, quick-training model. Users primarily interact with documents by integrating information from them into their workflows and asking analytical questions. These tasks often involve manual judgment and interpretation, and the team's technology, Impirus, initially missed addressing these needs. However, they later introduced Stock Query to tackle these aspects. Overall, the goal is to make document processing more efficient and less manual, allowing users to focus on higher-level tasks.
Impira addresses pain points of nonprofits and small organizations with time-consuming data labeling tasks: Impira simplifies data labeling process for nonprofits and small organizations using text-based question answering models, reducing the need for manual labeling and improving handling of various formats
Impira, a data extraction tool, identified the pain points of users, particularly in nonprofits and small organizations, who often handle administrative tasks that are time-consuming and not their areas of expertise. These tasks include providing labels for models to learn from, which can be tedious and time-consuming, especially when dealing with a wide variety of formats. In response, Impira aimed to improve the user experience by enabling users to work with any field they want and creating a simpler labeling process. They explored text-based question answering models, like those offered by Hugging Face, which proved to be surprisingly accurate even without specific training or context. This discovery led Impira to believe they could achieve better results with a little more effort, ultimately reducing the need for manual labeling and improving the tool's ability to handle a wide range of formats.
Drift's new question answering tool, Dotquery, was inspired by the potential of question answering frameworks.: Drift recognized the potential of question answering frameworks and developed Dotquery to address the generalization problem and improve existing solutions.
The development of Dotquery, Drift's new question answering tool, was inspired by the infinite possibilities offered by the question answering framework, aligning well with their product philosophy. This realization came about when the model, which had never seen documents like those being pasted into the text box, performed exceptionally well, indicating its potential to solve the generalization problem. This breakthrough occurred during a memorable car ride and late-night hot spot sessions. This recent announcement, on September 1st, also involved the integration of Hugging Face and its pipeline. Drift had been working on large language models, as mentioned on Twitter, and had collaborated with Hugging Face on this problem. The pipeline abstracts away the complex machinery, making it easier for non-experts to work with models. The question answering pipeline, specifically, caught their attention due to its compatibility with models that fit the question answering framework. They were also aware of Microsoft's layout LM, a language model that takes both text and bounding boxes as input, introducing geometric information relevant to their problem. However, they couldn't find a question answering pipeline that worked with layout LM, leading them to believe that there was an opportunity to innovate and improve the existing solution.
Open-source initiatives foster innovation and opportunities: Open-source projects can lead to wider access, innovation, and business opportunities. Teams can benefit from personal drive, potential distribution, and confidence in their unique value proposition.
Open-source initiatives can bring significant benefits to both the community and the entrepreneur. In this case, a team identified a gap in document-based question answering and collaborated with Hugging Face to create an easy-to-use solution. They open-sourced their contribution, motivated by personal innovation drive, potential distribution opportunities, and confidence in their proprietary strategy. Open-source distribution not only allows for wider access and innovation but also exposes the company to potential customers and builds credibility. The team's confidence comes from their unique value proposition: real-time data flywheel, ease of use, and advanced integrations, which are challenging to build and engineer independently. By open-sourcing their solution, they can still thrive with their proprietary product's core features. This story demonstrates that open-source initiatives can foster innovation, build community, and create business opportunities.
Lack of expertise as a strength for non-experts starting a business in tech: Starting a tech business as a non-expert can provide a unique perspective and identify market gaps. Being open-minded and user-focused can lead to innovative solutions and a successful business.
Having a fresh perspective as a non-expert in a field can be an asset when starting a business, particularly in technology. The founder of Dotquery and Hugging Face, Thomas Wolf, shared how his lack of deep learning expertise initially gave him a unique perspective on making complex models more accessible to non-experts. He emphasized the importance of being naive and open-minded, allowing him to identify gaps in the market and understand user needs. Additionally, Wolf discussed the benefits of open-source business models, which can lead to higher adoption rates due to increased accessibility and ease of use. Looking ahead for Hugging Face, the team plans to continue developing user-friendly tools and expanding their offerings in the natural language processing space. This approach of combining user needs with technical expertise can lead to innovative solutions and a successful business.
Expanding DocQuery's capabilities for complex document queries: DocQuery, a question answering framework, is enhancing its features to identify document types, extract table data, and query across multiple documents. The team is confident in its progress towards these goals.
DocQuery, a question answering framework developed by Impira, is making significant strides in expanding its capabilities to answer more complex questions about documents. Currently, users are asking for the ability to identify document types and extract information from tables, which the team plans to address in the near term. Additionally, the team is working on enabling users to ask natural language questions across multiple documents, a feature that is currently in the training phase. The team is confident in their ability to expand the question answering framework to support these features due to its flexibility and the success they've had with similar functionalities in Impira's product. The ultimate goal is to enable users to ask complex queries over a pile of documents, such as finding all invoices due next month or identifying the most relevant invoice from a vendor for a contract. While the team is making progress, there are still a few moving parts to figure out before this feature becomes widely available.
Exploring Commanding Data with Doc Query: Doc Query team is developing a system for users to type actions related to documents, called 'commanding data', which has the potential to make interacting with data more intuitive and powerful. They plan to open-source parts of the project to engage the community and reach a larger audience.
The team behind Doc Query, a document querying system, is exploring the idea of enabling users to type actions related to documents, rather than just asking questions. This approach, which they call "commanding data," has the potential to make interacting with data more intuitive and powerful. They plan to open-source parts of Doc Query to engage the community and tap into different use cases and domains. The team sees this as a significant shift in how people work with data, and they believe that open sourcing the project will help them reach a larger audience and achieve greater impact. Doc Query's vision is to make it simple for anyone to ask anything of any data and easily sequence the parts together. Despite the challenges, they are excited about the possibilities and the potential benefits for a wide range of users.

Recent Episodes from Practical AI: Machine Learning, Data Science

Stanford's AI Index Report 2024

We’ve had representatives from Stanford’s Institute for Human-Centered Artificial Intelligence (HAI) on the show in the past, but we were super excited to talk through their 2024 AI Index Report after such a crazy year in AI! Nestor from HAI joins us in this episode to talk about some of the main takeaways including how AI makes workers more productive, the US is increasing regulations sharply, and industry continues to dominate frontier AI research.

Practical AI: Machine Learning, Data Science

en-usJuly 02, 2024

On this page

What's up, DocQuery?

Practical AI: Machine Learning, Data Science

Podcast Summary

Recent Episodes from Practical AI: Machine Learning, Data Science

Stanford's AI Index Report 2024

Apple Intelligence & Advanced RAG

The perplexities of information retrieval

Using edge models to find sensitive data

Rise of the AI PC & local LLMs

AI in the U.S. Congress

First impressions of GPT-4o

Full-stack approach for effective AI agents

Autonomous fighter jets?!

Private, open source chat UIs

Related Episodes

When data leakage turns into a flood of trouble

Stable Diffusion (Practical AI #193)

AlphaFold is revolutionizing biology

The nose knows

Zero-shot multitask learning (Practical AI #158)