The Great Data Debate

enNovember 13, 2020

a16z Podcast

Podcast Summary

Data Lakes vs Data Warehouses: Unique Use Cases: Data Lakes are optimized for unstructured data and operational AI use cases, while Data Warehouses are optimized for analytics workflows and query patterns. Both can do what the other one does, but industry trends suggest SQL data warehouses may replace data lakes for structured and semi-structured data in the future.
Data lakes and data warehouses serve distinct purposes and are optimized for different use cases. Martine Casado, a 16z general partner and pioneer of software defined networking, argued that data lakes, which store tabular data in open source file formats like parquet or orc in public cloud object storage, are better suited for unstructured data and operational AI use cases that are compute-intensive. On the other hand, data warehouses, which use object storage to store their data and provide some advantages of data lakes, are optimized for analytics workflows and query patterns. Although both technologies can technically do what the other one does, the industry is making decisions based on the primary use cases they are being used for. Five years from now, Bob Muglia, the former CEO of Snowflake, believes that data will primarily sit behind a SQL prompt, and SQL data warehouses will replace data lakes for storing structured and semi-structured data. However, Martine sees the operational AI use cases growing faster and argues that over time, data lakes may end up consuming everything. This debate highlights the importance of understanding the unique strengths and limitations of different data architectures and choosing the right one based on specific use cases.
Handling complex data types in data warehouses: Data warehouses may not effectively manage complex data types like images, videos, and documents, but this functionality is expected to be added in the future. SQL relational data warehouses are predicted to dominate data retrieval, but processing complex data will likely require specialized tools and approaches.
While cloud SQL data warehouses are sufficient for handling structured and semi-structured data, they currently lack the capability to effectively manage complex data types such as images, videos, and documents. However, this functionality is expected to be added in the next 2-3 years. SQL relational data warehouses have historically dominated data retrieval, but the technology required for processing complex data is fundamentally different. Although SQL may eventually win in data processing as well, it is predicted to take 8-10 years for this to occur. The speaker argues that organizations will store complex data in a data lake only once, rather than maintaining separate copies in both the data lake and data warehouse. Overall, the evolution of data management systems will continue to favor relational databases for data retrieval, but the processing of complex data will likely require specialized tools and approaches.
The Future of Data Warehouses and Data Lakes: Both data warehouses and data lakes are evolving to support diverse access patterns, SQL, and procedural operations. The AI/ML domain is driving the growth of data lakes, emphasizing open formats and interoperability. The future may see a convergence of these technologies, with a focus on optimizing for specific use cases.
As data usage evolves, the distinction between data warehouses and data lakes is becoming less clear. Both systems will need to support various access patterns, SQL, and procedural operations to cater to diverse use cases. The future may see a convergence of these technologies, with companies like Snowflake and Databricks offering both declarative and procedural approaches. The data lake is gaining traction, particularly in the AI/ML domain, where complex models are being built and served. Use cases driving the technology include analytics, dashboarding, and creating complex models for applications like wait time prediction, fraud detection, and dynamic pricing. The growth in the AI/ML space suggests that this area will influence the technological architecture. Another key point is the importance of open formats and interoperability for various use cases. Open source file formats, indexing, and metadata are essential for both data warehouses and data lakes. The ability to input, output, and convert formats easily is crucial for handling the diverse operations required in the data processing landscape. In the coming years, we'll likely see a continued evolution of these systems, with a focus on optimizing for specific use cases. However, the use case itself will ultimately dictate the technology, as the industry moves towards a converged point where both declarative and procedural approaches can coexist.
Convergence of Data Science and Analytics: Data lakes will evolve, with a unifying layer for querying and serving data, while notebooks facilitate the combination of data, code, and visualizations. Machine learning and analytics communities are converging, with Google BigQuery leading the way by integrating machine learning into SQL.
The data science and analytics communities are continuing to evolve and converge, with specialized stacks becoming more prevalent due to resource constraints. The data lake will remain important for storing various data types, but its structure and usage will change as we move towards greater understanding of what data is truly valuable. The movement of data will decrease, and there will be a need for a unifying layer at the top to facilitate querying and serving information. Notebooks, with their language-agnostic approach and ability to combine data, code, and visualizations, are well-suited for this role. Another major topic discussed was the convergence of the machine learning and analytics worlds. Despite syncing the same data sources, these two communities have remained largely separate due to tooling inconveniences. There are three visions for bringing these worlds together: integrating machine learning into SQL, putting SQL into machine learning environments, or creating a new, unified platform. Google BigQuery is currently leading the charge in integrating machine learning into SQL. Ultimately, the goal is to make it easier for these communities to work together and access the same data, enabling more effective and efficient data analysis and machine learning applications.
Future of Data Processing and Machine Learning: Heterogeneous and Fragmented Systems: The future of data processing and machine learning involves a heterogeneous, fragmented system with multiple systems interacting through common formats like Arrow, while addressing technical challenges of interoperability and creating efficient workflows for data engineers and data scientists.
The future of data processing and machine learning is likely to involve a heterogeneous, fragmented system with multiple systems interacting through common formats. According to the discussion, there are different visions for this future, including SQL integrated with Python or Scala, Arrow as an interchange format, and specialization in deep learning versus predictive models. While Arrow is seen as a significant step forward for providing a consistent in-memory layout for advanced analytics, it doesn't completely solve the technical challenges of interoperability, such as egress fees and cloud servers being in different locations. The use cases and personas of data engineers and data scientists require different skills, and the current infrastructure separating data prep and feature engineering from machine learning model training creates a significant technical slowness. Despite this, having multiple languages and systems is not due to one being turning complete or the other not, but rather because people build their workflows around them. Therefore, open interfaces and common formats like Arrow are crucial for enabling efficient data processing and machine learning in a heterogeneous and fragmented system.
Integrating relational databases with predictive analytics using knowledge graphs: Hybrid systems combining relational and predictive capabilities will dominate, knowledge graphs will be essential for business modeling and predictive analytics in the 2030s, and data mesh is gaining traction as a solution to current data challenges
The future of data systems lies in the integration of relational databases with predictive analytics using knowledge graphs. Hybrid systems, combining both predictive and relational capabilities, will dominate for the next few years. However, as SQL reaches its limits in data modeling and transformation, knowledge graphs will become essential for business modeling and predictive analytics in the 2030s. Additionally, the concept of data mesh, which involves decentralizing data processing and analytics into individual business units, is gaining traction as a potential solution to current challenges, such as cost, data quality, and lack of business context. Data mesh aims to put technology where the data and knowledge reside, and it offers valuable ideas for organizing and managing data across large enterprises. Despite the potential of streaming data and in-flight transforms, I believe that data is not purely streaming, and a more comprehensive approach is needed for effective data management and analysis.
Maintaining data consistency in the modern data stack: The modern data stack prioritizes flexibility and handling various use cases, while ensuring data consistency through a unified architecture, rather than relying on traditional streaming-based solutions that neglect consistency for distribution.
Data consistency is crucial when managing data, and transactional data from business systems is an essential source that cannot be overlooked. Traditional streaming-based solutions often neglect this aspect, prioritizing distribution over consistency. However, building fully distributed architectures can lead to inefficiencies and separate administration, processing, and access to toolsets. The term "data mesh" can be misleading, as it connotes full distribution, but the goal should be to enable distributed nature with a unified architecture. The modern data stack, with its flexibility and capability to handle various use cases, is expected to continue absorbing new applications, such as complex data from predictive analytics and medical fields, in the coming years. Ultimately, the aim is to have one clean, well-understood data set that supports performance, large batch analytical processing, and data science, while accommodating specialized use cases.
The Future of Data Processing: Modern Data Apps and Latency Challenges: Modern data apps will revolutionize business decision-making with real-time data processing, but designers must consider latency and throughput trade-offs.
The future of data processing lies in the modern data app, which can autonomously make business decisions using data from various systems. However, building such data apps comes with challenges, particularly around latency. While some believe data apps should be separate systems that pull data from data warehouses, others argue for natively built data apps. Regarding latency, while some applications require instant response, most can work with a minute or two of delay. The trade-off between latency and throughput is a complex issue, and designers must consider it case by case. Despite these challenges, the future of data processing will involve more automation and real-time decision-making, making the development of modern data apps a crucial endeavor.
Data platforms trade-offs between latency and throughput: New data platforms may offer a combination of streaming and batch processing, allowing users to choose the best solution for their specific needs based on latency requirements.
As data platforms continue to evolve, there will be ongoing trade-offs between latency and throughput. While throughput-optimized architectures like Snowflake are expected to go lower in latency than anticipated, they may not be the best solution for applications requiring extremely low latency. The future may bring new major data platforms alongside the current players like Snowflake, Databricks, Google, AWS, and Azure. These new platforms could offer a combination of streaming and batch processing, allowing users to choose the best solution for their specific needs. However, it's important to remember that architectural choices will impact the latency characteristics of each platform differently. For instance, Snowflake's architecture differs from MIM SQL's in terms of latency. The conversation also touched on the potential for Lambda architectures without the need for additional tools, which could offer the benefits of both streaming and batch processing. In conclusion, the data platform landscape will continue to evolve, and users will need to make informed decisions based on their specific use cases and latency requirements.

Recent Episodes from a16z Podcast

The Art of Technology, The Technology of Art

We know that technology has changed art, and that artists have evolved with every new technology — it’s a tale as old as humanity, moving from cave paintings to computers. Underlying these movements are endless debates around inventing versus remixing; between commercialism and art; between mainstream canon and fringe art; whether we’re living in an artistic monoculture now (the answer may surprise you); and much much more.

So in this new episode featuring Berlin-based contemporary artist Simon Denny -- in conversation with a16z crypto editor in chief Sonal Chokshi -- we discuss all of the above debates. We also cover how artists experimented with the emergence of new technology platforms like the web browser, the iPhone, Instagram and social media; to how generative art found its “native” medium on blockchains, why NFTs; and other art movements.

Denny also thinks of entrepreneurial ideas -- from Peter Thiel's to Chris Dixon's Read Write Own -- as an "aesthetic"; and thinks of technology artifacts (like NSA sketches!) as art -- reflecting all of these in his works across various mediums and contexts. How has technology changed art, and more importantly, how have artists changed with technology? How does art change our place in the world, or span beyond space? It's about optimism, and seeing things anew... all this and more in this episode.

Resources:

Find Denny on Twitter: https://x.com/dennnnnnnnny

Find Sonal on Twitter: https://x.com/smc90

Stay Updated:

Let us know what you think: https://ratethispodcast.com/a16z

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

enJuly 03, 2024

Cybersecurity's Past, Present, and AI-Driven Future

Is it time to hand over cybersecurity to machines amidst the exponential rise in cyber threats and breaches?

We trace the evolution of cybersecurity from minimal measures in 1995 to today's overwhelmed DevSecOps. Travis McPeak, CEO and Co-founder of Resourcely, kicks off our discussion by discussing the historical shifts in the industry. Kevin Tian, CEO and Founder of Doppel, highlights the rise of AI-driven threats and deepfake campaigns. Feross Aboukhadijeh, CEO and Founder of Socket, provides insights into sophisticated attacks like the XZ Utils incident. Andrej Safundzic, CEO and Founder of Lumos, discusses the future of autonomous security systems and their impact on startups.

Recorded at a16z's Campfire Sessions, these top security experts share the real challenges they face and emphasize the need for a new approach.

Resources:

Find Travis McPeak on Twitter: https://x.com/travismcpeak

Find Kevin Tian on Twitter: https://twitter.com/kevintian00

Find Feross Aboukhadijeh on Twitter: https://x.com/feross

Find Andrej Safundzic on Twitter: https://x.com/andrejsafundzic

Stay Updated:

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

enJune 26, 2024

The Science and Supply of GLP-1s

Brooke Boyarsky Pratt, founder and CEO of knownwell, joins Vineeta Agarwala, general partner at a16z Bio + Health.

Together, they talk about the value of obesity medicine practitioners, patient-centric medical homes, and how Brooke believes the metabolic health space will evolve over time.

This is the second episode in Raising Health’s series on the science and supply of GLP-1s. Listen to last week's episode to hear from Carolyn Jasik, Chief Medical Officer at Omada Health, on GLP-1s from a clinical perspective.

Listen to more from Raising Health’s series on GLP-1s:

The science of satiety: https://raisinghealth.simplecast.com/episodes/the-science-and-supply-of-glp-1s-with-carolyn-jasik

Payers, providers and pricing: https://raisinghealth.simplecast.com/episodes/the-science-and-supply-of-glp-1s-with-chronis-manolis

Stay Updated:

Let us know what you think: https://ratethispodcast.com/a16z

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

enJune 19, 2024

The State of AI with Marc & Ben

In this latest episode on the State of AI, Ben and Marc discuss how small AI startups can compete with Big Tech’s massive compute and data scale advantages, reveal why data is overrated as a sellable asset, and unpack all the ways the AI boom compares to the internet boom.

Subscribe to the Ben & Marc podcast: https://link.chtbl.com/benandmarc

Stay Updated:

Let us know what you think: https://ratethispodcast.com/a16z

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

a16z Podcast

enJune 14, 2024

artificial intelligence

Predicting Revenue in Usage-based Pricing

Over the past decade, usage-based pricing has soared in popularity. Why? Because it aligns cost with value, letting customers pay only for what they use. But, that flexibility is not without issues - especially when it comes to predicting revenue. Fortunately, with the right process and infrastructure, your usage-based revenue can become more predictable than the traditional seat-based SaaS model.

In this episode from the a16z Growth team, Fivetran’s VP of Strategy and Operations Travis Ferber and Alchemy’s Head of Sales Dan Burrill join a16z Growth’s Revenue Operations Partner Mark Regan. Together, they discuss the art of generating reliable usage-based revenue. They share tips for avoiding common pitfalls when implementing this pricing model - including how to nail sales forecasting, adopting the best tools to track usage, and deal with the initial lack of customer data.

Resources:

Learn more about pricing, packaging, and monetization strategies: a16z.com/pricing-packaging

Find Dan on Twitter: https://twitter.com/BurrillDaniel

Find Travis on LinkedIn: https://www.linkedin.com/in/travisferber

Find Mark on LinkedIn: https://www.linkedin.com/in/mregan178

Stay Updated:

Let us know what you think: https://ratethispodcast.com/a16z

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

enJune 10, 2024

California's Senate Bill 1047: What You Need to Know

On May 21, the California Senate passed bill 1047.

This bill – which sets out to regulate AI at the model level – wasn’t garnering much attention, until it slid through an overwhelming bipartisan vote of 32 to 1 and is now queued for an assembly vote in August that would cement it into law. In this episode, a16z General Partner Anjney Midha and Venture Editor Derrick Harris breakdown everything the tech community needs to know about SB-1047.

This bill really is the tip of the iceberg, with over 600 new pieces of AI legislation swirling in the United States. So if you care about one of the most important technologies of our generation and America’s ability to continue leading the charge here, we encourage you to read the bill and spread the word.

Read the bill: https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202320240SB1047

a16z Podcast

enJune 06, 2024

The GenAI 100: The Apps that Stick

Consumer AI is moving fast, so who's leading the charge?

a16z Consumer Partners Olivia Moore and Bryan Kim discuss our GenAI 100 list and what it takes for an AI model to stand out and dominate the market.

They discuss how these cutting-edge apps are connecting with their users and debate whether traditional strategies like paid acquisition and network effects are still effective. We're going beyond rankings to explore pivotal benchmarks like D7 retention and introduce metrics that define today's AI market.

Note: This episode was recorded prior to OpenAI's Spring update. Catch our latest insights in the previous episode to stay ahead!

Resources:

Link to the Gen AI 100: https://a16z.com/100-gen-ai-apps

Find Bryan on Twitter: https://twitter.com/kirbyman

Find Olivia on Twitter: https://x.com/omooretweets

Stay Updated:

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

a16z Podcast

enMay 27, 2024

audio

healthcare

artificial intelligence

Finding a Single Source of AI Truth With Marty Chavez From Sixth Street

a16z General Partner David Haber talks with Marty Chavez, vice chairman and partner at Sixth Street Partners, about the foundational role he’s had in merging technology and finance throughout his career, and the magical promises and regulatory pitfalls of AI.

This episode is taken from “In the Vault”, a new audio podcast series by the a16z Fintech team. Each episode features the most influential figures in financial services to explore key trends impacting the industry and the pressing innovations that will shape our future.

Resources:
Listen to more of In the Vault: https://a16z.com/podcasts/a16z-live

Find Marty on X: https://twitter.com/rmartinchavez

Find David on X: https://twitter.com/dhaber

Stay Updated:

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

enMay 22, 2024

A Big Week in AI: GPT-4o & Gemini Find Their Voice

This was a big week in the world of AI, with both OpenAI and Google dropping significant updates. So big that we decided to break things down in a new format with our Consumer partners Bryan Kim and Justine Moore. We discuss the multi-modal companions that have found their voice, but also why not all audio is the same, and why several nuances like speed and personality really matter.

Resources:

OpenAI’s Spring announcement: https://openai.com/index/hello-gpt-4o/

Google I/O announcements: https://blog.google/technology/ai/google-io-2024-100-announcements/

Stay Updated:

Let us know what you think: https://ratethispodcast.com/a16z

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

a16z Podcast

enMay 19, 2024

voice

audio

artificial intelligence

Remaking the UI for AI

Make sure to check out our new AI + a16z feed: https://link.chtbl.com/aiplusa16z

a16z General Partner Anjney Midha joins the podcast to discuss what's happening with hardware for artificial intelligence. Nvidia might have cornered the market on training workloads for now, but he believes there's a big opportunity at the inference layer — especially for wearable or similar devices that can become a natural part of our everyday interactions.

Here's one small passage that speaks to his larger thesis on where we're heading:

"I think why we're seeing so many developers flock to Ollama is because there is a lot of demand from consumers to interact with language models in private ways. And that means that they're going to have to figure out how to get the models to run locally without ever leaving without ever the user's context, and data leaving the user's device. And that's going to result, I think, in a renaissance of new kinds of chips that are capable of handling massive workloads of inference on device.

"We are yet to see those unlocked, but the good news is that open source models are phenomenal at unlocking efficiency. The open source language model ecosystem is just so ravenous."

Related Episodes

The Analytics Edge Trailer

For many businesses, data-driven insights are the cornerstones of success. Data & analytics enables better decisions, unlocks answers to complex challenges, and can lead to unprecedented growth.

On this podcast, we’ll investigate the transformative power of data and analytics, and how they’re disrupting industries everywhere.

Join your host, Thomas Dong, VP of Marketing at NetSpring, his co-hosts, and guests as we share real-world stories of innovation powered by analytics.

In each episode, we'll bring you exclusive conversations with industry leaders and experts who will share their invaluable thoughts and experiences. They’ll reveal why modernizing your data stack is crucial to staying ahead of the curve, and driving your business forward.

Listen in as we discuss data-driven transformations, uncover the secrets to unlocking your enterprise’s full potential, and explore the endless possibilities of analytics.

Get ready to embark on a journey of insights, inspiration, and innovation.

Welcome to "The Analytics Edge: How Data Leaders are Changing the Game", sponsored by NetSpring.

The Analytics Edge

enJuly 10, 2023

llm

artificial intelligence

Data Alone Is Not Enough: The Evolution of Data Architectures

Data, data, data – it’s long been a buzzword in the industry, whether big data, streaming data, data analytics, data science, even AI & machine learning — but data alone is not enough: it takes an entire system of tools and technology to extract value from data.

A multibillion dollar industry has emerged around data tools and technologies. And with so much excitement and innovation in the space: how exactly do all these tools fit together?

This podcast – a hallway style conversation between Ali Ghodsi, CEO and Founder of Databricks, and a16z general partner Martin Casado – explores the evolution of data architectures, including some quick history, where they’re going, and a surprising use case for streaming data, as well as Ali’s take on how he’d architect the picks and shovels that handle data end-to-end today.

enOctober 23, 2020

#64 - Data Lake vs Data Warehouse: What's the Difference?

Hello and welcome to the marketing slice by Hurree, the show where the team at Hurree give you marketing insights, hints, and tips that will help you improve your results right now.

I’m Dominique Daly and in this podcast, we will be discussing the differences between a Data Warehouse and a Data lake. You can read this episode in full, along with helpful graphics and other great resources over on our website.

What is a data warehouse?

Data warehouses typically store data from various sources, integrating the source data to provide a unified view. These sources can include transactional systems, application log files, relational databases and more. The data in question is most often historical data required for periodical queries; this historical record is referred to as a business's ‘single source of truth’.

What is a data lake?

A data lake is a repository that allows for the storage of vast amounts of raw data, that is, data that has not been prepared, processed or manipulated to fit a particular schema. Data lakes are typically housed within a Hadoop technology environment; Hadoop is an open-source software framework that facilitates massive data storage capabilities.

In the episode, we cover 8 key areas where data lakes and data warehouses differ and discuss the benefits of each for your company.

We also have associated guides, videos, blogs, and infographics that can all be found at www.hurree.co.

So here we go…

The Marketing Slice by Hurree

en-gbJune 02, 2021

Valorem Visions ( Episode 14 ) - Data Mesh

In this episode of Valorem Visions, our expert, Erin Woodward shares insights into Data Mesh and its industry use cases.

Valorem Visions, Data Mesh, Data Management

Valorem Voices - The Business Transformation Podcast

en-usFebruary 27, 2023

Dw vs. Data Lake vs. MDW vs. Data Lakehouse para Pipeline de Dados

Uma das dúvidas mais comuns em ambientes de big data e construção de data pipelines é de fato entender as diferenças entre os diversos tipos de storages que podemos nos conectar para processar os dados.

Nesse episódio, atacamos todos os tipos que o mercado oferece mostrando seus lados positivos e negativos para que você que está construindo entenda da melhor forma como cada um desses storages se comportam.

Falamos também da importância do mindset tanto do profissional como da empresa em não somente armazenar mas como processar dados de forma eficiente, madura e rápida.

Entenda a evolução do mercado de Big Data e Analytics e entenda os mais novos termos e tecnologias utilizadas para construção de pipeline de dados.

Luan Moreno =
https://www.linkedin.com/in/luanmoreno/

Engenharia de Dados [Cast]

pt-brMarch 12, 2021