Data Alone Is Not Enough: The Evolution of Data Architectures

enOctober 23, 2020

a16z Podcast

Podcast Summary

From Data Warehousing to Data Lakes and Back: The data industry has evolved from data warehousing to data lakes and back, with businesses moving data back to traditional warehouses in the cloud for better management and insights.
The data industry has seen significant evolution over the decades, from data warehousing to data lakes, and the current landscape involves a complex system of tools and technologies. In the 1980s, data warehousing emerged as a solution to help businesses gain insights from their data, leading to a market worth over $20 billion. However, as data types diversified and the need for machine learning and AI grew, challenges arose. Data lakes were introduced around a decade ago as a cost-effective storage solution, but managing and making sense of data in these lakes proved difficult. Today, businesses are moving subsets of their data back into traditional data warehouses in the cloud to improve data management and gain better insights. The modern data stack is a complex ecosystem, and it's essential to understand the history and evolution of these tools and technologies to effectively extract value from data.
Merging BI and MLAI through the lake house design pattern: The lake house design pattern enables direct BI, reporting, data science, and machine learning on data lakes, improving data analysis and decision-making efficiency
We are witnessing the convergence of Business Intelligence (BI) and Machine Learning and Artificial Intelligence (MLAI) through a new design pattern called the "lake house." This design pattern allows for direct BI, reporting, data science, and machine learning on data lakes. The similarities between these fields include the need for the same data, with machine learning requiring additional metadata for optimal results. However, the differences lie in the personas and organizational placement. Traditional BI and analytics are typically used by data analysts and business analysts, while machine learning is used by data scientists, machine learning engineers, and machine learning scientists. Some argue that simple regressions can be achieved using traditional data warehouses with SQL. However, a research project at UC Berkeley attempted to augment an existing relational model with machine learning, but the results were not satisfactory. Overall, the lake house design pattern represents the emerging trend of merging BI and MLAI markets, allowing for more efficient and effective data analysis and decision-making.
Integrating ML and Data Science with BI systems: Effectively integrate ML and data science with BI systems to minimize redundancy and maintain data consistency, rather than maintaining two separate copies of data.
Integrating machine learning and data science with traditional business intelligence (BI) systems can be challenging due to the technical differences between the two. Machine learning algorithms are iterative and recursive, making it difficult to implement them on top of data warehousing systems. However, the emergence of data frames as a lingua franca for data scientists has made it possible to marry the worlds of data science and machine learning with SQL and BI. Despite this progress, many enterprises still maintain two separate copies of their data – one in a data lake for machine learning and data science, and another in a data warehouse for SQL and BI. This architectural redundancy comes with a hefty price tag. The question is, do we really need two copies of the data and the associated maintenance costs, or can we do it all in one place? While the market for AI and machine learning is large and valuable, it's important to remember that there is also a significant existing workflow around BI. The answer to this challenge lies in finding a way to effectively integrate machine learning and data science with traditional BI systems while minimizing redundancy and maintaining data consistency.
Transforming data lakes into relational storage systems: Data lakes can now support OLAP queries and provide fast analytical queries through APIs by turning into structured relational storage systems, handling various data processing needs within the same system.
Data lakes can now support various data processing needs, including OLAP queries, by turning them into structured relational storage systems. This transformation is achieved by building transactionality into data lakes and adding schemas, quality metrics, and a SQL layer on top. By doing so, data can be reasoned about as structured data in tables, enabling fast analytical queries through APIs like data frame and SQL. This development caters to the market trend of processing data at different time speeds, addressing both batch and streaming analytics use cases, and providing various latency requirements. This approach offers the same performance as the fastest MPP (Massively Parallel Processing) engines while dealing with structured data. The importance of this development lies in its ability to handle various data processing needs within the same data lake, providing flexibility and efficiency for businesses.
Streaming systems simplify data processing: Streaming systems can reduce latency, eliminate manual reconciliation, and potentially make batch processing obsolete, making them a valuable addition to a modern data stack
The use of streaming systems in data processing can significantly reduce latency and simplify data operations, despite the common obsession with sub-5 millisecond latency for most use cases. The weakest link in the system, often an upstream process, can dictate latency, making it essential to ensure that data is loaded as quickly as possible. Streaming systems can handle all data operations, eliminating the need for manual reconciliation, joining tables, and dealing with late or inconsistent data. This can simplify data processing and potentially make batch data processing obsolete. A modern data stack, which is still evolving, likely includes a combination of streaming and batch processing systems to address various use cases. If given the freedom to build a data infrastructure from scratch in a large company, focusing on cloud-based solutions for both analytics and AIML would be a wise choice to avoid political battles and the complexity of on-premises solutions.
Embrace cloud's invisible networks for data processing: In cloud, process data directly into a data lake without schema, build a transactional layer, and use interactive data science environments for insights
When transitioning to a cloud native architecture, it's essential not to replicate the on-premises model. Instead, embrace the invisible networks in the cloud that allow for high-speed communication between machines and storage systems. This changes the game for data processing, as you can directly send data into a data lake without deciding on a schema upfront. However, to make sense of the data, you need to build a structural transactional layer on top of it. Additionally, an interactive data science environment is necessary for gaining insights from the data. This environment typically includes notebook solutions with technologies like Spark and can lead to operational machine learning platforms for training, tracking, and moving machine learning models into production.
Bridging the gap between data scientists and IT in machine learning: Effective machine learning requires a platform that connects data scientists and IT, with data pipelines and DAG tools transforming raw data and Delta Lake/Iceberg enabling BI tool integration.
For organizations looking to maximize the value of machine learning, it's crucial to have a machine learning platform that effectively bridges the gap between data scientists and IT. The data pipeline and DAG tools play a significant role in this process by transforming raw data into a format suitable for machine learning. However, the challenge lies in connecting traditional Business Intelligence (BI) tools to the data lake. With the emergence of technologies like Delta Lake and Iceberg, it's now possible to connect BI tools directly to the transactional layer of the data lake, making the migration to a data lake more accessible for companies without significant legacy issues. Success stories, such as Uber, demonstrate the competitive advantage of effective machine learning predictions, making the investment in a comprehensive machine learning platform worthwhile. Despite the challenges, the future of machine learning lies in seamless collaboration between data scientists, IT, and advanced data management solutions.
Leveraging modern technologies for effective pricing in ride-sharing services: To remain competitive, enterprises should ensure their data stack is multi-cloud, based on open standards, and uses machine learning and data science to extract valuable insights from raw data in a data lake.
Companies utilizing modern technologies and approaches, such as those discussed in the conversation around ride-sharing services, are leveraging machine learning and multi-cloud solutions to effectively meet the demands of surge pricing and provide accurate pricing to consumers. These companies, which are relatively new and have not been burdened by legacy systems, have built their stacks specifically for this use case, creating a significant competitive advantage. For enterprises looking to build their data strategy, it's essential to ensure their stack is multi-cloud and based on open standards and open-source technology. This approach provides the flexibility to adapt to changing technologies and avoid being locked into a specific stack. Additionally, storing data in raw format in a data lake is crucial as the amount of data being collected continues to grow. Machine learning and data science should also be prioritized as first-class citizens within the stack to extract valuable business insights from the data. While the exact shape of machine learning platforms may change, their core ingredients, such as machine learning and data science, will likely remain. By focusing on these areas, enterprises can effectively turn their data into valuable business insights.

Recent Episodes from a16z Podcast

Cybersecurity's Past, Present, and AI-Driven Future

Is it time to hand over cybersecurity to machines amidst the exponential rise in cyber threats and breaches?

We trace the evolution of cybersecurity from minimal measures in 1995 to today's overwhelmed DevSecOps. Travis McPeak, CEO and Co-founder of Resourcely, kicks off our discussion by discussing the historical shifts in the industry. Kevin Tian, CEO and Founder of Doppel, highlights the rise of AI-driven threats and deepfake campaigns. Feross Aboukhadijeh, CEO and Founder of Socket, provides insights into sophisticated attacks like the XZ Utils incident. Andrej Safundzic, CEO and Founder of Lumos, discusses the future of autonomous security systems and their impact on startups.

Recorded at a16z's Campfire Sessions, these top security experts share the real challenges they face and emphasize the need for a new approach.

Resources:

Find Travis McPeak on Twitter: https://x.com/travismcpeak

Find Kevin Tian on Twitter: https://twitter.com/kevintian00

Find Feross Aboukhadijeh on Twitter: https://x.com/feross

Find Andrej Safundzic on Twitter: https://x.com/andrejsafundzic

Stay Updated:

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

enJune 26, 2024

The Science and Supply of GLP-1s

Brooke Boyarsky Pratt, founder and CEO of knownwell, joins Vineeta Agarwala, general partner at a16z Bio + Health.

Together, they talk about the value of obesity medicine practitioners, patient-centric medical homes, and how Brooke believes the metabolic health space will evolve over time.

This is the second episode in Raising Health’s series on the science and supply of GLP-1s. Listen to last week's episode to hear from Carolyn Jasik, Chief Medical Officer at Omada Health, on GLP-1s from a clinical perspective.

Listen to more from Raising Health’s series on GLP-1s:

The science of satiety: https://raisinghealth.simplecast.com/episodes/the-science-and-supply-of-glp-1s-with-carolyn-jasik

Payers, providers and pricing: https://raisinghealth.simplecast.com/episodes/the-science-and-supply-of-glp-1s-with-chronis-manolis

Stay Updated:

Let us know what you think: https://ratethispodcast.com/a16z

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

enJune 19, 2024

The State of AI with Marc & Ben

In this latest episode on the State of AI, Ben and Marc discuss how small AI startups can compete with Big Tech’s massive compute and data scale advantages, reveal why data is overrated as a sellable asset, and unpack all the ways the AI boom compares to the internet boom.

Subscribe to the Ben & Marc podcast: https://link.chtbl.com/benandmarc

Stay Updated:

Let us know what you think: https://ratethispodcast.com/a16z

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

a16z Podcast

enJune 14, 2024

artificial intelligence

Predicting Revenue in Usage-based Pricing

Over the past decade, usage-based pricing has soared in popularity. Why? Because it aligns cost with value, letting customers pay only for what they use. But, that flexibility is not without issues - especially when it comes to predicting revenue. Fortunately, with the right process and infrastructure, your usage-based revenue can become more predictable than the traditional seat-based SaaS model.

In this episode from the a16z Growth team, Fivetran’s VP of Strategy and Operations Travis Ferber and Alchemy’s Head of Sales Dan Burrill join a16z Growth’s Revenue Operations Partner Mark Regan. Together, they discuss the art of generating reliable usage-based revenue. They share tips for avoiding common pitfalls when implementing this pricing model - including how to nail sales forecasting, adopting the best tools to track usage, and deal with the initial lack of customer data.

Resources:

Learn more about pricing, packaging, and monetization strategies: a16z.com/pricing-packaging

Find Dan on Twitter: https://twitter.com/BurrillDaniel

Find Travis on LinkedIn: https://www.linkedin.com/in/travisferber

Find Mark on LinkedIn: https://www.linkedin.com/in/mregan178

Stay Updated:

Let us know what you think: https://ratethispodcast.com/a16z

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

enJune 10, 2024

California's Senate Bill 1047: What You Need to Know

On May 21, the California Senate passed bill 1047.

This bill – which sets out to regulate AI at the model level – wasn’t garnering much attention, until it slid through an overwhelming bipartisan vote of 32 to 1 and is now queued for an assembly vote in August that would cement it into law. In this episode, a16z General Partner Anjney Midha and Venture Editor Derrick Harris breakdown everything the tech community needs to know about SB-1047.

This bill really is the tip of the iceberg, with over 600 new pieces of AI legislation swirling in the United States. So if you care about one of the most important technologies of our generation and America’s ability to continue leading the charge here, we encourage you to read the bill and spread the word.

Read the bill: https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202320240SB1047

a16z Podcast

enJune 06, 2024

The GenAI 100: The Apps that Stick

Consumer AI is moving fast, so who's leading the charge?

a16z Consumer Partners Olivia Moore and Bryan Kim discuss our GenAI 100 list and what it takes for an AI model to stand out and dominate the market.

They discuss how these cutting-edge apps are connecting with their users and debate whether traditional strategies like paid acquisition and network effects are still effective. We're going beyond rankings to explore pivotal benchmarks like D7 retention and introduce metrics that define today's AI market.

Note: This episode was recorded prior to OpenAI's Spring update. Catch our latest insights in the previous episode to stay ahead!

Resources:

Link to the Gen AI 100: https://a16z.com/100-gen-ai-apps

Find Bryan on Twitter: https://twitter.com/kirbyman

Find Olivia on Twitter: https://x.com/omooretweets

Stay Updated:

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

a16z Podcast

enMay 27, 2024

audio

healthcare

artificial intelligence

Finding a Single Source of AI Truth With Marty Chavez From Sixth Street

a16z General Partner David Haber talks with Marty Chavez, vice chairman and partner at Sixth Street Partners, about the foundational role he’s had in merging technology and finance throughout his career, and the magical promises and regulatory pitfalls of AI.

This episode is taken from “In the Vault”, a new audio podcast series by the a16z Fintech team. Each episode features the most influential figures in financial services to explore key trends impacting the industry and the pressing innovations that will shape our future.

Resources:
Listen to more of In the Vault: https://a16z.com/podcasts/a16z-live

Find Marty on X: https://twitter.com/rmartinchavez

Find David on X: https://twitter.com/dhaber

Stay Updated:

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

enMay 22, 2024

A Big Week in AI: GPT-4o & Gemini Find Their Voice

This was a big week in the world of AI, with both OpenAI and Google dropping significant updates. So big that we decided to break things down in a new format with our Consumer partners Bryan Kim and Justine Moore. We discuss the multi-modal companions that have found their voice, but also why not all audio is the same, and why several nuances like speed and personality really matter.

Resources:

OpenAI’s Spring announcement: https://openai.com/index/hello-gpt-4o/

Google I/O announcements: https://blog.google/technology/ai/google-io-2024-100-announcements/

Stay Updated:

Let us know what you think: https://ratethispodcast.com/a16z

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

a16z Podcast

enMay 19, 2024

voice

audio

artificial intelligence

Remaking the UI for AI

Make sure to check out our new AI + a16z feed: https://link.chtbl.com/aiplusa16z

a16z General Partner Anjney Midha joins the podcast to discuss what's happening with hardware for artificial intelligence. Nvidia might have cornered the market on training workloads for now, but he believes there's a big opportunity at the inference layer — especially for wearable or similar devices that can become a natural part of our everyday interactions.

Here's one small passage that speaks to his larger thesis on where we're heading:

"I think why we're seeing so many developers flock to Ollama is because there is a lot of demand from consumers to interact with language models in private ways. And that means that they're going to have to figure out how to get the models to run locally without ever leaving without ever the user's context, and data leaving the user's device. And that's going to result, I think, in a renaissance of new kinds of chips that are capable of handling massive workloads of inference on device.

"We are yet to see those unlocked, but the good news is that open source models are phenomenal at unlocking efficiency. The open source language model ecosystem is just so ravenous."

How Discord Became a Developer Platform

In 2009 Discord cofounder and CEO, Jason Citron, started building tools and infrastructure for games. Fast forward to today and the platform has over 200 million monthly active users.

In this episode, Jason, alongside a16z General Partner Anjney Midha—who merged his company Ubiquiti 6 with Discord in 2021—shares insights on the nuances of community-driven product development, the shift from gamer to developer, and Discord’s longstanding commitment to platform extensibility.

Now, with Discord's recent release of embeddable apps, what can we expect now that it's easier than ever for developers to build?

Resources:

Find Jason on Twitter: https://twitter.com/jasoncitron

Find Anjney on Twitter: https://twitter.com/AnjneyMidha

Stay Updated:

Find a16z on Twitter: https://twitter.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Subscribe on your favorite podcast app: https://a16z.simplecast.com/

Follow our host: https://twitter.com/stephsmithio

enMay 10, 2024

developer communities

discord

creator tools

Related Episodes

The Great Data Debate

Lakes v. warehouses, analytics v. AI/ML, SQL v. everything else... As the technical capabilities of data lakes and data warehouses converge, are the separate tools and teams that run AI/ML and analytics converging as well?

In this podcast, originally recorded as part of Fivetran's Modern Data Stack conference, five leaders in data infrastructure debate that question: a16z general partner and pioneer of software defined networking Martin Casado, former CEO of Snowflake Bob Muglia; Michelle Ufford, founder and CEO of Noteable; Tristan Handy, founder of Fishtown Analytics and leader of the open source project dbt; and Fivetran founder George Fraser.

The conversation covers the future of data lakes, the new use cases for the modern data stack, data mesh and whether decentralization of teams and tools is the future, and how low we actually need to go with latency. And while the topic of debate is the modern data stack, the themes and differing perspectives strike at the heart of an even bigger: how does technology evolve in complex enterprise environments?

enNovember 13, 2020

#64 - Data Lake vs Data Warehouse: What's the Difference?

Hello and welcome to the marketing slice by Hurree, the show where the team at Hurree give you marketing insights, hints, and tips that will help you improve your results right now.

I’m Dominique Daly and in this podcast, we will be discussing the differences between a Data Warehouse and a Data lake. You can read this episode in full, along with helpful graphics and other great resources over on our website.

What is a data warehouse?

Data warehouses typically store data from various sources, integrating the source data to provide a unified view. These sources can include transactional systems, application log files, relational databases and more. The data in question is most often historical data required for periodical queries; this historical record is referred to as a business's ‘single source of truth’.

What is a data lake?

A data lake is a repository that allows for the storage of vast amounts of raw data, that is, data that has not been prepared, processed or manipulated to fit a particular schema. Data lakes are typically housed within a Hadoop technology environment; Hadoop is an open-source software framework that facilitates massive data storage capabilities.

In the episode, we cover 8 key areas where data lakes and data warehouses differ and discuss the benefits of each for your company.

We also have associated guides, videos, blogs, and infographics that can all be found at www.hurree.co.

So here we go…

The Marketing Slice by Hurree

en-gbJune 02, 2021

Dw vs. Data Lake vs. MDW vs. Data Lakehouse para Pipeline de Dados

Uma das dúvidas mais comuns em ambientes de big data e construção de data pipelines é de fato entender as diferenças entre os diversos tipos de storages que podemos nos conectar para processar os dados.

Nesse episódio, atacamos todos os tipos que o mercado oferece mostrando seus lados positivos e negativos para que você que está construindo entenda da melhor forma como cada um desses storages se comportam.

Falamos também da importância do mindset tanto do profissional como da empresa em não somente armazenar mas como processar dados de forma eficiente, madura e rápida.

Entenda a evolução do mercado de Big Data e Analytics e entenda os mais novos termos e tecnologias utilizadas para construção de pipeline de dados.

Luan Moreno =
https://www.linkedin.com/in/luanmoreno/

Engenharia de Dados [Cast]

pt-brMarch 12, 2021

40. Modern Data Platforms

In deze aflevering duiken we samen met Ian Smith in de wereld van Modern Data Platforms.

Ian is Modern Data Platform Architect bij Creates en houdt zich bezig met het neerzetten van toekomstvaste dataplatforms. Hij heeft al flink wat jaren ervaring in de wereld van Modern Data Platforms en legt ons precies uit wat dit nu precies inhoudt.

Wil je meer weten over Data Platforms of zelf aan de slag in deze interessante wereld? Check dan zeker de website van Creates.

Dev Talks

nl-nlOctober 28, 2022

Connecting the Data Dots with Tealbook CEO Stephany Lapierre

It’s been said that data is the new oil. Not only has it acted as a fuel for growth, but data has been the catalyst for a whole new way of doing business. In this episode, Michael sits down with Stephany Lapierre, the founder and CEO of Tealbook, a supplier data foundation that uses artificial intelligence and machine learning to enable supply-side innovation. As a supply chain thought leader, Stephany has raised more than $73 million for the company, and much of it during the pandemic. While COVID-19 was breaking down supply chains, Tealbook was building up its ability to connect buyers with suppliers in a whole new way. During the episode, Stephany explains how Tealbook accelerated its fundraising while pivoting to tackle the greatest health crisis of our time.

Connecting the dots

The idea behind Tealbook is to collect supplier information for global enterprises that need to access data quickly, easily, and efficiently. Recognizing the need for a faster way to access data, Stephany leveraged technologies like AI, machine learning and the cloud to create a system that could keep up with the rapid evolution of the supply chain. By connecting the dots in a way that no one else had yet, Stephany was able to create a service that’s completely unique and allows companies to make sense of an abundance of data—fast.

Never give up

After years of running her own successful consulting business, Stephany didn’t need to create Tealbook, but her vision was so clear and persistent that she couldn’t not bring it to life. This strong vision and her belief in the product helped to keep her driving forward, even when things got tough. Although there were plenty of opportunities to throw in the towel along the way, Stephany never gave up, keeping her vision alive in the face of obstacles and refusing to compromise.

Be a thought leader, not a follower

For Stephany, an important step in bringing Tealbook to life was positioning herself and the company as an influential voice in the space. By focusing on building relationships, solid branding, and earning credibility, Stephany was able to establish a place for herself in the data and supply chain industry. Posting on LinkedIn and writing about the subject regularly helped her gain a following and become a trusted thought leader over time.

CIBC Innovation Banking is a trusted financial partner to entrepreneurs and investors. Get in touch with our team at cibc.com/innovationbanking.

This episode of The CIBC Innovation Banking Podcast was produced by Quill.

CIBC Innovation Banking Podcast

enMarch 23, 2022

digital transformation

ceo interview

procurement

data collection