As the data lake becomes the first destination for a growing volume and variety of data, data teams are under pressure to deliver access to that data for analytics. Cloud data lakes provide businesses with an economical and efficient means of storing data, but that data has remained notoriously hard to leverage for analytics, particularly for Business Intelligence (BI) and reporting workloads.
Fortunately, new technologies for managing and providing access to data in the data lake are quickly gaining momentum, including table formats that combine many of the data management and governance capabilities of the data warehouse with the scalability and flexibility of the data lake.
One such technology is Apache Iceberg, an open table format for cloud data lakes that was originally developed at Netflix. After outgrowing their Hive architecture, they built Iceberg to efficiently manage the size and scale of modern data volumes.
In this blog post, we will explain what Iceberg is and why data teams should consider leveraging it to streamline data operations and deliver high performance BI and reporting directly on data in the data lake.
What is Iceberg?
Iceberg is an open table format that gives data engineers and data architects many of the tools they need to manage data directly on the data lake. Comparable solutions include Hive, Hudi, and Delta Lake.
One of the improvements upon Hive for Iceberg is the shift from folder-level data tracking to file-level, which eliminates many of Hive’s performance and consistency issues. It provides a level of standardization when managing files within a table.
Iceberg consists of three distinct layers:
- The top level is the Iceberg catalog. It provides the location of the current metadata pointer, and pinpoints where to read or write data for a given table.
- The middle layer is the metadata layer. It consists of the metadata files, which include schema, partitions, and snapshots, as well as manifest lists, which are a list of files,snapshot maps, and manifest files, which track data files and statistics.
- The final layer is the data layer. This layer includes partition membership, record count, and the lower and upper bounds of the columns. Data can be stored in multiple formats, including ORC, Parquet, and more.
Iceberg was purpose-built for scale and interoperability. It can manage exabyte-level volumes of data, and it supports multiple engines so data teams can use the right tool for the right workload.
Advantages of Iceberg
Iceberg provides several advantages for data teams:
- Schema evolution: With legacy tools, approaches, and architectures, making simple adds, drops, and renames seemed straightforward on the surface but often caused collateral overhead. With Iceberg’s flexible architecture, data teams are able to make adjustments to the schema as data continues to grow.
- Partition evolution: Iceberg facilitates the modification of partition layouts without forcing a rewrite of the entire table.
- Version rollback: Data teams can quickly correct problems by resetting the tables to a known “good” state.
- Transactional consistency: Iceberg avoids partial or uncommitted changes by tracking atomic transactions with ACID properties.
- Optimized processes: Iceberg utilizes enhanced filtering to limit user mistakes that cause slow queries.
- Time travel: One of our favorite features, time travel gives users access to previous table versions for comparison and reproduction of queries.
- Increased performance: Iceberg automates file filtering with advanced partition pruning and column-level statistics.
For data consumers, Iceberg unlocks access to analytics on the data lake. End users always have a correct and consistent view of a table, or a “single source of truth.” Iceberg delivers improved query performance and provides scalability at the data, users, and application levels.
Most importantly, from an analytics perspective, Iceberg significantly reduces the time to insight for the newest and fastest growing volumes of customer and operational data. Previously, many data teams have had to rely on complex Extract, Transform & Load (ETL) pipelines to move semi-structured and unstructured data into proprietary data warehouse platforms. Iceberg brings data management and data governance directly to the data lake, eliminating the need for complicated workarounds to provide data access.
Dremio: Get Started with Iceberg Today
Dremio is the easy and open lakehouse platform. It combines Dremio Arctic, an intelligent metastore built on Apache Iceberg that automates and optimizes data management and data governance, and Dremio Sonar, a SQL query engine, query accelerator, and semantic layer built on Apache Arrow, to deliver the data management and data analytics capabilities of the data warehouse directly on data lake storage.
Dremio is one of the easiest ways to get started with Iceberg tables, and Dremio Sonar is a powerful analytics platform that allows data consumers of all skill levels to quickly and easily access and start exploring their data.
Capitalize Can Help
Capitalize Consulting has the experience and skills to help your organization get the most out of Dremio and Iceberg tables. If you’d like to speak with one of our experts, please reach out to us at info@capitalizeconsulting.com.