As data is growing to be a substantial part of your business, creating the right storage strategy is crucial
As data is growing to be a substantial part of your business processes, creating the right storage strategy to benefit from what data has to offer to your organisation is crucial. The market is seeing an increasing amount of solutions varying from data lakes to data hubs and data warehouses. All of these solutions might fulfil your business requirements in a different way. The choice is yours, but where to start? In this article, we’ll dive a bit deeper into the differences between solutions and what to keep in mind when defining your software architecture strategy.
THE DATA WAREHOUSE
A known data solution since the early 90s, this type of data storage is optimized for repeatable processes like reporting and data analysis. The main goal of a data warehouse is to answer enterprise questions based on historical data mainly coming from operational systems. To do so, a data warehouse is optimized to work with processed data. This allows for efficient storage and a large enterprise audience of business analysts.
To us, key features are (1) Stored structured and relational, (2) Structured on write and (3) Highly curated. Enterprises opt for data warehouses when they want data coming from operational systems readily available for analysis. Data warehouses are expensive data storage solutions with clear benefits, when used smartly they bring true value to your business. Data warehouses are one-directional solutions optimised for analytics; they hardly cover any other use case.
Recommended solutions to look into are Amazon Redshift, Teradata, Cloudera, Panoply and BigQuery.
THE DATA LAKE
The main goal of a data lake is to serve exploration and analysis beyond the existing context of the enterprise. This solution aims to facilitate machine learning and AI on all enterprise data. It is built for flexible processing, variable semantics and generic use cases. To achieve this, a data lake is built to be a single store of all enterprise data. This store allows for all types of data to be imported. Data can be structured, unstructured, relational or even raw binary data, which makes ingestion of data in a data lake easy. But also holds a lot of challenges when doing; data extraction, governance and data infusion for further use.
As main benefits, we consider a data lake low in storage cost and, if used correctly, better suited for machine learning and AI. Enterprises adopt data lake solutions when they look for a flexible system that allows for advanced data mechanics.
A lot of data lake projects fail due to the last step of implementation. The hardest part is not building it or putting data in it, but making good use of mainly unstructured data in the first place. Data lakes often turn out to become data swamps. The key is to build expertise to prepare data, extract features and perform machine learning. If not done properly this often is a true showstopper, leading to a bad ROI and a lot of hidden costs with data lakes.
Best known Data lake solutions out in the market are Snowflake, Azure Data Lake and Apache Hadoop.
THE DATA HUB
A ‘new’ concept in the data storage industry, the hub positions itself as the go-to place for operational data within an enterprise. Where data warehouses and data lakes serve as endpoints for data collection, data hubs serve more as points of conciliation and data sharing. The main goal of a data hub is to organise data efficiently, store it in a cost-efficient manner and expose it towards key business functions. It excels in easy integration and enables de-duplication, security, quality and data standardisation.
As main benefits, we find the data hub to shine when being leveraged to enable data processing activities with the end use-case in mind, and typically has governance capabilities. Although operationally focused, it can be trusted as an analytical data source. A data hub truly delivers value throughout the enterprise when used to reduce system integration costs by automating ETL jobs and data pipelines, and by relying on pre-packaged connectors and standardised data ingestion interfaces.
Existing data hub solutions are Apache Hadoop, Google MapReduce, Cloudera CDH and our very own utility-specific Gorilla data hub.
DATA WAREHOUSE, DATA LAKE OR DATA HUB?
According to Gartner, enterprises using a cohesive strategy incorporating data hubs, lakes and warehouses will support 30% more use cases than competitors. When designing your digital solution architecture, we recommend mapping the different types of use cases, required flexibility and existing ecosystem. As you might have already noticed meeting those requirements won’t be done by choosing one out of three options. Data warehouses, lakes and hubs enforce each other and are all built with a clear purpose in mind. The solution lies in finding the right balance between the three options.
- Apply data hubs to reinforce your architecture and create an agile system that collects and connects data sources and consumers. Set yourself up for success by adopting tools with industry knowledge to enforce top-level data quality.
- Use data lakes for generic use cases to do exploration and innovation. Data lakes are best placed to discover the unknown. A data lake can perfectly be integrated with one or more data hubs to serve as a research base for your data scientists.
- Install data warehouses for data analytics, use-case optimisation and broader consumption. Enable your data analysts to gain key business insights based on data coming from data hubs.
Drive innovation with a solution designed for continuous platform evolution. With Gorilla we’ve created the very first utility-focused data hub, with built-in support for consumption, pricing, billing and customer data. Learn more here.