Deciphering data architectures

04.04.24

Introduction

This is a book summary on the book Deciphering data architectures which can be found here https://www.amazon.com/Deciphering-Data-Architectures-Warehouse-Lakehouse/dp/1098150767. This a great book to get to grips with the data structures out there are how to pick the optimal architecture to get the most out of big data which is becoming ever more important in the age of AI.

Key Take Aways

Big data is increasingly important to businesses and can save businesses millions if they can deploy the data in the correct way and gain insights from the data that allows them to predict what will happen in the future so it’s crucial to choose the correct architecture to store that data so the data use can be optimized.

The relationship database warehouse is the oldest data storage solution that used a relational database to store all the data that the businesses needs while this did have performance benefits as the schema is defined at write there is no flexibility in the data that can be stored to a relational database warehouse so it cant keep up with modern demands where data is unstructured and comes from a variety of data sources like IOT devices.

Data lake where data is stored in its raw format and then schema is defined when the user wants to read data from the data lake. This is an inexpensive way to store all data on any format as is normally used with another architecture like a Datawarehouse and the data lake acts as like a stagging storage solution before the data is cleaned and ingested by the warehouse. This adds flexibility to a traditional data warehouse while still having the advanced querying capabilities offered by a warehouse. This design on the data lake used with a data warehouse is called the modern data warehouse.

Data fabric this is a more advanced modern data warehouse with that is integrated through all the business processes to provide continuous analytics using existing, discoverable and guessed metadata in order to support the utilization of all the data collected. This architecture has the metadata Catalog that stores information on the data collected this provides a centralised way to manage and discover assets. The reason to switch from a modern data warehouse to a data fabric architecture is because data fabric has more flexibility and can be scaled easier future-proofing the architecture as data demands change as only need to change the Catalog to add new data sources not a whole database.

Data Lakehouse is a type of software built on top of a data lake that gives the data link capabilities like a traditional data warehouse. Delta lakes is the most popular implementation of a data Lakehouse. The delta lake woks by all the data that has been ingested is being stored in the parquet file format which is especially designed for the delta lake and is open source more information can be found here. https://www.databricks.com/glossary/what-is-parquet . Also a transactional log is stored alongside the parquet files that keeps track of all changes made to the data and enhances capabilities however the drawback is the software used with a delta lake most support the delta lake capabilities or it won’t work however most do. Because of the paraquet file format and the transactional log advanced features like compression algorithms, compaction algorithms to put small files into one file and the ability to time trave is all available in the delta lake.

Data Mesh this is the newest architecture and the only architecture that is decentralised. There is 4 key components of the data mesh.

1) Domain ownership — The data is managed by the people who are closest to the data which allows quicker reads and more scalable however can lead to more security issues as more assets need to be protected.

2) Data as a product — This means to treat the data as a product and the people who use it as customers. This means the data is treated as a more valuable assets that just a byproduct of the customers actions and this leads to the data being more accessible, trustworthy, and more insightful.

3) Self-serve data — this component relates to automating deployment so that there is less chance of human error in deployment and less chance of differences between the devices in the domain.

4) Federated computational governance — Have a central team to enforce some global rules so domains aren’t reinverting the wheel for their domain and that data can be shared between domains easily as they are following some global principles.

Summary

Overall, this was a very good book on data architectures and there is a lot more detail in the book like data virtualisation and what marketplaces are on and so on. Also, the book goes into the business side of architectures like the meetings that need to be in place in order to find what the correct architecture is for a business and things to look out for when implementing the architecture so I highly recommend it for anyone who wants to be a data architect. For more articles like this please check out by blog https://blog.blackcoat.co.uk/.


Related Articles can be found here:


  • oreilly
  • Data
  • Book Review
  • Data Science