Data Platform Guide
This is a collection of books and courses I can recommend personally. They are great for every data engineering learner and I have used or own these books during my professional work.
In order to implement the robust Data Platform Design framework, combing Data Engineering and Automation for Data Platoform Operations an Analytics.
Data Engineering Fundamentals
What is Data Platform Design ?
Data Platform Design Framework beyonds the traditional Data Scaling.
Data Platform Design is a set of practices and processes for managing the data lifecycle, from data ingestion to processing and analysis, in a way that ensures high quality and reliability.
The Data Platform Design framework provides a variety of different tools to manage data lifecycle, automatically processing and analysis data as well as maintain high quality of data. Help company to reduce the effort of data operations and take advantage of Data Insights.
Goals: Improve collaboration between data professionals, enhance data quality, and speed up data-related tasks.
Set of expectations for data platform design:
- Highly-available, redundant configuration services run within platform.
- Zero-downtime capability with granular monitoring.
- Auto-scaling across services
- All services are maintained, governed by Governance which is backbone of the platform.
I write of Data Platform Design Framework with 5 layers to help readers what is concctepture and contextualize the Data Platform Design Framework.
I’ve started the DataPods - Open Source Data Platform Ops to help readers to understand the Data Platform Design Framework and how to implement it. It has the following components: DataPods is a comprehensive starter kit that provides:
1. Production-like configurations
2. Easy deployment options (K8s/Docker)
3. Best-in-class open source tools
Resources: From Internet
- Designing Data-Intensive Applications - Legit
- Building the Data Warehouse, Bill Inmon - Legit
- Data Engineering Nanodegree (Udacity) - Overview, Demo
- Big Data Specialization (Coursera)
- Learning Spark
- The Data Warehouse Lifecycle Toolkit by Ralph Kimball and Laura Reeves - Legit
- Data Engineering
- Pattern of DE Online Data Engineering Design Pattern by Simon
-
Use caes of DE [Vu Trinh Substack](https://vutr.substack.com/) - Open Mordern Data Platform Starburst Galaxy
- Open Source Data Stack Summary
- Summary of Books I have read DEH-Books
Design Data Platform
Fact: Every Data Platform I have been designed, it always have 5 essencial components I mentioned in this book Data Engineering Handbok
In data scaling technique, I focus on Data Warehouse Scaling and Data Pipeline Scaling. Intake, we have to cover:
Other Book that I cover the Data Platform Design Framework and Guide for DataOps settle down Serverless Data Platform - WIP, I will cover how platform works and example of AWS services being used in Data Platform.
AWS, Azure, GCP are a service provider for contralizing the control, maintenance, operation and management of the data platform and data infrastructure.
Note: The list of Books, Blogs, Course that I personally have been closed or on the way to close it, highly recommend to everyone.
Updated 2024-12-12: I write the note for Platform Ops along with Data Engineering
Overview
- How Open Source Applications Work
- Serverless Data Pipeline
- Specification of Designing Data Pipeline
- Building the Data Warehouse, Bill Inmon
- Data Modeling with Snowflake, Serge Gershkovich
- The Data Engineering Cookbook, Andreas Kretz
- Data Engineering Patterns on the Cloud, Bartosz Konieczny
- C4 Archtecture
- Introduction to Data Engineering, Daniel Beach
- Data With Rust - Re-write Data Engineering in Rust, Karim Jedda
- Data Pipelines Pocket Reference, James Densmore
- Designing Data-Intensive Applications
- DAMA-DMBOK: Data Management Body of Knowledge (DAMA-DMBOK)
- Streaming Systems, Tyler Akidau, Slava Chernyak, Reuven Lax
- High Performance Spark, Holden Karau, Rachel Warren
- Data Pipelines with Apache Airflow
- Fundamentals of Data Observability, Andy Petrella
- Scaling Machine Learning with Spark, Adi Polak
- Deciphering Data Architectures, James Serra (Deciphering Data Architectures (James Serra))
- Architecture Patterns with Python
- Learning Spark, Brooke Wenig, Denny Lee, Tathagata Das, Jules Damji
- The Unified Star Schema. Bill Inmon
- Data Engineering Book by Oleg Agapov : Accumulated knowledge and experience in the field of Data Engineering
Papers
-
[Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics
-
Bolt-on causal consistency by Peter Bailis, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica
-
BigQuery: Creating a table definition file for an external data source
-
FSST: Fast random access string compression by Peter Boncz, Thomas Neumann, and Viktor Leis
-
Building a database on S3 by Matthias Brantner, Daniela Florescu, David Graf, Donald Kossmann, and Tim Kraska
-
Data validation for machine learning by Eric Breck, Marty Zinkevich, Neoklis Polyzotis, Steven Whang, and Sudip Roy
-
No silver bullet – essence and accidents of software engineering by Fred Brooks
-
The Snowflake elastic data warehouse by Benoit Dageville et al.
-
GraphFrames: An integrated API for mixing graph and relational queries by Ankur Dave et al.
-
How to move beyond a monolithic data lake to a distributed data mesh by Zhamak Dehghani
-
The BigDAWG polystore system by Jeff Duggan et al.
-
White-box compression: Learning and exploiting compact table representations by Bogdan Ghita et al.
-
Feature store: The missing data layer in ML pipelines? by Klas Hammar and Jim Dowling
-
Here are my data files. here are my queries. where are my results? by Stratos Idreos et al.
-
Enabling and optimizing non-linear feature interactions in factorized linear algebra by Shangyu Li et al.
-
MLlib: Machine learning in Apache Spark by Xiangrui Meng et al.
-
A computer oriented geodetic data base; and a new technique in file sequencing by G. M. Morton
-
Starling: A scalable query engine on cloud functions by Matthew Perron et al.
-
Missing the forest for the trees: End-to-end AI application performance in edge data centers by Daniel Richins et al.
-
Presto: SQL on everything by Raghav Sethi et al.
-
Why the ‘data lake’ is really a ‘data swamp’ by Michael Stonebraker
-
Skipping-oriented partitioning for columnar layouts by Li Sun et al.
-
Hive - a petabyte scale data warehouse using Hadoop by Ashish Thusoo et al.
-
SparkR: Scaling R programs with Spark by Shivaram Venkataraman et al.
-
Accelerating the machine learning lifecycle with MLflow by Matei Zaharia et al.
-
Dimensions based data clustering and zone maps by M. Ziauddin et al.
Case studies
Practical Example for Data Engineering
- Big Data Framework
- Kappa Data Pipeline aka Realtime using AWS
- Data Modeling and Analytic Engineering
- Data pipeline with Open Source Mage AI and ClickHouse
- AWS Ingestion Pipeline
- Azure Data Pipeline in 1 hour
- Design ETL Pipeline for Interview Assessment
- How to do everything
- https://www.ssp.sh/blog/open-data-stack-core-tools/
Check out the documentation Hands-on with Data Open Source
Bonus
Additional Recommendations:
- Certifications: Consider certifications like AWS Certified Data Engineer, GCP Certified Data Engineer, or Azure Data Engineer Associate.
- Open-source projects: Contribute to open-source data engineering projects to gain practical experience.
- Online communities: Engage with data engineering communities on platforms like Stack Overflow, Reddit, and LinkedIn.
- Networking: Build relationships with other data engineers to learn from their experiences.
- Remember: This is a general roadmap. The specific courses, books, and practices may vary depending on your experience level, industry, and technology stack.
My universe
TODO:
-
[ ] Settup DNS / domain name / sub-domain name
- longdatadevlog.com & longdatadevlog.com/brain: My knowledge hub and digital garden
- : Professional profile and LinkedIn presence
- https://de-book.longdatadevlog.com/datacamping/index.html & https://de-book.longdatadevlog.com/04-HandsOnCourse/index.html : Coding projects and GitHub portfolio
- github.com/longbuivan/dotfile: My development environment setup
- https://www.longdatadevlog.com/brain : Curated insights and updates
- https://datapods-oss.vercel.app/ & https://stringx.longdatadevlog.com/category/start-here: Showcasing my creative projects
- payhip.com/longdatadevlog & use.longdatadevlog.com:
- de-book.longdatadevlog.com: An unfinished book about timeless practices
- longdatadevlog.com/brain : Thoughts Sectin is What I’m currently focused -e