Specification of Designing Data Warehouse
For the schema design of the data warehouse, we will use a star schema approach, which consists of one or more fact tables referencing multiple dimension tables. This design is efficient for querying and reporting purposes. → This is the motivation for data warehouse design specifications
1. Schema Design
People have confused about the name of schema and how to put the tables/views in to exact schema… Lets simplify those by functionality split.
For the schema design of the data warehouse, we will use a star schema approach, which consists of one or more fact tables referencing multiple dimension tables. This design is efficient for querying and reporting purposes. Example schemas include:
ETL functional
- Landing: External tables, ingested from sources
- Staging: Cleansing data and re-formatted data, data-transit
- Transforming: Applying business, core data transfer
- Housing: Fact and Dimension structure, aggregation
- Publishing: Sharing, optimizing, access control
- Secret (Rarely existing): it is for customers/clerk data with HIGHEST restriction.
Audit funtional
- Archived/Backup: cold data for historical loads, backfilling
- Failed: for re-processing, investigation
- Validation: for pipeline monitoring, data validation
- Logging: for recording all activities of pipelines
System and Metadata
- Config: storing configuration tables, data of data
- Info: information of data, pipeline, catalog for building lineage.
This is how end-to-end warehouse structure. Yes I know the life is not easy. Dig into it and you will see the beautiful picture.
2. Partitioning Strategy
To optimize query performance and manage large volumes of data efficiently, we will implement partitioning based on time intervals (e.g., monthly or quarterly partitions). This will allow for faster data retrieval and facilitate data pruning during archiving processes.
There are 4 kind of partition I almost seen:
- Datetime: Analytics, preprocessing
- Source: Transformation
- Type: Transformation
- Business: Querying objects for analysis
3. Roles and Users (including Permission Access Control)
Following the Identical Access Management (IAM) access control, there are 3 type of concepts related to permission: user, group, role. It is the core concept of IAM in Cloud.
Spit by the role, we have:
-
Data Stewards: Responsible for managing data quality, ensuring data integrity, and defining data governance policies.
-
Data Engineers: Tasked with designing, developing, and maintaining the data warehouse infrastructure, including ETL processes, data pipelines, and database administration.
- Business Analysts: Utilize the data warehouse to perform ad hoc queries, generate reports, and derive insights for decision-making purposes.
- Data Scientists: Extract actionable insights from the data warehouse through advanced analytics, machine learning, and predictive modeling techniques.
4. Data Lifecycle Management
-
Data Acquisition: Ingest raw data from various sources, including transactional databases, external APIs, and streaming platforms. Data Transformation: Cleanse, enrich, and transform raw data into a structured format suitable for analysis and reporting.
-
Data Storage: Store processed data in the data warehouse, ensuring proper indexing, partitioning, and compression for optimal storage and query performance.
-
Data Usage: Enable users to query and analyze data using BI tools, SQL queries, or custom applications, facilitating data-driven decision-making across the organization.
-
Data Archiving: Periodically archive historical data to lower-cost storage solutions or data lakes, ensuring compliance with data retention policies and regulations.
-
Data Purging: Implement data purging mechanisms to remove obsolete or redundant data from the data warehouse, freeing up storage space and improving query performance.
-
Monitoring and Auditing: Continuously monitor data warehouse performance, usage patterns, and data quality metrics to ensure adherence to SLAs and regulatory requirements. Conduct regular audits to verify data integrity and compliance with data governance policies.
Conclusion
By adhering to these criteria for schema design, partitioning strategy, roles and users, and data lifecycle management, the data warehouse will be well-equipped to support the organization’s analytical and reporting needs while ensuring data quality, integrity, and governance.