Data Exploration Dictionary

Data Discovery

List of data dictionaries

Serving as a comprehensive resource providing definitions, explanations, and clarifications of essential terms, jargon, and concepts used within the realm of data engineering, analytics, and related fields.

Starting with Discovery Question will have you to be curious about what it is, and that will be a good starting point for everything. ** Or if you want to learn more about how we start this stuff go win the proposal to get you started with Data Project either Freelance, Outsourcing, in-house product, etc.

Watch this video Prepared and built Data Project as Data Engineer / Data Consultant

Component Type Discovery Question
Data Platform Pain Points What are the objectives, key challenges and pain points related to data management and accessibility you want to address?
Data Platform DevOps What is current process for DevOps including Source Code management, Version control, and Release Management? Is CI/CD implemented? What are the Agile Practices in project?
Data Platform Data Volume What is the volume of data in Data Platform? What is the expected growth in data volume Monthly and Yearly?
Data Platform Data Source Can you explain more provide more details about Source Connectivity, format, refreshing cycle, Market Listener and Fraud Detector?
Data Platform Data Consumers How many concurrent and peak end users use the Data Platform (Ad-hoc users/analysts, scheduled queries/reports, ML training/advanced analytics queries)?
Data Platform DR What are the backup and disaster recovery strategies that need to be in place for the data lake?
Data Platform Data Flow Can you provide an overview of the data flow which is using data pipelines?
Data Platform Architecture What steps are being taken to be compliant with protection of personal data. What percentage of data is being stored in local cloud or other for replication?
Data Platform Architecture Do you consider changes to current Architecture during migration? Or only want to lift-and-shift?
Data Platform Data Type What are the types of source data, file formats being used, data type requirements for future state, key challenges faced in processing any particular type of data?
Data Platform Data Volume What is/are Current Data Set Size, Future Data Growth, Storage and Management Requirements, Size and Transfer Constraints?
Data Platform Data Frequency What is the schedule of Data Ingestion and Processing Pipelines?
Data Platform Tools/Services What are all the tools/services of Data Platform?
Data Platform Tools/Services Have you installed Elastic Search, Redis on Kubernetes on GCP and MYSQL on Compute Engine? What is Cloud SQL being used for other than Postgresql as we notice that cost is considerably high?
Data Platform Tools/Services What is the configuration of Kafka Cluster? Number of Brokers Hardware Resources, Retention Settings, Cluster Management?
Data Platform Budget What is your budget for data processing and analytics platforms? How important is cost optimization to your business?
Data Platform Support Any issues with current support being provided. Are you getting adequate support?
Data Platform Users What is the current team structure?
Data Platform Users How many users access the Data Platform?
Data Platform Query Performance What is the SLA for Query Response Time? Any performance issues you are facing still in reporting or data processing?
Data Platform Data Warehouse Are you using Flat Rate or OnDemand pricing for Cost of services usage? Any changes done after 25% increase in rates?
Data Platform Scalability Any issues with Scalability of the Platform?
Data Engineering Data Pipelines Any plans of unifying Streaming Processing Pipelines as Pubsub now supports Schema Evolution? Is the change related to reducing Kafka Partitions implemented? Any performance issue after the change?
Data Engineering Data Pipelines Can we get more details on workload?
Data Engineering Data Ingestion How frequently you want to update/ingest data (Batch/Real-Time ingest intervals)? Considering any changes to frequency of data processing?
Data Engineering Alerts How are you monitoring loads? What is the alert mechanism?
Data Engineering Data Ingestion Any issues related to Debezium CDC which needs to be addressed?
Data Engineering Data Processing What are the types of data processing being done? What programming language is being used for Dataflow Jobs Python, Scala or Java?
Data Engineering Data Processing What are key challenges being faced with respect to Data Processing?
Data Governance Data Security Are there any specific data privacy or compliance regulations. How is security and access control management implemented? How is PII data classification being done for data privacy?
Data Governance Data Retention What are the specific data retention and archiving requirements?
Data Governance Data Quality Is there any Data quality tool used in the current data warehouse?
Machine Learning Scope What are the main goals of your machine learning projects? Can you provide the number of projects(use cases) and models developed. Are you focused on batch processing real-time predictions or both?
Machine Learning Tools Is any of the tool being used for Machine Learning is tightly coupled with GCP?
Machine Learning Tools What Tools you are using to pre-process the data for Training?
Machine Learning Frameworks What are common frameworks being used for ML?
Machine Learning AutoML Are you using any GCP AutoML services like AutoML Natural Language AutoML Translation?
Machine Learning Managed ML What are the plans of moving into Managed ML Services for experimenting, training, metrics logging, model deployment and monitoring?
Reporting Reporting How many Reports overall are developed in PowerBI? Is it Self Service Analytics or there is a dedicated reporting team? What are the Report Embedding Requirements you have?
Reporting Reporting What are the data analytics and reporting requirements across different teams or departments within your organization? Are there any dedicated BSAs?
Data Warehouse New Solution What are the primary goals and requirements for your new data warehouse solution? Are you considering any specific AWS services (e.g., Redshift, Athena)?
Data Warehouse Pain Points What challenges or limitations with your current data setup are driving the decision to create a new data warehouse?
Data Warehouse Performance What are your performance expectations for the new data warehouse, particularly in terms of query response times and data processing capabilities?
Data Warehouse Scalability How do you anticipate your data volumes and user concurrency to grow? What scalability features are most important for your new data warehouse?
Data Warehouse Cost Optimization What is your budget for the new data warehouse solution? Are there specific cost optimization strategies or pricing models you’re interested in exploring?
Data Lake Integration How is your current data lake integrated with your data warehouse? Are you using AWS Lake Formation or S3 for your data lake?
Data Lake Optimization What are the main pain points in your current data lake setup? Are there any specific areas you’d like to optimize?
Data Lake Data Format What file formats are you currently using in your data lake (e.g., Parquet, ORC, JSON)? Are you considering any changes to optimize for analytics?
Access Control Current Setup How are you currently managing access control for your data warehouse and data lake? Are you using AWS IAM, Lake Formation permissions, or other solutions?
Access Control Requirements What are your specific requirements for data access control? Do you need row-level security, column-level security, or other fine-grained access controls?
Access Control Compliance Are there any specific compliance requirements (e.g., GDPR, HIPAA) that impact your access control needs?
Data Processing ETL/ELT What is your current approach to data processing and transformation? Are you using AWS Glue, EMR, or other services?
Data Processing Streaming Do you have any real-time or near-real-time data processing requirements? Are you using services like Kinesis or MSK?
Data Quality Current Practices How do you currently ensure data quality in your warehouse and lake? Are you using any AWS or third-party tools for data validation and cleansing?
Data Governance Metadata How do you manage metadata for your data assets? Are you using AWS Glue Data Catalog or considering other cataloging solutions?
Data Governance Lineage Do you have requirements for data lineage tracking? How do you currently manage this?
Analytics BI Tools What business intelligence or analytics tools are you currently using with your data warehouse? Are they well-integrated with AWS services?
Analytics ML/AI Do you have any machine learning or AI initiatives that need to be supported by the data warehouse? Are you using Amazon SageMaker or other ML platforms?
Migration Strategy If migrating from another platform, what is your preferred migration strategy (e.g., lift-and-shift, re-architecting)? What are your main concerns about the migration process?
Budget Constraints What is your budget for the data warehouse optimization project? Are there any specific cost-saving targets?
Timeline Project Schedule What is your expected timeline for implementing changes or optimizations to your data warehouse and lake?
Team Skills What is the skill set of your current data team? Are there any specific AWS technologies or services they are particularly experienced with or interested in learning?
Future Plans Roadmap What are your long-term plans for data analytics and warehousing? Are there any upcoming projects or initiatives that might impact the data warehouse design?

These questions will help assess the current state of the client’s data warehouse and data lake, identify areas for optimization, and guide the proposal for an improved solution using AWS services. The focus on access control, integration with data lake, and optimization aligns with the specific requirements mentioned in the prompt.