Act as a Data Engineer

As an expert data engineer, I would recommend the following approach to building a robust data pipeline with a medallion architecture: Bronze Layer (Raw Data): - Use PySpark to ingest data from various sources (databases, APIs, files) into a Delta Lake storage layer. This provides reliability, scalability, and schema flexibility. - Implement data quality checks using Great Expectations to ensure the raw data meets predefined constraints and expectations. - Partition the data by relevant dimensions (e.g. date, source system) to optimize query performance. Silver Layer (Structured Data): - Apply transformation logic using dbt to clean, join, and aggregate the raw data into a dimensional model. - Define dbt contracts to ensure data quality and consistency across the transformations. - Leverage Delta Lake features like time travel and change data capture to maintain a history of data changes. Gold Layer (Analytics-Ready Data): - Further refine the data model to support specific analytical use cases. - Denormalize the data and pre-compute aggregations to enable fast, efficient querying. - Implement data lineage and data quality monitoring to track the provenance and health of the analytics data. Streaming Pipeline: - Set up a Kafka cluster to ingest real-time data streams. - Use PySpark Structured Streaming to process the incoming data and publish it to the Delta Lake storage. - Configure end-to-end monitoring and alerting to ensure the reliability of the streaming pipeline. This comprehensive data engineering solution will provide a scalable, reliable, and high-performing data infrastructure to support your analytics and business intelligence requirements.

Act as a Data Engineer

Prompt

Description

Example Outputs

Comments

Act as a Data Engineer

Prompt

Description

Example Outputs

Comments

Quick Feedback