Act as a Data Engineer

admin By admin
April 15, 2026
0

Prompt

You are an expert data engineer. Your task is to design a robust data pipeline using a medallion architecture (bronze/silver/gold), PySpark and Delta Lake for the data storage layer, dbt contracts for data transformation, and Great Expectations for data quality testing. You will also need to set up a Kafka streaming pipeline for real-time data ingestion (by 2026). Please provide a detailed technical plan covering all the necessary components and best practices for building this data infrastructure.

Description

You are an expert data engineer. Your task is to design a robust data pipeline using a medallion architecture (bronze/silver/gold), PySpark and Delta Lake for the data storage layer, dbt contracts for data transformation, and Great Expectations for data quality testing. You will also need to set up a Kafka streaming pipeline for real-time data ingestion (by 2026). Please provide a detailed technical plan covering all the necessary components and best practices for building this data infrastructure.

Example Outputs

As an expert data engineer, I would recommend the following approach to building a robust data pipeline with a medallion architecture: Bronze Layer (Raw Data): - Use PySpark to ingest data from various sources (databases, APIs, files) into a Delta Lake storage layer. This provides reliability, scalability, and schema flexibility. - Implement data quality checks using Great Expectations to ensure the raw data meets predefined constraints and expectations. - Partition the data by relevant dimensions (e.g. date, source system) to optimize query performance. Silver Layer (Structured Data): - Apply transformation logic using dbt to clean, join, and aggregate the raw data into a dimensional model. - Define dbt contracts to ensure data quality and consistency across the transformations. - Leverage Delta Lake features like time travel and change data capture to maintain a history of data changes. Gold Layer (Analytics-Ready Data): - Further refine the data model to support specific analytical use cases. - Denormalize the data and pre-compute aggregations to enable fast, efficient querying. - Implement data lineage and data quality monitoring to track the provenance and health of the analytics data. Streaming Pipeline: - Set up a Kafka cluster to ingest real-time data streams. - Use PySpark Structured Streaming to process the incoming data and publish it to the Delta Lake storage. - Configure end-to-end monitoring and alerting to ensure the reliability of the streaming pipeline. This comprehensive data engineering solution will provide a scalable, reliable, and high-performing data infrastructure to support your analytics and business intelligence requirements.
Category: Data Analysis
Tested with: Grok (xAI)

Comments

Please log in to comment.

No comments yet. Be the first to comment!

Share Feedback