๐ Source Code (GitHub)
This project showcases my ability to design, build, and operate a full data mart system using modern data engineering tools
๐ก Project Overview
When working with retail or fintech transaction data, businesses need robust data pipelines that can clean, transform, validate, and deliver data for reporting and analytics.
In this project, I built a complete OLAP-style data mart pipeline using open-source tools, fully containerized for production-like deployment.
๐ง Tech Stack
- Orchestration: Apache Airflow (Dockerized)
- Data Transformation: dbt (Data Build Tool)
- Data Storage: PostgreSQL (OLAP-style data mart)
- Dashboard & BI: Metabase
- Containerization: Docker Compose
- Cloud Readiness: S3-ready ingestion logic for future extensibility
- Data Quality: dbt tests (accepted range, uniqueness, referential integrity)
๐๏ธ Architecture Summary
- Source: UCI Online Retail Dataset (transaction log format)
- ETL Flow:
- Raw ingestion โ PostgreSQL
- Staging models โ dbt transformations
- Fact and dimension models โ Star schema design
- Monthly aggregations โ fct_monthly_sales table
- Data quality checks โ dbt tests for production readiness
- Orchestration with Airflow:
- Modular DAGs:
ingestion_dag
, dbt_pipeline_dag
, full_etl_dag
- Easy to extend and schedule for recurring batch jobs
๐ Data Mart Design
stg_card_transactions
: staging layer with data cleansing
dim_customers
, dim_products
: dimension tables
fct_transactions
: full transaction-level fact table
fct_monthly_sales
: monthly aggregated fact table for BI
โ
Data Quality Controls
Ensured production-grade integrity with dbt tests:
- Not Null Checks
- Accepted Range Tests (for amount & quantity)
- Unique Keys on surrogate primary keys
- Referential Integrity between fact & dimension models
๐ Example Dashboards
Built fully automated dashboards with Metabase:
- Revenue Trends (6-month & current month)
- Average Order Value
- Top Selling Products
- Customer Spending Trends
- Anomaly Detection (suspicious transactions)
โ๏ธ Pipeline Orchestration
- Dockerized deployment using Docker Compose
- Modular Airflow DAGs for ingestion and transformation
- Fault-tolerant design for batch processing pipelines
๐ฌ dbt Documentation & Lineage
- dbt docs generated with full model documentation
- Column-level metadata and lineage graphs
๐งฐ Key Skills Demonstrated
- Full-stack batch data pipeline architecture
- Data mart design using dbt
- Docker-based orchestration of Airflow, dbt, Metabase, PostgreSQL
- Data quality monitoring using dbt tests
- Automated BI dashboards (Metabase)
- Production-grade engineering mindset: modular, scalable, fault-tolerant
๐ฏ Takeaway
This project simulates real-world batch processing pipelines youโd expect in production data platforms. It demonstrates:
- My ability to own the full pipeline from ingestion to reporting
- My understanding of data validation and observability
- My hands-on experience with modern data stack tools: Airflow, dbt, Docker, Metabase