๐ Source Code (GitHub)
This project simulates a fintech payment platform by building a real-time transaction monitoring pipeline.
It integrates streaming, batch processing, data modeling, and BI dashboards into one cohesive architecture โ a skillset directly applicable to solving real-world problems.
๐ก Project Overview
Financial institutions process millions of transactions per day. To ensure fraud detection, revenue tracking, and customer analytics, data pipelines must combine real-time monitoring with batch aggregations.
In this project, I designed and built a production-style end-to-end data pipeline that demonstrates:
- Streaming ingestion and fraud detection with Kafka + Spark Structured Streaming
- Batch modeling and aggregations with dbt
- Orchestration with Airflow
- Data quality validation using dbt tests
- Business dashboards powered by Metabase
๐ง Tech Stack
- Streaming Ingestion: Kafka
- Stream Processing: Spark Structured Streaming
- Data Storage: Postgres (warehouse schema with fact/dim tables)
- Data Modeling: dbt (transformations, aggregations, star schema)
- Orchestration: Apache Airflow
- Visualization: Metabase
- Containerization: Docker Compose
- Programming Language: Python
๐๏ธ Architecture Summary
Flow of Data:
- Transaction Generator (Kafka Producer) โ Simulates live card/e-commerce transactions
- Kafka Broker โ Streams transactions in real time
- Spark Structured Streaming โ Consumes, validates, applies fraud detection logic
- Postgres โ Stores fact and dimension data for analytics
- dbt โ Transforms and aggregates into business models
- Airflow โ Orchestrates dbt runs and batch jobs
- Metabase โ Dashboards for fraud monitoring and KPIs
๐ Data Model
- Fact Table:
fact_transaction
- Transaction-level events with fraud flags, geolocation, payment metadata
- Dimension Tables:
dim_customer, dim_merchant
- Customer and merchant attributes for analytics
- Batch Models (dbt):
- Daily revenue summary
- Customer KPIs (avg spend, fraud ratio)
- Merchant analytics (sales, fraud by category/location)
โ
Data Quality Controls
- dbt tests enforce production-grade reliability:
- Unique keys on surrogate primary keys
- Not-null constraints on business-critical fields
- Referential integrity between fact and dimension tables
- Accepted value checks for categorical fields
๐ Dashboards
Dashboards were built in Metabase to provide real-time and batch views:
- Daily revenue trends
- Fraud detection overview (suspicious transactions flagged)
- Customer spending distribution
- Merchant-level analytics
โ๏ธ Pipeline Orchestration
- Airflow DAGs schedule daily dbt runs and enforce retries
- Modular tasks for ingestion, transformation, and testing
- Fault-tolerant design ensuring late-arriving transactions are captured
๐งญ Future Improvements
- Deploy pipeline on Kubernetes for scalability
- Integrate Kafka Connect for CDC simulation
- Extend fraud detection with stateful ML models
- Move warehouse layer to BigQuery or Redshift for cloud scalability
- Add monitoring via Prometheus + Grafana
๐งฐ Key Skills Demonstrated
- Designing end-to-end data pipelines (streaming + batch)
- Implementing real-time ingestion with Kafka + Spark
- Building OLAP-style schemas with dbt (fact/dim, star schema)
- Orchestration with Airflow and containerized deployment
- Data quality monitoring with dbt tests
- Business-focused dashboarding with Metabase
- Production-oriented mindset: scalable, modular, fault-tolerant
๐ฏ Takeaway
This project simulates the real-time and batch data integration required in modern fintech and payment platforms. It demonstrates:
- My ability to design and deploy streaming-first architectures
- Hands-on experience with the modern data engineering stack
- A strong focus on data quality, observability, and business outcomes
๐ Source Code (GitHub)