Nami Kim profile
Nami Kim
Data Engineer ๐Ÿงก AI ML

Fintech Batch ETL with Redshift Cloud Integration

A production-style batch ETL pipeline for fintech transactions with cloud-ready architecture.

๐Ÿ‘‰ Source Code (GitHub)

This project demonstrates a cloud-integrated batch ETL pipeline for fintech transaction data, built with Airflow, Spark, dbt, Redshift, and S3.
It simulates how a real financial services company can process, validate, and warehouse millions of daily transactions securely and efficiently.

๐Ÿ’ก Project Overview

Financial institutions rely on scalable batch data pipelines to process credit card transactions for analytics such as fraud detection, risk scoring, and customer segmentation.

In this project, I built a production-style batch ETL pipeline that is fully containerized and optimized for cloud integration with AWS.

Key features include:

  • End-to-end orchestration with Airflow
  • Spark transformations for cleansing, deduplication, and schema evolution
  • AWS S3 for Bronze/Silver/Gold data layers
  • Redshift for staging, fact, dimension, and marts
  • dbt for warehouse modeling and data quality tests
  • Great Expectations for validation of curated data
  • IAM-based security, encryption, and cost optimizations

๐Ÿ”ง Tech Stack

  • Orchestration: Apache Airflow (Dockerized)
  • Data Transformation: Apache Spark
  • Data Storage: AWS S3 (Bronze, Silver, Gold zones)
  • Data Warehouse: AWS Redshift
  • Modeling & Testing: dbt
  • Data Quality: Great Expectations + dbt tests
  • Visualization: Metabase
  • Containerization: Docker Compose
  • Cloud Security: IAM least privilege, S3 encryption

๐Ÿ—๏ธ Architecture Summary

Data Flow:

  • Bronze (Raw Zone)
    • Faker-generated synthetic transactions stored in S3 (JSON/CSV, partitioned by ingest_date).
    • Immutable and schema-on-read for auditing.
  • Silver (Curated Zone)
    • Spark jobs cleanse, deduplicate, enrich, and partition data into Parquet.
    • Validated with Great Expectations.
  • Gold (Business Zone)
    • dbt models in Redshift implementing data marts.
    • Used by BI tools like Metabase for customer & merchant analytics.

workflow

๐Ÿ“Š Data Model

  • Staging (stg_*) โ†’ 1:1 mapping from Silver zone
  • Dimensions (dim_*) โ†’ cards, merchants, customers
  • Facts (fact_*) โ†’ transaction-level events, deduplicated & enriched
  • Marts (mart_*) โ†’ RFM, LTV, cohort analysis models for BI

dbt lineage

โœ… Data Quality Controls

  • Great Expectations validation on Silver zone (null checks, enums, ranges)
  • dbt tests on Redshift models:
    • Uniqueness
    • Referential integrity
    • Not-null constraints

dbt tests

โš™๏ธ Pipeline Orchestration

  • Airflow DAGs for ingestion, transformation, validation, and dbt runs
  • Backfill & retry handling for late-arriving transactions (up to 2 days)
  • Dockerized deployment with all services running locally & cloud-ready

๐Ÿ” Cloud Integration Highlights

  • AWS S3: Partitioned Parquet storage, encryption at rest
  • AWS Redshift: Sort/dist keys for performance, incremental updates for cost control
  • IAM Security: Role-based access with least privilege
  • Secrets Manager: Secure handling of Redshift & S3 credentials
  • Future-Ready: Extendable to Redshift Spectrum, Glue, or Iceberg for hybrid lakehouse

๐Ÿงฐ Key Skills Demonstrated

  • Building cloud-native batch ETL pipelines on AWS
  • Designing Bronze โ†’ Silver โ†’ Gold data lakehouse architecture
  • dbt-based data modeling and testing in Redshift
  • Orchestration with Airflow and Dockerized deployment
  • Data quality monitoring with Great Expectations
  • Applying cloud security best practices in data engineering

๐ŸŽฏ Takeaway

This project simulates a real-world fintech batch data pipeline that can expand to cloud daily.

It demonstrates:

  • My ability to design, build, and operate cloud-integrated data pipelines
  • Strong skills in data modeling, validation, and orchestration
  • Hands-on experience with AWS services and production-grade engineering

๐Ÿ‘‰ Source Code (GitHub)