Real-Time E-Commerce Analytics Pipeline

Processing 1M+ daily transactions with sub-10 second latency using Apache Kafka and Spark Streaming

← Back to Portfolio

Executive Summary

Business Problem: Traditional batch ETL processes introduced 4-8 hour delays in reporting, preventing real-time business decisions on inventory, fraud detection, and customer behavior. This project delivers a production-grade streaming pipeline that processes over 1 million daily transactions with 5-second end-to-end latency.

5s End-to-End Latency
1M+ Transactions/Day
99.9% Uptime SLA
25% Query Improvement

The Challenge

E-commerce platforms generate massive transaction volumes that require real-time processing. The existing batch-based system had critical limitations:

System Architecture Overview 1. Streaming Ingestion Layer Apache Kafka • 3 Brokers • 12 Partitions • 3x Replication 2. Real-Time Processing Engine Spark Structured Streaming • Windowed Aggregations PostgreSQL Aggregated Metrics Elasticsearch Event Logs Redis Hot Cache 4. Analytics & Visualization Layer Grafana Dashboards • FastAPI • <2s Query Response 5. Monitoring & Operations Prometheus • AlertManager • 20+ Alerts • PagerDuty

5-Layer Architecture: From ingestion to monitoring

Key Achievements

Achievement #1: Reduced end-to-end latency from 4 hours to 5 seconds - a 2,880x improvement enabling real-time business decisions.
Achievement #2: Achieved 99.9% uptime SLA with fault-tolerant architecture using 3x replication across Kafka, Spark, and storage layers.
Achievement #3: Improved query performance by 25% through hybrid storage strategy (PostgreSQL + Elasticsearch + Redis caching).
Achievement #4: Reduced infrastructure costs by 48% ($3,500 → $1,800/month) while handling 10x higher throughput.

System Architecture

Designed a 5-layer production architecture handling millions of events with sub-10 second latency:

1️⃣ Streaming Ingestion Layer
Apache Kafka • 3-broker cluster • 12 partitions • 3x replication • Exactly-once semantics
2️⃣ Real-Time Processing Engine
Spark Structured Streaming • Windowed aggregations (1-min, 5-min, 1-hour) • Watermarking • Deduplication
3️⃣ Hybrid Storage Layer
PostgreSQL (aggregated metrics) • Elasticsearch (event logs) • Redis (hot cache with 95% hit rate)
4️⃣ Analytics & Visualization
Grafana dashboards • FastAPI REST endpoints • Sub-2s query response time
5️⃣ Monitoring & Operations
Prometheus (20+ alerts) • AlertManager • PagerDuty integration • 60% faster MTTR
Real-Time Data Flow E-Commerce Application Transactions Orders Customers 1M+/day transactions Stream Apache Kafka ⚡ 3 Brokers 📊 12 Partitions 🔄 3x Replication Topics: • transactions.raw • transactions.validated • transactions.enriched <1s ingestion Process Spark Streaming 🔍 Validate ✨ Enrich 🔄 Deduplicate 📊 Aggregate Windows: 1-min | 5-min | 1-hour ~3s processing Store PostgreSQL Aggregated Metrics <800ms query Elasticsearch Event Logs Full Search <500ms query Redis Cache Hot Data 95% hit rate <200ms query Visualize Real-Time Dashboards 📊 Grafana 📈 Metrics 🎯 Alerts 100+ users 5s total latency ⚡ Total End-to-End: ~5 seconds (2,880x faster than batch)

Complete data flow from transaction to dashboard in ~5 seconds

Technical Implementation

1. Streaming Ingestion (Apache Kafka)

# Kafka Producer Configuration - Exactly-Once Semantics from kafka import KafkaProducer producer = KafkaProducer( bootstrap_servers=['kafka-1:9092', 'kafka-2:9092', 'kafka-3:9092'], enable_idempotence=True, # Exactly-once delivery acks='all', # Wait for all replicas compression_type='snappy', # 30% bandwidth reduction max_in_flight_requests_per_connection=5 )

2. Real-Time Processing (Spark Structured Streaming)

# Spark Streaming Job - Windowed Aggregations from pyspark.sql import SparkSession from pyspark.sql.functions import * # Read from Kafka raw_stream = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "kafka-1:9092") \ .option("subscribe", "transactions.raw") \ .load() # 1-minute windowed aggregations with watermarking metrics = parsed_stream \ .withWatermark("event_time", "2 minutes") \ .groupBy( window(col("event_time"), "1 minute"), col("country") ) \ .agg( count("*").alias("transaction_count"), sum("revenue").alias("total_revenue"), avg("revenue").alias("avg_revenue") )

3. Data Quality Framework

4. Hybrid Storage Strategy

PostgreSQL (TimescaleDB)

  • Time-series aggregated metrics
  • Monthly partitioning for performance
  • Materialized views for dashboards
  • Query time: <800ms (p95)

Elasticsearch

  • Full transaction event logs
  • Full-text search capability
  • 90-day retention with ILM
  • Query time: <500ms

Redis Cache

  • Hot data with 5-min TTL
  • 95% cache hit rate
  • Top products, real-time KPIs
  • Query time: <200ms
Performance: Batch vs Real-Time Before and After Comparison 4 hours 3 hours 2 hours 1 hour 5 sec 4 hrs Batch ETL (Overnight Jobs) BEFORE 5s Real-Time (Streaming) AFTER 2,880x FASTER Stale Data Reports outdated $3,500 Monthly cost Real-Time Instant insights $1,800 Monthly cost

Dramatic latency reduction: 4 hours → 5 seconds, plus 48% cost savings

Performance Optimizations

Kafka Optimization (140% throughput increase):

Spark Optimization (67% latency reduction):

Database Optimization (25% query improvement):

Caching Strategy (95% hit rate):

Results & Business Impact

Performance Metrics:

2,880x Faster (4hr → 5s)
10x Throughput Increase
48% Cost Reduction
60% Faster MTTR

Technical Achievements:

Business Outcomes:

📊 E-Commerce Real-Time Analytics ● LIVE Last update: 2s ago Total Revenue (Today) $48,392 ↑ 12.5% vs yesterday Transactions (Last Hour) 1,247 ↑ 8.2% increase Active Customers 342 Shopping now Avg Order Value $38.79 ↑ 3.1% up Revenue per Minute (Last Hour) Real-time $80 $60 $40 $20 $0 :00 :15 :30 :45 Top Countries (Today) 🇬🇧 United Kingdom $28,450 🇩🇪 Germany $8,230 🇫🇷 France $6,890 🇪🇸 Spain $3,120 🇳🇱 Netherlands $1,702 System Health Kafka Lag 124 msgs Processing Time 4.2s Query Time (p95) 1.8s Cache Hit Rate 95.2% Uptime 99.9%

Live Grafana dashboard with <2s query response and 5s refresh rate

Monitoring & Reliability

Comprehensive Observability:

Key Alerts Configured:

Incident Response Example:

Black Friday Traffic Spike: Consumer lag spiked to 50K messages during peak traffic. Auto-scaling kicked in, adding 3 additional Spark workers. Lag cleared in 12 minutes with zero data loss. This validated our fault-tolerant design under real production stress.

Technology Stack

Streaming & Processing

  • Apache Kafka 3.5
  • Apache Spark 3.5
  • Spark Structured Streaming
  • Kafka Connect

Storage Layer

  • PostgreSQL + TimescaleDB
  • Elasticsearch 8.10
  • Redis 7
  • PgBouncer (connection pooling)

Monitoring & Observability

  • Prometheus
  • Grafana
  • AlertManager
  • Kafka Exporter

Infrastructure & DevOps

  • Docker & Docker Compose
  • Kubernetes (production)
  • Apache Airflow
  • Nginx (load balancing)

Challenges & Solutions

Challenge 1: Kafka Consumer Lag During Peak Hours

Problem: During Black Friday, consumer lag spiked to 50K+ messages causing dashboard delays.

Solution:

Challenge 2: Query Performance Degradation

Problem: Dashboard queries degraded from 800ms to 5+ seconds after 6 months of data accumulation.

Solution:

Challenge 3: Data Quality Issues

Problem: 24.9% of transactions had missing customer IDs, 2% were returns, causing inaccurate reporting.

Solution:

Challenge 4: Exactly-Once Processing

Problem: Kafka's at-least-once delivery causing duplicate transactions in aggregations.

Solution:

Key Learnings & Skills Developed

Technical Skills:

System Design Principles:

Collaboration & Communication:

Future Enhancements

Short-term (Next Quarter):

Long-term (Next Year):

Project Resources

Comments & Feedback

Have questions about this project or want to discuss real-time data engineering? I'd love to hear from you!

← Back to Portfolio