Pythonscikit-learnFastAPIPostgreSQLPandas

Fraud Detection System

End-to-end ML pipeline for detecting fraudulent financial transactions using anomaly detection and feature engineering.

Problem

Financial fraud detection requires identifying rare, anomalous transactions in highly imbalanced datasets where fraudulent events represent less than 0.5% of all transactions. Traditional rule-based systems fail to adapt to evolving fraud patterns, requiring a more intelligent, data-driven approach.

Solution

Built a multi-stage pipeline combining unsupervised anomaly detection with supervised classification, designed to handle real-time transaction scoring:

›Temporal feature extraction: rolling transaction velocity, time-of-day patterns
›Behavioral features: deviation from user spending baseline
›Ensemble model combining Isolation Forest and Gradient Boosting
›SMOTE oversampling to handle severe class imbalance
›FastAPI inference endpoint with sub-50ms p99 latency

Architecture


# Fraud Detection Pipeline Architecture

Raw Transactions
      │
      ▼
┌─────────────────┐
│  Data Ingestion │  ← Kafka / CSV batch
│  & Validation   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Feature Eng   │  ← Time windows, velocity,
│   Layer         │    user behavior features
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Model Ensemble │  ← Isolation Forest +
│  (Anomaly Det.) │    Gradient Boosting
└────────┬────────┘
         │
      ┌──┴──┐
      │     │
   fraud  legit
      │
      ▼
┌─────────────────┐
│  Fraud Alert    │  ← REST API response
│  API            │    + alerting pipeline
└─────────────────┘

Tech Stack

scikit-learn

ML models

FastAPI

Inference API

PostgreSQL

Transaction storage

Pandas

Feature engineering

imbalanced-learn

SMOTE sampling

Docker

Service packaging

Challenges

Class Imbalance

Fraud cases were < 0.5% of data. Applied SMOTE, adjusted class weights, and used precision-recall AUC as primary metric.

Feature Drift

Fraud patterns evolve over time. Implemented periodic model retraining triggered by distribution shift monitoring.

Latency Requirements

Real-time scoring required sub-50ms response. Achieved through model quantization and feature precomputation.

View on GitHub