Python Isolation Forest PySpark

Scalable Financial Anomaly Detection Pipeline

Processing 10M+ daily transactions to identify fraud patterns in real time using unsupervised learning and distributed computing.

10M+
Daily Transactions
150ms
Inference Latency
99.9%
Uptime
~$1M
Fraud Prevented

The challenge

Traditional rule-based systems were failing to catch sophisticated “long-tail” anomalies in corporate transaction data. The system needed to ingest high-throughput streams from Oracle DB, normalize the data, and flag relative outliers without blocking legitimate high-value transfers.

System architecture

[ Kafka → Spark Streaming → Isolation Forest → Redis ]

The pipeline uses Apache Spark for distributed processing, in a lambda architecture:

  • Speed layer: ingests transaction logs via Kafka.
  • Batch layer: retrains the Isolation Forest weekly on historical data in S3.
  • Serving: model artifacts served via a scalable FastAPI microservice on Kubernetes.

The ML solution

We moved away from supervised learning given the lack of labeled fraud data, instead employing an ensemble of Isolation Forests and autoencoders.

from sklearn.ensemble import IsolationForest # Model configuration clf = IsolationForest( n_estimators=100, max_samples='auto', contamination=0.01, random_state=42, ) clf.fit(X_train)

Business impact

The new system reduced false positives by 40% versus the legacy rule-based engine. It runs in production today, monitoring over $500M in daily transaction volume.