++++

Languages

Mar 2025×10 min read

Anomaly detection is the process of identifying data points that deviate significantly from the

Page 6: Anomaly Detection — Finding What Doesn

Driptanil DattaSoftware Developer

Page 6: Anomaly Detection — Finding What Doesn't Belong 🚨

Anomaly detection is the process of identifying data points that deviate significantly from the "norm." In data analytics, this is critical for spotting fraud, tracking system errors, or identifying massive shifts in user behavior.

🌍

References & Disclaimer

This content is adapted from Mastering System Design from Basics to Cracking Interviews (Udemy). It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

What SQL Can (and Can't) Do

SQL is excellent for Statistical Outlier Detection but lacks the capability for complex multidimensional pattern recognition (Clustering) that ML models provide.

Surfacing Outliers with Sorting

The simplest form of anomaly detection:

Sort by value descending to find the highest earners or biggest spenders.
Sort by date to find the oldest or newest records.

Using Percentiles and Standard Deviations

A more rigorous approach involves math:

Z-Score: How many standard deviations is a point away from the mean?
Interquartile Range (IQR): Calculating the 25th and 75th percentiles to define a "normal" range.

-- Identifying values outside 3 Standard Deviations
SELECT * FROM metrics
WHERE ABS(val - AVG(val) OVER()) > 3 * STDDEV(val) OVER();

Visualizing Anomalies

Transforming query results into simple flags for BI tools:

CASE WHEN is_anomaly THEN 1 ELSE 0 END as anomaly_flag

Types of Anomalies

Value Anomalies: A single transaction for $1,000,000 in a system where the average is$ 10.
Frequency Anomalies: A user logging in 500 times in 1 minute.
Absence of Data: A critical drop to ZERO in a metric like "Checkout Start" (usually indicates a bug).

Handling Anomalies

Investigation

Check if the anomaly is a data collection error or a real-world event.

Removal

Excluding extreme outliers from your average to avoid skewed results.

Replacement / Rescaling

Capping values at a certain percentile (Winsorization) to maintain the data point without letting it break the model.

🚫

Caution: Never delete raw data. Always flag or filter in your analytical layer.

5. Text Analysis 7. Experiment Analysis