SQL for Data Analytics
Advanced SQL patterns for data analysis and reporting.

Page 5: Text Analysis with SQL β Mining Meaning from Words π
While SQL isn't a replacement for NLP (Natural Language Processing), it is incredibly efficient at exploring and parsing text data at scale before handing it off to more complex models.
This content is adapted from Mastering System Design from Basics to Cracking Interviews (Udemy). It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
Why use SQL for Text Analysis?
- Efficiency: Filter out millions of irrelevant records in milliseconds.
- Direct Access: Work directly on the production data where text (comments, reviews, logs) is stored.
- Pre-processing: Standardize casing and remove special characters before analysis.
When SQL isn't the right tool: Don't use SQL for sentiment analysis, entity recognition, or translation. Use Python libraries like NLTK or SpaCy for those tasks.
Exploring Text Characteristics
- Length Analysis:
LENGTH(field)to find exceptionally short or long entries. - Pattern Frequency: Finding the most common words or substrings.
Wildcard and Exact Matching
- LIKE / ILIKE: Using
%to find patterns (e.g.,'%error%'). - IN / NOT IN: Filtering for specific lists of keywords.
Regular Expressions (Regex) in SQL
Most modern SQL dialects (BigQuery, PostgreSQL) support Regex for powerful pattern matching:
REGEXP_CONTAINS(field, r'pattern')REGEXP_REPLACE(field, r'pattern', 'replacement')REGEXP_EXTRACT(field, r'(\d+)')
Parsing and Reshaping
- Splitting: Using
SPLIT()orSUBSTRING_INDEX()to break apart strings like URLs or email addresses. - Concatenation: Combing fields with
CONCAT()or||to create unique descriptors. - Parsing JSON/XML: Modern SQL can parse structured text stored as strings within rows.
Use Case: Analyzing URL Parameters
Analysts often need to break down a URL to understand where traffic is coming from:
-- Example (Postgres)
SELECT
split_part(url, 'utm_source=', 2) as source,
split_part(url, 'utm_medium=', 2) as medium
FROM web_logs;