Kaviarasan Mani kaviarasanmani

🧪 I don't just test data pipelines — I build the tools that test them.

I'm a Senior SDET (SDET III) at UST specializing in Data Quality Engineering and ETL Automation Testing. With 4+ years of experience validating production-scale PySpark pipelines (10M+ records/day), I sit at the intersection of data engineering and QA — catching the bugs that hide inside your data, not just your code.

In 2026, I published ValidateX — a lightweight Python data quality validation framework — to PyPI. Because after years of writing the same validation boilerplate across projects, I decided to ship it as a library instead.

pip install validatex

🚀 Featured: ValidateX

A lightweight, production-ready data quality validation framework for Python Supports Pandas & PySpark • 25+ built-in expectations • Weighted quality scoring • Modern HTML reports

import pandas as pd
import validatex as vx

suite = (
    vx.ExpectationSuite("production_data")
    .add("expect_column_to_not_be_null",          column="user_id")
    .add("expect_column_values_to_be_unique",      column="user_id")
    .add("expect_column_values_to_be_between",     column="age", min_value=0, max_value=150)
    .add("expect_column_values_to_match_regex",    column="email", regex=r"^[\w.]+@[\w]+\.\w+$")
)

result = vx.validate(df, suite)
print(result.summary())          # Data Quality Score: 97/100
result.to_html("report.html")    # Beautiful dark-theme HTML report

Why ValidateX?

	ValidateX	Great Expectations
Setup	`pip install` → validate in 5 lines	Multi-step setup with contexts & stores
Quality Score	✅ Weighted 0–100	❌
Severity Levels	✅ Critical / Warning / Info	❌
CI/CD CLI	✅ Built-in	❌
Learning Curve	Minutes	Hours to days

📦 PyPI • 💻 GitHub • 📖 Docs

💼 What I Do

┌─────────────────────────────────────────────────────────────┐
│  ETL Testing          →  Validate PySpark pipelines at scale │
│  Data Quality         →  Schema checks, SCD-2, drift detect  │
│  Test Automation      →  Selenium + Robot Framework + pytest │
│  Open Source          →  Building tools the data world needs │
│  CI/CD Integration    →  Jenkins, GitHub Actions, Airflow    │
└─────────────────────────────────────────────────────────────┘

By the numbers from my 4+ years in production:

🔴 60% reduction in data quality issues through automated testing frameworks
⚡ 40% reduction in manual reconciliation effort via Python automation
📊 10M+ records/day validated across PySpark ETL pipelines
🧪 96% code coverage on ValidateX (66 tests passing)

🛠️ Tech Stack

Data & ETL

Testing & Automation

Cloud & Storage

📂 Projects

🧪 ValidateX — Published on PyPI

Open-source Python data quality validation framework. Pandas + PySpark support, 25+ expectations, severity scoring, HTML reports, CLI for CI/CD integration. pip install validatex

📈 NSE/BSE Stock Market Data Ingestion Tool

Python ETL pipeline for Indian stock market data — bulk ingestion via CSV/Excel, OHLCV schema normalization, API constraint handling, Streamlit control layer. A hands-on data engineering project focused on ingestion, transformation, and delivery.

🏅 Certifications

🏆 Databricks — Data Governance Fundamentals (Jan 2026)
🏆 Databricks — Databricks Fundamentals (Nov 2025)
📜 Big Data Analytics with Hadoop & Apache Spark — LinkedIn Learning (Sep 2025)
📜 Selenium WebDriver with Python — Udemy (Apr 2025)
📜 Getting Started in Test Automation Engineering — LinkedIn Learning (Apr 2025)

✍️ Writing

I write about data engineering, ETL automation, and real-world pipeline challenges on Medium.

📝 medium.com/@kavim1996

📊 GitHub Stats

🤝 Let's Connect

"Bad data is worse than no data — it gives you false confidence."
That's why I build systems that catch it before it reaches your dashboards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly