I'm a Senior SDET (SDET III) at UST specializing in Data Quality Engineering and ETL Automation Testing. With 4+ years of experience validating production-scale PySpark pipelines (10M+ records/day), I sit at the intersection of data engineering and QA β catching the bugs that hide inside your data, not just your code.
In 2026, I published ValidateX β a lightweight Python data quality validation framework β to PyPI. Because after years of writing the same validation boilerplate across projects, I decided to ship it as a library instead.
pip install validatexA lightweight, production-ready data quality validation framework for Python Supports Pandas & PySpark β’ 25+ built-in expectations β’ Weighted quality scoring β’ Modern HTML reports
import pandas as pd
import validatex as vx
suite = (
vx.ExpectationSuite("production_data")
.add("expect_column_to_not_be_null", column="user_id")
.add("expect_column_values_to_be_unique", column="user_id")
.add("expect_column_values_to_be_between", column="age", min_value=0, max_value=150)
.add("expect_column_values_to_match_regex", column="email", regex=r"^[\w.]+@[\w]+\.\w+$")
)
result = vx.validate(df, suite)
print(result.summary()) # Data Quality Score: 97/100
result.to_html("report.html") # Beautiful dark-theme HTML reportWhy ValidateX?
| ValidateX | Great Expectations | |
|---|---|---|
| Setup | pip install β validate in 5 lines |
Multi-step setup with contexts & stores |
| Quality Score | β Weighted 0β100 | β |
| Severity Levels | β Critical / Warning / Info | β |
| CI/CD CLI | β Built-in | β |
| Learning Curve | Minutes | Hours to days |
π¦ PyPI β’ π» GitHub β’ π Docs
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ETL Testing β Validate PySpark pipelines at scale β
β Data Quality β Schema checks, SCD-2, drift detect β
β Test Automation β Selenium + Robot Framework + pytest β
β Open Source β Building tools the data world needs β
β CI/CD Integration β Jenkins, GitHub Actions, Airflow β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
By the numbers from my 4+ years in production:
- π΄ 60% reduction in data quality issues through automated testing frameworks
- β‘ 40% reduction in manual reconciliation effort via Python automation
- π 10M+ records/day validated across PySpark ETL pipelines
- π§ͺ 96% code coverage on ValidateX (66 tests passing)
Data & ETL
Testing & Automation
Cloud & Storage
π§ͺ ValidateX β Published on PyPI
Open-source Python data quality validation framework. Pandas + PySpark support, 25+ expectations, severity scoring, HTML reports, CLI for CI/CD integration.
pip install validatex
Python ETL pipeline for Indian stock market data β bulk ingestion via CSV/Excel, OHLCV schema normalization, API constraint handling, Streamlit control layer. A hands-on data engineering project focused on ingestion, transformation, and delivery.
- π Databricks β Data Governance Fundamentals (Jan 2026)
- π Databricks β Databricks Fundamentals (Nov 2025)
- π Big Data Analytics with Hadoop & Apache Spark β LinkedIn Learning (Sep 2025)
- π Selenium WebDriver with Python β Udemy (Apr 2025)
- π Getting Started in Test Automation Engineering β LinkedIn Learning (Apr 2025)
I write about data engineering, ETL automation, and real-world pipeline challenges on Medium.
"Bad data is worse than no data β it gives you false confidence."
That's why I build systems that catch it before it reaches your dashboards.