In this article, we delve into the often overlooked, but crucial aspect of data quality – data lineage. Data lineage records the flow of data and all the transformations throughout its life-cycle, from source to destination. Understanding this is vital for maintaining data integrity and transparency in data processes, making it an essential component of the data quality workflow.
In this blog, we explore how to ensure data quality in a Spark Scala ETL (Extract, Transform, Load) job. To achieve this, we leverage Deequ, an open-source library, to define and enforce various data quality checks..