Developing a Framework for Enhanced Data Pipeline Quality Management System
Voropaeva, Anastasiia (2022)
Voropaeva, Anastasiia
2022
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2022060214672
https://urn.fi/URN:NBN:fi:amk-2022060214672
Tiivistelmä
This thesis focuses on the quality of a data pipeline. The thesis was performed in the company that offers the data pipeline as part of a product for developing autonomous solutions for mobile radio networks. Data pipeline aims at delivering data from various sources to a destination in a form optimized for reading, querying, decision making and building data products.
Rationality of decisions and quality of data products are defined by the quality of the data pipeline. Quality monitoring system, currently used in case company, provides means for ensuring quality of the data pipeline and data products. However, with development of the product and gaining more experience from implementing the data pipeline into various environments, it was revealed that the system allows some flaws to pass unnoticed. This thesis aimed at highlighting insufficiently observed stages of the data pipeline and proposed key components of quality management system to implement.
The study was performed using action research methodology and represents applied research, conducted based on qualitative data collection and analysis. The theoretical framework discussed the main components of data pipeline quality management system, such as notifications rules, quality statements, metadata storage, audit storage, incident management system, and quality monitoring system.
The thesis focused on observability problem of a data handling process and proposed the main components that are required for troubleshooting, monitoring, and understanding the health of the data and the pipeline itself. Based on the best practices studied in the thesis, recommended components are inevitable part of trustworthy data pipeline solution of scale. Addressing observability problem and improving means for data pipeline quality monitoring will allow to ensure accuracy of data products and provide solid ground for decision making.
Rationality of decisions and quality of data products are defined by the quality of the data pipeline. Quality monitoring system, currently used in case company, provides means for ensuring quality of the data pipeline and data products. However, with development of the product and gaining more experience from implementing the data pipeline into various environments, it was revealed that the system allows some flaws to pass unnoticed. This thesis aimed at highlighting insufficiently observed stages of the data pipeline and proposed key components of quality management system to implement.
The study was performed using action research methodology and represents applied research, conducted based on qualitative data collection and analysis. The theoretical framework discussed the main components of data pipeline quality management system, such as notifications rules, quality statements, metadata storage, audit storage, incident management system, and quality monitoring system.
The thesis focused on observability problem of a data handling process and proposed the main components that are required for troubleshooting, monitoring, and understanding the health of the data and the pipeline itself. Based on the best practices studied in the thesis, recommended components are inevitable part of trustworthy data pipeline solution of scale. Addressing observability problem and improving means for data pipeline quality monitoring will allow to ensure accuracy of data products and provide solid ground for decision making.