How can data science contribute to a greener world? : an exploration featuring machine learning and data mining for environmental facilities and energy end users

Sammanfattning: Human society has taken many measures to address environmental issues. For example, deploying wastewater treatment plants (WWTPs) to alleviate water pollution and the shortage of usable water; using waste-to-energy (WtE) plants to recover energy from the waste and reduce its environmental impact. However, managing these facilities is taxing because the processes and operations are always complex and dynamic. These characteristics hinder the comprehensive and precise understanding of the processes through the conventional mechanistic models. On the other hand, with the development of the Fourth Industrial Revolution, large-volume and high-resolution data from automatic online monitoring have become increasingly obtainable. These data usually reflect abundant detailed information of process activities that can be utilized for optimizing process control. Similarly, data monitoring is also adopted by the resource end users. For example, energy consumption is usually recorded by commercial buildings for optimizing energy consumption behavior, eventually saving running costs and reducing carbon footprint. With the data recorded and retrieved, appropriate data science methods need to be employed to extract the desired information. Data science is a field incorporating formulating data-driven solutions, data preprocessing, analyzing data with particular algorithms, and employing results to support high-level decisions in various application scenarios.The aim of this PhD project is to explore how data science can contribute to a more sustainable world from the perspectives of both improving the operation of environmental engineering processes and optimizing the activities of energy end users. The major work and corresponding results are as follows:(1) (Paper I) An ML workflow consisting of Random Forest (RF) models, Deep Neural Network (DNN) models, Variable Importance Measure (VIM) analyses, and Partial Dependence Plot (PDP) analyses was developed and utilized to model WWTP processes and reveal how operational features impact on effluent quality. The case study was conducted on a full-scale WWTP in Sweden with large data (105,763 samples). This paper was the first ML application study investigating cause-and-effect relationships for full-scale WWTPs. Also, for the first time, time lags between process parameters were treated rigorously for accurate information uncovering. The cause-and-effect findings in this paper can contribute to more sophisticated process control that is more precise and cost-effective. (2) (Paper II) An upgraded workflow was designed to enhance the WWTP cause-and-effect investigation to be more precise, reliable, and comprehensive. Besides RF, two more typical tree-based models, XGBoost and LightGBM, were introduced. Also, two more metrics were adopted for a more comprehensive performance evaluation. A unified and more advanced interpretation method, SHapley Additive exPlanations (SHAP), was employed to aid model comparison and interpret the optimal models more profoundly. Along with the new local findings, this study delivered two significant general findings for cause-and-effect ML implementations in process industries. First, multi-perspective model comparison is vital for selecting a truly reliable model for interpretation. Second, adopting an accurate and granular interpretation method can profit both model comparison and interpretation.(3) (Paper III) A novel workflow was proposed to identify the accountable operational factors for boiler failures at WtE plants. In addition to data preprocessing and domain knowledge integration, it mainly comprised feature space embedding and unsupervised clustering. Two methods, PCA + K-means and Deep Embedding Clustering (DEC), were carried out and compared. The workflow succeeded in fulfilling the objective of a case study on three datasets from a WtE plant in Sweden, and DEC outperformed PCA + K-means for all the three datasets. DEC was superior due to its unique mechanism in which the embedding module and K-means are trained simultaneously and iteratively with the bidirectional information pass.(4) (Paper IV) A two-level (data structure level and algorithm mechanism level) workflow was put forward to detect imperceptible anomalies in energy consumption profiles of commercial buildings. The workflow achieved two objectives – it precisely detected the contextual energy anomalies hidden behind the time variation in the case study; it investigated the combined influence of data structures and algorithm mechanisms on unsupervised anomaly detection for building energy consumption. The overall conclusion was that the contextualization resulted in a less skewed estimation of correlations between instances, and the algorithms with more local perspectives benefited more from the contextualization.

  KLICKA HÄR FÖR ATT SE AVHANDLINGEN I FULLTEXT. (PDF-format)