Although practitioners have long acknowledged the prime importance of using quality event logs to successfully transform event logs into relevant insights, little is known about the precise relationships between specific imperfection patterns and process mining outcomes. Process mining research usually does not attach much importance to the content of the datasets used for benchmarking purposes either. Likewise, the development of cleaning techniques tailored to such imperfection patterns could benefit practitioners and ensure that their data is reliable and accurately represents the real world. This thesis builds on this state of affairs by exploring how the degradation of activity labels in event logs may affect predictive process monitoring results. Our experiments show that stateof-the-art LSTM-based predictive process monitoring pipelines may not be affected by such pollution for remaining time prediction but that the opposite seems to be true for next activity prediction. The results are then confirmed by comparing them to the outcomes of the same experiments conducted on cleaned versions of the datasets, obtained using various cleaning algorithms developed for the occasion. Moreover, we analyzed our bibliography to conduct a survey of some of the most popular datasets in process mining literature and extract practitioners’ comments relevant to those datasets. The comments are further enriched by some profiling of the datasets and comparisons with our own findings, in an effort to discuss the generalizability of the conclusions obtained from using the datasets in question and the precise role thereof in determining such conclusions. The experiments were conducted on a number of datasets extracted from the survey, plus two synthetic datasets.
Benché gli esperti del process mining abbiano per molto tempo riconosciuto la capitale importanza di utilizzare log degli eventi (o event logs) di qualità per garantire l’estrazione di informazioni utili, si sa poco circa le relazioni precise tra la presenza di ciascuna imperfezione specifica in un log impiegato per il process mining e le conseguenze sui risultati delle analisi. In generale la ricerca sul process mining non pone importanza al contenuto dei dataset che vengono utilizzati al fine di fare benchmark. Lo sviluppo di tecniche di data cleaning su misura per tali imperfezioni potrebbe giovare agli esperti, garantendo che i loro dati siano affidabili e collimino correttamente con la realtà. Questa tesi parte da questa realtà esplorando come i risultati del predictive process monitoring possano essere influenzati dalla degradazione dei nomi (label) delle attività nei log degli eventi. Gli esperimenti mostrano che le pipeline basati su LSTM molto utilizzate nel predictive process monitoring, non risultano influenzate dalle imperfezioni per la remaining time prediction, ma lo sono per la next activity prediction. In seguito confermiamo i risultati paragonandoli ai risultati degli esperimenti stessi ma eseguiti con versioni pulite dei dataset, ottenuti dopo l’esecuzione di diversi algoritmi di cleaning sviluppati appositamente. Abbiamo inoltre analizzato la nostra bibliografia per comporre una revisione di alcuni dataset più popolari nella letteratura scientifica sul process mining, e ne abbiamo estratto i commenti degli esperti a proposito di questi dataset. Abbiamo inoltre arricchito tali osservazioni tramite il nostro profiling dei dataset e li abbiamo paragonati con le nostre scoperte, al fine di valutare la generalizzabilità delle conclusioni ottenute utilizzando i dataset, e discutere l’influenza che possano avere loro su tali conclusioni. Gli esperimenti sono stati condotti su alcuni dei dataset estratti dalla revisione oltre a due dataset sintetici.
Quality of activity labels and impacts on predictive process monitoring
TIÉDREZ, ROMAN JULES
2024/2025
Abstract
Although practitioners have long acknowledged the prime importance of using quality event logs to successfully transform event logs into relevant insights, little is known about the precise relationships between specific imperfection patterns and process mining outcomes. Process mining research usually does not attach much importance to the content of the datasets used for benchmarking purposes either. Likewise, the development of cleaning techniques tailored to such imperfection patterns could benefit practitioners and ensure that their data is reliable and accurately represents the real world. This thesis builds on this state of affairs by exploring how the degradation of activity labels in event logs may affect predictive process monitoring results. Our experiments show that stateof-the-art LSTM-based predictive process monitoring pipelines may not be affected by such pollution for remaining time prediction but that the opposite seems to be true for next activity prediction. The results are then confirmed by comparing them to the outcomes of the same experiments conducted on cleaned versions of the datasets, obtained using various cleaning algorithms developed for the occasion. Moreover, we analyzed our bibliography to conduct a survey of some of the most popular datasets in process mining literature and extract practitioners’ comments relevant to those datasets. The comments are further enriched by some profiling of the datasets and comparisons with our own findings, in an effort to discuss the generalizability of the conclusions obtained from using the datasets in question and the precise role thereof in determining such conclusions. The experiments were conducted on a number of datasets extracted from the survey, plus two synthetic datasets.| File | Dimensione | Formato | |
|---|---|---|---|
|
2025_07_Tiedrez_Thesis_01.pdf
accessibile in internet per tutti
Dimensione
11.83 MB
Formato
Adobe PDF
|
11.83 MB | Adobe PDF | Visualizza/Apri |
|
2025_07_Tiedrez_Executive_Summary_02.pdf
accessibile in internet per tutti
Dimensione
698.86 kB
Formato
Adobe PDF
|
698.86 kB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/239912