Article Preview
TopIntroduction
The idea of process mining is to discover, monitor and improve fact-based processes by extracting knowledge from event logs readily available in today’s systems. There are two main drivers for the growing interest in process mining. On the one hand, more and more events are being recorded, thus, providing detailed information about the history of processes. On the other hand, there is a need to improve and support business processes in competitive and rapidly changing environments (Van der Aalst et al., 2012). Discovery is the type of process mining that strives to take unstructured data (Event Log) and without any a-priori information to produce a process model (Van der Aalst, 2011, p. 10). The focus of this paper is to facilitate the discovery endeavor through a clustering approach. The proposed method is useful when a process is expected to contain a large set of unique event classes. This work aims to provide computerized facilitation to process analysts following the Decision Support Systems paradigm. Facilitation occurs through a) suggesting a way to divide-and-conquer (Carmona et al., 2009a) the log file, i.e., to define horizontal boundaries for the global process, b) discovering marginal yet meaningful process models by virtually any technique and c) providing recommendations to merge marginal models into an overall process model.
Process discovery is one of the most challenging tasks in process mining. State-of-the-art techniques still have problems dealing with large and/or complex event logs and process models (Van der Aalst, 2012a). Complexity (either due to a large number of distinct activities comprising the process, or due to a huge number of traces in the event log - the Big Data case) is the main reason why the discovery problem is hard, since most discovery techniques are exponential to the number of activities. These techniques have difficulties, or even fail to deliver results in a reasonable time when an Event Log contains a large number of unique event classes. Moreover, process discovery is hard because the process is typically quite unstructured and/or the log that an analyst can obtain is not complete. A large number of activities aggravates the potentials to deal with those facts. Researchers have early identified the need for Process Mining to keep pace with the Big Data reality, making process mining problems decomposition an emerging topic of the field (Van der Aalst 2012a; Van der Aalst 2013a; Verbeek & Van der Aalst, 2013; Van der Aalst, 2013b; Munoz-Gama et al., 2013b; Munoz-Gama et al., 2013a).
An additional reason why discovery is hard is that when producing a process model, one must consider multiple criteria. The need to consider multiple criteria in all process mining tasks has been well documented in the literature (Buijs et al., 2012; Adriansyah et al., 2011b; Rozinat & Van der Aalst, 2008; Van der Aalst et al., 2008). Four main quality criteria have been identified for the created process models: fitness (be able to replay the observed behavior), precision (do not allow too much additional behavior), generalization (avoid overfitting) and simplicity (do not increase, beyond what is necessary, the number of entities required to explain the behavior). Since these criteria are conflicting, the discovery methods should find a way to trade-off their requirements. On the other hand, despite its hardness, the discovery problem remains interesting because organizations need process models that reflect the real processes for documentation, verification, performance analysis etc. (Lee et al., 2013). These needs are reflected in the growing interest of Business Process Management vendors, consultants, and researchers.