Applicability of Process Mining Techniques in Business Environments

Applicability of Process Mining Techniques in Business Environments Candidate Andrea
Burattin, XXV Ciclo Supervisor Prof. Alessandro Sperduti April 8, 2013

2 of 45 Table of Contents  Information systems and
business processes  Process mining  Application of process mining in business and industrial environments – Problems – Possible solutions  Conclusions and future work

3 of 45 Current Information Systems  Information systems –
Usage of information systems growing in all companies – Information systems moving from single “vertical” functionalities towards “horizontal” business processes – All these systems record many data but its exploitation, typically, is not straightforward

4 of 45 Business Processes  There is no comprehensively
accepted definition of “business process”, however, shared features are – A finite set of activities (and dependencies) – Activity performers or originators – Output produced as execution result  Typical example

5 of 45 Event Logs  Executions traces of business
processes are typically recorded in log files

6 of 45 What is Process Mining?  Input –
An event log – Optionally, other a-priori knowledge (e.g., a process model)  Examples of possible outputs – A model describing how activities are performed • Useful to compare ideal process versus the actual one – Relationships among originator • Useful to redistribute resources over activities – Statistics on the performance of the execution • Useful for monitoring purposes

7 of 45 Process Mining Overview Image source: Christian W.
Gűnther. Process mining in Flexible Environments. PhD thesis, Technische Universiteit Eindhoven, Eindhoven, 2009 .

8 of 45 Theoretical Open Problems  Process mining literature
presents several open problems, for example (van der Aalst, Comp. in Ind., 2004) – Duplicate tasks • Activities with the same name in different positions of the model – Exploiting all data available • For example, not all the algorithms use all the time information to distinguish the starting from the finishing time of an event – Holistic mining • Different perspectives from different sources: not only the control flow but also other perspectives, in order to create a global process description – Noise and incompleteness • Obtaining a complete log, where all the required information are actually available

9 of 45 Industrial-related Open Problems  Our case studies
revealed other problems – Using process mining tools and configuring algorithms – Results interpretation • Generation of the results with an as-readable-as-possible [graphical] representation of the process: information are represented in a simple and understandable manner – Computational power and storage capacity required • Small and medium sized companies may not be able to cope with the technological requirement of large process mining projects

10 of 45 Possible Industry Scenarios  We characterized four
possible scenarios, based on process awareness of companies and of their systems – Company process aware vs. Process unaware – Process aware software vs. Process unaware software

11 of 45 Thesis Structure and Organization

12 of 45 Overview – Data Preparation

13 of 45 Problems with Data Preparation  Several problems
with data preparation, at different complexity and abstraction levels  Key points – Adaptation of existing data (“syntax problem”, easy) – Construction of all the required information – Introduction of new information

14 of 45 Problems with Data Preparation – 2 
Process mining required fields are (activity; process-name; case-id; timestamp; originator)  In our real case, we have – Company process aware – Information systems process unaware – Log structured as (activity; timestamp; originator; info1; …; infon)

15 of 45 Problems with Data Preparation – 3 
The name of the process is not a problem (we can assume all events belonging to the same process)  To extract the case-id from info fields (idea) – Isolation of candidate case-id fields (a-priori knowledge) – Construction of “event chains”, binding two events that share at least one field's value (strings similarity functions) – Selection of the maximal chain (one with most activities or the simplest chain)  Details reported in (Burattin & Vigo, CIDM, 2011)

16 of 45 Overview – Control-flow Mining

17 of 45 Basic Idea of Control-flow Discovery  Basic
idea of control-flow discovery algorithms is the identification of dependencies between activities  If the event log contains sequences as – … A, B … Notation: A > B – … B, A …  Discovered model contains dependency from A to B: A B

18 of 45 Exploiting Data Available  Events with duration
instead of instantaneous event  Generalization (i.e. same parameters) of a currently available control-flow mining algorithm to exploit this new information – Dependencies for time intervals  Dependency measure  AND-measure

19 of 45 Exploiting Data Available – 2  Evaluation
of new algorithm against synthetic dataset (100 processes, 1000 cases, 12000 events, 10% noise)  Test against “real” dataset, almost correct mining  Reported in (Burattin & Sperduti, ESANN, 2010), implemented in ProM 5.2

20 of 45 Not-expert Users  Typical users of process
mining are “not-expert” users  Not-expert users in process mining but we assume they have notions in process modelling – Process mining algorithms provide configurations (parameters) to cope with different scenarios • Heuristics Miner parameters are thresholds on measures (e.g., dependency measure and AND-measure) – Process mining algorithms are implemented in tools – Not-expert users don't understand algorithm and are not skilled in using tools

21 of 45 Basic Idea  We defined an approach
to discretize the values of each parameters of the algorithm we are considering – Idea: the log is finite, therefore only a finite number of significant thresholds exist  Each parameter configuration defines a model  We can shift the problem of configuring parameters of process mining algorithms into “selecting the best model out of a set” – Automatic approach to explore this space – User-guided approach

22 of 45 Automatic Discovery  Hill climbing with maximum
plateau steps to reach local optimum  Several random restarts to try to achieve the global optimum  Optimization criterion based on Minimum Description Length principle – MDL encoding for process mining (Calders et al., SAC, 2009) – Heuristics to improve efficiency and performance

23 of 45 Automatic Discovery – Results  Tests against
synthetic dataset: 93 process models, 250 cases for each process – Rigorous adaptation with 3 values for α (balancing the model complexity and the data explanation)

24 of 45 Automatic Discovery – Results 2  Performance
in terms of time required to process different models, with the two techniques  Reported in (Burattin & Sperduti, CEC, 2010), implemented in ProM 5.2

25 of 45 User-guided Discovery  Model comparison using a
model-to-model metric (Burattin et al., BPI, 2011)  Approach based on hierarchical clustering with average linkage  In our implementation, user can “explore” the hierarchy

26 of 45 Overview – Results Evaluation

27 of 45 Process Mining Evaluation  It is possible
to evaluate results of process mining using – Model-to-model metrics: to compare two given process models (typically the target and the mined one) • Structural Similarity • Dependency Difference Metric – Model-to-log metrics: to define the amount of behaviour observed in the log that is allowed by the model • Fitness • Soundness

28 of 45 Model-to-model Metric  Metric used to cluster
processes (Burattin et al., BPI, 2011) – Decomposition of complex process models into • Allowed / Forbidden behaviour – Comparison using Jaccard Similarity

29 of 45 Model-to-log Metric  Process model as a
set of Declare constraints  Given a Declare constraint π and a trace σ  Healthiness measures as – Activation sparsity – Violation ratio – Fulfilment ratio – Conflict ratio  Reported in (Burattin et al., EDOC, 2012), implemented in ProM 6.2

30 of 45 Overview – Process Extension

31 of 45 Multi-perspective Mining  Extension of control-flow model
with social perspective  Input – Log with information on originators – Process model  Output – Process model extended with roles

32 of 45 Multi-perspective Mining – 2  Each role
requires a different set of skills – We assume that skills are characterized by set of users  Basic idea is to identify handover of roles starting from actual process dependencies  We don't have handover of role between a and b if – In some cases a and b are performed by the same originator – If the set of originators performing a and b are similar  We divide activities in roles according to threshold τw

33 of 45 Multi-perspective Mining – 3  Identifying handover
is not sufficient  Sometimes we need to merge roles  We merge roles according to a threshold τρ

34 of 45 Multi-perspective Mining – 4  We are
able to discretize τw and τρ to their significant values  We use an entropy-based metric to rank the most interesting partitions  Reported in (Burattin et al., CIDM, 2013), impl. in ProM 6.2

35 of 45 Overview – Stream Control-flow Mining

36 of 45 Basic Idea

37 of 45 Stream Mining Peculiarities  Peculiarities of the
stream mining problem – It is impossible to store the complete stream – Backtracking over a data stream is not feasible, so algorithms are required to make only one pass over data – It is important to quickly adapt the model to cope with unusual data values – The approach must deal with variable system conditions, such as fluctuating stream rates  Basic idea is that recent observations are more important than older ones

38 of 45 Proposed Approaches  Three approaches for stream
process mining – A basic approach with two variations (sliding window and periodic reset) • Parameter: maximum space occupied – The adaptation of an approach for approximate frequency counting (i.e. Lossy Counting) to our problem • Parameter: maximum error allowed – A new general framework specifically design for process mining which support several update policies • Parameters: maximum space occupied, weighting policy

39 of 45 Weights Update Policies  The framework is
based on the continuous update of frequencies of Heuristics Miner basic relations – Most frequent activities – Most frequent direct succession relations  Policy for Stationary Streams – Online Heuristics Miner (with error bounds)  Policies for Evolving Streams – Online Heuristics Miner with Ageing – Online Heuristics Miner with Self-Adapting Ageing

40 of 45 Stream Miner Results  Tests against 3
realistic datasets, and one real dataset  Results on a dataset with 6000 cases and 58783 events  Results in technical report (CoRR abs/1212.6383), implemented in ProM 6.2

41 of 45 Stream Miner Results – 2

42 of 45 Overview

43 of 45 Processes and Logs Generator  Many companies
are reluctant to share their data  Stochastic context free grammar to generate random business processes – Rules to simulate a process in order to produce an event log – Reference model and log can be used for the evaluation of control-flow mining algorithms  Reported in (Burattin & Sperduti, BPI, 2011)  Implemented in independent software, now part of official ProM 6.2 distribution

44 of 45 Conclusions  Process mining techniques may have
an important impact on companies work (especially with SMEs)  Several problems prevent the application of process mining techniques in real business environment – Data availability – Actual mining phase (both in mining algorithms and in human interaction) – Results interpretation  Solutions have been proposed on all the stages where problems may arise

45 of 45 Detailed Map of Performed Activities

Applicability of Process Mining Techniques in B...

Applicability of Process Mining Techniques in Business Environments

More Decks by Andrea Burattin

Other Decks in Science

Featured

Transcript