Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applicability of Process Mining Techniques in B...

Applicability of Process Mining Techniques in Business Environments

This thesis analyses problems related to the applicability, in business environments, of Process Mining tools and techniques.

The first contribution is a presentation of the state of the art of Process Mining and a characterization of companies, in terms of their "process awareness". The work continues identifying circumstance where problems can emerge: data preparation; actual mining; and results interpretation. Other problems are the configuration of parameters by not-expert users and computational complexity.

We concentrate on two possible scenarios: "batch" and "on-line" Process Mining.

Concerning the batch Process Mining, we first investigated the data preparation problem and we proposed a solution for the identification of the "case-ids" whenever this field is not explicitly indicated.
After that, we concentrated on problems at mining time and we propose the generalization of a well-known control-flow discovery algorithm in order to exploit non instantaneous events. The usage of interval-based recording leads to an important improvement of performance.
Later on, we report our work on the parameters configuration for not-expert users. We present two approaches to select the "best" parameters configuration: one is completely autonomous; the other requires human interaction to navigate a hierarchy of candidate models.
Concerning the data interpretation and results evaluation, we propose two metrics: a model-to-model and a model-to-log. Finally, we present an automatic approach for the extension of a control-flow model with social information, in order to simplify the analysis of these perspectives.

The second part of this thesis deals with control-flow discovery algorithms in on-line settings. We propose a formal definition of the problem, and two baseline approaches. The actual mining algorithms proposed are two: the first is the adaptation, to the control-flow discovery problem, of a frequency counting algorithm; the second constitutes a framework of models which can be used for different kinds of streams (stationary versus evolving).

Thesis available at http://andrea.burattin.net/publications/2013-phd-thesis

Andrea Burattin

April 08, 2013
Tweet

More Decks by Andrea Burattin

Other Decks in Science

Transcript

  1. Applicability of Process Mining Techniques in Business Environments Candidate Andrea

    Burattin, XXV Ciclo Supervisor Prof. Alessandro Sperduti April 8, 2013
  2. 2 of 45 Table of Contents  Information systems and

    business processes  Process mining  Application of process mining in business and industrial environments – Problems – Possible solutions  Conclusions and future work
  3. 3 of 45 Current Information Systems  Information systems –

    Usage of information systems growing in all companies – Information systems moving from single “vertical” functionalities towards “horizontal” business processes – All these systems record many data but its exploitation, typically, is not straightforward
  4. 4 of 45 Business Processes  There is no comprehensively

    accepted definition of “business process”, however, shared features are – A finite set of activities (and dependencies) – Activity performers or originators – Output produced as execution result  Typical example
  5. 5 of 45 Event Logs  Executions traces of business

    processes are typically recorded in log files
  6. 6 of 45 What is Process Mining?  Input –

    An event log – Optionally, other a-priori knowledge (e.g., a process model)  Examples of possible outputs – A model describing how activities are performed • Useful to compare ideal process versus the actual one – Relationships among originator • Useful to redistribute resources over activities – Statistics on the performance of the execution • Useful for monitoring purposes
  7. 7 of 45 Process Mining Overview Image source: Christian W.

    Gűnther. Process mining in Flexible Environments. PhD thesis, Technische Universiteit Eindhoven, Eindhoven, 2009 .
  8. 8 of 45 Theoretical Open Problems  Process mining literature

    presents several open problems, for example (van der Aalst, Comp. in Ind., 2004) – Duplicate tasks • Activities with the same name in different positions of the model – Exploiting all data available • For example, not all the algorithms use all the time information to distinguish the starting from the finishing time of an event – Holistic mining • Different perspectives from different sources: not only the control flow but also other perspectives, in order to create a global process description – Noise and incompleteness • Obtaining a complete log, where all the required information are actually available
  9. 9 of 45 Industrial-related Open Problems  Our case studies

    revealed other problems – Using process mining tools and configuring algorithms – Results interpretation • Generation of the results with an as-readable-as-possible [graphical] representation of the process: information are represented in a simple and understandable manner – Computational power and storage capacity required • Small and medium sized companies may not be able to cope with the technological requirement of large process mining projects
  10. 10 of 45 Possible Industry Scenarios  We characterized four

    possible scenarios, based on process awareness of companies and of their systems – Company process aware vs. Process unaware – Process aware software vs. Process unaware software
  11. 13 of 45 Problems with Data Preparation  Several problems

    with data preparation, at different complexity and abstraction levels  Key points – Adaptation of existing data (“syntax problem”, easy) – Construction of all the required information – Introduction of new information
  12. 14 of 45 Problems with Data Preparation – 2 

    Process mining required fields are (activity; process-name; case-id; timestamp; originator)  In our real case, we have – Company process aware – Information systems process unaware – Log structured as (activity; timestamp; originator; info1; …; infon)
  13. 15 of 45 Problems with Data Preparation – 3 

    The name of the process is not a problem (we can assume all events belonging to the same process)  To extract the case-id from info fields (idea) – Isolation of candidate case-id fields (a-priori knowledge) – Construction of “event chains”, binding two events that share at least one field's value (strings similarity functions) – Selection of the maximal chain (one with most activities or the simplest chain)  Details reported in (Burattin & Vigo, CIDM, 2011)
  14. 17 of 45 Basic Idea of Control-flow Discovery  Basic

    idea of control-flow discovery algorithms is the identification of dependencies between activities  If the event log contains sequences as – … A, B … Notation: A > B – … B, A …  Discovered model contains dependency from A to B: A B
  15. 18 of 45 Exploiting Data Available  Events with duration

    instead of instantaneous event  Generalization (i.e. same parameters) of a currently available control-flow mining algorithm to exploit this new information – Dependencies for time intervals  Dependency measure  AND-measure
  16. 19 of 45 Exploiting Data Available – 2  Evaluation

    of new algorithm against synthetic dataset (100 processes, 1000 cases, 12000 events, 10% noise)  Test against “real” dataset, almost correct mining  Reported in (Burattin & Sperduti, ESANN, 2010), implemented in ProM 5.2
  17. 20 of 45 Not-expert Users  Typical users of process

    mining are “not-expert” users  Not-expert users in process mining but we assume they have notions in process modelling – Process mining algorithms provide configurations (parameters) to cope with different scenarios • Heuristics Miner parameters are thresholds on measures (e.g., dependency measure and AND-measure) – Process mining algorithms are implemented in tools – Not-expert users don't understand algorithm and are not skilled in using tools
  18. 21 of 45 Basic Idea  We defined an approach

    to discretize the values of each parameters of the algorithm we are considering – Idea: the log is finite, therefore only a finite number of significant thresholds exist  Each parameter configuration defines a model  We can shift the problem of configuring parameters of process mining algorithms into “selecting the best model out of a set” – Automatic approach to explore this space – User-guided approach
  19. 22 of 45 Automatic Discovery  Hill climbing with maximum

    plateau steps to reach local optimum  Several random restarts to try to achieve the global optimum  Optimization criterion based on Minimum Description Length principle – MDL encoding for process mining (Calders et al., SAC, 2009) – Heuristics to improve efficiency and performance
  20. 23 of 45 Automatic Discovery – Results  Tests against

    synthetic dataset: 93 process models, 250 cases for each process – Rigorous adaptation with 3 values for α (balancing the model complexity and the data explanation)
  21. 24 of 45 Automatic Discovery – Results 2  Performance

    in terms of time required to process different models, with the two techniques  Reported in (Burattin & Sperduti, CEC, 2010), implemented in ProM 5.2
  22. 25 of 45 User-guided Discovery  Model comparison using a

    model-to-model metric (Burattin et al., BPI, 2011)  Approach based on hierarchical clustering with average linkage  In our implementation, user can “explore” the hierarchy
  23. 27 of 45 Process Mining Evaluation  It is possible

    to evaluate results of process mining using – Model-to-model metrics: to compare two given process models (typically the target and the mined one) • Structural Similarity • Dependency Difference Metric – Model-to-log metrics: to define the amount of behaviour observed in the log that is allowed by the model • Fitness • Soundness
  24. 28 of 45 Model-to-model Metric  Metric used to cluster

    processes (Burattin et al., BPI, 2011) – Decomposition of complex process models into • Allowed / Forbidden behaviour – Comparison using Jaccard Similarity
  25. 29 of 45 Model-to-log Metric  Process model as a

    set of Declare constraints  Given a Declare constraint π and a trace σ  Healthiness measures as – Activation sparsity – Violation ratio – Fulfilment ratio – Conflict ratio  Reported in (Burattin et al., EDOC, 2012), implemented in ProM 6.2
  26. 31 of 45 Multi-perspective Mining  Extension of control-flow model

    with social perspective  Input – Log with information on originators – Process model  Output – Process model extended with roles
  27. 32 of 45 Multi-perspective Mining – 2  Each role

    requires a different set of skills – We assume that skills are characterized by set of users  Basic idea is to identify handover of roles starting from actual process dependencies  We don't have handover of role between a and b if – In some cases a and b are performed by the same originator – If the set of originators performing a and b are similar  We divide activities in roles according to threshold τw
  28. 33 of 45 Multi-perspective Mining – 3  Identifying handover

    is not sufficient  Sometimes we need to merge roles  We merge roles according to a threshold τρ
  29. 34 of 45 Multi-perspective Mining – 4  We are

    able to discretize τw and τρ to their significant values  We use an entropy-based metric to rank the most interesting partitions  Reported in (Burattin et al., CIDM, 2013), impl. in ProM 6.2
  30. 37 of 45 Stream Mining Peculiarities  Peculiarities of the

    stream mining problem – It is impossible to store the complete stream – Backtracking over a data stream is not feasible, so algorithms are required to make only one pass over data – It is important to quickly adapt the model to cope with unusual data values – The approach must deal with variable system conditions, such as fluctuating stream rates  Basic idea is that recent observations are more important than older ones
  31. 38 of 45 Proposed Approaches  Three approaches for stream

    process mining – A basic approach with two variations (sliding window and periodic reset) • Parameter: maximum space occupied – The adaptation of an approach for approximate frequency counting (i.e. Lossy Counting) to our problem • Parameter: maximum error allowed – A new general framework specifically design for process mining which support several update policies • Parameters: maximum space occupied, weighting policy
  32. 39 of 45 Weights Update Policies  The framework is

    based on the continuous update of frequencies of Heuristics Miner basic relations – Most frequent activities – Most frequent direct succession relations  Policy for Stationary Streams – Online Heuristics Miner (with error bounds)  Policies for Evolving Streams – Online Heuristics Miner with Ageing – Online Heuristics Miner with Self-Adapting Ageing
  33. 40 of 45 Stream Miner Results  Tests against 3

    realistic datasets, and one real dataset  Results on a dataset with 6000 cases and 58783 events  Results in technical report (CoRR abs/1212.6383), implemented in ProM 6.2
  34. 43 of 45 Processes and Logs Generator  Many companies

    are reluctant to share their data  Stochastic context free grammar to generate random business processes – Rules to simulate a process in order to produce an event log – Reference model and log can be used for the evaluation of control-flow mining algorithms  Reported in (Burattin & Sperduti, BPI, 2011)  Implemented in independent software, now part of official ProM 6.2 distribution
  35. 44 of 45 Conclusions  Process mining techniques may have

    an important impact on companies work (especially with SMEs)  Several problems prevent the application of process mining techniques in real business environment – Data availability – Actual mining phase (both in mining algorithms and in human interaction) – Results interpretation  Solutions have been proposed on all the stages where problems may arise