Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Framework for Semi-Automated Process Instance Discovery From Decorative Attributes

A Framework for Semi-Automated Process Instance Discovery From Decorative Attributes

Process mining is a relatively new field of research: its final aim is to bridge the gap between data mining and business process modelling. In particular, the assumption underpinning this discipline is the availability of data coming from business process executions. In business process theory, once the process has been defined, it is possible to have a number of instances of the process running at the same time. Usually, the identification of different instances is referred to a specific "case id" field in the log exploited by process mining techniques. The software systems that support the execution of a business process, however, often do not record explicitly such information. This paper presents an approach that faces the absence of the "case id" information: we have a set of extra fields, decorating each single activity log, that are known to carry the information on the process instance. A framework is addressed, based on simple relational algebra notions, to extract the most promising case ids from the extra fields. The work is a generalization of a real business case.

More info: http://andrea.burattin.net/publications/2011-cidm

0b6203b08e1c063c97bb25abfc3842ec?s=128

Andrea Burattin

April 14, 2011
Tweet

Transcript

  1. A Framework for Semi-Automated Process Instance Discovery from Decorative Attributes

    Andrea Burattin and Roberto Vigo Department of Pure and Applied Mathematics University of Padua, Italy SIAV S.p.A. April 14th, 2010 Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  2. Slide 2 of 19 Introduction – What is process mining?

    Given a log describing the executions of some activities Case id Activities Execution Time Other fields 1 Order received mar 21, 2011 12:00 . . . 1 Payment received mar 22, 2011 09:00 . . . 2 Order received mar 23, 2011 15:45 . . . 2 Payment reminder mar 25, 2011 15:45 . . . 2 Payment received mar 25, 2011 17:31 . . . 1 Goods available mar 26, 2011 08:30 . . . 2 Goods available mar 26, 2011 10:00 . . . 1 Shipping mar 26, 2011 10:15 . . . 2 Shipping mar 26, 2011 12:30 . . . Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  3. Slide 3 of 19 Introduction – What is process mining?

    Produce a process model (discovery) Order received Payment reminder Payment received Goods available Shipping Work with other perspectives Conformance between the executions log and a reference process model Relationships among activities originators, other extensions Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  4. Slide 4 of 19 Introduction – What is process mining?

    It is necessary to isolate log into a set of traces Activity Execution Time Other fields Process Instance 1 Order received mar 21, 2011 12:00 . . . Payment received mar 22, 2011 09:00 . . . Goods available mar 26, 2011 08:30 . . . Shipping mar 26, 2011 10:15 . . . Process Instance 2 Order received mar 23, 2011 15:45 . . . Payment reminder mar 25, 2011 15:45 . . . Payment received mar 25, 2011 17:31 . . . Goods available mar 26, 2011 10:00 . . . Shipping mar 26, 2011 12:30 . . . Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  5. Slide 5 of 19 Introduction This work is focused on

    data preparation for process mining on “non standard” business logs Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  6. Slide 5 of 19 Introduction This work is focused on

    data preparation for process mining on “non standard” business logs In particular Our data sources lack an explicit “case id” field This work generalizes a real business case Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  7. Slide 6 of 19 Goal of this work The goal

    is to perform log conversions (activity, timestamp, originator, info1, . . . , infom) ⇓ (activity, timestamp, originator, case id[, info1, . . . , infom]) Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  8. Slide 6 of 19 Goal of this work The goal

    is to perform log conversions (activity, timestamp, originator, info1, . . . , infom) ⇓ (activity, timestamp, originator, case id[, info1, . . . , infom]) Basic assumptions Availability of additional attributes Relations among values on infos fields identify relations among corresponding activities Basic approach 1 Identify relations among infos fields by matching their values 2 Use those relations to relate different activities Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  9. Slide 7 of 19 Similar work in literature In the

    literature there are two types of approaches 1 Domain specific (e.g. SAP) 2 Domain agnostic Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  10. Slide 7 of 19 Similar work in literature In the

    literature there are two types of approaches 1 Domain specific (e.g. SAP) 2 Domain agnostic Our approach is “hybrid” 1 The domain is a “parameter” 2 We only need that each activity is decorated with additional infos fields whose semantics is fixed per activity Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  11. Slide 8 of 19 Original log structure Basic structure of

    our log (activity, timestamp, originator, info1, . . . , infom) infoi represents the same information for all the log entries with the same activity (e.g. “invoice number”) Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  12. Slide 8 of 19 Original log structure Basic structure of

    our log (activity, timestamp, originator, info1, . . . , infom) infoi represents the same information for all the log entries with the same activity (e.g. “invoice number”) In this work we adopt a relational-algebra point of view because It seems “natural” to see a log as a relation It is close to the implementation Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  13. Slide 9 of 19 Basic notation Elementary relational-algebra notation, let

    π be the projection operator σ be the selection operator Other symbols, let L be the original log P be the power set I be the set of infos fields names Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  14. Slide 10 of 19 Procedure details – Preprocessing Basic approach

    reminder 1 Identify relations among infos fields by matching their values 2 Use those relations to relate different activities Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  15. Slide 10 of 19 Procedure details – Preprocessing Basic approach

    reminder 1 Identify relations among infos fields by matching their values 2 Use those relations to relate different activities We have a lot of infos fields and a complete search on all their combinations is infeasible Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  16. Slide 10 of 19 Procedure details – Preprocessing Basic approach

    reminder 1 Identify relations among infos fields by matching their values 2 Use those relations to relate different activities We have a lot of infos fields and a complete search on all their combinations is infeasible Reduction of the search space with a priori knowledge Assumptions on the data type (e.g. it would be unlikely that a timestamp represents a case id) Assumptions on the case id expected features (e.g. average length bounds, variance, presence/absence of symbols, . . . ) These assumptions are “domain specific parameters” Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  17. Slide 11 of 19 Procedure details – Match procedure The

    match procedure for selecting the process instance candidate is forall the (ai , aj ) where ai = aj do forall the PIi ∈ P(I) and PIj ∈ P(I) do if |PIi | = |PIj | then end end end Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  18. Slide 11 of 19 Procedure details – Match procedure The

    match procedure for selecting the process instance candidate is forall the (ai , aj ) where ai = aj do forall the PIi ∈ P(I) and PIj ∈ P(I) do if |PIi | = |PIj | then k ← |πPIi (σactivity=ai (L)) πPIj (σactivity=aj (L))| |πPIi (σactivity=ai (L)) πPIj (σactivity=aj (L))| // k is the normalized amount of shared values between PIi and PIj for ai and aj end end end Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  19. Slide 11 of 19 Procedure details – Match procedure The

    match procedure for selecting the process instance candidate is forall the (ai , aj ) where ai = aj do forall the PIi ∈ P(I) and PIj ∈ P(I) do if |PIi | = |PIj | then k ← |πPIi (σactivity=ai (L)) πPIj (σactivity=aj (L))| |πPIi (σactivity=ai (L)) πPIj (σactivity=aj (L))| // k is the normalized amount of shared values between PIi and PIj for ai and aj if k > threshold then // In this case PIi and PIj are sets of fields that relate ai and aj // Collect all this information end end end end Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  20. Slide 12 of 19 Procedure details – Chains of activities

    With the previous algorithm we produce “tuples” of the form: [ai , PIi ] k − → [aj , PIj ] Where ai , aj are activities; PIi , PIj are the corresponding sets of infos fields and k is the amount of values shared by PIi , PIj on activities ai , aj Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  21. Slide 12 of 19 Procedure details – Chains of activities

    With the previous algorithm we produce “tuples” of the form: [ai , PIi ] k − → [aj , PIj ] Where ai , aj are activities; PIi , PIj are the corresponding sets of infos fields and k is the amount of values shared by PIi , PIj on activities ai , aj PIi , PIj exhibits close values thus they could hide process instance information for ai , aj Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  22. Slide 12 of 19 Procedure details – Chains of activities

    With the previous algorithm we produce “tuples” of the form: [ai , PIi ] k − → [aj , PIj ] Where ai , aj are activities; PIi , PIj are the corresponding sets of infos fields and k is the amount of values shared by PIi , PIj on activities ai , aj PIi , PIj exhibits close values thus they could hide process instance information for ai , aj Now We are generally interested in relating more than 2 activities Given the set of all those tuples we can join them in “chains” Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  23. Slide 13 of 19 Procedure details – Chains of activities

    Example of chains joining C1 = [a1, A] k1 − → [a2, B] C2 = [a3, C] k2 − → [a4, D] Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  24. Slide 13 of 19 Procedure details – Chains of activities

    Example of chains joining C1 = [a1, A] k1 − → [a2, B] C2 = [a3, C] k2 − → [a4, D] If there exists C3 = [a2, B] k3 − → [a3, C] Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  25. Slide 13 of 19 Procedure details – Chains of activities

    Example of chains joining C1 = [a1, A] k1 − → [a2, B] C2 = [a3, C] k2 − → [a4, D] If there exists C3 = [a2, B] k3 − → [a3, C] We can build C4 = [a1, A] k1 − → [a2, B] k3 − → [a3, C] k2 − → [a4, C] Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  26. Slide 14 of 19 Procedure details – Chains handling Collection

    of generated chains can be large It is not feasible to analyze them all Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  27. Slide 14 of 19 Procedure details – Chains handling Collection

    of generated chains can be large It is not feasible to analyze them all ⇓ Cardinality reduction, selecting only the “most interesting” Ordering operator among chains Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  28. Slide 14 of 19 Procedure details – Chains handling Collection

    of generated chains can be large It is not feasible to analyze them all ⇓ Cardinality reduction, selecting only the “most interesting” Ordering operator among chains If it is reflexive, antisymmetric and transitive then we can build a partially ordered set and consider the maximal elements as the most promising ones Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  29. Slide 15 of 19 Procedure details – Chains handling Example

    operator. Let A and B be two chains A B ⇔          API 1 ≥ BPI 1 if A(A) = A(B) ∧ S(A) = S(B) S(A) ≤ S(B) if A(A) = A(B) ∧ S(A) = S(B) A(A) ⊆ A(B) otherwise A(A) is the set of activities of chain A S(A) is the average sharing among activities of chain A API 1 is the number of infos fields that define the case id Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  30. Slide 15 of 19 Procedure details – Chains handling Example

    operator. Let A and B be two chains A B ⇔          API 1 ≥ BPI 1 if A(A) = A(B) ∧ S(A) = S(B) S(A) ≤ S(B) if A(A) = A(B) ∧ S(A) = S(B) A(A) ⊆ A(B) otherwise A(A) is the set of activities of chain A S(A) is the average sharing among activities of chain A API 1 is the number of infos fields that define the case id This operator is based on observations of our real settings It can be a “domain specific parameter” Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  31. Slide 16 of 19 Procedure details – Final log generation

    Given a chain we build a log suitable for the application of process mining techniques For each original log entry Activity, timestamp and originator are the same of original log Process name is a constant label The case id for all the entries of a same activity is given by a composition function applied to the selected infos fields for, that activity We obtain one process log per chain Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  32. Slide 17 of 19 Experimental results Work driven by a

    real business case Procedure implementation languages PL/SQL Preprocessing steps: application of a priori knowledge C# Chains generation and their extension Domain specific heuristics Candidate attributes as a string (no timestamp or numbers) Max average length 20 chars Minimum average length 3 chars Maximum variation w.r.t. average length 10 Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  33. Slide 18 of 19 Experimental results – 2 Logs from

    document management systems 3 log sources, 7 total logs Log sizes: 10k, 20k, 40k, 60k, 140k, 20k, 30k Number of activities: 13, 39, 47, 2, 4, 12, 16 Number of infos fields: 26, 18, 16 Running time (chain generation): from 2 seconds to 2 minutes Chains discovered l1 l2 l3 l4 l5 l6 l7 Maximal 2 2 3 1 1 3 3 Expert’s 1 1 2 0 1 1 1 Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  34. Slide 19 of 19 Conclusion and future works Our environment

    and approach Incomplete log sources lacking process instance information Case id guessed relying on additional fields decorating the log Case id computed on statistical basis Highly parametric framework (with pros and cons) Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery
  35. Slide 19 of 19 Conclusion and future works Our environment

    and approach Incomplete log sources lacking process instance information Case id guessed relying on additional fields decorating the log Case id computed on statistical basis Highly parametric framework (with pros and cons) Planned improvements Experimentation on other log sources Organic system for expressing a priori knowledge Exploit semantics of infos fields if available Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery