A Framework for Semi-Automated Process Instance Discovery From Decorative Attributes

A Framework for Semi-Automated Process Instance Discovery from Decorative Attributes
Andrea Burattin and Roberto Vigo Department of Pure and Applied Mathematics University of Padua, Italy SIAV S.p.A. April 14th, 2010 Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Introduction – What is process mining?
Given a log describing the executions of some activities Case id Activities Execution Time Other ﬁelds 1 Order received mar 21, 2011 12:00 . . . 1 Payment received mar 22, 2011 09:00 . . . 2 Order received mar 23, 2011 15:45 . . . 2 Payment reminder mar 25, 2011 15:45 . . . 2 Payment received mar 25, 2011 17:31 . . . 1 Goods available mar 26, 2011 08:30 . . . 2 Goods available mar 26, 2011 10:00 . . . 1 Shipping mar 26, 2011 10:15 . . . 2 Shipping mar 26, 2011 12:30 . . . Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

Produce a process model (discovery) Order received Payment reminder Payment received Goods available Shipping Work with other perspectives Conformance between the executions log and a reference process model Relationships among activities originators, other extensions Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

It is necessary to isolate log into a set of traces Activity Execution Time Other ﬁelds Process Instance 1 Order received mar 21, 2011 12:00 . . . Payment received mar 22, 2011 09:00 . . . Goods available mar 26, 2011 08:30 . . . Shipping mar 26, 2011 10:15 . . . Process Instance 2 Order received mar 23, 2011 15:45 . . . Payment reminder mar 25, 2011 15:45 . . . Payment received mar 25, 2011 17:31 . . . Goods available mar 26, 2011 10:00 . . . Shipping mar 26, 2011 12:30 . . . Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Introduction This work is focused on
data preparation for process mining on “non standard” business logs Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Introduction This work is focused on
data preparation for process mining on “non standard” business logs In particular Our data sources lack an explicit “case id” ﬁeld This work generalizes a real business case Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Goal of this work The goal
is to perform log conversions (activity, timestamp, originator, info1, . . . , infom) ⇓ (activity, timestamp, originator, case id[, info1, . . . , infom]) Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Goal of this work The goal
is to perform log conversions (activity, timestamp, originator, info1, . . . , infom) ⇓ (activity, timestamp, originator, case id[, info1, . . . , infom]) Basic assumptions Availability of additional attributes Relations among values on infos fields identify relations among corresponding activities Basic approach 1 Identify relations among infos fields by matching their values 2 Use those relations to relate different activities Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Similar work in literature In the
literature there are two types of approaches 1 Domain speciﬁc (e.g. SAP) 2 Domain agnostic Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Similar work in literature In the
literature there are two types of approaches 1 Domain specific (e.g. SAP) 2 Domain agnostic Our approach is “hybrid” 1 The domain is a “parameter” 2 We only need that each activity is decorated with additional infos fields whose semantics is fixed per activity Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Original log structure Basic structure of
our log (activity, timestamp, originator, info1, . . . , infom) infoi represents the same information for all the log entries with the same activity (e.g. “invoice number”) Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Original log structure Basic structure of
our log (activity, timestamp, originator, info1, . . . , infom) infoi represents the same information for all the log entries with the same activity (e.g. “invoice number”) In this work we adopt a relational-algebra point of view because It seems “natural” to see a log as a relation It is close to the implementation Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Basic notation Elementary relational-algebra notation, let
π be the projection operator σ be the selection operator Other symbols, let L be the original log P be the power set I be the set of infos ﬁelds names Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Procedure details – Preprocessing Basic approach
reminder 1 Identify relations among infos ﬁelds by matching their values 2 Use those relations to relate diﬀerent activities Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

reminder 1 Identify relations among infos fields by matching their values 2 Use those relations to relate different activities We have a lot of infos fields and a complete search on all their combinations is infeasible Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

reminder 1 Identify relations among infos fields by matching their values 2 Use those relations to relate different activities We have a lot of infos fields and a complete search on all their combinations is infeasible Reduction of the search space with a priori knowledge Assumptions on the data type (e.g. it would be unlikely that a timestamp represents a case id) Assumptions on the case id expected features (e.g. average length bounds, variance, presence/absence of symbols, . . . ) These assumptions are “domain specific parameters” Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Procedure details – Match procedure The
match procedure for selecting the process instance candidate is forall the (ai , aj ) where ai = aj do forall the PIi ∈ P(I) and PIj ∈ P(I) do if |PIi | = |PIj | then end end end Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

match procedure for selecting the process instance candidate is forall the (ai , aj ) where ai = aj do forall the PIi ∈ P(I) and PIj ∈ P(I) do if |PIi | = |PIj | then k ← |πPIi (σactivity=ai (L)) πPIj (σactivity=aj (L))| |πPIi (σactivity=ai (L)) πPIj (σactivity=aj (L))| // k is the normalized amount of shared values between PIi and PIj for ai and aj end end end Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

match procedure for selecting the process instance candidate is forall the (ai , aj ) where ai = aj do forall the PIi ∈ P(I) and PIj ∈ P(I) do if |PIi | = |PIj | then k ← |πPIi (σactivity=ai (L)) πPIj (σactivity=aj (L))| |πPIi (σactivity=ai (L)) πPIj (σactivity=aj (L))| // k is the normalized amount of shared values between PIi and PIj for ai and aj if k > threshold then // In this case PIi and PIj are sets of fields that relate ai and aj // Collect all this information end end end end Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Procedure details – Chains of activities
With the previous algorithm we produce “tuples” of the form: [ai , PIi ] k − → [aj , PIj ] Where ai , aj are activities; PIi , PIj are the corresponding sets of infos ﬁelds and k is the amount of values shared by PIi , PIj on activities ai , aj Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

With the previous algorithm we produce “tuples” of the form: [ai , PIi ] k − → [aj , PIj ] Where ai , aj are activities; PIi , PIj are the corresponding sets of infos ﬁelds and k is the amount of values shared by PIi , PIj on activities ai , aj PIi , PIj exhibits close values thus they could hide process instance information for ai , aj Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

With the previous algorithm we produce “tuples” of the form: [ai , PIi ] k − → [aj , PIj ] Where ai , aj are activities; PIi , PIj are the corresponding sets of infos ﬁelds and k is the amount of values shared by PIi , PIj on activities ai , aj PIi , PIj exhibits close values thus they could hide process instance information for ai , aj Now We are generally interested in relating more than 2 activities Given the set of all those tuples we can join them in “chains” Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

Example of chains joining C1 = [a1, A] k1 − → [a2, B] C2 = [a3, C] k2 − → [a4, D] Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

Example of chains joining C1 = [a1, A] k1 − → [a2, B] C2 = [a3, C] k2 − → [a4, D] If there exists C3 = [a2, B] k3 − → [a3, C] Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

Example of chains joining C1 = [a1, A] k1 − → [a2, B] C2 = [a3, C] k2 − → [a4, D] If there exists C3 = [a2, B] k3 − → [a3, C] We can build C4 = [a1, A] k1 − → [a2, B] k3 − → [a3, C] k2 − → [a4, C] Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Procedure details – Chains handling Collection
of generated chains can be large It is not feasible to analyze them all Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of generated chains can be large It is not feasible to analyze them all ⇓ Cardinality reduction, selecting only the “most interesting” Ordering operator among chains Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of generated chains can be large It is not feasible to analyze them all ⇓ Cardinality reduction, selecting only the “most interesting” Ordering operator among chains If it is reﬂexive, antisymmetric and transitive then we can build a partially ordered set and consider the maximal elements as the most promising ones Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Procedure details – Chains handling Example
operator. Let A and B be two chains A B ⇔          API 1 ≥ BPI 1 if A(A) = A(B) ∧ S(A) = S(B) S(A) ≤ S(B) if A(A) = A(B) ∧ S(A) = S(B) A(A) ⊆ A(B) otherwise A(A) is the set of activities of chain A S(A) is the average sharing among activities of chain A API 1 is the number of infos ﬁelds that deﬁne the case id Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Procedure details – Chains handling Example
operator. Let A and B be two chains A B ⇔          API 1 ≥ BPI 1 if A(A) = A(B) ∧ S(A) = S(B) S(A) ≤ S(B) if A(A) = A(B) ∧ S(A) = S(B) A(A) ⊆ A(B) otherwise A(A) is the set of activities of chain A S(A) is the average sharing among activities of chain A API 1 is the number of infos fields that define the case id This operator is based on observations of our real settings It can be a “domain specific parameter” Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Procedure details – Final log generation
Given a chain we build a log suitable for the application of process mining techniques For each original log entry Activity, timestamp and originator are the same of original log Process name is a constant label The case id for all the entries of a same activity is given by a composition function applied to the selected infos ﬁelds for, that activity We obtain one process log per chain Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Experimental results Work driven by a
real business case Procedure implementation languages PL/SQL Preprocessing steps: application of a priori knowledge C# Chains generation and their extension Domain speciﬁc heuristics Candidate attributes as a string (no timestamp or numbers) Max average length 20 chars Minimum average length 3 chars Maximum variation w.r.t. average length 10 Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Experimental results – 2 Logs from
document management systems 3 log sources, 7 total logs Log sizes: 10k, 20k, 40k, 60k, 140k, 20k, 30k Number of activities: 13, 39, 47, 2, 4, 12, 16 Number of infos ﬁelds: 26, 18, 16 Running time (chain generation): from 2 seconds to 2 minutes Chains discovered l1 l2 l3 l4 l5 l6 l7 Maximal 2 2 3 1 1 3 3 Expert’s 1 1 2 0 1 1 1 Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Conclusion and future works Our environment
and approach Incomplete log sources lacking process instance information Case id guessed relying on additional ﬁelds decorating the log Case id computed on statistical basis Highly parametric framework (with pros and cons) Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

of 19 Conclusion and future works Our environment
and approach Incomplete log sources lacking process instance information Case id guessed relying on additional ﬁelds decorating the log Case id computed on statistical basis Highly parametric framework (with pros and cons) Planned improvements Experimentation on other log sources Organic system for expressing a priori knowledge Exploit semantics of infos ﬁelds if available Andrea Burattin and Roberto Vigo Framework Semi-Automated Process Instance Discovery

A Framework for Semi-Automated Process Instance...

A Framework for Semi-Automated Process Instance Discovery From Decorative Attributes

More Decks by Andrea Burattin

Other Decks in Science

Featured

Transcript