$30 off During Our Annual Pro Sale. View Details »

DEXA 2012 talk

Emir Muñoz
September 03, 2012

DEXA 2012 talk

Paper: "Performance Analysis of Algorithms to Reason about XML Keys"

Emir Muñoz

September 03, 2012
Tweet

More Decks by Emir Muñoz

Other Decks in Research

Transcript

  1. Performance Analysis of Algorithms to
    Reason about XML Keys
    Emir Mu˜
    noz Jim´
    enez
    University of Santiago de Chile
    Yahoo! Labs LatAm
    Joint work with F. Ferrarotti, S. Hartmann, S. Link, M. Marin
    DEXA 2012 @ Vienna, Austria, 3rd September 2012
    Introduction RW & Motivation XML Keys Results Conclusion 1/24

    View Slide

  2. Contribution
    An efficient implementation of an algorithm that decides the
    implication problem for a tractable and expressive class of
    XML keys.
    Performance Analysis:
    Implication problem over large sets of XML Keys.
    Non-redundant covers of XML keys and validation of XML
    documents.
    Our experiments show that reasoning about expressive notions
    of XML keys can be done efficiently in practice and scales well.
    Introduction RW & Motivation XML Keys Results Conclusion 2/24

    View Slide

  3. Outline
    1 Introduction
    2 Related Work & Motivation
    3 Reasoning about XML Keys
    4 Results
    5 Conclusion
    Introduction RW & Motivation XML Keys Results Conclusion 3/24

    View Slide

  4. Introduction
    Keys
    XML: de-facto standard for Web data exchange and
    integration.
    XML ↔ data storage ↔ data processing.
    Both industry and academia have long since recognized the
    importance of keys in XML data management.
    Several notions of XML keys have been proposed. Most
    influential is due to Buneman et al. who defined keys on the
    basis of an XML tree model similar to the one suggested by
    DOM and XPath.
    Introduction RW & Motivation XML Keys Results Conclusion 4/24

    View Slide

  5. Introduction
    XML Tree Example (1/2)
    Example (An XML tree representing an XML document)
    E
    E
    E
    A
    E E
    A
    A A
    E E E
    A
    Member
    Team
    rol
    A
    Team
    Member
    rol
    A
    Leader
    Team
    rol
    E E
    E
    S
    Smith
    E
    S
    Dexter
    E
    S
    Cooper
    E
    E
    S
    William
    E
    S
    Bell
    E
    A
    Team
    Member
    rol
    E
    E
    S
    Dexter
    E
    E
    S
    Cooper
    A
    E
    E
    S
    William
    E
    S
    Bell
    E
    A
    E
    E
    S
    Davis
    E
    S
    Brook
    E
    S
    John
    project
    db
    team
    Riders
    team
    tname
    Coyotes
    pname
    Phobia MediaLow
    pname
    project
    tname
    employee
    employee
    employee
    name name
    lname
    fname
    lname
    fname
    name
    fname lname
    employee
    name
    fname lname
    employee
    rol
    Team
    Member
    name
    fname lname
    employee
    Team
    Leader
    rol
    name
    fname lname
    Introduction RW & Motivation XML Keys Results Conclusion 5/24

    View Slide

  6. Introduction
    XML Tree Example (2/2)
    Some Constraints: (gathered as business rules)
    (a) A project node is identified by pname, no matter where the
    project node appears in the document.
    (b) A team node can be identified by tname relatively to a
    project node.
    (c) Within any given subtree rooted at team, an employee node
    is identified by name.
    Introduction RW & Motivation XML Keys Results Conclusion 6/24

    View Slide

  7. Introduction
    XML Tree Example (2/2)
    Some Constraints: (gathered as business rules)
    (a) A project node is identified by pname, no matter where the
    project node appears in the document. Absolute key
    (b) A team node can be identified by tname relatively to a
    project node. Relative key
    (c) Within any given subtree rooted at team, an employee node
    is identified by name. Relative key
    Introduction RW & Motivation XML Keys Results Conclusion 6/24

    View Slide

  8. Related Work & Motivation
    Keys are important in many perennial tasks in database
    management.
    The hope is that keys will turn out to be equally beneficial for
    XML.
    In practice, expressive yet tractable notions of XML keys have
    been ignored so far.
    We initiate an empirical study of the fragment of XML keys
    with nonempty sets of simple key paths.
    One of the most fundamental questions on keys is:
    Can we decide if a new key holds given a set of known keys?
    (Logical Implication Problem.)
    Introduction RW & Motivation XML Keys Results Conclusion 7/24

    View Slide

  9. Related Work & Motivation
    Key implication example
    Some Constraints: (gathered as business rules)
    (a) A project node is identified by pname, no matter where the
    project node appears in the document.
    (b) A team node can be identified by tname relatively to a
    project node.
    (c) Within any given subtree rooted at team, an employee node
    is identified by name.
    Suppose, the consideration of a further key:
    (d) A project node can be identified in the document by its child
    nodes pname and team.
    Introduction RW & Motivation XML Keys Results Conclusion 8/24

    View Slide

  10. Related Work & Motivation
    Key implication example
    Some Constraints: (gathered as business rules)
    (a) A project node is identified by pname, no matter where the
    project node appears in the document.
    (b) A team node can be identified by tname relatively to a
    project node.
    (c) Within any given subtree rooted at team, an employee node
    is identified by name.
    Suppose, the consideration of a further key:
    (d) A project node can be identified in the document by its child
    nodes pname and team. key (a) actually implies (d)!!
    because (d) is a superkey of (a).
    Introduction RW & Motivation XML Keys Results Conclusion 8/24

    View Slide

  11. Keys for XML
    Definitions (1/2)
    Definition (Value equality)
    u is value equal to v (u =v v) if the trees rooted at u and v are
    isomorphic by an isomorphism that is the identity on string values.
    (e.g., element nodes employee for “Dexter Cooper”)
    E
    E
    E
    A
    E E
    A
    A A
    E E E
    A
    Member
    Team
    rol
    A
    Team
    Member
    rol
    A
    Leader
    Team
    rol
    E E
    E
    S
    Smith
    E
    S
    Dexter
    E
    S
    Cooper
    E
    E
    S
    William
    E
    S
    Bell
    E
    A
    Team
    Member
    rol
    E
    E
    S
    Dexter
    E
    E
    S
    Cooper
    A
    E
    E
    S
    William
    E
    S
    Bell
    E
    A
    E
    E
    S
    Davis
    E
    S
    Brook
    E
    S
    John
    project
    db
    team
    Riders
    team
    tname
    Coyotes
    pname
    Phobia MediaLow
    pname
    project
    tname
    employee
    employee
    employee
    name name
    lname
    fname
    lname
    fname
    name
    fname lname
    employee
    name
    fname lname
    employee
    rol
    Team
    Member
    name
    fname lname
    employee
    Team
    Leader
    rol
    name
    fname lname
    Introduction RW & Motivation XML Keys Results Conclusion 9/24

    View Slide

  12. Keys for XML
    Definitions (2/2)
    Some Constraints: (in class K of XML keys)
    (a) A project node is identified by pname, no matter where the
    project node appears in the document.
    (ε, (project, {pname}))
    (b) A team node can be identified by tname relatively to a
    project node.
    (project, (team, {tname}))
    (c) Within any give subtree rooted at team, an employee node is
    identified by name.
    ( ∗.team, (employee, {name}))
    (d) A project node can be identified by its child nodes pname
    and team.
    (ε, (project, {pname, team}))
    Introduction RW & Motivation XML Keys Results Conclusion 10/24

    View Slide

  13. Deciding XML Key Implication
    Notions
    We say that Σ implies ϕ, denoted by Σ |= ϕ, iff every finite
    XML tree T that satisfies all σ ∈ Σ also satisfies ϕ.
    Implication Problem for a class C of XML keys: given any
    Σ ∪ {ϕ} in C, decide whether Σ |= ϕ.
    e.g., let be Σ = {(a), (b), (c)} and ϕ = (d), then Σ |= ϕ is
    true.
    e.g., let be Σ = {(a), (b)} and ϕ = (c), then Σ |= ϕ is false.
    Hartmann & Link characterize the implication problem for the
    class K of XML keys in terms of the of reachability problem
    for fixed nodes in a suitable digraph.
    Introduction RW & Motivation XML Keys Results Conclusion 11/24

    View Slide

  14. Deciding XML Key Implication
    Mini-trees and Witness Graphs (Hartmann & Link, 2009)
    0000
    0000
    0000
    0000
    1111
    1111
    1111
    1111
    00000
    00000
    00000
    00000
    00000
    11111
    11111
    11111
    11111
    11111
    0000
    0000
    0000
    0000
    0000
    1111
    1111
    1111
    1111
    1111
    0000
    0000
    0000
    0000
    1111
    1111
    1111
    1111
    E
    S
    E
    db
    E
    E public
    national
    project
    E
    S
    E project
    E
    S
    E
    S
    (1)
    (3)
    E
    E
    E national
    public
    (2)
    year
    pname
    pname year
    ×
    ×
    p
    p1 p2
    p′
    rϕ = qϕ
    r′
    ϕ

    1

    2

    q′
    ϕ
    q′
    ϕ

    1

    1

    2

    2
    v′
    1
    v′
    1
    v′
    2
    v′
    2
    w1
    w1
    w′
    1
    w′
    1
    (a) Mini tree TΣ,ϕ
    db
    E
    E
    E
    E E
    S
    S
    public
    national
    project
    E
    year
    pname
    ×
    ×

    q′
    ϕ
    (b) Witness graph GΣ,ϕ
    ϕ = (ε, (public. ∗.project, {pname.S, year.S}))


    Introduction RW & Motivation XML Keys Results Conclusion 12/24

    View Slide

  15. An Efficient Implementation
    Data structures and implementation
    Suitable data structures for Mini-tree & Witness-graph.
    db
    E
    E
    E
    E E
    S
    S
    public
    national
    project
    E
    year
    pname
    ×
    ×

    q′
    ϕ
    (c) Witness graph GΣ,ϕ
    Node 1
    Node 2
    vertexEle
    Node 1
    edgeEle
    nodeEle
    (public)
    Node 3 Node 0
    vertexEle
    Node 2
    edgeEle edgeEle
    nodeEle
    (national)
    Node 4 Node 6 Node 1
    vertexEle
    Node 3
    edgeEle edgeEle edgeEle
    nodeEle
    (project)
    (S)
    Node 7
    (db)
    vertexEle
    edgeEle
    nodeEle
    Node 0
    L
    .....
    (d) Adj. list for GΣ,ϕ
    Introduction RW & Motivation XML Keys Results Conclusion 13/24

    View Slide

  16. XML Key Reasoning for Document Validation
    Applications
    Fast algorithms for the validation of XML documents against
    keys are crucial to ensure the consistency and semantic
    correctness of data stored or exchanged between applications.
    We use our implementation of the implication problem to
    compute non-redundant covers for sets of XML keys.
    Which in turn can be used to significantly speed up the
    process of XML document validation against sets of XML
    keys.
    Introduction RW & Motivation XML Keys Results Conclusion 14/24

    View Slide

  17. XML Key Reasoning for Document Validation
    Cover Sets for XML Keys
    Fact
    Σ is non-redundant if there is no key ψ ∈ Σ such that
    Σ − {ψ} |= ψ.
    Algorithm 1: Non-redundant Cover for XML keys
    Input : Finite set Σ of XML keys
    Output: A non-redundant cover for Σ
    Θ = Σ;
    1
    foreach key ψ ∈ Σ do
    2
    if Θ − {ψ} |= ψ then
    3
    Θ = Θ − {ψ};
    4
    return Θ;
    5
    6
    This set can be computed in O(|Σ| × (max{|ψ| : ψ ∈ Σ})2) time.
    Introduction RW & Motivation XML Keys Results Conclusion 15/24

    View Slide

  18. Experimental Results
    Performance Analysis
    We analyze:
    (i) The scalability of the implication problem.
    (ii) The viability of computing non-redundant cover sets to speed
    up the validation of XML documents against XML keys.
    For (i) we generated large sets of XML keys in the following
    two systematic ways:
    1. using a manually defined set of 5 to 10 XML keys as seeds, we
    computed new implied keys by successively applying the
    inference rules from the axiomatization presented by
    Hartmann & Link (2009).
    2. we defined some non-implied XML keys, adding witness edges
    keeping q
    ϕ
    not reachable from q′
    ϕ
    .
    For (ii) we generated sets of keys following the same strategy.
    Introduction RW & Motivation XML Keys Results Conclusion 16/24

    View Slide

  19. Experimental Results – Datasets (1/2)
    Table: XML Documents.
    Doc ID Document No. of
    Elements
    No. of
    Attributes
    Size Max.
    Depth
    Average
    Depth
    Doc1 321gone.xml 311 0 23 KB 5 3.76527
    Doc2 yahoo.xml 342 0 24 KB 5 3.76608
    Doc3 dblp.xml 29,494 3,247 1.6 MB 6 2.90228
    Doc4 nasa.xml 476,646 56,317 23 MB 8 5.58314
    Doc5 SigmodRecord.xml 11,526 3,737 476 KB 6 5.14107
    Doc6 mondial-3.0.xml 22,423 47,423 1 MB 5 3.59274
    Introduction RW & Motivation XML Keys Results Conclusion 17/24

    View Slide

  20. Experimental Results – Datasets (2/2)
    SigmodRecord (seed keys)
    (ε, (issue, {volume.S, number.S}))
    (ε, ( ∗.issue, {volume.S, number.S})) Interaction rule
    (issue, (articles, {article.title.S})) (Q, (Q′, {P.P1, . . . , P.Pk})),
    (issue.articles, (article, {title.S})) (Q.Q′, (P, {P1, . . . , Pk}))
    (issue, (articles.article, {initP age.S, endP age.S})) (Q, (Q′.P, {P1, . . . , Pk}))
    (ε, (issue.articles.article.authors.author,{position})) (issue, (articles.article, {title.S}))
    Sigmod.xml


    13
    2


    Deadlock Detection is Cheap.
    19
    34

    Rakesh Agrawal
    Michael J. Carey
    David J. DeWitt

    ...

    ...

    Introduction RW & Motivation XML Keys Results Conclusion 18/24

    View Slide

  21. Experimental Results
    Strategies to generate implied and non-implied XML keys
    If we apply the interaction rule to
    - (issue, (articles, {article.title.S}))
    - (issue.articles, (article, {title.S}))
    We derived the implied key
    - (issue, (articles.article, {title.S}))
    We applied the interaction,
    context-target, subnodes, context-path
    containment, target-path containment,
    subnodes-epsilon and prefix-epsilon rules
    whenever possible.
    Non-implied keys
    E
    E
    E
    E
    E
    E
    E
    E
    S
    E
    S
    author
    db
    conference
    issue
    articles
    article
    last
    first

    q′
    ϕ
    ℓ0 = content
    Introduction RW & Motivation XML Keys Results Conclusion 19/24

    View Slide

  22. Results for Implication Problem
    Deciding implication of XML keys
    0
    0.2
    0.4
    0.6
    0.8
    1
    1.2
    1.4
    1.6
    1.8
    0 20 40 60 80 100 120 140
    abs-abs
    abs-rel
    rel-abs
    rel-rel
    mix-abs
    mix-rel
    Time [ms]
    Size of Σ
    (e) XML key implication, all cases.
    0
    0.5
    1
    1.5
    2
    2.5
    0 20 40 60 80 100 120 140
    mix-rel
    wildcard
    Time [ms]
    Size of Σ
    (f) Effect of wildcards presence.
    Figure: Performance of the Algorithm for the Implication of XML Keys
    (Σ |= ϕ).
    Introduction RW & Motivation XML Keys Results Conclusion 20/24

    View Slide

  23. Results for Validation Problem
    Document Validation (1/2)
    XML doc. Key Set Time[ms]
    321gone & Processed Keys: 23 3.458
    yahoo Discarded keys: 15
    (Doc1 & 2) Cover set: 8 keys
    DBLP Processed Keys: 36 12.757
    (Doc3) Discarded keys: 24
    Cover set: 12 keys
    nasa Processed Keys: 35 9.23
    (Doc4) Discarded keys: 28
    Cover set size: 7 keys
    Sigmod Processed Keys: 24 5.294
    Record Discarded Keys: 19
    (Doc5) Cover set: 5 keys
    mondial Processed Keys: 26 4.342
    (Doc6) Discarded Keys: 16
    Cover set: 10
    0
    5000
    10000
    15000
    20000
    25000
    30000
    35000
    Doc1 Doc2 Doc3 Doc4 Doc5 Doc6
    Running Time [ms]
    XML documents
    952ms 1329ms
    Full-set
    Cover-set
    >30min >25min 1.88min 5min
    (a) Non-redundant Cover Sets. (b) Validation Against Cover Sets.
    Figure: Non-redundant Cover Sets of XML keys and Validation of XML
    Documents
    Introduction RW & Motivation XML Keys Results Conclusion 21/24

    View Slide

  24. Results for Validation Problem
    Document Validation (2/2)
    There are some very expensive XPath queries.
    SigmodRecord
    (issue, (articles.article, {title.S})) - there are 1503 nodes
    in the XPath query “//issue/articles/article”.
    (ε, (issue.articles.article.authors.author, {S})) - there are
    3737 nodes in the XPath query
    “//issue/articles/article/authors/author”.
    Introduction RW & Motivation XML Keys Results Conclusion 22/24

    View Slide

  25. Main conclusion
    This work was motivated by two objectives:
    1 To demonstrate that there are expressive classes of XML keys that can be
    reasoned about efficiently.
    2 To show that our observations on the problem of deciding implication has
    immediate consequences for other perennial task in XML database
    management.
    [1.1] We studied a fragment of XML keys with nonempty sets of simple key
    paths.
    [1.1.1] Presenting an efficient implementation for the implication problem.
    [1.1.2] Showing through experiments that the proposed algorithm runs fast in
    practice and scales well.
    [2.1] We studied the problem of validating an XML document against a set of
    XML keys.
    [2.2] Presenting an optimization method for this validation via the
    computation of non-redundant cover sets of XML keys.
    [2.3] Our experiments show that enormous time savings can be achieved in
    practice.
    Introduction RW & Motivation XML Keys Results Conclusion 23/24

    View Slide

  26. Questions?
    THANKS!
    Emir Mu˜
    noz Jim´
    enez– [email protected]
    Introduction RW & Motivation XML Keys Results Conclusion 24/24

    View Slide

  27. Axiomatization
    Let be PL from the grammar Q → ℓ | ε | Q.Q | ∗
    where ℓ ∈ L is a label, ε is the empty word, “.” is the concat
    operator, and “ ∗” is the wildcard.
    For nodes v and v′ of an XML tree T, the value intersection
    of v[[Q]] and v′[[Q]] is given by
    v[[Q]] ∩v v′[[Q]] = {(w, w′) | w ∈ v[[Q]], w′ ∈ v′[[Q]], w =v w′}
    We define semantic closure by: Σ∗ = {ϕ ∈ C | Σ |= ϕ}
    and also syntactic closure by: Σ+

    = {ϕ | Σ ⊢ℜ ϕ}
    A set of rules ℜ is sound (complete) if Σ+

    ⊆ Σ∗(Σ∗ ⊆ Σ+

    )
    A sound and complete set of rules is called axiomatization
    1/2

    View Slide

  28. Deciding XML Key Implication
    Mini-trees and Witness Graphs (2/2)
    Theorem (Hartman & Link, 2009)
    Let Σ ∪ {ϕ} be a finite set of keys in the class K. We have Σ |= ϕ
    if and only if qϕ is reachable from q′
    ϕ
    in GΣ,ϕ.
    Algorithm 2: XML key implication in K
    Input : finite set of XML keys Σ ∪ {ϕ} in K
    Output: yes, if Σ |= ϕ; no, otherwise
    Construct GΣ,ϕ
    for Σ and ϕ;
    1
    if qϕ
    is reachable from q′
    ϕ
    in G then return yes; else return no; end if
    2
    2/2

    View Slide