Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Topic Modeling of Child Fatalities Using Latent Dirichlet Allocation in a Geographical Information Systems Framework

Elise_gia
August 21, 2017

Topic Modeling of Child Fatalities Using Latent Dirichlet Allocation in a Geographical Information Systems Framework

Machine Learning (ML) and Natural Language Processing (NLP) techniques were used to semantically and geographically map, extract and synthesize the unstructured text for the purpose of classification and identification of risk factors for fatal child maltreatment and victimization. Using the resulting vectors as input data, a Self-Organizing Map (SOM) consisting of 12 x 8 dimensions was used to separate the document collection into identifiable clusters based on the similarity relationship between documents. The clusters were then linked to the spatial coordinates associated with each child fatality. The study has specific implications for understanding child death types and their shared modifiable risk factors as well as broader implications associated with how we collect, conceptualize and analyze data across different forms of interpersonal violence.

Elise_gia

August 21, 2017
Tweet

More Decks by Elise_gia

Other Decks in Research

Transcript

  1. By Gia Elise Barboza Topic Modeling of Child Fatalities Using

    Latent Dirichlet Allocation in a GIS Framework
  2. Introduction & Motivation for Study: Practical • One of the

    main challenges facing the child welfare system is the detection and classification capabilities of surveillance systems • Some of the difficulties associated with recognizing and classifying child deaths include • Narrowly focused and inconsistent legal and medical diagnoses and definitions • Variation in reporting requirements and definitions • Lack of consideration of the broader circumstances associated with child welfare • Limited coding options for child deaths, especially those due to neglect or negligence • Query: Can we create a more accurate way to classify child death types for the purpose of prevention? • Develop a typology of risk factors that incorporates more contextual features • Explore the characteristics of and relationship between each typology • Accurately predict and classify new cases based on our knowledge of existing typologies 2
  3. Introduction & Motivation for Study: Analytical • It’s relatively easy

    to give answers to well-specified questions using existing data sources and standard statistical analysis • Latent class analysis: risk profiles of early child adversity • Spatial models: clustering of gun violence near schools • Parallel process models: co-occurrence of psychological symptoms over time • When there is no structure to the data and the problem is not well-specified standard analyses don’t work • Unstructured data makes up 80-90% of all data produced by organizations (Holzinger, Stocker et al., 2013) including police reports, court documents and human service organizations • What patterns exist in complaints filed in protection from abuse orders? • What patterns exist in child fatality reports? • Machine learning algorithms such as text mining and topic modeling can help given meaning to narrative, unstructured text 3
  4. Background: A Significant Public Health Issue • Homicide is the

    3rd and 4th leading cause of death among children aged 0-4, 5-9 (Karch et al., 2009) • An estimated 1,750 children (2.36 per 100,000) died from maltreatment in 2016, up 7.4% from 2012 (Child Maltreatment, 2016) • In the United States, about 5 children a day die due to fatal child maltreatment 5
  5. Background: Risk Factors The majority of child deaths occur as

    a result of maltreatment in the familial environment (Schnitzer & Ewigman, 2005) Most children are killed by family members (Finkelhor & Ormerod, 2001) Most fatal child maltreatment cases involve children under the age of 6 (Finkelhor, 2001) Gender differences in perpetration (Welch & Bonner, 2013) Injuries are believed more likely caused by ‘strong arm’ methods and not weapons (Kunz & Bahr, 1996) History of CPS involvement (?) (Damashek et al., 2013; Anderson et al., 1983) 6
  6. Background: Individual and Contextual Characteristics of Individuals and Families •

    Parents’ lack of understanding of children’s needs and child development • Negative perception of one’s own situation • Lack of personal and social resources • An acceptance of physical violence • Parental stress and distress, including depression or other mental health conditions Individual level characteristics • Social isolation of families • Parents’ history of domestic abuse • Disabilities in children that may increase caregiver burden Relational characteristics 7 (Wallace, 1986; Strang, 1996; Wilczynski, 1995; Wallace, 1986; Strang, 1996)
  7. Background: An emergent typology of child fatality among children under

    the age of 18 Neonaticide (3%) The maltreatment-related death of an infant within the first 24 hours of life Fatal child abuse (39%) An assault or series of assaults resulting from being punched, hit, kicked, shaken or thrown resulting in significant injuries Family/Domestic homicide (35%) The killing of one’s whole family generally after a relationship breakdown followed by suicide of perpetrator Intentional homicide (14%) A deliberate attempt to murder a child resulting from the severe mental illness of the mother Fatal sexual assault (9%) Killing a child following a sexual assault Fatal neglect Failure to provide for a child adequate food, clothing, medical attention or lodging, or failure to exercise supervision and/or control 8 Lawrence, R. (2004). Understanding fatal assault of children: a typology and explanatory theory. Children and Youth Services Review, 26(9), 837-852.
  8. Limitations of Current Research • Few attempts to statistically validate

    typologies of child deaths • But see Roach and Bryant (2015) examined over 1,000 child homicides in England between 1996 – 2013 and found 4 victim clusters: 3 clusters of fatal child maltreatment distinguished by race, age and gender of child; one cluster describing fatal stabbings resulting from neighborhood ‘feuds’ • Studies to date have examined child fatality types as if they comprise a single homogeneous group • Possibly identifiable subgroups based on contextual characteristics • Circumstances surrounding death are socially and legally relevant • Many parameters change the social and legal meaning of the event • These parameters may change over time (i.e. role of substance abuse and/or domestic violence in child death) • Little research has connected child homicide risk with the broader temporal and contextual/neighborhood risk factors of child fatality 9
  9. Research Goals • Data Collection & Processing: • To construct

    novel data sources to explore child deaths • Data Analysis: • To analyze unstructured narrative text to gain new insight about child deaths and create a child death typology (i.e. look at old data in a new way) • Validation: • To predict new cases of child deaths for automatic categorization and eventually hierarchical organization • Engage community in prevention: • To facilitate data sharing so that communities can make use of the data in idiosyncratic ways
  10. 14

  11. Corpus Construction & Preprocessing • In general, text analysis protocol

    involves: • Eliminate nuisance words, errors and inconsistencies • Pre-processing procedures • Stemming algorithm (“bruises”, “bruised”, and “bruising”) • Examine result library(tm) library(SnowballC) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, tolower) docs <- tm_map(docs,removePunctuation) docs <- tm_map(docs, removeWords, stopwords("english")) docs <- tm_map(docs, removeWords, c("apparent", "nfd", "yearold", "decedent", "report", “baby“) docs <- tm_map(docs, stripWhitespace) docs<-tm_map(docs, stemDocument) docs <- tm_map(docs, PlainTextDocument) docs[[1]]$content[1] 18 APPARENT FULL TERM FETUS FOUND IN TRASH DUMPSTER, IN PLASTIC BAG. UNKNOWN MECHANISM OF INJURY THE DECEDENT IS A NEWBORN. IT IS BELIEVED SHE WAS DELIVERED IN THE BATHROOM OF A BOAT AND PUT IN THE TRASH. 2/15/2003. appar full term fetus trash dumpster plastic bag unknown mechan injuri believ deliv bathroom boat put trash • Matrix representation (TDM and DTM) dtm<-TermDocumentMatrix(docs)
  12. Term-Document Matrix 2005-06226 2005-06226 2005-06226 2005-06226 2005-06226 2005-06226 2005-06226 abdomen

    0 1 0 1 0 1 0 abdomin 0 1 0 0 0 0 1 abras 1 0 1 0 0 1 0 belt 1 0 1 0 0 0 1 bruis 0 1 0 0 0 0 0 care 0 1 0 0 0 0 0 dumpster 0 0 0 1 0 0 0 hematoma 1 0 0 0 1 0 0 inflict 0 0 0 0 0 0 0 pariet 0 0 1 0 0 1 0 pregnanc 0 0 1 1 0 0 1 trash 0 1 0 1 0 0 1 19
  13. Zipf’s Law THE FREQUENCY OF ANY WORD IS INVERSELY PROPORTIONAL

    TO ITS RANK 20 Some words have no discriminatory power Words useful for detecting dominant patterns
  14. Step 3 Data Analysis Exploratory Data Analysis (Desriptive) Latent Dirichlet

    Allocation (topic modeling & dimensionality reduction) – Self-Organizing Maps (Clustering and Visualization) 21
  15. Topic Modeling • Topic modeling aims to automatically discover the

    hidden thematic structure in a large corpus of text documents • Topic modeling is an unsupervised text mining approach • Once the number of topics is ascertained, topic modeling can be used to: • Uncover the meaning of topics in a corpus (q1) • Estimate the prevalence of each topic (q2) • Describe the relationships between topics (q3) • Discover how topics change over time (i.e. identify ‘hot’ or ‘cold’ topics) (q4) • Estimate posterior distributions and merge them with metadata for further analysis (q5) 22
  16. Latent Dirichlet Allocation (LDA) (Blei, Ng, & Jordan, 2003; Millar,

    Peterson & Mendenhall, 2009). • LDA is a generative probabilistic latent variable model that describes how documents in a dataset are created • Each documents exhibit multiple topics • A topic is a multinomial distribution of words • A document is a multinomial distribution of latent topics • The machine learning problem is that we need to infer the latent topic structure • The model uses observed documents and words to infer hidden structure by creating p(topic|document) and p(word|topic) • LDA “learns” a set of thematic topics or clusters from co-occurring words in documents • Assumptions • Mixed membership model: the topics are fixed, but how much each document exhibits each topic changes • Word order doesn’t matter
  17. 24 α d zd,n Wd,n φ M N T 1.

    Choose values for the hyperparameters α and and the number of topics, T. 2. For each document: a. Randomly choose a distribution over topics (a multinomial of length K) b. For each word: (i) Probabilistically draw one of the T topics from the distribution over topics (ii) Probabilistically draw one of the W words from the per-topic word distribution 3. Perform document clustering by 1. Finding , for fixed α, , T using Gibbs sampling (Griffiths and Steyvers, 2004) 2. Estimating for each document (this is the input for the SOM) Per-topic word distribution Per-document topic distribution The topic from which a particular word is drawn
  18. Self-Organizing Maps • An unsupervised learning algorithm useful to identify

    clusters of related documents based on the similarity between them. • SOM consists of a fixed lattice of neurons connected to adjacent neurons • Each neuron is associated with a prototype vector that is the same dimension as the input space • During training, the processing element with the shortest distance to the input vector is identified • The prototype vector and all other elements within a neighborhood move in the direction of the input vector • Over time, the map converges to a 2-dimensional representation of the original sample space • Geo-SOM is similar but spatial dependence is incorporated into the learning process 25
  19. Summary of Approach • Pre-process the data and encode as

    word histograms • Using LDA with Gibbs sampling compute posterior distributions • Estimate the per-document topic distributions and per-topic word distributions • Label, visualize and interpret the topics • Merge results with metadata for further analysis
  20. Step 3a Results: EDA & Visualization R, Python & SPAWWN

    Term Frequency, Importance, Distribution, Correlation and Similarity
  21. Results • More than half (54.2%) of all children who

    died were less than 1 year old • Most common cause was listed as other (28.9%) followed by blunt force injury (19.2%) and gunshot (12.6%) • Almost half of the children who died were of Latino origin (49.6%) and a slight majority were male (53.5%) Descriptive Statistics of Coroner’s Data of Child Deaths 2000 - 2017 Age % N 0 54.2 246 1 16.5 75 2 11.7 53 3 7.5 34 4 6.2 28 5 4 18 Cause Blunt-force 19.2 87 Gun Shot 12.6 57 Other 28.9 131 Stabbing 3.1 14 Strangled .9 4 Pending 1 .2 Gender Male 53.5 243 Female 43.8 199 Unk 2.6 12 Race Asian 3.7 17 Black 28.6 130 Latino 49.6 225 White 10.6 48 Unk
  22. Corpus Word Cloud • Someone (e.g. boyfriend) • Harmed (e.g.

    suffocated) • Child’s person (e.g. head, abdomen) • By acting or failing to act (e.g. hitting) • In some location (e.g. bathtub) • Provoking a response (e.g. police, DCSF) 30
  23. 34 Shoot Wound Gun Vehicle Drove Multiple Stab Bag Plastic

    Newborn Wrap Cord Trash Care Left Fell Batter Bruise Boyfriend Bath Drown Unattend Bathtub Water
  24. Initial Values • LDA • Applied to the cleaned corpus

    representing 454 documents consisting of 668 distinct words from 2000 – 2018 • T = 13, initially α= .1, β = 0.01 • LDA-SOM • Applied to a hexagonal grid size of 12 x 8 = 96 neurons, radius = 1 and 100,000 training cycles • The prototype vector is from the LDA • Captures topical content • Clusters documents based on their topical representations
  25. Step 3b Results: Topic Modeling & Visualization R, Python &

    SPAWWN LDA (topic modeling) & GEO-SOM (Visualization)
  26. 40 Fatal Neglect Maternal Homicide/Suicide Unknown Mechanism Family/Domestic Violence Co-sleeping/Sleeping

    Acute Child Abuse Asphyxiation Fatal fall/injury Stabbing Chronic Child Abuse Community Violence Neonaticide Multiple Traumatic Injuries Per-Topic Word Distributions for 15 words for 13 topics
  27. Topic Prevalence 41 TOPIC TOPIC LABEL % TOPIC (COHERENCE) %

    CR 1 Fatal Neglect 6.4 (0.365) -- 2 Maternal Homicide/Suicide/Assault 9.8 (.145) -- 3 Unknown Mechanism (undetermined) 3.2 (0.086) 29.3 4 Family/Domestic Violence 3.8 (.232) -- 5 Co-sleeping/Sleeping 10.8 (.085) -- 6 Acute Child Abuse 10.3 (0.203) -- 7 Asphyxiation/Strangulation 5.5 (.234) 0.9 8 Fatal Fall/Injury 13.8 (0.392) -- 9 Stabbing 6.9 (.322) 3.2 10 Chronic Child Abuse 7.6 (.100) -- 11 Community Violence (gun shot) 7.8 (0.694) 12.6 12 Neonaticide 9.8 (.388) -- 13 Multiple Traumatic Injury (blunt-force) 4.3 (0.133) 19.2 Blank -- 34.8
  28. 42 Topic 11: Community Violence Topic 12: Neonaticide Topic 6:

    Acute Child Abuse Topic 1: Fatal Neglect
  29. 44

  30. Topic Label Common Offenders Top n Terms per-topic word distribution

    (φ) Antecedents/Circumstance Common Risk Location Fatal neglect Grandmother; babysitter left, minutes, unresponsive, unattended, water, abd omen, bathtub, fell, bumped, burn, grandmother, dirty, pounds, starve, urine Left unattended minutes Physical injury; Distracted caregiver, mental illness, DCSF; Bathtub or swimming pool Homicidal-Suicidal Mother Mother Deliv, fetus, week, birth, gestat, born, prematur Intentional drowning, stabbing History of depression Home Unknown Mechanism Mother Infant, injuri, floor, mouth, put, blood, check Feeding, cough, DCFS Home, hotel Family Violence/Domestic homicide Father Famii, sister, brother, suicide, domest, life, murder, neighbor Homicide-suicide; history of abuse prison time, Arson (intentional) Domestic abuse; divorce, custody dispute, Mental illness; murder suicide, argument Residence, bedroom Co-Sleeping or Sleeping Parents Histori, medic, unrespons, bed, known, play, night, r ecent, sleep, lie, ill, nose, good, health, cold, drug, pillow Suffocation; Ingest drugs, wine or “medicine”; toxicity Fever, Drug use (meth); alcohol use; multiple siblings in bed, unsafe sleep area Bed, sofa, crib Fatal child abuse Father, mother’s boyfriend, child care provider; foster parents Abus, shaken, foster, syndrome, hit, rib, aunt, prior Shaken baby syndrome; trauma; blunt force trauma; battered child, restrained; striking child with hand; Developmental delay or illness; history of abuse, foster care; child crying Residence, Closet Asphyxiation Caregiver, staff, mom Arriv, emerg, resuscit, breath, cardiac, vomit, staff, fell, stop, seizure, feed Vomiting, seizure, some health issue Neglect No pattern detected Injury/Fall Father Injuri, hermorrag, head, subdur, fractur, retin, hematoma, brain, skull, fall, bilater, acut, edema, traumat Petechia; hyper/hypothermia; injury from fall; Get child to sleep; child ill, no known history; good health Home Multiple Stab Female Fire, work, stab, weapon, kill, throat No pattern detected No pattern detected No pattern detected Multiple Abuse Boyfriend, step-father Bruis, care, bodi, abdomen, assault, batter, burn, sexual, physic, choke, dcfs, pain, abdomen, forehead, cheek, kit, lacer Physical and sexual abuse No pattern detected No pattern detected Community Violence Non-family member; rival gang (known or unknown), uncle shot, head, arm, back, chest, multiple, male, weapon, wounds, front, standing Intrauterine fetal demise secondary to mother’s gunshot of knife wound Drive by shooting, gang affiliation car; street; home Neonaticide Mother Bag, appear, place, trash, plastic, newborn, term, still, unknown, wrap, dumpster, neck, box, umbel, towel asphyxia Mother was unaware of pregnancy; no prenatal care Trash, closet Multiple Injury (not identified as abusive) Siblings, male Head, multiple, unrespons, blunt, forc,brain, abdomen, sever, Investigation pending No pattern detected Home
  31. Demographic Characteristics and Common Antecedents of LDA Topics Community Violence

    Domestic Violence Homicidal-Suicidal Mother Fatal Child Neglect Fatal Child Abuse Neonaticide Race White 4.3 11.8 20.0 17.0 6.5 12 Black 27.7 26.5 23.3 38.3 32.3 24 Latino 57.4 47.1 53.3 40.0 57.0 40 Asian 6.4 8.8 3.3 17.0 1.1 8 Gender Female 46.8 47.1 43.3 34.0 40.9 56 Male 44.7 50.0 56.7 66.0 59.1 36 Age (mean) 1.43 (1.67) 1.59 (1.74) 1.50 (1.80) .98 (1.34) 1.33 (1.35) 0.0 (0.00) Neighborhoods Compton, Inglewood Wilmington Long Beach, Lancaster, Palmdale Long Beach -- Antecedents DCF involvement - 8.8 3.3 12 9.7 3 Substance Use - 2.9 0 4.3 6.5 - Mental Illness - 5.6 19.1 13.3 9.7 8 Weapon Use 100 32.4 33.3 23.4 32.3 28
  32. 49 Co-sleeping Community Violence Neonaticide Fatal Neglect Mechanism Unknown Mechanism

    Unknown Family Violence Family Violence Homicide Mother Injury/Fall Stabbing Stabbing Fatal Abuse Documents within each of the seven clusters share similar topics, and neighboring clusters may have one or more topics in common
  33. 50 Fatal Neglect Hom_mo Mech_unk Fam_vio Co-sleep Fatal Abuse Asphy

    Inj_fall M_stab M_injuries Comm_vio Neonaticide
  34. Step 4 Prediction & Validation Can Supervised ML predict new

    cases of child death types 53 https://github.com/elisegia/coroner_data/blob/master/Predictive%20Analytics%20of%20Coroner%20Data.ipynb
  35. 54 UNIGRAMS BIGRAMS TRIGRAMS Fatal injury/fall play . pillow .

    unresponsive . recent foul play . unresponsive crib . unresponsive bed . known history foul play suspected . subdural hematoma retinal . hematoma retinal hemorrhages . multiple blunt force Child maltreatment battered . abuse . shaken . Bruises retinal subdural retinal hemorrhages . suspected abuse . physical abuse multiple bruises. shaken syndrome declared brain dead . resuscitated cardiac arrest . sexual assault kit . bilateral retinal hemorrhages Family violence siblings . brother . evening . self suicide sharp force . self inflicted inflicted gunshot family residence multiple stab wounds. self inflicted gunshot . inflicted gunshot wound Fatal neglect bathtub . neglect . left . grandmother left unattended . infant unresponsive . petechial hemorrhages . death scene left unattended bathtub . resuscitated emergency room . blunt force head . progressed brain death Community violence shot . wound . shooting . uncle drive shooting . gunshot wound . shot head gunshot wound head . multiple stab wounds . multiple blunt force . multiple gun shot Neonaticide fetus . newborn . plastic . bag plastic bag . newborn infant . weeks gestation . trash dumpster unknown mechanism injury . intrauterine fetal demise . blunt force head . presented emergency room Asphyxiation . cough . sids . pillow . signs stomach bottle sleeping bed bed father foul play suspected . history good health
  36. 55

  37. Predicting New Cases of Child Death Types Topic Label Precision

    Recall F1-Score Support Child maltreatment 0.64 0.95 0.76 100 Asphyxiation 0.63 0.40 0.49 42 Family violence 0.60 0.20 0.30 15 Fatal neglect 0.83 0.50 0.62 48 Community violence 0.88 0.67 0.76 52 Neonaticide 0.71 0.76 0.74 61 Ave/Total 0.72 0.70 0.68 318 Neglect predicted as child maltreatment: BABY IS DIRTY AND THIN. HE HAS SCRATCHES AND SMALL SCARS TO BODY. RED MARK TO RIGHT NECK AND POSSIBLE BRUISES TO TORSO. THE CHILD'S HEALTH HISTORY IS UNKNOWN. PARAMEDICS REPORTED THAT THE CONVERTED GARAGE APARTMENT WAS DIRTY AND CLUTTERED.
  38. Limitations • Bag-of-words assumption: Does word order matter? (Bigram TM?)

    • The mother of the child couched the abuse and lied • The child of the mother couched after the abuse and lied • Difficulty and arbitrariness of labeling and deriving topics and categories • Some words were very common across all categories • These words are not discriminatory but may be important (i.e. ‘mother’ & ‘trauma’) • Neglect was hard to distinguish from physical injury, in most neglect cases children died from some traumatic injury • When the mechanism of injury was ‘unknown’ or when the story was inconsistent and unresolved • The child death types attributed to ‘injury/fall’ which are associated with ‘retinal hemorrhages’ may be diagnostic of child abuse in the absence of a documented history of major trauma, consistent with previous research • Questionable cases involved subjectivity (e.g. when a child drowns is it an accident, negligence or neglect?) 58
  39. Summary of Main Findings/Observations • Potentially new methods of data

    acquisition and analysis coupled with coordinated efforts may lead to more targeted prevention efforts • Confirms and extends existing typologies • Clustering of deaths in general suggests shared underlying risk in distinct parts of the county • Even in distinct clusters children are exposed to multiple risk factors • Why are these cases non-existing in wealthy/white neighborhoods, is this a civil rights issue? • Are parenting classes effective without systems change? Is systems change effective without understanding psychopathology? • Significant hot (multiple traumatic injuries) and cold trends (neonaticide, fatal child abuse) were identified • Perhaps changes to the law are responsible for reduction (e.g. Safe Surrender (2001); changes to CANRA (2006)) • Nevertheless, underlying causes must be acknowledged in context • Fatal child maltreatment most common death type, physical harm may mask the nature of the injury in neglect cases (some of which may be preventable and unintentional) • Weapons are frequently used in child fatality even in child maltreatment cases • In many cases, this was fetal demise -- are we pro-life in these cases? 59
  40. Policy implications 60 Child death should be defined as a

    community or structural problem Draw attention to structural vulnerability as well as socioeconomic/behavioral vulnerability, particularly those community characteristics associated with unsafe environments for children (alcohol outlets – zoning fix) Stop using foster care and CJ system for prevention Lancaster and Palmdale have among the highest reports of child maltreatment and a greater concentration of foster homes than any other area in LA county (Chronicle of Social Change, 2016). Examine spatial clusters Policy should consider the broader context in which child deaths occur and their effect on communities (of color, siblings of murdered children, etc.) Prevention must consider that multiple forms of adversity in context Unique risk factors for child death types Ex. Substance use was commonly associated with death due to co-sleeping, leaving a child was most often associated with a bathtub or swimming pool Target more of the manner and circumstances surrounding death Different perpetrators (grandma, foster care, baby sitters, different causes (guns, knives), in certain locations (foster homes, day cares, neighborhoods), with different antecedents (mental illness, divorce, DCF involvement)
  41. Future Directions 61 https://www.acesdv.org/fatality-reports/ Text Mining Twitter for Attitudes Towards

    @metoo GEO-SOM of Social Disorganization and the Likelihood of Reporting Child Abuse LDA on DV Related Fatalities in Arizona (!)