Slide 1

Slide 1 text

hackistry or hacking chemical data for knowledge extraction Łukasz Mentel June 19, 2015 Catalysis Group, UiO

Slide 2

Slide 2 text

About me Ph.D. studies at Vrije Universiteit in Amsterdam/POSTECH, South Korea with prof. E. J. Baerends and dr O. V. Gritsenko 1

Slide 3

Slide 3 text

About me Thesis Reduced Density Matrix Inspired Approached to Electronic Structure Theory 2

Slide 4

Slide 4 text

About me Thesis Reduced Density Matrix Inspired Approached to Electronic Structure Theory Appproximate the solutions of the Schrödinger equation ˆ HΨ = EΨ 2

Slide 5

Slide 5 text

About me Thesis Reduced Density Matrix Inspired Approached to Electronic Structure Theory Appproximate the solutions of the Schrödinger equation ˆ HΨ = EΨ E [Ψ] = Ψ ˆ H Ψ as a functional of the One-Body Reduced Density Matrix E [γ] = ˆ Hγ where the Density Matrix is defined as γ (x1 , x1 ) = N dx2...N Ψ(x1 , x2 , . . . , xN )Ψ∗(x1 , x2 , . . . , xN ) 2

Slide 6

Slide 6 text

introdcution

Slide 7

Slide 7 text

The Jargon Key concepts ∙ programmatic data access/manipulations ∙ data mining/data exploration 4

Slide 8

Slide 8 text

The Jargon Key concepts ∙ programmatic data access/manipulations ∙ data mining/data exploration Toolbox ∙ elements ∙ zefram ∙ colorcif 4

Slide 9

Slide 9 text

math·e·ma·ti·cian (n) A mathematician is a device for turning coffee into theorems. — Alfréd Rényi 5

Slide 10

Slide 10 text

math·e·ma·ti·cian (n) A mathematician is a device for turning coffee into theorems. — Alfréd Rényi pro·gram·mer (n) An organism capable of converting caffeine into code. — Urban dictionary 5

Slide 11

Slide 11 text

math·e·ma·ti·cian (n) A mathematician is a device for turning coffee into theorems. — Alfréd Rényi pro·gram·mer (n) An organism capable of converting caffeine into code. — Urban dictionary che·mist (n) Knows how to make the caffeine. 5

Slide 12

Slide 12 text

Scientist class Scientist(object): def __init__(self, first_name, last_name, age): # Attributes self.first_name = first_name self.last_name = last_name self.age = age # Methods def full_name(self): return self.first_name + ’ ’ + self.last_name def get_older(self): self.age = self.age + 10 6

Slide 13

Slide 13 text

Scientist cont’d Create an instance doc = Scientist(”Peter”, ”Strangelove”, 45) Attribute access >>> doc.first_name ’Peter’ >>> doc.last_name ’Strangelove’ >>> doc.age 45 Method access >>> doc.full_name() ’Peter Strangelove’ >>> doc.get_older() >>> doc.age 55 7

Slide 14

Slide 14 text

elements

Slide 15

Slide 15 text

Element properties ∙ annotation ∙ atomic number ∙ atomic radius ∙ atomic volume ∙ block ∙ boiling point ∙ covalent radius ∙ density ∙ description ∙ dipole polarizability ∙ electron affinity ∙ electronegativity ∙ electronic configuration ∙ electrons ∙ evaporation heat ∙ exact mass ∙ fusion heat ∙ group ∙ ionic radii ∙ ionization energies ∙ isotopes ∙ lattice constant ∙ lattice structure ∙ mass ∙ mass number ∙ melting point ∙ name ∙ neutrons ∙ oxidation states ∙ period ∙ protons ∙ series ∙ specific heat ∙ symbol ∙ thermal conductivity ∙ vdw radius 9

Slide 16

Slide 16 text

Element properties ∙ annotation ∙ atomic number ∙ atomic radius ∙ atomic volume ∙ block ∙ boiling point ∙ covalent radius ∙ density ∙ description ∙ dipole polarizability ∙ electron affinity ∙ electronegativity ∙ electronic configuration ∙ electrons ∙ evaporation heat ∙ exact mass ∙ fusion heat ∙ group ∙ ionic radii ∙ ionization energies ∙ isotopes ∙ lattice constant ∙ lattice structure ∙ mass ∙ mass number ∙ melting point ∙ name ∙ neutrons ∙ oxidation states ∙ period ∙ protons ∙ series ∙ specific heat ∙ symbol ∙ thermal conductivity ∙ vdw radius 9

Slide 17

Slide 17 text

Element properties ∙ annotation ∙ atomic number ∙ atomic radius ∙ atomic volume ∙ block ∙ boiling point ∙ covalent radius ∙ density ∙ description ∙ dipole polarizability ∙ electron affinity ∙ electronegativity ∙ electronic configuration ∙ electrons ∙ evaporation heat ∙ exact mass ∙ fusion heat ∙ group ∙ ionic radii ∙ ionization energies ∙ isotopes ∙ lattice constant ∙ lattice structure ∙ mass ∙ mass number ∙ melting point ∙ name ∙ neutrons ∙ oxidation states ∙ period ∙ protons ∙ series ∙ specific heat ∙ symbol ∙ thermal conductivity ∙ vdw radius 9

Slide 18

Slide 18 text

Accessing the properties To get started we need an access method >>> from elements import element element method can be used to retrieve elements from the db >>> h = element(’H’) >>> h.name ’Hydrogen’ >>> si = element(’Silicon’) >>> si.mass 28.0855 >>> al = element(13) >>> al.electronegativity 1.61 10

Slide 19

Slide 19 text

Querrying/Quiz 1. the smallest atomic radius 11

Slide 20

Slide 20 text

Querrying/Quiz 1. the smallest atomic radius He 2. the largest density 11

Slide 21

Slide 21 text

Querrying/Quiz 1. the smallest atomic radius He 2. the largest density Os 3. lowest first ionization energy 11

Slide 22

Slide 22 text

Querrying/Quiz 1. the smallest atomic radius He 2. the largest density Os 3. lowest first ionization energy Cs 4. highest second ionization energy 11

Slide 23

Slide 23 text

Querrying/Quiz 1. the smallest atomic radius He 2. the largest density Os 3. lowest first ionization energy Cs 4. highest second ionization energy Li 5. easiest to remove two electrons from 11

Slide 24

Slide 24 text

Querrying/Quiz 1. the smallest atomic radius He 2. the largest density Os 3. lowest first ionization energy Cs 4. highest second ionization energy Li 5. easiest to remove two electrons from Ba 6. how many elements with oxidation state VII 11

Slide 25

Slide 25 text

Querrying/Quiz 1. the smallest atomic radius He 2. the largest density Os 3. lowest first ionization energy Cs 4. highest second ionization energy Li 5. easiest to remove two electrons from Ba 6. how many elements with oxidation state VII 7 11

Slide 26

Slide 26 text

Tablulated elements >>> import pandas as pd >>> from elements import get_engine >>> engine = get_engine() >>> ptable = pd.read_sql(’elements’, engine) AN: atomic number, AR: atomic radius, DPol: dipole polarizability, EA: electron affinity, ENEG: electronegativity 12

Slide 27

Slide 27 text

Tablulated elements >>> import pandas as pd >>> from elements import get_engine >>> engine = get_engine() >>> ptable = pd.read_sql(’elements’, engine) AN AR DPol AE ENEG block mass · · · 1 79 4.51 0.75 2.20 s 1.01 · · · 2 NaN 1.38 0.00 NaN s 4.00 · · · 3 155 164.00 0.62 0.98 s 6.94 · · · 4 112 37.71 0.00 1.57 s 9.01 · · · 5 98 20.53 0.28 2.04 p 10.81 · · · 6 91 20.53 1.26 2.55 p 12.01 · · · . . . . . . . . . . . . . . . . . . . . . ... AN: atomic number, AR: atomic radius, DPol: dipole polarizability, EA: electron affinity, ENEG: electronegativity 12

Slide 28

Slide 28 text

Remark For simplicity well restrict the focus to just 5 properties i.e. reduce the number of columns in the table to 5 >>> properties = [’atomic_number’, ’atomic_radius’, ... ’dipole_polarizability’, ’electron_affinity’, ... ’electronegativity’] >>> ptable = ptable[properties] 13

Slide 29

Slide 29 text

Descriptive statistics >>> ptable.describe() AN: atomic number, AR: atomic radius, DPol: dipole polarizability, EA: electron affinity, ENEG: electronegativity 14

Slide 30

Slide 30 text

Descriptive statistics >>> ptable.describe() AN AR DPol EA ENEG count 118.00 88.00 106.00 86.00 93.00 mean 59.50 169.40 100.58 0.78 1.69 std 34.21 49.81 80.76 0.83 0.62 min 1.00 79.00 1.38 -0.07 0.70 25% 30.25 137.00 37.41 0.29 1.27 50% 59.50 160.00 67.75 0.50 1.61 75% 88.75 181.00 158.50 1.07 2.04 max 118.00 299.00 401.00 3.61 3.98 AN: atomic number, AR: atomic radius, DPol: dipole polarizability, EA: electron affinity, ENEG: electronegativity 14

Slide 31

Slide 31 text

Pearson’s correlation coefficient >>> ptable.corr() AN: atomic number, AR: atomic radius, DPol: dipole polarizability, EA: electron affinity, ENEG: electronegativity 15

Slide 32

Slide 32 text

Pearson’s correlation coefficient >>> ptable.corr() AN AR DPol EA ENEG AN 1.00 0.57 0.28 0.04 -0.35 AR 0.57 1.00 0.63 -0.23 -0.63 DPol 0.28 0.63 1.00 -0.31 -0.81 EA 0.04 -0.23 -0.31 1.00 0.70 ENEG -0.35 -0.63 -0.81 0.70 1.00 AN: atomic number, AR: atomic radius, DPol: dipole polarizability, EA: electron affinity, ENEG: electronegativity 15

Slide 33

Slide 33 text

visualization

Slide 34

Slide 34 text

Different radii 17

Slide 35

Slide 35 text

Ionization Energies 18

Slide 36

Slide 36 text

Polarizability small multiples 19

Slide 37

Slide 37 text

Boiling point heatmap 20

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

zefram

Slide 40

Slide 40 text

Framework properties ∙ a ∙ access. area ∙ access. area m2pg ∙ access. volume ∙ access. volume pct ∙ alpha ∙ atoms ∙ b ∙ beta ∙ c ∙ cages ∙ channel dim ∙ channels ∙ cif ∙ code ∙ connections ∙ framework density ∙ gamma ∙ isdisordered ∙ isinterrupted ∙ junctions ∙ lcd ∙ maxdsd a ∙ maxdsd b ∙ maxdsd c ∙ maxdsi ∙ name ∙ occup. area ∙ occup. area m2pg ∙ occup. volume ∙ occup. volume pct ∙ pld ∙ portals ∙ spacegroup ∙ specific access. area ∙ specific occup. area ∙ tatoms ∙ td10 ∙ topological density ∙ tpv abs ∙ tpv pct 23

Slide 41

Slide 41 text

Framework properties ∙ a ∙ access. area ∙ access. area m2pg ∙ access. volume ∙ access. volume pct ∙ alpha ∙ atoms ∙ b ∙ beta ∙ c ∙ cages ∙ channel dim ∙ channels ∙ cif ∙ code ∙ connections ∙ framework density ∙ gamma ∙ isdisordered ∙ isinterrupted ∙ junctions ∙ lcd ∙ maxdsd a ∙ maxdsd b ∙ maxdsd c ∙ maxdsi ∙ name ∙ occup. area ∙ occup. area m2pg ∙ occup. volume ∙ occup. volume pct ∙ pld ∙ portals ∙ spacegroup ∙ specific access. area ∙ specific occup. area ∙ tatoms ∙ td10 ∙ topological density ∙ tpv abs ∙ tpv pct 23

Slide 42

Slide 42 text

Classification and reduction of data Suppose we want to find an efficient way of evalutating the framework ”area” ∙ group the frameworks together into clusters based on the selected properties using KMeans algorithm ∙ find new cumulative variables in the reduced space (PCA) Select the propeties associated with area (6D) >>> area_props = [’tpv_abs’, ’accessible_area’, ’channel_dim’, ... ’maxdsi’, ’occupiable_area’, ... ’specific_accessible_area’] >>> zolites = zeolites[area_props] 24

Slide 43

Slide 43 text

Classes histogram 25

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

Principal component analysis (PCA) 27

Slide 46

Slide 46 text

colorcif

Slide 47

Slide 47 text

Visualize symmetry unique T and O atoms MFI [010] 29

Slide 48

Slide 48 text

Visualize symmetry unique T and O atoms MFI [010] TON [001] 29

Slide 49

Slide 49 text

Summary ∙ programmatic data handling offers huge advantages 30

Slide 50

Slide 50 text

Summary ∙ programmatic data handling offers huge advantages ∙ efficiency/time saving 30

Slide 51

Slide 51 text

Summary ∙ programmatic data handling offers huge advantages ∙ efficiency/time saving ∙ modelling 30

Slide 52

Slide 52 text

Summary ∙ programmatic data handling offers huge advantages ∙ efficiency/time saving ∙ modelling ∙ exploration 30

Slide 53

Slide 53 text

Summary ∙ programmatic data handling offers huge advantages ∙ efficiency/time saving ∙ modelling ∙ exploration ∙ visualization 30

Slide 54

Slide 54 text

Summary ∙ programmatic data handling offers huge advantages ∙ efficiency/time saving ∙ modelling ∙ exploration ∙ visualization ∙ storage 30

Slide 55

Slide 55 text

Summary ∙ programmatic data handling offers huge advantages ∙ efficiency/time saving ∙ modelling ∙ exploration ∙ visualization ∙ storage ∙ sharing 30

Slide 56

Slide 56 text

Summary ∙ programmatic data handling offers huge advantages ∙ efficiency/time saving ∙ modelling ∙ exploration ∙ visualization ∙ storage ∙ sharing ∙ join in 30

Slide 57

Slide 57 text

Summary ∙ programmatic data handling offers huge advantages ∙ efficiency/time saving ∙ modelling ∙ exploration ∙ visualization ∙ storage ∙ sharing ∙ join in ∙ bother me 30

Slide 58

Slide 58 text

Questions? 31

Slide 59

Slide 59 text

Useful links ∙ Python www.python.org ∙ Numpy www.numpy.org ∙ Scipy www.scipy.org ∙ Pandas www.pandas.org ∙ matplotlib www.matplotlib.org ∙ seaborn www.seaborn.org ∙ ipython www.ipython.org ∙ elements www.bitbucket.org/lukaszmentel/elements ∙ zefram www.bitbucket.org/lukaszmentel/zefram ∙ colorcif www.bitbucket.org/lukaszmentel/colorcif 32