Slide 1

Slide 1 text

Parallel Knowledge Discovery in very large datasets Monday, November 5, 12

Slide 2

Slide 2 text

Objective • Understanding basic concepts of knowledge discovery. • Understanding basic concepts of distributed and parallel processing. • Basis for future developments. Monday, November 5, 12

Slide 3

Slide 3 text

Knowledge Discovery Definitions • Deep Understanding of Variation in Large Datasets • “Knowledge discovery describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data.” • “Deriving knowledge and create abstractions of the input data.” • “Process of analyzing data to identify patterns or relationships.” • Predicting future events, behaviors, estimating values, etc. Monday, November 5, 12

Slide 4

Slide 4 text

Monday, November 5, 12

Slide 5

Slide 5 text

The process of extracting patterns from large data sets. Monday, November 5, 12

Slide 6

Slide 6 text

Knowledge Discovery Applications • Astronomy • Biology • Business • Internet • Government • Religion Monday, November 5, 12

Slide 7

Slide 7 text

Knowledge Discovery Example Datasets • Eyes 324 - 576 megapixels • Ears Stereo Audio 20-20000Hz • Nose 10000 Chemical Compounds • Mouth 5-6 Flavors • Skin Temperature / Pressure / Texture • Memory 2.5 Petabytes Monday, November 5, 12

Slide 8

Slide 8 text

Knowledge Discovery History • 1960s Data Collection • 1980s Data Access (Disk) • 1990s Data Warehousing (SQL) • Now Knowledge Discovery Monday, November 5, 12

Slide 9

Slide 9 text

Knowledge Discovery Typical Structure Monday, November 5, 12

Slide 10

Slide 10 text

Knowledge Discovery Typical Structure - Datasets • Rows & Columns • Rows: Objects • Columns: A Single Object’s Attributes Monday, November 5, 12

Slide 11

Slide 11 text

Knowledge Discovery Typical Structure - Outputs • Concepts: predicates, algorithms are usually transparent. • Model Parameters: built classifiers, mostly opaque. Monday, November 5, 12

Slide 12

Slide 12 text

Knowledge Discovery Typical Structure - Processing Steps • Association Rule Learning • Clustering • Classification • Modeling • Visualization Monday, November 5, 12

Slide 13

Slide 13 text

Knowledge Discovery Application Types •Category One: find explanations for the most variable elements of the data set—that is, to find and explain the outliers. find the unexpected stellar object in this sky sweep (essentially parallel). •Category Two: understand the variations of the majority of the data set elements, with little interest in the outliers. understand the buying habits of most of our customers (optionally parallel). Monday, November 5, 12

Slide 14

Slide 14 text

Knowledge Discovery Algorithms Monday, November 5, 12

Slide 15

Slide 15 text

Knowledge Discovery Algorithms • 1700s Bayes Theorem • 1800s Regression Analysis • 1940s Neural Networks • 1950s Genetic Algorithms • 1960s Decision Tree Learning • 1990s Support Vector Machines Monday, November 5, 12

Slide 16

Slide 16 text

Knowledge Discovery Neural Networks • Connected neurons, • Learn through training, • Resemble to biological networks in structure, • Not easy to use and to understand (opaque), • Cannot deal with missing data, • Feed the datasets to the network many times (epochs). Monday, November 5, 12

Slide 17

Slide 17 text

Knowledge Discovery Genetic Algorithm • Analogy to Darwinian evolution. • Various evolutionary algorithms: • Genetic Algorithms • Ant Colony • Swarm Intelligence • Replicate, mutate, and cross over Monday, November 5, 12

Slide 18

Slide 18 text

Knowledge Discovery Association Rules & Semantic Methods • Rule indication: • Extraction of if , then , else rules from data based on statistical significance. • Output: how likely it is that certain patterns of attributes occur with other attributes in dataset objects (transparent). Monday, November 5, 12

Slide 19

Slide 19 text

Knowledge Discovery Clustering Algorithms Monday, November 5, 12

Slide 20

Slide 20 text

Knowledge Discovery Clustering Algorithms • K-Means • MK-Means • Canopy • Self Organization Map (SOM) • DBSCAN/OPTICS Monday, November 5, 12

Slide 21

Slide 21 text

Monday, November 5, 12

Slide 22

Slide 22 text

Monday, November 5, 12

Slide 23

Slide 23 text

K-Means Voronoi Diagram Monday, November 5, 12

Slide 24

Slide 24 text

DBSCAN Monday, November 5, 12

Slide 25

Slide 25 text

Monday, November 5, 12

Slide 26

Slide 26 text

Knowledge Discovery Limitations • Complex algorithms and high computing power • Complex, dynamic datasets, and large volume of data • Curse of Dimensionality Monday, November 5, 12

Slide 27

Slide 27 text

Knowledge Discovery Parallelization Strategies • Independent search, • Parallelized sequential algorithms, • Replication. Monday, November 5, 12

Slide 28

Slide 28 text

ESA Gaia • Objective: To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars. • Launch Date: 2013 • Mission End: 2018 (5 years) Monday, November 5, 12

Slide 29

Slide 29 text

Monday, November 5, 12

Slide 30

Slide 30 text

Monday, November 5, 12

Slide 31

Slide 31 text

Monday, November 5, 12

Slide 32

Slide 32 text

Monday, November 5, 12

Slide 33

Slide 33 text

Monday, November 5, 12

Slide 34

Slide 34 text

Monday, November 5, 12

Slide 35

Slide 35 text

ESA Gaia CCDs (Dimensions) • 3 columns of 4 CCDs for the Radial Velocity Spectrometer (RVS), • 7 CCDs for the Red Photometer, • 1 column of 7 CCDs for the Blue Photometer, • 9 columns of 7 CCDs forming the Astrometric field, • 2 columns of 7 CCDs for the sky mapper, • 1 column with 3 CCDs: two basic angle monitors, and one wave front sensor. Monday, November 5, 12

Slide 36

Slide 36 text

Monday, November 5, 12

Slide 37

Slide 37 text

Limitations • Sloan Digital Sky Survey: • around 53 million unique objects. • Hipparcos: • around 10^5 unique objects. • GAIA Mission: • 10^9 unique objects, • 10% variable objects (10^8), • 13 dimensions. Monday, November 5, 12

Slide 38

Slide 38 text

mrkd-EM DBSCAN DENCLUE DBCLASD 0(nlog(n)) 0(nlog(n)) 0{n) Oinlogin)) TABLE 2. Dataset attributes names [3] Attribute name log-fl log-f2 log-aflhl-t log-aflh2-t log-aflh3-t log-aflh4-t log-aOhl-t log-aGh2-t log-cri'lO pdfl2 varrat B-V V-I Meaning log of the first fi-equency log of the second frequency log ampUtude first harmonic first frequency log ampUtude second harmonic first frequency log ampUtude third harmonic first frequency log ampUtude fourth harmonic first frequency log ampUtude first harmonic second frequency log ampUtude second harmonic second frequency amplitude ratio between harmonics of the first frequency phase difference between harmonics of first frequency variance ratio before and after first frequency subtraction color index color index reate some templates of a given classes of variable stars and run the scanning law Gaia Dataset Attributes (Dimensions) Monday, November 5, 12

Slide 39

Slide 39 text

• Flynn’s taxonomy • SISD: Single Instruction, Single Data • SIMD: Single Instruction, Multiple Data • MISD: Multiple Instruction, Single Data • MIMD: Multiple Instruction, Multiple Data Types of Parallel Computing Hardware-Oriented Monday, November 5, 12

Slide 40

Slide 40 text

Types of Parallel Computing Software-Oriented • Data-parallel: Same operations on different data • Task-parallel: Different programs, different data • Dataflow: Pipelined parallelism • MIMD: Different programs, different data • SPMD: Same program, different data Monday, November 5, 12

Slide 41

Slide 41 text

Parallel Processing Monday, November 5, 12

Slide 42

Slide 42 text

Types of Parallel Computing • Divide-and-Conquer (fork/join parallelism) • Hierarchical • Applications: parallel rendering, SAT solver, VLSI routing, N-body simulation, multiple sequence alignment, grammar based learning Monday, November 5, 12

Slide 43

Slide 43 text

Message Passing Interface Features • HPC Message Passing Interface standardized in the early 1990s by the MPI Forum. • API for communication between nodes of a distributed memory parallel computer (typically a workstation cluster). • Fortran, C, and C++. • Low-level parts of API: • Fast transfer of data from user program to network, • Supporting multiple modes of message synchronization available on HPC platforms. • Higher level parts of the API: • Organization of process groups and providing the kind of collective communications seen in typical parallel applications. Monday, November 5, 12

Slide 44

Slide 44 text

MPI Message Passing Interface • An API for sending and receiving messages. • General platform for Single Program Multiple Data (SPMD) parallel computing on distributed memory architectures. • Directly comparable with the PVM (Parallel Virtual Machine) environment • Introduced the important abstraction of a communicator, which is an object something like an N-way communication channel, connecting all members of a group of cooperating processes. • Introduced partly to support using multiple parallel libraries without interference. • Introduced a novel concept of datatypes, used to describe the contents of communication buffers. • Introduced partly to support “zero-copying” message transfer. Monday, November 5, 12

Slide 45

Slide 45 text

• The Comm class represents an MPI communicator. All communication operations ultimately go through instances of the Comm class. • A communicator defines two things: • Group of processes—the participants in some kind of parallel task or subtask • A communication context. • The idea is that the same group of processes might be involved in more than one kind of “ongoing activity”. • We don’t want these distinct “activities” to interfere with one another. • We don’t want messages that are sent in the context of one activity to be accidentally received in the context of another. This would be a kind of race condition. • Messages sent on one communicator can never be received on another. MPI Message Passing Interface Monday, November 5, 12

Slide 46

Slide 46 text

• A process group in MPI is defined as a fixed set of processes, which never changes in the lifetime of the group. • The number of processes in the group associated with a communicator can be found by the Size() method of the Comm. • Each process in a group has a unique rank within the group, an integer value between 0 and Size() – 1. This value is returned by the Rank() method. MPI Message Passing Interface Monday, November 5, 12

Slide 47

Slide 47 text

• The basic point-to-point communication methods are members of the Comm class. • Send and receive members of Comm: • void Send() • Status Recv() MPI Message Passing Interface Monday, November 5, 12

Slide 48

Slide 48 text

• JavaMPI • mpiJava • DOGMA MPIJ • jMPI • JavaPVM • MPJ • CCJ • MPJava • JOPI MPI Message Passing Interface Monday, November 5, 12

Slide 49

Slide 49 text

Monday, November 5, 12

Slide 50

Slide 50 text

MPI and OpenMP Examples MPI OpenMP Monday, November 5, 12

Slide 51

Slide 51 text

A B C R R R Send Buffer Receive Buffer AllReduce R = A op B op C Rank: 0 1 2 Monday, November 5, 12

Slide 52

Slide 52 text

import mpi.* class Hello { static public void main(String[] args) { MPI.Init(args) ; int myrank = MPI.COMM_WORLD.Rank() ; if(myrank == 0) { char[] message = “Hello there”.toCharArray() ; MPI.COMM_WORLD.Send(message, 0, message.length, MPI.CHAR, 1, 99) ; } else { char[] message = new char [20] ; MPI.COMM_WORLD.Recv(message, 0, 20, MPI.CHAR, 0, 99) ; System.out.println(“received:” + new String(message) + “:”) ; } MPI.Finalize() ; } } MPI Message Passing Interface Monday, November 5, 12

Slide 53

Slide 53 text

Europe GridLab • Europe GridLab: 5 cities • 40 CPUs in total • Different architectures • Latencies: • 0.2 – 210 ms daytime • 0.2 – 66 ms night • Bandwidth: 9KB/s – 11MB/s • 80% efficiency Monday, November 5, 12

Slide 54

Slide 54 text

Grid5000 • 3 clusters in France (Grid5000), 120 nodes • Latency: 4-10 ms • Bandwidth: 200-1000Mbps • Ran VLSI routing app • 86% efficiency Monday, November 5, 12

Slide 55

Slide 55 text

Monday, November 5, 12

Slide 56

Slide 56 text

Parallel KMeans Monday, November 5, 12

Slide 57

Slide 57 text

Parallel Self-Organization Map Serial SOM Monday, November 5, 12

Slide 58

Slide 58 text

Parallel MKMeans Monday, November 5, 12

Slide 59

Slide 59 text

Monday, November 5, 12

Slide 60

Slide 60 text

? Monday, November 5, 12