Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parallel Knowledge Discovery - Intro

Parallel Knowledge Discovery - Intro

Presented at 2nd Scientific Workshop, Abrar University, Tehran, Iran.

Morteza Ansarinia

November 05, 2012
Tweet

More Decks by Morteza Ansarinia

Other Decks in Research

Transcript

  1. Objective • Understanding basic concepts of knowledge discovery. • Understanding

    basic concepts of distributed and parallel processing. • Basis for future developments. Monday, November 5, 12
  2. Knowledge Discovery Definitions • Deep Understanding of Variation in Large

    Datasets • “Knowledge discovery describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data.” • “Deriving knowledge and create abstractions of the input data.” • “Process of analyzing data to identify patterns or relationships.” • Predicting future events, behaviors, estimating values, etc. Monday, November 5, 12
  3. Knowledge Discovery Applications • Astronomy • Biology • Business •

    Internet • Government • Religion Monday, November 5, 12
  4. Knowledge Discovery Example Datasets • Eyes 324 - 576 megapixels

    • Ears Stereo Audio 20-20000Hz • Nose 10000 Chemical Compounds • Mouth 5-6 Flavors • Skin Temperature / Pressure / Texture • Memory 2.5 Petabytes Monday, November 5, 12
  5. Knowledge Discovery History • 1960s Data Collection • 1980s Data

    Access (Disk) • 1990s Data Warehousing (SQL) • Now Knowledge Discovery Monday, November 5, 12
  6. Knowledge Discovery Typical Structure - Datasets • Rows & Columns

    • Rows: Objects • Columns: A Single Object’s Attributes Monday, November 5, 12
  7. Knowledge Discovery Typical Structure - Outputs • Concepts: predicates, algorithms

    are usually transparent. • Model Parameters: built classifiers, mostly opaque. Monday, November 5, 12
  8. Knowledge Discovery Typical Structure - Processing Steps • Association Rule

    Learning • Clustering • Classification • Modeling • Visualization Monday, November 5, 12
  9. Knowledge Discovery Application Types •Category One: find explanations for the

    most variable elements of the data set—that is, to find and explain the outliers. find the unexpected stellar object in this sky sweep (essentially parallel). •Category Two: understand the variations of the majority of the data set elements, with little interest in the outliers. understand the buying habits of most of our customers (optionally parallel). Monday, November 5, 12
  10. Knowledge Discovery Algorithms • 1700s Bayes Theorem • 1800s Regression

    Analysis • 1940s Neural Networks • 1950s Genetic Algorithms • 1960s Decision Tree Learning • 1990s Support Vector Machines Monday, November 5, 12
  11. Knowledge Discovery Neural Networks • Connected neurons, • Learn through

    training, • Resemble to biological networks in structure, • Not easy to use and to understand (opaque), • Cannot deal with missing data, • Feed the datasets to the network many times (epochs). Monday, November 5, 12
  12. Knowledge Discovery Genetic Algorithm • Analogy to Darwinian evolution. •

    Various evolutionary algorithms: • Genetic Algorithms • Ant Colony • Swarm Intelligence • Replicate, mutate, and cross over Monday, November 5, 12
  13. Knowledge Discovery Association Rules & Semantic Methods • Rule indication:

    • Extraction of if , then , else rules from data based on statistical significance. • Output: how likely it is that certain patterns of attributes occur with other attributes in dataset objects (transparent). Monday, November 5, 12
  14. Knowledge Discovery Clustering Algorithms • K-Means • MK-Means • Canopy

    • Self Organization Map (SOM) • DBSCAN/OPTICS Monday, November 5, 12
  15. Knowledge Discovery Limitations • Complex algorithms and high computing power

    • Complex, dynamic datasets, and large volume of data • Curse of Dimensionality Monday, November 5, 12
  16. ESA Gaia • Objective: To create the largest and most

    precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars. • Launch Date: 2013 • Mission End: 2018 (5 years) Monday, November 5, 12
  17. ESA Gaia CCDs (Dimensions) • 3 columns of 4 CCDs

    for the Radial Velocity Spectrometer (RVS), • 7 CCDs for the Red Photometer, • 1 column of 7 CCDs for the Blue Photometer, • 9 columns of 7 CCDs forming the Astrometric field, • 2 columns of 7 CCDs for the sky mapper, • 1 column with 3 CCDs: two basic angle monitors, and one wave front sensor. Monday, November 5, 12
  18. Limitations • Sloan Digital Sky Survey: • around 53 million

    unique objects. • Hipparcos: • around 10^5 unique objects. • GAIA Mission: • 10^9 unique objects, • 10% variable objects (10^8), • 13 dimensions. Monday, November 5, 12
  19. mrkd-EM DBSCAN DENCLUE DBCLASD 0(nlog(n)) 0(nlog(n)) 0{n) Oinlogin)) TABLE 2.

    Dataset attributes names [3] Attribute name log-fl log-f2 log-aflhl-t log-aflh2-t log-aflh3-t log-aflh4-t log-aOhl-t log-aGh2-t log-cri'lO pdfl2 varrat B-V V-I Meaning log of the first fi-equency log of the second frequency log ampUtude first harmonic first frequency log ampUtude second harmonic first frequency log ampUtude third harmonic first frequency log ampUtude fourth harmonic first frequency log ampUtude first harmonic second frequency log ampUtude second harmonic second frequency amplitude ratio between harmonics of the first frequency phase difference between harmonics of first frequency variance ratio before and after first frequency subtraction color index color index reate some templates of a given classes of variable stars and run the scanning law Gaia Dataset Attributes (Dimensions) Monday, November 5, 12
  20. • Flynn’s taxonomy • SISD: Single Instruction, Single Data •

    SIMD: Single Instruction, Multiple Data • MISD: Multiple Instruction, Single Data • MIMD: Multiple Instruction, Multiple Data Types of Parallel Computing Hardware-Oriented Monday, November 5, 12
  21. Types of Parallel Computing Software-Oriented • Data-parallel: Same operations on

    different data • Task-parallel: Different programs, different data • Dataflow: Pipelined parallelism • MIMD: Different programs, different data • SPMD: Same program, different data Monday, November 5, 12
  22. Types of Parallel Computing • Divide-and-Conquer (fork/join parallelism) • Hierarchical

    • Applications: parallel rendering, SAT solver, VLSI routing, N-body simulation, multiple sequence alignment, grammar based learning Monday, November 5, 12
  23. Message Passing Interface Features • HPC Message Passing Interface standardized

    in the early 1990s by the MPI Forum. • API for communication between nodes of a distributed memory parallel computer (typically a workstation cluster). • Fortran, C, and C++. • Low-level parts of API: • Fast transfer of data from user program to network, • Supporting multiple modes of message synchronization available on HPC platforms. • Higher level parts of the API: • Organization of process groups and providing the kind of collective communications seen in typical parallel applications. Monday, November 5, 12
  24. MPI Message Passing Interface • An API for sending and

    receiving messages. • General platform for Single Program Multiple Data (SPMD) parallel computing on distributed memory architectures. • Directly comparable with the PVM (Parallel Virtual Machine) environment • Introduced the important abstraction of a communicator, which is an object something like an N-way communication channel, connecting all members of a group of cooperating processes. • Introduced partly to support using multiple parallel libraries without interference. • Introduced a novel concept of datatypes, used to describe the contents of communication buffers. • Introduced partly to support “zero-copying” message transfer. Monday, November 5, 12
  25. • The Comm class represents an MPI communicator. All communication

    operations ultimately go through instances of the Comm class. • A communicator defines two things: • Group of processes—the participants in some kind of parallel task or subtask • A communication context. • The idea is that the same group of processes might be involved in more than one kind of “ongoing activity”. • We don’t want these distinct “activities” to interfere with one another. • We don’t want messages that are sent in the context of one activity to be accidentally received in the context of another. This would be a kind of race condition. • Messages sent on one communicator can never be received on another. MPI Message Passing Interface Monday, November 5, 12
  26. • A process group in MPI is defined as a

    fixed set of processes, which never changes in the lifetime of the group. • The number of processes in the group associated with a communicator can be found by the Size() method of the Comm. • Each process in a group has a unique rank within the group, an integer value between 0 and Size() – 1. This value is returned by the Rank() method. MPI Message Passing Interface Monday, November 5, 12
  27. • The basic point-to-point communication methods are members of the

    Comm class. • Send and receive members of Comm: • void Send() • Status Recv() MPI Message Passing Interface Monday, November 5, 12
  28. • JavaMPI • mpiJava • DOGMA MPIJ • jMPI •

    JavaPVM • MPJ • CCJ • MPJava • JOPI MPI Message Passing Interface Monday, November 5, 12
  29. A B C R R R Send Buffer Receive Buffer

    AllReduce R = A op B op C Rank: 0 1 2 Monday, November 5, 12
  30. import mpi.* class Hello { static public void main(String[] args)

    { MPI.Init(args) ; int myrank = MPI.COMM_WORLD.Rank() ; if(myrank == 0) { char[] message = “Hello there”.toCharArray() ; MPI.COMM_WORLD.Send(message, 0, message.length, MPI.CHAR, 1, 99) ; } else { char[] message = new char [20] ; MPI.COMM_WORLD.Recv(message, 0, 20, MPI.CHAR, 0, 99) ; System.out.println(“received:” + new String(message) + “:”) ; } MPI.Finalize() ; } } MPI Message Passing Interface Monday, November 5, 12
  31. Europe GridLab • Europe GridLab: 5 cities • 40 CPUs

    in total • Different architectures • Latencies: • 0.2 – 210 ms daytime • 0.2 – 66 ms night • Bandwidth: 9KB/s – 11MB/s • 80% efficiency Monday, November 5, 12
  32. Grid5000 • 3 clusters in France (Grid5000), 120 nodes •

    Latency: 4-10 ms • Bandwidth: 200-1000Mbps • Ran VLSI routing app • 86% efficiency Monday, November 5, 12