Slide 1

Slide 1 text

The influence of input data standardization methods on the prediction accuracy of genetic programming generated classifier Amaal R. Al Shorman1 Hossam Faris1 Pedro A. Castillo2 J.J. Merelo2 Nailah Al-Madi3 1Department of Business Information Technology University of Jordan, Amman, Jordan 2Department of Computer Architecture and Computer Technology, ETSIIT and CITIC University of Granada, Granada, Spain 3Department of Computer Science Princess Sumaya University for Technology, Amman, Jordan International Joint Conference on Computational Intelligence, 2018 Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 2

Slide 2 text

Outline 1 Introduction 2 Motivation 3 Research objectives and questions 4 Experiments and results Data sets description Experiments environment Results Results - Scenario I Results - Scenario II Results - Scenario III Results - Overall 5 Conclusions and future works Conclusions Future works Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 3

Slide 3 text

Introduction - data standardization - Genetic programming Data classification Data classification techniques deal with creating classifiers which assign labels to data vectors. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 4

Slide 4 text

Introduction - data standardization - Genetic programming Data classification Data classification techniques deal with creating classifiers which assign labels to data vectors. → Use the existing data ⇒ to build classifier ⇒ apply it to new unseen data. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 5

Slide 5 text

Introduction - data standardization - Genetic programming Data classification Data classification techniques deal with creating classifiers which assign labels to data vectors. → Use the existing data ⇒ to build classifier ⇒ apply it to new unseen data. Various techniques have been applied to data classification, including: Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 6

Slide 6 text

Introduction - data standardization - Genetic programming Data classification Data classification techniques deal with creating classifiers which assign labels to data vectors. → Use the existing data ⇒ to build classifier ⇒ apply it to new unseen data. Various techniques have been applied to data classification, including: → Statistical methods. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 7

Slide 7 text

Introduction - data standardization - Genetic programming Data classification Data classification techniques deal with creating classifiers which assign labels to data vectors. → Use the existing data ⇒ to build classifier ⇒ apply it to new unseen data. Various techniques have been applied to data classification, including: → Statistical methods. → Evolutionary algorithms such as genetic programming. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 8

Slide 8 text

Introduction – Data standardization Data standardization Data standardization is one of the most important pre-processing steps in machine learning. It´ s purpose is to unify the scale of all input features to have equal contribution to the model. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 9

Slide 9 text

Methods applied to data sets Taken from [Kaftanowicz and Krzemi´ nski, 2015, Zavadskas and Turskis, 2008, Altman, 1968]: Vector standardization Ai = Aoi n i=1 (Aoi )2 (1) Manhattan standardization Ai = Aoi n i=1 |Aoi | (2) Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 10

Slide 10 text

Methods applied to data sets (2) Maximum linear standardization Ai = Aoi max Aoi (3) Weitendorf’s linear standardization Ai = Aoi − min Aoi max Aoi − min Aoi (4) Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 11

Slide 11 text

Methods applied to data sets (and 3) Peldschus’ nonlinear standardization Ai = ( Aoi max Aoi )2 (5) Altman Z−score standardization Ai = Aoi − ¯ E 1 (n−1) n i=1 (Aoi − ¯ E)2 (6) Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 12

Slide 12 text

What is Genetic Programming (GP)? Concept introduced by John Koza Symbolic regression or classification method. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 13

Slide 13 text

What is Genetic Programming (GP)? Concept introduced by John Koza Symbolic regression or classification method. GP is an evolutionary algorithm, inspired by the principles of Darwinian evolution theory and natural selection [Koza, 1992]. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 14

Slide 14 text

What is Genetic Programming (GP)? Concept introduced by John Koza Symbolic regression or classification method. GP is an evolutionary algorithm, inspired by the principles of Darwinian evolution theory and natural selection [Koza, 1992]. GP is a domain-independent modeling technique that automatically solves problems without having to tell the computer explicitly how to do it [Koza, 1991]. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 15

Slide 15 text

What is Genetic Programming (GP)? Concept introduced by John Koza Symbolic regression or classification method. GP is an evolutionary algorithm, inspired by the principles of Darwinian evolution theory and natural selection [Koza, 1992]. GP is a domain-independent modeling technique that automatically solves problems without having to tell the computer explicitly how to do it [Koza, 1991]. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 16

Slide 16 text

Genetic programming How does it work? GP algorithms works iteratively as an evolutionary cycle, evolving a population of computer programs or models. The evolutionary process of GP is shown in the following Figure: Figure: Main GP loop [Sheta et al., 2014]. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 17

Slide 17 text

Motivation GP Used to solve data classification problems and has been successful in producing good classifiers. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 18

Slide 18 text

Motivation GP Used to solve data classification problems and has been successful in producing good classifiers. GP has a capacity to model very complex problems in the area of Machine Learning, Data Mining and Pattern Recognition. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 19

Slide 19 text

Motivation GP Used to solve data classification problems and has been successful in producing good classifiers. GP has a capacity to model very complex problems in the area of Machine Learning, Data Mining and Pattern Recognition. GP is a powerful classification technique. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 20

Slide 20 text

Motivation GP Used to solve data classification problems and has been successful in producing good classifiers. GP has a capacity to model very complex problems in the area of Machine Learning, Data Mining and Pattern Recognition. GP is a powerful classification technique. GP is interpretable. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 21

Slide 21 text

Motivation GP Used to solve data classification problems and has been successful in producing good classifiers. GP has a capacity to model very complex problems in the area of Machine Learning, Data Mining and Pattern Recognition. GP is a powerful classification technique. GP is interpretable. Addressing data classification by using genetic programming is not always practical due to a large computation time (hours or days). [Jabeen and Baig, 2010, Faris et al., 2014]. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 22

Slide 22 text

Research objectives and questions Research objectives The primary objective of this paper is to investigate the influence of input data standardization methods on the performance genetic programming in the domain of data classification. Research questions What is the impact of input data standardization methods on GP? Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 23

Slide 23 text

Research objectives and questions Research objectives The primary objective of this paper is to investigate the influence of input data standardization methods on the performance genetic programming in the domain of data classification. Research questions What is the impact of input data standardization methods on GP? How these methods affect prediction accuracy of GP? Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 24

Slide 24 text

Outline 1 Introduction 2 Motivation 3 Research objectives and questions 4 Experiments and results Data sets description Experiments environment Results Results - Scenario I Results - Scenario II Results - Scenario III Results - Overall 5 Conclusions and future works Conclusions Future works Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 25

Slide 25 text

Data sets description Ten binary and nearly balanced data sets were obtained from the University of California at Irvine (UCI) machine learning repository [Dheeru and Karra Taniskidou, 2017]. Dataset No. of classes No. of features No. of data points No. of objects in each class Dataset Type Breast Cancer Wisconsin 2 9 683 444-239 Integer Ionosphere 2 34 351 255-126 Integer, Real Parkinsons 2 22 195 147-48 Real Indian Liver Patient 2 8 583 416-167 Integer, Real Blood Transfusion Service Center 2 4 748 570-178 Real Haberman‘s Survival 2 3 306 255-81 Integer Mammographic Mass 2 5 830 427-403 Integer MONK‘s Problems 2 6 432 228-204 Categorical Connectionist Bench 2 60 208 111-97 Real Australian Credit Approval 2 14 690 383-307 Categorical, Integer, Real Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 26

Slide 26 text

Outline 1 Introduction 2 Motivation 3 Research objectives and questions 4 Experiments and results Data sets description Experiments environment Results Results - Scenario I Results - Scenario II Results - Scenario III Results - Overall 5 Conclusions and future works Conclusions Future works Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 27

Slide 27 text

Experiments environment All experiments are conducted on a PC with Windows 7 Ultimate 64 bit Operating System, an Intel(R) Core(TM) i7 − 4500U CPU with 8 GB RAM memory. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 28

Slide 28 text

Experiments environment All experiments are conducted on a PC with Windows 7 Ultimate 64 bit Operating System, an Intel(R) Core(TM) i7 − 4500U CPU with 8 GB RAM memory. HeuristicLab version 3.3 is used to perform all symbolic GP experiments. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 29

Slide 29 text

Experiments environment All experiments are conducted on a PC with Windows 7 Ultimate 64 bit Operating System, an Intel(R) Core(TM) i7 − 4500U CPU with 8 GB RAM memory. HeuristicLab version 3.3 is used to perform all symbolic GP experiments. A simple split method is used as a training and testing methodology. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 30

Slide 30 text

Experiments environment All experiments are conducted on a PC with Windows 7 Ultimate 64 bit Operating System, an Intel(R) Core(TM) i7 − 4500U CPU with 8 GB RAM memory. HeuristicLab version 3.3 is used to perform all symbolic GP experiments. A simple split method is used as a training and testing methodology. Each experiment is repeated 30 times independently in order to obtain statistically meaningful results. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 31

Slide 31 text

Experiments environment All experiments are conducted on a PC with Windows 7 Ultimate 64 bit Operating System, an Intel(R) Core(TM) i7 − 4500U CPU with 8 GB RAM memory. HeuristicLab version 3.3 is used to perform all symbolic GP experiments. A simple split method is used as a training and testing methodology. Each experiment is repeated 30 times independently in order to obtain statistically meaningful results. The following Table shows the GP parameters were determined empirically through trial runs. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 32

Slide 32 text

Experiments environment Table: GP Parameters. GP Parameter Value Elites 1 Population Size 50, 100, 200 Maximum Generations 100, 200, 500 Mutation Probability 15% Internal Crossover Point Probability 90% Maximum Symbolic Expression Tree Depth 15 Maximum Symbolic Expression Tree Length 15 Solution Creator Probabilistic Tree Creator Parent Selection Method Tournament selection, size 5 Symbolic Expression Tree Grammar Addition, Subtraction, Multiplication, Division, Sine, Cosine, Tangent, Exponential, Logarithm, Root, Power, GreaterThan, LessThan, And, Or, Not, IfThenElse Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 33

Slide 33 text

Experiments environment All experiments are concerned with applying GP on 10 different data sets and using six different standardization methods. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 34

Slide 34 text

Experiments environment All experiments are concerned with applying GP on 10 different data sets and using six different standardization methods. The experiments are divided into three scenarios according to the population size and the maximum generations parameters of GP. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 35

Slide 35 text

Experiments environment All experiments are concerned with applying GP on 10 different data sets and using six different standardization methods. The experiments are divided into three scenarios according to the population size and the maximum generations parameters of GP. 1 The first scenario, the population size and maximum generation are set to 50 and 100 respectively. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 36

Slide 36 text

Experiments environment All experiments are concerned with applying GP on 10 different data sets and using six different standardization methods. The experiments are divided into three scenarios according to the population size and the maximum generations parameters of GP. 1 The first scenario, the population size and maximum generation are set to 50 and 100 respectively. 2 The second scenario, the population size and maximum generation are changed to 100 and 200 respectively. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 37

Slide 37 text

Experiments environment All experiments are concerned with applying GP on 10 different data sets and using six different standardization methods. The experiments are divided into three scenarios according to the population size and the maximum generations parameters of GP. 1 The first scenario, the population size and maximum generation are set to 50 and 100 respectively. 2 The second scenario, the population size and maximum generation are changed to 100 and 200 respectively. 3 The third scenario, the population size and maximum generation are modified to 200 and 500 respectively. Rank test was used across all three scenarios, to provide an overall summary for the influence of different standardization methods on GP. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 38

Slide 38 text

Outline 1 Introduction 2 Motivation 3 Research objectives and questions 4 Experiments and results Data sets description Experiments environment Results Results - Scenario I Results - Scenario II Results - Scenario III Results - Overall 5 Conclusions and future works Conclusions Future works Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 39

Slide 39 text

Results - Scenario I Table: Accuracy results of different standardization methods for scenario I (Population size=50, Maximum generations=100) Dataset Maximum Manhattan Min-Max Peldschus Vector Z-score Original Breast Cancer Wisconsin 0.93 ± 0.03 0.91 ± 0.03 0.94 ± 0.02 0.92 ± 0.03 0.94 ± 0.02 0.93 ± 0.02 0.93 ± 0.02 Ionosphere 0.76 ± 0.07 0.71 ± 0.12 0.78 ± 0.06 0.79 ± 0.04 0.75 ± 0.06 0.77 ± 0.06 0.73 ± 0.12 Parkinsons 0.87 ± 0.04 0.79 ± 0.12 0.84 ± 0.02 0.82 ± 0.03 0.80 ± 0.04 0.77 ± 0.07 0.78 ± 0.07 Indian Liver Patient 0.66 ± 0.12 0.54 ± 0.20 0.68 ± 0.10 0.70 ± 0.02 0.70 ± 0.02 0.69 ± 0.04 0.70 ± 0.03 Blood Transfusion Service Center 0.76 ± 0.01 0.50 ± 0.25 0.73 ± 0.11 0.70 ± 0.14 0.76 ± 0.01 0.70 ± 0.08 0.74 ± 0.07 Haberman‘s Survival 0.74 ± 0.01 0.54 ± 0.24 0.61 ± 0.20 0.55 ± 0.23 0.70 ± 0.08 0.68 ± 0.14 0.50 ± 0.25 Mammographic Mass 0.79 ± 0.01 0.78 ± 0.06 0.78 ± 0.02 0.76 ± 0.01 0.80 ± 0.01 0.83 ± 0.03 0.83 ± 0.01 MONK‘s Problems 0.86 ± 0.07 0.73 ± 0.18 0.81 ± 0.06 0.82 ± 0.14 0.74 ± 0.12 0.79 ± 0.05 0.53 ± 0.18 Connectionist Bench 0.67 ± 0.05 0.64 ± 0.10 0.73 ± 0.04 0.61 ± 0.07 0.61 ± 0.07 0.56 ± 0.06 0.68 ± 0.07 Australian Credit Approval 0.82 ± 0.01 0.84 ± 0.08 0.86 ± 0.01 0.88 ± 0.00 0.84 ± 0.05 0.85 ± 0.04 0.86 ± 0.05 (The best values are marked in bold) Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 40

Slide 40 text

Results - Scenario I Table: Ranks for different standardization methods for scenario I (Population size=50, Maximum generations=100) Dataset Maximum Manhattan Min-Max Peldschus Vector Z-score Original Breast Cancer Wisconsin 3 7 1 6 2 4 5 Ionosphere 4 7 2 1 5 3 6 Parkinsons 1 5 2 3 4 7 6 Indian Liver Patient 6 7 5 3 2 4 1 Blood Transfusion Service Center 1 7 4 5 2 6 3 Haberman‘s Survival 1 6 4 5 2 3 7 Mammographic Mass 4 6 5 7 3 1 2 MONK‘s Problems 1 6 3 2 5 4 7 Connectionist Bench 3 4 1 5 6 7 2 Australian Credit Approval 7 5 2 1 6 4 3 Rank sum 31 60 29 38 37 43 42 Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 41

Slide 41 text

Results - Scenario I Average accuracy There is a significance difference in average accuracy when using the standardization methods. → The accuracy generally decreases when using Manhattan and Z-score methods. Rank test Min-Max obtains the best rank. → This confirms the ability of the GP based on Min-Max to obtain better accuracy with less number of iterations. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 42

Slide 42 text

Results - Scenario II Table: Accuracy results of different standardization methods for scenario II (Population size=100, Maximum generations=200) Dataset Maximum Manhattan Min-Max Peldschus Vector Z-score Original Breast Cancer Wisconsin 0.95 ± 0.02 0.92 ± 0.02 0.95 ± 0.03 0.96 ± 0.03 0.94 ± 0.02 0.94 ± 0.02 0.94 ± 0.02 Ionosphere 0.81 ± 0.05 0.78 ± 0.07 0.81 ± 0.05 0.59 ± 0.32 0.68 ± 0.18 0.79 ± 0.11 0.75 ± 0.20 Parkinsons 0.81 ± 0.01 0.81 ± 0.03 0.79 ± 0.02 0.85 ± 0.05 0.85 ± 0.03 0.80 ± 0.08 0.83 ± 0.03 Indian Liver Patient 0.68 ± 0.03 0.67 ± 0.09 0.70 ± 0.06 0.68 ± 0.03 0.69 ± 0.04 0.69 ± 0.05 0.71 ± 0.01 Blood Transfusion Service Center 0.77 ± 0.01 0.69 ± 0.16 0.69 ± 0.16 0.73 ± 0.11 0.75 ± 0.01 0.68 ± 0.10 0.73 ± 0.09 Haberman‘s Survival 0.72 ± 0.03 0.73 ± 0.10 0.73 ± 0.01 0.67 ± 0.08 0.75 ± 0.01 0.70 ± 0.12 0.71 ± 0.13 Mammographic Mass 0.83 ± 0.01 0.78 ± 0.03 0.80 ± 0.01 0.79 ± 0.03 0.78 ± 0.02 0.81 ± 0.01 0.78 ± 0.03 MONK‘s Problems 0.86 ± 0.10 0.78 ± 0.13 0.83 ± 0.10 0.91 ± 0.08 0.87 ± 0.08 0.81 ± 0.07 0.72 ± 0.19 Connectionist Bench 0.68 ± 0.02 0.66 ± 0.07 0.71 ± 0.07 0.73 ± 0.07 0.73 ± 0.07 0.58 ± 0.09 0.74 ± 0.05 Australian Credit Approval 0.83 ± 0.01 0.85 ± 0.01 0.85 ± 0.00 0.88 ± 0.01 0.87 ± 0.02 0.85 ± 0.01 0.85 ± 0.01 (The best values are marked in bold) Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 43

Slide 43 text

Results - Scenario II Table: Summary of ranks for different standardization methods for scenario II (Population size=100, Maximum generations=200) Dataset Maximum Manhattan Min-Max Peldschus Vector Z-score Original Breast Cancer Wisconsin 2 7 3 1 4 6 5 Ionosphere 2 4 1 7 6 3 5 Parkinsons 5 4 7 1 2 6 3 Indian Liver Patient 6 7 2 5 3 4 1 Blood Transfusion Service Center 1 6 5 3 2 7 4 Haberman‘s Survival 4 2 3 7 1 6 5 Mammographic Mass 1 6 3 4 7 2 5 MONK‘s Problems 3 6 4 1 2 5 7 Connectionist Bench 5 6 4 3 2 7 1 Australian Credit Approval 7 6 3 1 2 4 5 Rank sum 36 54 35 33 31 50 41 Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 44

Slide 44 text

Results - Scenario II Average accuracy The effect of standardization methods on GP was reduced. → The accuracy decreases when using Manhattan and Z-score. Rank test GP based on Vector obtains the best rank. → This confirms the ability of the GP based on Vector to obtain better accuracy with less number of iterations. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 45

Slide 45 text

Results - Scenario III Table: Accuracy results of different standardization methods for scenario III (Population Size=200, Maximum Generations=500) Dataset Maximum Manhattan Min-Max Peldschus Vector Z-score Original Breast Cancer Wisconsin 0.94 ± 0.02 0.93 ± 0.02 0.95 ± 0.02 0.94 ± 0.01 0.96 ± 0.02 0.94 ± 0.03 0.94 ± 0.02 Ionosphere 0.85 ± 0.04 0.83 ± 0.02 0.82 ± 0.04 0.79 ± 0.05 0.79 ± 0.04 0.76 ± 0.20 0.84 ± 0.12 Parkinsons 0.84 ± 0.02 0.85 ± 0.02 0.87 ± 0.04 0.85 ± 0.04 0.85 ± 0.03 0.84 ± 0.05 0.85 ± 0.04 Indian Liver Patient 0.69 ± 0.03 0.69 ± 0.08 0.72 ± 0.01 0.70 ± 0.01 0.69 ± 0.03 0.70 ± 0.03 0.71 ± 0.03 Blood Transfusion Service Center 0.78 ± 0.01 0.75 ± 0.05 0.75 ± 0.02 0.76 ± 0.03 0.76 ± 0.03 0.74 ± 0.06 0.75 ± 0.05 Haberman‘s Survival 0.74 ± 0.01 0.75 ± 0.02 0.73 ± 0.02 0.71 ± 0.07 0.76 ± 0.01 0.75 ± 0.02 0.72 ± 0.02 Mammographic Mass 0.80 ± 0.02 0.81 ± 0.01 0.82 ± 0.02 0.78 ± 0.03 0.83 ± 0.02 0.79 ± 0.02 0.83 ± 0.01 MONK‘s Problems 0.91 ± 0.10 0.82 ± 0.08 0.90 ± 0.07 0.94 ± 0.07 0.85 ± 0.09 0.84 ± 0.09 0.91 ± 0.07 Connectionist Bench 0.72 ± 0.03 0.71 ± 0.08 0.72 ± 0.05 0.73 ± 0.08 0.76 ± 0.04 0.64 ± 0.07 0.73 ± 0.03 Australian Credit Approval 0.85 ± 0.03 0.85 ± 0.01 0.87 ± 0.01 0.85 ± 0.01 0.85 ± 0.01 0.85 ± 0.01 0.84 ± 0.01 (The best values are marked in bold) Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 46

Slide 46 text

Results - Scenario III Table: Summary of ranks for different standardization methods for scenario III (PopulationSize=200, MaximumGenerations=500) Dataset Maximum Manhattan Min-Max Peldschus Vector Z-score Original Breast Cancer Wisconsin 4 7 3 5 2 6 1 Ionosphere 1 3 4 6 5 7 2 Parkinsons 5 7 1 3 6 4 2 Indian Liver Patient 6 5 1 4 2 7 3 Blood Transfusion Service Center 1 6 5 3 2 7 4 Haberman‘s Survival 4 3 5 7 1 2 6 Mammographic Mass 5 4 3 7 1 6 2 MONK‘s Problems 2 7 4 1 5 6 3 Connectionist Bench 4 6 5 2 1 7 3 Australian Credit Approval 3 6 1 2 4 5 7 Rank sum 35 54 32 40 29 57 33 Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 47

Slide 47 text

Results - Scenario III Average accuracy The effect of standardization methods on GP is not noticeable. → Using the Manhattan and Z-score standardization methods does not improve the accuracy of GP. The accuracy decreases when using Manhattan and Z-score. Rank test GP based on Vector obtains the best rank. → This confirms the ability of the GP based on Vector to obtain better accuracy with less number of iterations. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 48

Slide 48 text

Results - Overall The results showed that the GP based on Vector and Min-max standardization methods are the best. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 49

Slide 49 text

Results - Overall The results showed that the GP based on Vector and Min-max standardization methods are the best. → This confirms the ability of the GP based on Vector and Min-Max to obtain better accuracy with fewer number of iterations. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 50

Slide 50 text

Results - Overall The results showed that the GP based on Vector and Min-max standardization methods are the best. → This confirms the ability of the GP based on Vector and Min-Max to obtain better accuracy with fewer number of iterations. The factors that influence the performance of GP at lower population size and lower maximum number of generations are: Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 51

Slide 51 text

Results - Overall The results showed that the GP based on Vector and Min-max standardization methods are the best. → This confirms the ability of the GP based on Vector and Min-Max to obtain better accuracy with fewer number of iterations. The factors that influence the performance of GP at lower population size and lower maximum number of generations are: → The size of the data set. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 52

Slide 52 text

Results - Overall The results showed that the GP based on Vector and Min-max standardization methods are the best. → This confirms the ability of the GP based on Vector and Min-Max to obtain better accuracy with fewer number of iterations. The factors that influence the performance of GP at lower population size and lower maximum number of generations are: → The size of the data set. → Standardization methods. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 53

Slide 53 text

Results - Overall The results showed that the GP based on Vector and Min-max standardization methods are the best. → This confirms the ability of the GP based on Vector and Min-Max to obtain better accuracy with fewer number of iterations. The factors that influence the performance of GP at lower population size and lower maximum number of generations are: → The size of the data set. → Standardization methods. GP requires more iterations and larger population size if no standardization method was applied. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 54

Slide 54 text

Outline 1 Introduction 2 Motivation 3 Research objectives and questions 4 Experiments and results Data sets description Experiments environment Results Results - Scenario I Results - Scenario II Results - Scenario III Results - Overall 5 Conclusions and future works Conclusions Future works Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 55

Slide 55 text

Conclusions The goal of this paper is to investigate the performance of data standardization on the accuracy of GP classification. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 56

Slide 56 text

Conclusions The goal of this paper is to investigate the performance of data standardization on the accuracy of GP classification. → Three scenarios have been implemented and tested using six different standardization methods based on ten datasets. GP can achieve higher accuracy rates than GP without data standardization. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 57

Slide 57 text

Conclusions The goal of this paper is to investigate the performance of data standardization on the accuracy of GP classification. → Three scenarios have been implemented and tested using six different standardization methods based on ten datasets. GP can achieve higher accuracy rates than GP without data standardization. → By using standardization methods, GP managed to achieve higher results with fewer iterations and smaller population size. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 58

Slide 58 text

Conclusions The goal of this paper is to investigate the performance of data standardization on the accuracy of GP classification. → Three scenarios have been implemented and tested using six different standardization methods based on ten datasets. GP can achieve higher accuracy rates than GP without data standardization. → By using standardization methods, GP managed to achieve higher results with fewer iterations and smaller population size. The best results are obtained when using Min-Max and Vector methods. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 59

Slide 59 text

Conclusions The goal of this paper is to investigate the performance of data standardization on the accuracy of GP classification. → Three scenarios have been implemented and tested using six different standardization methods based on ten datasets. GP can achieve higher accuracy rates than GP without data standardization. → By using standardization methods, GP managed to achieve higher results with fewer iterations and smaller population size. The best results are obtained when using Min-Max and Vector methods. Manhattan and Z−Score methods achieved the worst accuracy results. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 60

Slide 60 text

Conclusions The goal of this paper is to investigate the performance of data standardization on the accuracy of GP classification. → Three scenarios have been implemented and tested using six different standardization methods based on ten datasets. GP can achieve higher accuracy rates than GP without data standardization. → By using standardization methods, GP managed to achieve higher results with fewer iterations and smaller population size. The best results are obtained when using Min-Max and Vector methods. Manhattan and Z−Score methods achieved the worst accuracy results. Data standardization improve the classification accuracy of the generated GP trees. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 61

Slide 61 text

Outline 1 Introduction 2 Motivation 3 Research objectives and questions 4 Experiments and results Data sets description Experiments environment Results Results - Scenario I Results - Scenario II Results - Scenario III Results - Overall 5 Conclusions and future works Conclusions Future works Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 62

Slide 62 text

Future works Future work includes: Testing the effect of other GP parameters in combination with data standardization. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 63

Slide 63 text

Future works Future work includes: Testing the effect of other GP parameters in combination with data standardization. Testing the usage of GP for other types of prediction problems like multi-class classification and regression problem. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 64

Slide 64 text

Future works Future work includes: Testing the effect of other GP parameters in combination with data standardization. Testing the usage of GP for other types of prediction problems like multi-class classification and regression problem. Studying the influence of data standardization methods when GP is applied to higher dimensional datasets. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 65

Slide 65 text

References I Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. The journal of finance, 23(4):589–609. Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository. Faris, H., Al-Shboul, B., and Ghatasheh, N. (2014). A genetic programming based framework for churn prediction in telecommunication industry. In Hwang, D., Jung, J. J., and Nguyen, N.-T., editors, Computational Collective Intelligence. Technologies and Applications, pages 353–362, Cham. Springer International Publishing. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 66

Slide 66 text

References II Jabeen, H. and Baig, A. R. (2010). Review of classification using genetic programming. International journal of engineering science and technology, 2(2):94–103. Kaftanowicz, M. and Krzemi´ nski, M. (2015). Multiple-criteria analysis of plasterboard systems. Procedia Engineering, 111:364–370. Koza, J. R. (1991). Evolving a computer program to generate random numbers using the genetic programming paradigm. In ICGA, pages 37–44. Citeseer. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 67

Slide 67 text

References III Koza, J. R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA. Sheta, A. F., Faris, H., and ¨ Oznergiz, E. (2014). Improving production quality of a hot-rolling industrial process via genetic programming model. International Journal of Computer Applications in Technology, 49(3-4):239–250. Zavadskas, E. K. and Turskis, Z. (2008). A new logarithmic normalization method in games theory. Informatica, 19(2):303–314. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36

Slide 68

Slide 68 text

Thank You! Questions? Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36