The influence of input data standardization methods on the prediction accuracy of genetic programming generated classifier

The influence of input data standardization methods on the prediction accuracy of genetic programming generated classifier

Paper presentation for IJCCI 2018

Dd366bcdcf85991fa8af1b6d11d3ad49?s=128

Juan Julián Merelo Guervós

September 18, 2018
Tweet

Transcript

  1. The influence of input data standardization methods on the prediction

    accuracy of genetic programming generated classifier Amaal R. Al Shorman1 Hossam Faris1 Pedro A. Castillo2 J.J. Merelo2 Nailah Al-Madi3 1Department of Business Information Technology University of Jordan, Amman, Jordan 2Department of Computer Architecture and Computer Technology, ETSIIT and CITIC University of Granada, Granada, Spain 3Department of Computer Science Princess Sumaya University for Technology, Amman, Jordan International Joint Conference on Computational Intelligence, 2018 Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  2. Outline 1 Introduction 2 Motivation 3 Research objectives and questions

    4 Experiments and results Data sets description Experiments environment Results Results - Scenario I Results - Scenario II Results - Scenario III Results - Overall 5 Conclusions and future works Conclusions Future works Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  3. Introduction - data standardization - Genetic programming Data classification Data

    classification techniques deal with creating classifiers which assign labels to data vectors. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  4. Introduction - data standardization - Genetic programming Data classification Data

    classification techniques deal with creating classifiers which assign labels to data vectors. → Use the existing data ⇒ to build classifier ⇒ apply it to new unseen data. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  5. Introduction - data standardization - Genetic programming Data classification Data

    classification techniques deal with creating classifiers which assign labels to data vectors. → Use the existing data ⇒ to build classifier ⇒ apply it to new unseen data. Various techniques have been applied to data classification, including: Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  6. Introduction - data standardization - Genetic programming Data classification Data

    classification techniques deal with creating classifiers which assign labels to data vectors. → Use the existing data ⇒ to build classifier ⇒ apply it to new unseen data. Various techniques have been applied to data classification, including: → Statistical methods. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  7. Introduction - data standardization - Genetic programming Data classification Data

    classification techniques deal with creating classifiers which assign labels to data vectors. → Use the existing data ⇒ to build classifier ⇒ apply it to new unseen data. Various techniques have been applied to data classification, including: → Statistical methods. → Evolutionary algorithms such as genetic programming. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  8. Introduction – Data standardization Data standardization Data standardization is one

    of the most important pre-processing steps in machine learning. It´ s purpose is to unify the scale of all input features to have equal contribution to the model. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  9. Methods applied to data sets Taken from [Kaftanowicz and Krzemi´

    nski, 2015, Zavadskas and Turskis, 2008, Altman, 1968]: Vector standardization Ai = Aoi n i=1 (Aoi )2 (1) Manhattan standardization Ai = Aoi n i=1 |Aoi | (2) Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  10. Methods applied to data sets (2) Maximum linear standardization Ai

    = Aoi max Aoi (3) Weitendorf’s linear standardization Ai = Aoi − min Aoi max Aoi − min Aoi (4) Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  11. Methods applied to data sets (and 3) Peldschus’ nonlinear standardization

    Ai = ( Aoi max Aoi )2 (5) Altman Z−score standardization Ai = Aoi − ¯ E 1 (n−1) n i=1 (Aoi − ¯ E)2 (6) Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  12. What is Genetic Programming (GP)? Concept introduced by John Koza

    Symbolic regression or classification method. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  13. What is Genetic Programming (GP)? Concept introduced by John Koza

    Symbolic regression or classification method. GP is an evolutionary algorithm, inspired by the principles of Darwinian evolution theory and natural selection [Koza, 1992]. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  14. What is Genetic Programming (GP)? Concept introduced by John Koza

    Symbolic regression or classification method. GP is an evolutionary algorithm, inspired by the principles of Darwinian evolution theory and natural selection [Koza, 1992]. GP is a domain-independent modeling technique that automatically solves problems without having to tell the computer explicitly how to do it [Koza, 1991]. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  15. What is Genetic Programming (GP)? Concept introduced by John Koza

    Symbolic regression or classification method. GP is an evolutionary algorithm, inspired by the principles of Darwinian evolution theory and natural selection [Koza, 1992]. GP is a domain-independent modeling technique that automatically solves problems without having to tell the computer explicitly how to do it [Koza, 1991]. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  16. Genetic programming How does it work? GP algorithms works iteratively

    as an evolutionary cycle, evolving a population of computer programs or models. The evolutionary process of GP is shown in the following Figure: Figure: Main GP loop [Sheta et al., 2014]. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  17. Motivation GP Used to solve data classification problems and has

    been successful in producing good classifiers. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  18. Motivation GP Used to solve data classification problems and has

    been successful in producing good classifiers. GP has a capacity to model very complex problems in the area of Machine Learning, Data Mining and Pattern Recognition. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  19. Motivation GP Used to solve data classification problems and has

    been successful in producing good classifiers. GP has a capacity to model very complex problems in the area of Machine Learning, Data Mining and Pattern Recognition. GP is a powerful classification technique. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  20. Motivation GP Used to solve data classification problems and has

    been successful in producing good classifiers. GP has a capacity to model very complex problems in the area of Machine Learning, Data Mining and Pattern Recognition. GP is a powerful classification technique. GP is interpretable. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  21. Motivation GP Used to solve data classification problems and has

    been successful in producing good classifiers. GP has a capacity to model very complex problems in the area of Machine Learning, Data Mining and Pattern Recognition. GP is a powerful classification technique. GP is interpretable. Addressing data classification by using genetic programming is not always practical due to a large computation time (hours or days). [Jabeen and Baig, 2010, Faris et al., 2014]. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  22. Research objectives and questions Research objectives The primary objective of

    this paper is to investigate the influence of input data standardization methods on the performance genetic programming in the domain of data classification. Research questions What is the impact of input data standardization methods on GP? Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  23. Research objectives and questions Research objectives The primary objective of

    this paper is to investigate the influence of input data standardization methods on the performance genetic programming in the domain of data classification. Research questions What is the impact of input data standardization methods on GP? How these methods affect prediction accuracy of GP? Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  24. Outline 1 Introduction 2 Motivation 3 Research objectives and questions

    4 Experiments and results Data sets description Experiments environment Results Results - Scenario I Results - Scenario II Results - Scenario III Results - Overall 5 Conclusions and future works Conclusions Future works Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  25. Data sets description Ten binary and nearly balanced data sets

    were obtained from the University of California at Irvine (UCI) machine learning repository [Dheeru and Karra Taniskidou, 2017]. Dataset No. of classes No. of features No. of data points No. of objects in each class Dataset Type Breast Cancer Wisconsin 2 9 683 444-239 Integer Ionosphere 2 34 351 255-126 Integer, Real Parkinsons 2 22 195 147-48 Real Indian Liver Patient 2 8 583 416-167 Integer, Real Blood Transfusion Service Center 2 4 748 570-178 Real Haberman‘s Survival 2 3 306 255-81 Integer Mammographic Mass 2 5 830 427-403 Integer MONK‘s Problems 2 6 432 228-204 Categorical Connectionist Bench 2 60 208 111-97 Real Australian Credit Approval 2 14 690 383-307 Categorical, Integer, Real Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  26. Outline 1 Introduction 2 Motivation 3 Research objectives and questions

    4 Experiments and results Data sets description Experiments environment Results Results - Scenario I Results - Scenario II Results - Scenario III Results - Overall 5 Conclusions and future works Conclusions Future works Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  27. Experiments environment All experiments are conducted on a PC with

    Windows 7 Ultimate 64 bit Operating System, an Intel(R) Core(TM) i7 − 4500U CPU with 8 GB RAM memory. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  28. Experiments environment All experiments are conducted on a PC with

    Windows 7 Ultimate 64 bit Operating System, an Intel(R) Core(TM) i7 − 4500U CPU with 8 GB RAM memory. HeuristicLab version 3.3 is used to perform all symbolic GP experiments. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  29. Experiments environment All experiments are conducted on a PC with

    Windows 7 Ultimate 64 bit Operating System, an Intel(R) Core(TM) i7 − 4500U CPU with 8 GB RAM memory. HeuristicLab version 3.3 is used to perform all symbolic GP experiments. A simple split method is used as a training and testing methodology. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  30. Experiments environment All experiments are conducted on a PC with

    Windows 7 Ultimate 64 bit Operating System, an Intel(R) Core(TM) i7 − 4500U CPU with 8 GB RAM memory. HeuristicLab version 3.3 is used to perform all symbolic GP experiments. A simple split method is used as a training and testing methodology. Each experiment is repeated 30 times independently in order to obtain statistically meaningful results. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  31. Experiments environment All experiments are conducted on a PC with

    Windows 7 Ultimate 64 bit Operating System, an Intel(R) Core(TM) i7 − 4500U CPU with 8 GB RAM memory. HeuristicLab version 3.3 is used to perform all symbolic GP experiments. A simple split method is used as a training and testing methodology. Each experiment is repeated 30 times independently in order to obtain statistically meaningful results. The following Table shows the GP parameters were determined empirically through trial runs. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  32. Experiments environment Table: GP Parameters. GP Parameter Value Elites 1

    Population Size 50, 100, 200 Maximum Generations 100, 200, 500 Mutation Probability 15% Internal Crossover Point Probability 90% Maximum Symbolic Expression Tree Depth 15 Maximum Symbolic Expression Tree Length 15 Solution Creator Probabilistic Tree Creator Parent Selection Method Tournament selection, size 5 Symbolic Expression Tree Grammar Addition, Subtraction, Multiplication, Division, Sine, Cosine, Tangent, Exponential, Logarithm, Root, Power, GreaterThan, LessThan, And, Or, Not, IfThenElse Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  33. Experiments environment All experiments are concerned with applying GP on

    10 different data sets and using six different standardization methods. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  34. Experiments environment All experiments are concerned with applying GP on

    10 different data sets and using six different standardization methods. The experiments are divided into three scenarios according to the population size and the maximum generations parameters of GP. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  35. Experiments environment All experiments are concerned with applying GP on

    10 different data sets and using six different standardization methods. The experiments are divided into three scenarios according to the population size and the maximum generations parameters of GP. 1 The first scenario, the population size and maximum generation are set to 50 and 100 respectively. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  36. Experiments environment All experiments are concerned with applying GP on

    10 different data sets and using six different standardization methods. The experiments are divided into three scenarios according to the population size and the maximum generations parameters of GP. 1 The first scenario, the population size and maximum generation are set to 50 and 100 respectively. 2 The second scenario, the population size and maximum generation are changed to 100 and 200 respectively. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  37. Experiments environment All experiments are concerned with applying GP on

    10 different data sets and using six different standardization methods. The experiments are divided into three scenarios according to the population size and the maximum generations parameters of GP. 1 The first scenario, the population size and maximum generation are set to 50 and 100 respectively. 2 The second scenario, the population size and maximum generation are changed to 100 and 200 respectively. 3 The third scenario, the population size and maximum generation are modified to 200 and 500 respectively. Rank test was used across all three scenarios, to provide an overall summary for the influence of different standardization methods on GP. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  38. Outline 1 Introduction 2 Motivation 3 Research objectives and questions

    4 Experiments and results Data sets description Experiments environment Results Results - Scenario I Results - Scenario II Results - Scenario III Results - Overall 5 Conclusions and future works Conclusions Future works Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  39. Results - Scenario I Table: Accuracy results of different standardization

    methods for scenario I (Population size=50, Maximum generations=100) Dataset Maximum Manhattan Min-Max Peldschus Vector Z-score Original Breast Cancer Wisconsin 0.93 ± 0.03 0.91 ± 0.03 0.94 ± 0.02 0.92 ± 0.03 0.94 ± 0.02 0.93 ± 0.02 0.93 ± 0.02 Ionosphere 0.76 ± 0.07 0.71 ± 0.12 0.78 ± 0.06 0.79 ± 0.04 0.75 ± 0.06 0.77 ± 0.06 0.73 ± 0.12 Parkinsons 0.87 ± 0.04 0.79 ± 0.12 0.84 ± 0.02 0.82 ± 0.03 0.80 ± 0.04 0.77 ± 0.07 0.78 ± 0.07 Indian Liver Patient 0.66 ± 0.12 0.54 ± 0.20 0.68 ± 0.10 0.70 ± 0.02 0.70 ± 0.02 0.69 ± 0.04 0.70 ± 0.03 Blood Transfusion Service Center 0.76 ± 0.01 0.50 ± 0.25 0.73 ± 0.11 0.70 ± 0.14 0.76 ± 0.01 0.70 ± 0.08 0.74 ± 0.07 Haberman‘s Survival 0.74 ± 0.01 0.54 ± 0.24 0.61 ± 0.20 0.55 ± 0.23 0.70 ± 0.08 0.68 ± 0.14 0.50 ± 0.25 Mammographic Mass 0.79 ± 0.01 0.78 ± 0.06 0.78 ± 0.02 0.76 ± 0.01 0.80 ± 0.01 0.83 ± 0.03 0.83 ± 0.01 MONK‘s Problems 0.86 ± 0.07 0.73 ± 0.18 0.81 ± 0.06 0.82 ± 0.14 0.74 ± 0.12 0.79 ± 0.05 0.53 ± 0.18 Connectionist Bench 0.67 ± 0.05 0.64 ± 0.10 0.73 ± 0.04 0.61 ± 0.07 0.61 ± 0.07 0.56 ± 0.06 0.68 ± 0.07 Australian Credit Approval 0.82 ± 0.01 0.84 ± 0.08 0.86 ± 0.01 0.88 ± 0.00 0.84 ± 0.05 0.85 ± 0.04 0.86 ± 0.05 (The best values are marked in bold) Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  40. Results - Scenario I Table: Ranks for different standardization methods

    for scenario I (Population size=50, Maximum generations=100) Dataset Maximum Manhattan Min-Max Peldschus Vector Z-score Original Breast Cancer Wisconsin 3 7 1 6 2 4 5 Ionosphere 4 7 2 1 5 3 6 Parkinsons 1 5 2 3 4 7 6 Indian Liver Patient 6 7 5 3 2 4 1 Blood Transfusion Service Center 1 7 4 5 2 6 3 Haberman‘s Survival 1 6 4 5 2 3 7 Mammographic Mass 4 6 5 7 3 1 2 MONK‘s Problems 1 6 3 2 5 4 7 Connectionist Bench 3 4 1 5 6 7 2 Australian Credit Approval 7 5 2 1 6 4 3 Rank sum 31 60 29 38 37 43 42 Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  41. Results - Scenario I Average accuracy There is a significance

    difference in average accuracy when using the standardization methods. → The accuracy generally decreases when using Manhattan and Z-score methods. Rank test Min-Max obtains the best rank. → This confirms the ability of the GP based on Min-Max to obtain better accuracy with less number of iterations. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  42. Results - Scenario II Table: Accuracy results of different standardization

    methods for scenario II (Population size=100, Maximum generations=200) Dataset Maximum Manhattan Min-Max Peldschus Vector Z-score Original Breast Cancer Wisconsin 0.95 ± 0.02 0.92 ± 0.02 0.95 ± 0.03 0.96 ± 0.03 0.94 ± 0.02 0.94 ± 0.02 0.94 ± 0.02 Ionosphere 0.81 ± 0.05 0.78 ± 0.07 0.81 ± 0.05 0.59 ± 0.32 0.68 ± 0.18 0.79 ± 0.11 0.75 ± 0.20 Parkinsons 0.81 ± 0.01 0.81 ± 0.03 0.79 ± 0.02 0.85 ± 0.05 0.85 ± 0.03 0.80 ± 0.08 0.83 ± 0.03 Indian Liver Patient 0.68 ± 0.03 0.67 ± 0.09 0.70 ± 0.06 0.68 ± 0.03 0.69 ± 0.04 0.69 ± 0.05 0.71 ± 0.01 Blood Transfusion Service Center 0.77 ± 0.01 0.69 ± 0.16 0.69 ± 0.16 0.73 ± 0.11 0.75 ± 0.01 0.68 ± 0.10 0.73 ± 0.09 Haberman‘s Survival 0.72 ± 0.03 0.73 ± 0.10 0.73 ± 0.01 0.67 ± 0.08 0.75 ± 0.01 0.70 ± 0.12 0.71 ± 0.13 Mammographic Mass 0.83 ± 0.01 0.78 ± 0.03 0.80 ± 0.01 0.79 ± 0.03 0.78 ± 0.02 0.81 ± 0.01 0.78 ± 0.03 MONK‘s Problems 0.86 ± 0.10 0.78 ± 0.13 0.83 ± 0.10 0.91 ± 0.08 0.87 ± 0.08 0.81 ± 0.07 0.72 ± 0.19 Connectionist Bench 0.68 ± 0.02 0.66 ± 0.07 0.71 ± 0.07 0.73 ± 0.07 0.73 ± 0.07 0.58 ± 0.09 0.74 ± 0.05 Australian Credit Approval 0.83 ± 0.01 0.85 ± 0.01 0.85 ± 0.00 0.88 ± 0.01 0.87 ± 0.02 0.85 ± 0.01 0.85 ± 0.01 (The best values are marked in bold) Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  43. Results - Scenario II Table: Summary of ranks for different

    standardization methods for scenario II (Population size=100, Maximum generations=200) Dataset Maximum Manhattan Min-Max Peldschus Vector Z-score Original Breast Cancer Wisconsin 2 7 3 1 4 6 5 Ionosphere 2 4 1 7 6 3 5 Parkinsons 5 4 7 1 2 6 3 Indian Liver Patient 6 7 2 5 3 4 1 Blood Transfusion Service Center 1 6 5 3 2 7 4 Haberman‘s Survival 4 2 3 7 1 6 5 Mammographic Mass 1 6 3 4 7 2 5 MONK‘s Problems 3 6 4 1 2 5 7 Connectionist Bench 5 6 4 3 2 7 1 Australian Credit Approval 7 6 3 1 2 4 5 Rank sum 36 54 35 33 31 50 41 Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  44. Results - Scenario II Average accuracy The effect of standardization

    methods on GP was reduced. → The accuracy decreases when using Manhattan and Z-score. Rank test GP based on Vector obtains the best rank. → This confirms the ability of the GP based on Vector to obtain better accuracy with less number of iterations. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  45. Results - Scenario III Table: Accuracy results of different standardization

    methods for scenario III (Population Size=200, Maximum Generations=500) Dataset Maximum Manhattan Min-Max Peldschus Vector Z-score Original Breast Cancer Wisconsin 0.94 ± 0.02 0.93 ± 0.02 0.95 ± 0.02 0.94 ± 0.01 0.96 ± 0.02 0.94 ± 0.03 0.94 ± 0.02 Ionosphere 0.85 ± 0.04 0.83 ± 0.02 0.82 ± 0.04 0.79 ± 0.05 0.79 ± 0.04 0.76 ± 0.20 0.84 ± 0.12 Parkinsons 0.84 ± 0.02 0.85 ± 0.02 0.87 ± 0.04 0.85 ± 0.04 0.85 ± 0.03 0.84 ± 0.05 0.85 ± 0.04 Indian Liver Patient 0.69 ± 0.03 0.69 ± 0.08 0.72 ± 0.01 0.70 ± 0.01 0.69 ± 0.03 0.70 ± 0.03 0.71 ± 0.03 Blood Transfusion Service Center 0.78 ± 0.01 0.75 ± 0.05 0.75 ± 0.02 0.76 ± 0.03 0.76 ± 0.03 0.74 ± 0.06 0.75 ± 0.05 Haberman‘s Survival 0.74 ± 0.01 0.75 ± 0.02 0.73 ± 0.02 0.71 ± 0.07 0.76 ± 0.01 0.75 ± 0.02 0.72 ± 0.02 Mammographic Mass 0.80 ± 0.02 0.81 ± 0.01 0.82 ± 0.02 0.78 ± 0.03 0.83 ± 0.02 0.79 ± 0.02 0.83 ± 0.01 MONK‘s Problems 0.91 ± 0.10 0.82 ± 0.08 0.90 ± 0.07 0.94 ± 0.07 0.85 ± 0.09 0.84 ± 0.09 0.91 ± 0.07 Connectionist Bench 0.72 ± 0.03 0.71 ± 0.08 0.72 ± 0.05 0.73 ± 0.08 0.76 ± 0.04 0.64 ± 0.07 0.73 ± 0.03 Australian Credit Approval 0.85 ± 0.03 0.85 ± 0.01 0.87 ± 0.01 0.85 ± 0.01 0.85 ± 0.01 0.85 ± 0.01 0.84 ± 0.01 (The best values are marked in bold) Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  46. Results - Scenario III Table: Summary of ranks for different

    standardization methods for scenario III (PopulationSize=200, MaximumGenerations=500) Dataset Maximum Manhattan Min-Max Peldschus Vector Z-score Original Breast Cancer Wisconsin 4 7 3 5 2 6 1 Ionosphere 1 3 4 6 5 7 2 Parkinsons 5 7 1 3 6 4 2 Indian Liver Patient 6 5 1 4 2 7 3 Blood Transfusion Service Center 1 6 5 3 2 7 4 Haberman‘s Survival 4 3 5 7 1 2 6 Mammographic Mass 5 4 3 7 1 6 2 MONK‘s Problems 2 7 4 1 5 6 3 Connectionist Bench 4 6 5 2 1 7 3 Australian Credit Approval 3 6 1 2 4 5 7 Rank sum 35 54 32 40 29 57 33 Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  47. Results - Scenario III Average accuracy The effect of standardization

    methods on GP is not noticeable. → Using the Manhattan and Z-score standardization methods does not improve the accuracy of GP. The accuracy decreases when using Manhattan and Z-score. Rank test GP based on Vector obtains the best rank. → This confirms the ability of the GP based on Vector to obtain better accuracy with less number of iterations. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  48. Results - Overall The results showed that the GP based

    on Vector and Min-max standardization methods are the best. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  49. Results - Overall The results showed that the GP based

    on Vector and Min-max standardization methods are the best. → This confirms the ability of the GP based on Vector and Min-Max to obtain better accuracy with fewer number of iterations. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  50. Results - Overall The results showed that the GP based

    on Vector and Min-max standardization methods are the best. → This confirms the ability of the GP based on Vector and Min-Max to obtain better accuracy with fewer number of iterations. The factors that influence the performance of GP at lower population size and lower maximum number of generations are: Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  51. Results - Overall The results showed that the GP based

    on Vector and Min-max standardization methods are the best. → This confirms the ability of the GP based on Vector and Min-Max to obtain better accuracy with fewer number of iterations. The factors that influence the performance of GP at lower population size and lower maximum number of generations are: → The size of the data set. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  52. Results - Overall The results showed that the GP based

    on Vector and Min-max standardization methods are the best. → This confirms the ability of the GP based on Vector and Min-Max to obtain better accuracy with fewer number of iterations. The factors that influence the performance of GP at lower population size and lower maximum number of generations are: → The size of the data set. → Standardization methods. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  53. Results - Overall The results showed that the GP based

    on Vector and Min-max standardization methods are the best. → This confirms the ability of the GP based on Vector and Min-Max to obtain better accuracy with fewer number of iterations. The factors that influence the performance of GP at lower population size and lower maximum number of generations are: → The size of the data set. → Standardization methods. GP requires more iterations and larger population size if no standardization method was applied. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  54. Outline 1 Introduction 2 Motivation 3 Research objectives and questions

    4 Experiments and results Data sets description Experiments environment Results Results - Scenario I Results - Scenario II Results - Scenario III Results - Overall 5 Conclusions and future works Conclusions Future works Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  55. Conclusions The goal of this paper is to investigate the

    performance of data standardization on the accuracy of GP classification. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  56. Conclusions The goal of this paper is to investigate the

    performance of data standardization on the accuracy of GP classification. → Three scenarios have been implemented and tested using six different standardization methods based on ten datasets. GP can achieve higher accuracy rates than GP without data standardization. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  57. Conclusions The goal of this paper is to investigate the

    performance of data standardization on the accuracy of GP classification. → Three scenarios have been implemented and tested using six different standardization methods based on ten datasets. GP can achieve higher accuracy rates than GP without data standardization. → By using standardization methods, GP managed to achieve higher results with fewer iterations and smaller population size. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  58. Conclusions The goal of this paper is to investigate the

    performance of data standardization on the accuracy of GP classification. → Three scenarios have been implemented and tested using six different standardization methods based on ten datasets. GP can achieve higher accuracy rates than GP without data standardization. → By using standardization methods, GP managed to achieve higher results with fewer iterations and smaller population size. The best results are obtained when using Min-Max and Vector methods. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  59. Conclusions The goal of this paper is to investigate the

    performance of data standardization on the accuracy of GP classification. → Three scenarios have been implemented and tested using six different standardization methods based on ten datasets. GP can achieve higher accuracy rates than GP without data standardization. → By using standardization methods, GP managed to achieve higher results with fewer iterations and smaller population size. The best results are obtained when using Min-Max and Vector methods. Manhattan and Z−Score methods achieved the worst accuracy results. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  60. Conclusions The goal of this paper is to investigate the

    performance of data standardization on the accuracy of GP classification. → Three scenarios have been implemented and tested using six different standardization methods based on ten datasets. GP can achieve higher accuracy rates than GP without data standardization. → By using standardization methods, GP managed to achieve higher results with fewer iterations and smaller population size. The best results are obtained when using Min-Max and Vector methods. Manhattan and Z−Score methods achieved the worst accuracy results. Data standardization improve the classification accuracy of the generated GP trees. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  61. Outline 1 Introduction 2 Motivation 3 Research objectives and questions

    4 Experiments and results Data sets description Experiments environment Results Results - Scenario I Results - Scenario II Results - Scenario III Results - Overall 5 Conclusions and future works Conclusions Future works Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  62. Future works Future work includes: Testing the effect of other

    GP parameters in combination with data standardization. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  63. Future works Future work includes: Testing the effect of other

    GP parameters in combination with data standardization. Testing the usage of GP for other types of prediction problems like multi-class classification and regression problem. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  64. Future works Future work includes: Testing the effect of other

    GP parameters in combination with data standardization. Testing the usage of GP for other types of prediction problems like multi-class classification and regression problem. Studying the influence of data standardization methods when GP is applied to higher dimensional datasets. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  65. References I Altman, E. I. (1968). Financial ratios, discriminant analysis

    and the prediction of corporate bankruptcy. The journal of finance, 23(4):589–609. Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository. Faris, H., Al-Shboul, B., and Ghatasheh, N. (2014). A genetic programming based framework for churn prediction in telecommunication industry. In Hwang, D., Jung, J. J., and Nguyen, N.-T., editors, Computational Collective Intelligence. Technologies and Applications, pages 353–362, Cham. Springer International Publishing. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  66. References II Jabeen, H. and Baig, A. R. (2010). Review

    of classification using genetic programming. International journal of engineering science and technology, 2(2):94–103. Kaftanowicz, M. and Krzemi´ nski, M. (2015). Multiple-criteria analysis of plasterboard systems. Procedia Engineering, 111:364–370. Koza, J. R. (1991). Evolving a computer program to generate random numbers using the genetic programming paradigm. In ICGA, pages 37–44. Citeseer. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  67. References III Koza, J. R. (1992). Genetic Programming: On the

    Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA. Sheta, A. F., Faris, H., and ¨ Oznergiz, E. (2014). Improving production quality of a hot-rolling industrial process via genetic programming model. International Journal of Computer Applications in Technology, 49(3-4):239–250. Zavadskas, E. K. and Turskis, Z. (2008). A new logarithmic normalization method in games theory. Informatica, 19(2):303–314. Amaal R. Al Shorman, Hossam Faris, Pedro A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36
  68. Thank You! Questions? Amaal R. Al Shorman, Hossam Faris, Pedro

    A. Castillo, J.J. Merelo, Nailah Al-Madi (University of Jordan, Amman, Jordan) The influence of input data standardization methods on the prediction accuracy of genetic International Joint Conference on Computatio / 36