Selecting a suitable method of data mining for successful forecasting

Alsultanny, Yas

doi:10.1057/jt.2011.21

Selecting a suitable method of data mining for successful forecasting

Original Article
Published: 13 December 2011

Volume 19, pages 207–225, (2011)
Cite this article

Download PDF

Journal of Targeting, Measurement and Analysis for Marketing

Selecting a suitable method of data mining for successful forecasting

Download PDF

Yas Alsultanny¹

5378 Accesses
1 Citation
Explore all metrics

Abstract

The aim of using data mining in the education field is to enhance educational performance, by using the six school indicators, which are defined in this article to identify useful guidelines that can improve school performance. Knowledge discovery requires a clear methodology that can be successfully applied in the education sector. This can be obtained from the use of the CRoss-Industry Standard Process for Data Mining (CRISP–DM). The CRISP-DM was used in this article to implement data mining for knowledge discovery from the database of 104 schools. Three methods of data mining, Naïve Bayes, Nearest Neighbor and the C4.5 decision tree, are implemented on the school data. The results showed that the C4.5 decision tree is significantly more accurate compared with the other methods.

Machine Learning Based Decision Support System for High-School Study

Factors Affecting Students’ Performance in Higher Education: A Systematic Review of Predictive Data Mining Techniques

Article 25 April 2019

Educational data mining: prediction of students' academic performance using machine learning algorithms

Article Open access 03 March 2022

INTRODUCTION

Database systems have had great success during the past two decades. More and more data are being collected and saved in databases; a database with a pedabyte (1024 terabytes) of data is no longer uncommon. Finding useful information in these databases has become an important focus of many enterprises, and more and more attention has turned to data mining as a key component of such information discovery. Data mining algorithms and visualization tools are being used to find important patterns in data and to create useful forecasts. This technology is being virtually applied in all business sections, including banking, telecommunication, manufacturing, marketing and e-commerce.¹

Data mining is not specific to one type of media or data, and should be applicable to any kind of information repository. However, algorithms and approaches may differ when applied to different types of data. Indeed, the challenges presented by different types of data vary significantly. Data mining is being put into use and studied for databases, including relational databases, object-relational databases and object-oriented databases, data warehouses, transactional databases, unstructured and semi-structured repositories such as the World Wide Web, advanced databases such as spatial databases, multimedia databases, time-series databases and textual databases, and even flat files. A data warehouse is a repository of data collected from multiple data sources (often heterogeneous), and is intended to be used as a whole under the same unified schema. Data warehouses are usually modeled by a multi-dimensional data structure.

KNOWLEDGE DISCOVERY

Knowledge discovery has been defined as the ‘non-trivial extraction of implicit, previously unknown and potentially useful information from data’. It is a process of which data mining forms just one part. Figure 1 shows a slightly idealized version of the complete knowledge discovery process. Data come in, possibly from many sources. They are integrated and placed in some common data store. Part of them is then taken and pre-processed into a standard format. These ‘prepared data’ are then passed to a data mining algorithm that produces an output in the form of rules or some other kind of ‘patterns’. These are then interpreted to give new and potentially useful knowledge. This brief description makes it clear that although the data mining algorithms are central to knowledge discovery, they are not the whole story. The pre-processing of the data and the interpretation (as opposed to the blind use) of the results are both of great importance.²

The main purpose of data mining is to extract patterns from the data at hand, increase their intrinsic value and transfer the data to knowledge.¹ Figure 2 shows that in the information technology context, knowledge is very distinct from data and information.³

DATA MINING TASKS

Data mining tasks can be grouped into the following main categories:

Classification
Clustering
Association rules
Regression
Forecasting

RELATED STUDIES

Urbancic et al,⁴ discussed some aspects of the international offer of machine learning: data mining and decision support education, based on investigations of the materials available on the world-wide web. The aim of this investigation was to provide starting points for an analysis that could be used to enhance the educational offer in this area in general, and knowledge transfer from academia to industry and the business sector in particular, in addition to general findings concerning the materials collected such as contents, target audience and teaching materials.

Behrouz et al⁵ presented an approach to classifying students in order to predict their final grade based on features extracted from logged data in an education web-based system. Their design implemented a series of pattern classifiers to compare their performance on an online course dataset. A combination of multiple classifiers leads to a significant improvement in classification performance.

Schumann⁶ proved that the CRoss-Industry Standard Process for Data Mining (CRISP–DM), a non-proprietary data mining process that was developed for and is currently used in the business world, can be transferred to educational settings and provide a start-to-end structure that is capable of producing operationally actionable information to address the student achievement questions of educational leaders.

HÄubscher et al⁷ collected data from interactions with an educational system, which are only useful if the patterns can be meaningfully interpreted in the context of the student–system interaction.

Romero et al⁸ compared different data mining methods and techniques for classifying students based on their model usage data and the final marks obtained in their respective courses. They developed a specific mining tool for making the configuration and execution of data mining techniques easier for instructors. They used real data from seven model courses with Cordoba University students.

Pittman⁹ proposed to expand the understanding of how educational institutions might benefit from including data mining processes and multivariate analysis to inform student retention strategies. Her study applied data mining techniques to student demographic and behavioral data in an institution of higher education, providing a detailed description of the data mining process.

DATA MINING IMPLEMENTATION AND ANALYSIS

There are many data mining techniques that can be used to evaluate a school's performance. The following classification technique is used:

Naïve Bayes Classification;
k-Nearest Neighbor Classification;
Decision tree Classification.

The CRISP–DM is used in this article,¹⁰ which consists mainly of the following tasks:¹¹

Business Understanding;
Data Understanding;
Data Preparation;
Modeling;
Evaluation;
Deployment.

Figure 3 shows the phases and tasks for CRISP–DM.

It is important for educational leaders and decision makers to classify the schools according to their performance, which is based on several contributing indicators in order to help school administrations to correct their problems.

The following indicators have been selected and used in this article:

Figure 4 shows the data mining process diagram that was designed and implemented in this article.

Data that are collected from the schools are constantly increased with time. These data are raw and must be converted into a form to extract useful knowledge from them, and this knowledge is essential for decision-makers working in the educational field. Data were collected from a data warehouse for the academic year 2007–2008. The data for 104 intermediate schools were used in this article. The data warehouse contains data about the staff, schools, exams and students. These tables of the warehouse were constructed for processing. Figure 5 shows the samples to the tables used in building the data warehouse.

The data were distributed among 170 tables and stored in the Microsoft Access database. This database is classified according to the following category:

Staff data (contain detailed information on employee, level of education, specialization, the university, personal information and training courses).
School data (contain detailed information about schools and their geographical location, the school library, laboratories, school supplies and much more information).
Exam data (contain detailed information on all of the students’ exams, statistics, analysis of results, graphs for students, and information to assist in the issuance of monthly reports on the student).
Student data (contain detailed information about students, their parents, contact information, health, social data, official papers and many other things).

The six indicators from Equations (1),(2),(3),(4),(5) and (6) were used to measure the School Performance; these indicators enabled the production of the dataset that was used with data mining. Table 1 shows the sample of the result to these dataset. These samples of the 104 schools were selected from the warehouse that was built in this study.

Table 1 The six indicators of the schools

Full size table

The suggested procedure to classify the schools’ performance is as follows:

1)
Sorting the value of SPC and SPT attributes in ascending order and the rest of the attributes in descending order.
2)
The dataset was divided into three groups: the first group with 26 readings, the second group with 52 readings and the third group with 26 readings with the assumption that the values are distributed by normal distribution.
3)
The period band for each class was identified through the initial and final values of each attribute. For example, in Table 2 the excellent band (first group) for attribute SPC is between (11.14) and (20.30), the value (11.14) will be the initial value, while (20.30) will be the last value. Table 2 shows the period bands for all the attributes.
Table 2 Period band classification
Full size table
4)
Generate classification rules for dataset by using the previous step. Table 3 shows these rules.
Table 3 Classification rule for dataset
Full size table
5)
Classify the dataset according to the classifications of Table 3. Table 4 shows the sample of the training dataset after classification.
Table 4 Sample of training dataset after classification
Full size table
6)
Replace the following:
- Good values replaced by the number 1.
- Very good values replaced by the number 2.
- Excellent values replaced by the number 3.
7)
The total score was calculated for the six indicators so that the total score cannot be less than 6 (by assuming all the attributes had values of 1) and the total score cannot be more than 18 (by assuming all the attributes had values of 3).
8)
Total score was divided into three bands as follows:
- Between 6 and 10 points classified as good.
- Between 11 and 14 points classified as very good.
- Between 15 and 18 points classified as excellent.

Table 5 shows the sample of the dataset after applying steps 6–8.

Table 5 Sample of dataset classification used in training data mining

Full size table

The dataset has been divided into two groups, the first group used to train the model and the second used to test this model. The first group (training dataset) consists of 80 instances (records), approximately 77 percent of the dataset, and the second group (testing dataset) consists of 24 instances (records), approximately 23 percent of the dataset.

NAÏVE BAYES CLASSIFICATION

Naïve Bayes is one of the most efficient and effective inductive learning algorithms for machine learning and data mining. Its competitive performance in classification is surprising.¹² The probability of an event, for example that the school receives excellent performance, is a number from 0 to 1 inclusive, with 0 indicating ‘impossible’ and 1 indicating ‘certain’. A probability of 0.7 means that if a long series of trials was conducted, for example if the School Performance recorded yearly for N years, the expectation that the school will get excellent performance calculated by 0.7 × N years. The longer the series of trials, the more reliable this estimate is likely to be.²

Given a set of k mutually exclusive and exhaustive classifications c₁, c₂, …, c_k, which have prior probabilities P(c₁), P(c₂), …, P(c_k), respectively, and n attributes a₁, a₂, …, a_n, which for a given instance have values v₁, v₂, …, v_n, respectively, the posterior probability of class c_i occurring for the specified instance can be shown to be proportional to

Making the assumption that the attributes are independent, the value of this expression can be calculated by using the product

The product will be calculated for each value of i from 1 to k and the classification that has the largest value will be chosen. Three mutually exclusive and exhaustive events can be defined as follows:

E₁ – School has a good performance.
E₂ – School has a very good performance.
E₃ – School has excellent performance.

From the training dataset in Table 5, the probability of an event is usually indicated by a capital letter P, to get the following:

P(E₁)=18/80=0.225
P(E₂)=54/80=0.675
P(E₃)= 8/80=0.100

The values 18, 54 and 8 represent the number of schools that was obtained at the ranks of Good, Very Good and Excellent, respectively. The number of schools used as a training set is 80. Each of these probabilities is between 0 and 1 inclusive, as it has to be qualified as a probability. They also satisfy a second important condition: the sum of the ranks of the three probabilities has to be 1, because precisely one of the events must always occur. In this case,

Table 6 shows the School Performance obtained by calculating how many times each attribute occurs with each value (Good, Very good and Excellent) to determine the School Performance. For example, the SPC is Very good for 39 schools.

Six of them have School Performance=Good.
Thirty-two of them have School Performance=Very good.
One of them has School Performance=Excellent.

Table 6 Conditional and prior probabilities for dataset

Full size table

The cells in the table simply count these occurrences for all possible values of each attribute and probabilities that the event will occur. The prior probability value in the final row (prior probability row) in Table 6 shows the total number of occurrences of Excellent, Very good and Good, respectively.

As an example, suppose a new school with the values that are shown in Table 7. How to recognize the School Performance for this school?

School Performance=Good
0.333 × 0.167 × 0.667 × 0.722 × 0.000 × 0.056=0
School Performance=Very good
0.593 × 0.574 × 0.519 × 0.444 × 0.037 × 0.148=0.00043
School Performance=Excellent
0.125 × 0.375 × 0.375 × 0.625 × 0.750 × 0.500=0.004065
The largest value is for class=Excellent

Table 7 School performance for a new school

Full size table

The target value is calculated by the probability, the highest probability value will be selected, this value represent the performance that needs determination.

Table 8 calculated by using the Naïve Bayes classification algorithm, where Table 5 used to determine the rank in each column of Table 8, the posterior probabilities to each possible rank used to test the dataset.

Table 8 Classification by Naïve Bayes algorithm

Full size table

The predictive accuracy of a model is estimated by the following equation:

where C: correctly classified instances; N: total number of instances in test set.

illustration

Nearest Neighbor classification is mainly used when all attribute values are continuous, although it can be modified to deal with categorical attributes. The idea is to estimate the classification of an unseen instance by using the classification of the instance or instances that are closest to it, Table 9 shows the dataset of nearest neighbor classification.

Table 9 Dataset for Nearest Neighbor

Full size table

illustration

On the basis of the k-Nearest Neighbor classification algorithm and Table 9, the following question can be asked:

What should classification be if there is the following unseen instance?

illustration

The answer for this question is found by calculating the distance between the point (25, 10) and other points and then finding the nearest five points (see Figure 6). There are many possible ways of measuring the distance between the two instances with n attribute values, or equivalently between two points in n-dimensional space. The three conditions on any measure of distance implemented by using the following (Bramer, 2007);

1)
The distance of any point A from itself is zero, that is dist(A, A)=0;
2)
The distance from A to B is the same as the distance from B to A, that is, dist (A, B)=dist(B, A) (the symmetry condition).
3)
The intuitive idea is that ‘the shortest distance between any two points is a straight line’. The condition says that for any two points A, B and Z dist(A, B)⩽dist(A, Z)+dist(Z points, B).

There are many possible distance measures, but the most popular is almost certainly the Euclidean distance.

By applying Equation (11) to the point (25, 10) and the value from Table 9, the five nearest points to the point (25, 10) are classified as very good, as shown in Table 10.

Table 10 The five nearest points to the point (25, 10)

Full size table

MIN–MAX NORMALIZATION

Min–Max normalization works by seeing how much greater the field value is than the minimum value min(X) and scaling this difference by the range. That is

By applying Equation (12) to the dataset, the normalization can be obtained. Table 11 shows the sample of dataset normalization.

Table 11 Sample of normalization dataset

Full size table

The formula for Euclidean distance between points (a₁, a₂, …, a_n) and (b₁, b₂, …, b_n) in n-dimensional space can be generalized by the following equation:

By using equation (13) and the training dataset of the five-Nearest Neighbor, the result is as shown in Table 12.

Table 12 The five-Nearest Neighbor result

Full size table

To calculate the predictive accuracy for the k-Nearest Neighbor classification model, equation (10) was used.

Correctly classified instances=18
Incorrectly classified instances=6
Therefore, the Predictive accuracy
p=(18/24) × 100=75 percent
1−p=(6/24) × 100=25 percent

DECISION TREE

Decision trees are powerful and popular for both classification and prediction. The attractiveness of tree-based methods is largely a result of the fact that decision trees represent rules. Rules can readily be expressed in English so that humans can understand them; they can also be expressed in a database access language such as SQL to retrieve records in a particular category. Decision trees are also useful for exploring data to gain insight into the relationships of a large number of candidate input variables to a target variable. Because decision trees combine both data exploration and modeling, they are a powerful first step in the modeling process even when building the final model using some other technique.¹³

Decision trees are an intuitive and widely used type of influence diagram. The basic goal of a decision tree is to find an optimal set of yes-or-no questions that ultimately leads to a correct classification, or probability. The tree must have meaningful criteria for choosing questions, and derive answers from the training dataset.¹⁴

Decision tree C4.5 algorithm

The decision tree C4.5 algorithm is Quinlan's extension of his own ID3 algorithm for generating decision trees.¹⁵ The C4.5 algorithm recursively visits each decision node, selecting the optimal split, until no further splits are possible.

The C4.5 algorithm uses the concept of information gain or entropy reduction to select the optimal split. Suppose the variable X whose k possible values have probabilities p₁, p₂, …, p_k. What is the smallest number of bits, on average per symbol, needed to transmit a stream of symbols representing the values of X observed? The answer is called the entropy of X and is defined as

C4.5 uses this concept of entropy as follows. Suppose that we have a candidate split S, which partitions the training dataset T into several subsets, T₁, T₂, …, T_k.

The mean information requirement can then be calculated as the weighted sum of the entropies for the individual subsets, as follows:

Where P_i represents the proportion of records in subset i. Define the information gain to be

That is, the increase in information produced by partitioning the training data T according to this candidate split S. At each decision node, C4.5 chooses the optimal split to be the split that has the greatest information gain, gain(S).

Given a set S of cases, C4.5 first grows an initial tree using the divide-and-conquer algorithm as follows:

If all the cases in S belong to the same class or S is small, the tree is a leaf labeled with the most frequent class in S.
Otherwise, choose a test based on a single attribute with two or more outcomes. Make this test the root of the tree with one branch for each outcome of the test, partition S into corresponding subsets S₁, S₂, … according to the outcome for each case, and apply the same procedure recursively to each subset.

Equations (14) and (15) are used for the training dataset. Table 13 shows the optimal split that has the greatest information gain is (0.3694); this is achieved with the split by attribute SPC.

Table 13 Information gain for the first split node in C4.5 decision tree

Full size table

Table 14 shows the information gain with the second spilt node, which was the attribute SRS with maximum Information Gain value (0.4151). Figure 7 shows the final decision tree.

Table 14 Information gain for the second split node in C4.5 decision tree

Full size table

Table 15 shows the result of testing the test dataset of the last 24 instances on the C4.5 decision tree model obtained in Figure 7.

Table 15 The C4.5 decision tree result

Full size table

To calculate the predictive accuracy for the C4.5 decision tree Classification model, by using equation (9). Equation (10) was used to calculate the predictive accuracy for the C4.5 decision tree Classification model.

Correctly classified instances=16
Incorrectly classified instances=8
Therefore, the Predictive accuracy
p=(16/24) × 100=66.667 percent
1−p=(8/24) × 100=33.333 percent

The C4.5 decision tree model was implemented by using data of 80 schools (out of 104 schools) and applying the following indicators: SPC, SPT, SRS, PSSS, ERS and STRS. The model was then tested by using the 24 schools (out of 104 schools). The results showed that there are 16 schools whose predicted School Performance is identical to the actual school performance, while there are eight schools whose predicted School Performance is different from the actual school performance. Predictive accuracy was 66.667 percent for the C4.5 decision tree model, which is an acceptable percentage in the implementation of data mining projects.

ACCURACY EVALUATION

After testing the three methods of data mining (Nearest Neighbor, Naïve Bayes and the C4.5 decision tree), the three methods were valuated by measuring accuracy, Weka version 3.6 was one of the data mining tools used in accuracy comparison. Weka was developed at the University of Waikato in New Zealand. The dataset used with the Weka is the same set of data that were used in the three methods, but converted by equation (13) to values between 0 and 1. Figure 8 shows the Weka system test option.

Table 16 shows the results of testing each method of data mining by using different test modes. The results showed that the C4.5 decision tree had a high accuracy rate (95.1923 percent) compared with the other methods, when using the test mode (evaluate on training data).

Table 16 Accuracy of results by using WEKA

Full size table

CONCLUSION

Data mining, also called Knowledge Discovery in Databases, is the field of discovering useful information from large amounts of data. In recent years, there has been increasing interest in the use of data mining to investigate scientific questions within educational research, an area of inquiry termed educational data mining. Educational data mining is the area of scientific inquiry centered on the development of methods for making discoveries within the unique kinds of data that come from educational settings, and using those methods to better understand students and the settings that they learn in.

The results were obtained by applying the normal method in the analysis by using Microsoft 2007; the results showed that the test method (evaluate on training data) achieved the accuracy (87.5 percent, 85.576 percent and 95.192 percent) of the method's Nearest Neighbor, Naïve Bayes and the C4.5 decision tree, respectively. Therefore, the C4.5 decision tree algorithm has a high accuracy rate compared with the other models, which is recommended to be used with education data processing and testing.

References

Tang, Z. and MacLennan, J. (2005) Data Mining with SQL Server 2005. Indianapolis, IN: Wiley.
Google Scholar
Bramer, M.A. (2007) Principles of Data Mining. London: Springer.
Google Scholar
Turban, E.M. and Wetherbe, J. (2010) Information Technology for Management: Transforming Business in the Digital Economy. New York: John Wiley.
Google Scholar
Urbancic, T., Skrjanc, M. and Flach, P. (2002) Web-based analysis of data mining and decision support education. AI Communications 15 (4): 199–204.
Google Scholar
Behrouz, M.B. and Punch, W.F. (2003) Using genetic algorithms for data mining optimization in an educational web-based system. GECCO 2003 Genetic and Evolutionary Computation Conference. Chicago, IL: Springer-Verlag, pp. 2252–2263.
Schumann, J.A. (2005) Data mining methodologies in educational organizations. PhD thesis, University of Connecticut, CT, USA.
HÄubscher, R., Puntambekar, S. and Nye, A.H. (2007) Domain specific interactive data mining. International Conference on User Modeling UM2007. Corfu, Greece: User Modelling Inc, pp. 81–90.
Romero, C., Ventura, S., Espejo, P.G. and Hervás, C. (2008) Data mining algorithms to classify students. Educational Data Mining Conference EDM 2008, Montreal, Canada, pp. 182–185.
Pittman, K. (2008) Comparison of data mining techniques used to predict student retention. PhD thesis, Nova Southeastern University, FL, USA.
The Cross Industry Standard Process for Data Mining Blog, http://crispdm.wordpress.com/about-crisp-dm-2/, accessed November 2011.
Chapman, P. et al (2000) CRISP-DM 1.0: Step-by-Step data mining guide, www.nancygrady.info/CRISP-DM.pdf.
Zhang, H. (2004) The optimality of Naive Bayes. The 17th International FLAIRS Conference, Miami Beach, FL: Florida Artificial Intelligence Research Symposium.
Berry, M.J. and Linoff, G.S. (2004) Data Mining Techniques for Marketing Sales and Customer Relationship Management, 3rd edn. Indianapolis, IN: Wiley Publishing.
Google Scholar
Murray, E. (2005) Using decision trees to understand student data. The 22nd International Conference on Machine Learning, Bonn, Germany.
Quinlan, J.R. (1994) C4.5: Programs for machine learning. Machine Learning 16 (3): 235–240.
Google Scholar

Download references

Author information

Authors and Affiliations

College of Graduate Studies, Arabian Gulf University, Manama, Kingdom of Bahrain
Yas Alsultanny

Authors

Yas Alsultanny
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yas Alsultanny.

Additional information

¹was an Assistant Professor at Baghdad University, and the head of the Remote Sensing Department at the Scientific Research Council in Baghdad from 1997 to 2000. From 2000 he was an Associate Professor at the Applied Science University in Jordan, and in 2007 he became an Associate Professor at the Arabian Gulf University in Bahrain. He has spoken at many conferences.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alsultanny, Y. Selecting a suitable method of data mining for successful forecasting. J Target Meas Anal Mark 19, 207–225 (2011). https://doi.org/10.1057/jt.2011.21

Download citation

Received: 31 October 2011
Accepted: 31 October 2011
Published: 13 December 2011
Issue Date: 01 September 2011
DOI: https://doi.org/10.1057/jt.2011.21

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Selecting a suitable method of data mining for successful forecasting

Abstract

Similar content being viewed by others

Machine Learning Based Decision Support System for High-School Study

Factors Affecting Students’ Performance in Higher Education: A Systematic Review of Predictive Data Mining Techniques

Educational data mining: prediction of students' academic performance using machine learning algorithms

INTRODUCTION

KNOWLEDGE DISCOVERY

DATA MINING TASKS

RELATED STUDIES

DATA MINING IMPLEMENTATION AND ANALYSIS

NAÏVE BAYES CLASSIFICATION

MIN–MAX NORMALIZATION