INTRODUCTION

Database systems have had great success during the past two decades. More and more data are being collected and saved in databases; a database with a pedabyte (1024 terabytes) of data is no longer uncommon. Finding useful information in these databases has become an important focus of many enterprises, and more and more attention has turned to data mining as a key component of such information discovery. Data mining algorithms and visualization tools are being used to find important patterns in data and to create useful forecasts. This technology is being virtually applied in all business sections, including banking, telecommunication, manufacturing, marketing and e-commerce.1

Data mining is not specific to one type of media or data, and should be applicable to any kind of information repository. However, algorithms and approaches may differ when applied to different types of data. Indeed, the challenges presented by different types of data vary significantly. Data mining is being put into use and studied for databases, including relational databases, object-relational databases and object-oriented databases, data warehouses, transactional databases, unstructured and semi-structured repositories such as the World Wide Web, advanced databases such as spatial databases, multimedia databases, time-series databases and textual databases, and even flat files. A data warehouse is a repository of data collected from multiple data sources (often heterogeneous), and is intended to be used as a whole under the same unified schema. Data warehouses are usually modeled by a multi-dimensional data structure.

KNOWLEDGE DISCOVERY

Knowledge discovery has been defined as the ‘non-trivial extraction of implicit, previously unknown and potentially useful information from data’. It is a process of which data mining forms just one part. Figure 1 shows a slightly idealized version of the complete knowledge discovery process. Data come in, possibly from many sources. They are integrated and placed in some common data store. Part of them is then taken and pre-processed into a standard format. These ‘prepared data’ are then passed to a data mining algorithm that produces an output in the form of rules or some other kind of ‘patterns’. These are then interpreted to give new and potentially useful knowledge. This brief description makes it clear that although the data mining algorithms are central to knowledge discovery, they are not the whole story. The pre-processing of the data and the interpretation (as opposed to the blind use) of the results are both of great importance.2

Figure 1
figure 1

Data mining is the core of the knowledge discovery process.Source: Bramer (2007)2.

The main purpose of data mining is to extract patterns from the data at hand, increase their intrinsic value and transfer the data to knowledge.1 Figure 2 shows that in the information technology context, knowledge is very distinct from data and information.3

Figure 2
figure 2

Data, information and knowledge.Source: Turban et al (2010)3.

DATA MINING TASKS

Data mining tasks can be grouped into the following main categories:

  • Classification

  • Clustering

  • Association rules

  • Regression

  • Forecasting

RELATED STUDIES

Urbancic et al,4 discussed some aspects of the international offer of machine learning: data mining and decision support education, based on investigations of the materials available on the world-wide web. The aim of this investigation was to provide starting points for an analysis that could be used to enhance the educational offer in this area in general, and knowledge transfer from academia to industry and the business sector in particular, in addition to general findings concerning the materials collected such as contents, target audience and teaching materials.

Behrouz et al5 presented an approach to classifying students in order to predict their final grade based on features extracted from logged data in an education web-based system. Their design implemented a series of pattern classifiers to compare their performance on an online course dataset. A combination of multiple classifiers leads to a significant improvement in classification performance.

Schumann6 proved that the CRoss-Industry Standard Process for Data Mining (CRISP–DM), a non-proprietary data mining process that was developed for and is currently used in the business world, can be transferred to educational settings and provide a start-to-end structure that is capable of producing operationally actionable information to address the student achievement questions of educational leaders.

HÄubscher et al7 collected data from interactions with an educational system, which are only useful if the patterns can be meaningfully interpreted in the context of the student–system interaction.

Romero et al8 compared different data mining methods and techniques for classifying students based on their model usage data and the final marks obtained in their respective courses. They developed a specific mining tool for making the configuration and execution of data mining techniques easier for instructors. They used real data from seven model courses with Cordoba University students.

Pittman9 proposed to expand the understanding of how educational institutions might benefit from including data mining processes and multivariate analysis to inform student retention strategies. Her study applied data mining techniques to student demographic and behavioral data in an institution of higher education, providing a detailed description of the data mining process.

DATA MINING IMPLEMENTATION AND ANALYSIS

There are many data mining techniques that can be used to evaluate a school's performance. The following classification technique is used:

  • Naïve Bayes Classification;

  • k-Nearest Neighbor Classification;

  • Decision tree Classification.

The CRISP–DM is used in this article,10 which consists mainly of the following tasks:11

  • Business Understanding;

  • Data Understanding;

  • Data Preparation;

  • Modeling;

  • Evaluation;

  • Deployment.

Figure 3 shows the phases and tasks for CRISP–DM.

Figure 3
figure 3

Phases and tasks for CRISP-DM.

It is important for educational leaders and decision makers to classify the schools according to their performance, which is based on several contributing indicators in order to help school administrations to correct their problems.

The following indicators have been selected and used in this article:

Figure 4 shows the data mining process diagram that was designed and implemented in this article.

Figure 4
figure 4

Designed data mining block diagram.

Data that are collected from the schools are constantly increased with time. These data are raw and must be converted into a form to extract useful knowledge from them, and this knowledge is essential for decision-makers working in the educational field. Data were collected from a data warehouse for the academic year 2007–2008. The data for 104 intermediate schools were used in this article. The data warehouse contains data about the staff, schools, exams and students. These tables of the warehouse were constructed for processing. Figure 5 shows the samples to the tables used in building the data warehouse.

Figure 5
figure 5

Sample of the schema database warehouse.

The data were distributed among 170 tables and stored in the Microsoft Access database. This database is classified according to the following category:

  • Staff data (contain detailed information on employee, level of education, specialization, the university, personal information and training courses).

  • School data (contain detailed information about schools and their geographical location, the school library, laboratories, school supplies and much more information).

  • Exam data (contain detailed information on all of the students’ exams, statistics, analysis of results, graphs for students, and information to assist in the issuance of monthly reports on the student).

  • Student data (contain detailed information about students, their parents, contact information, health, social data, official papers and many other things).

The six indicators from Equations (1),(2),(3),(4),(5) and (6) were used to measure the School Performance; these indicators enabled the production of the dataset that was used with data mining. Table 1 shows the sample of the result to these dataset. These samples of the 104 schools were selected from the warehouse that was built in this study.

Table 1 The six indicators of the schools

The suggested procedure to classify the schools’ performance is as follows:

  1. 1)

    Sorting the value of SPC and SPT attributes in ascending order and the rest of the attributes in descending order.

  2. 2)

    The dataset was divided into three groups: the first group with 26 readings, the second group with 52 readings and the third group with 26 readings with the assumption that the values are distributed by normal distribution.

  3. 3)

    The period band for each class was identified through the initial and final values of each attribute. For example, in Table 2 the excellent band (first group) for attribute SPC is between (11.14) and (20.30), the value (11.14) will be the initial value, while (20.30) will be the last value. Table 2 shows the period bands for all the attributes.

    Table 2 Period band classification
  4. 4)

    Generate classification rules for dataset by using the previous step. Table 3 shows these rules.

    Table 3 Classification rule for dataset
  5. 5)

    Classify the dataset according to the classifications of Table 3. Table 4 shows the sample of the training dataset after classification.

    Table 4 Sample of training dataset after classification
  6. 6)

    Replace the following:

    • Good values replaced by the number 1.

    • Very good values replaced by the number 2.

    • Excellent values replaced by the number 3.

  7. 7)

    The total score was calculated for the six indicators so that the total score cannot be less than 6 (by assuming all the attributes had values of 1) and the total score cannot be more than 18 (by assuming all the attributes had values of 3).

  8. 8)

    Total score was divided into three bands as follows:

    • Between 6 and 10 points classified as good.

    • Between 11 and 14 points classified as very good.

    • Between 15 and 18 points classified as excellent.

Table 5 shows the sample of the dataset after applying steps 6–8.

Table 5 Sample of dataset classification used in training data mining

The dataset has been divided into two groups, the first group used to train the model and the second used to test this model. The first group (training dataset) consists of 80 instances (records), approximately 77 percent of the dataset, and the second group (testing dataset) consists of 24 instances (records), approximately 23 percent of the dataset.

NAÏVE BAYES CLASSIFICATION

Naïve Bayes is one of the most efficient and effective inductive learning algorithms for machine learning and data mining. Its competitive performance in classification is surprising.12 The probability of an event, for example that the school receives excellent performance, is a number from 0 to 1 inclusive, with 0 indicating ‘impossible’ and 1 indicating ‘certain’. A probability of 0.7 means that if a long series of trials was conducted, for example if the School Performance recorded yearly for N years, the expectation that the school will get excellent performance calculated by 0.7 × N years. The longer the series of trials, the more reliable this estimate is likely to be.2

Given a set of k mutually exclusive and exhaustive classifications c1, c2, …, c k , which have prior probabilities P(c1), P(c2), …, P(c k ), respectively, and n attributes a1, a2, …, a n , which for a given instance have values v1, v2, …, v n , respectively, the posterior probability of class c i occurring for the specified instance can be shown to be proportional to

Making the assumption that the attributes are independent, the value of this expression can be calculated by using the product

The product will be calculated for each value of i from 1 to k and the classification that has the largest value will be chosen. Three mutually exclusive and exhaustive events can be defined as follows:

  • E1 – School has a good performance.

  • E2 – School has a very good performance.

  • E3 – School has excellent performance.

From the training dataset in Table 5, the probability of an event is usually indicated by a capital letter P, to get the following:

  • P(E1)=18/80=0.225

  • P(E2)=54/80=0.675

  • P(E3)= 8/80=0.100

The values 18, 54 and 8 represent the number of schools that was obtained at the ranks of Good, Very Good and Excellent, respectively. The number of schools used as a training set is 80. Each of these probabilities is between 0 and 1 inclusive, as it has to be qualified as a probability. They also satisfy a second important condition: the sum of the ranks of the three probabilities has to be 1, because precisely one of the events must always occur. In this case,

Table 6 shows the School Performance obtained by calculating how many times each attribute occurs with each value (Good, Very good and Excellent) to determine the School Performance. For example, the SPC is Very good for 39 schools.

  • Six of them have School Performance=Good.

  • Thirty-two of them have School Performance=Very good.

  • One of them has School Performance=Excellent.

Table 6 Conditional and prior probabilities for dataset

The cells in the table simply count these occurrences for all possible values of each attribute and probabilities that the event will occur. The prior probability value in the final row (prior probability row) in Table 6 shows the total number of occurrences of Excellent, Very good and Good, respectively.

As an example, suppose a new school with the values that are shown in Table 7. How to recognize the School Performance for this school?

  • School Performance=Good

  • 0.333 × 0.167 × 0.667 × 0.722 × 0.000 × 0.056=0

  • School Performance=Very good

  • 0.593 × 0.574 × 0.519 × 0.444 × 0.037 × 0.148=0.00043

  • School Performance=Excellent

  • 0.125 × 0.375 × 0.375 × 0.625 × 0.750 × 0.500=0.004065

  • The largest value is for class=Excellent

Table 7 School performance for a new school

The target value is calculated by the probability, the highest probability value will be selected, this value represent the performance that needs determination.

Table 8 calculated by using the Naïve Bayes classification algorithm, where Table 5 used to determine the rank in each column of Table 8, the posterior probabilities to each possible rank used to test the dataset.

Table 8 Classification by Naïve Bayes algorithm

The predictive accuracy of a model is estimated by the following equation:

where C: correctly classified instances; N: total number of instances in test set.

illustration

figure c

Nearest Neighbor classification is mainly used when all attribute values are continuous, although it can be modified to deal with categorical attributes. The idea is to estimate the classification of an unseen instance by using the classification of the instance or instances that are closest to it, Table 9 shows the dataset of nearest neighbor classification.

Table 9 Dataset for Nearest Neighbor

illustration

figure b

On the basis of the k-Nearest Neighbor classification algorithm and Table 9, the following question can be asked:

What should classification be if there is the following unseen instance?

illustration

figure a

The answer for this question is found by calculating the distance between the point (25, 10) and other points and then finding the nearest five points (see Figure 6). There are many possible ways of measuring the distance between the two instances with n attribute values, or equivalently between two points in n-dimensional space. The three conditions on any measure of distance implemented by using the following (Bramer, 2007);

  1. 1)

    The distance of any point A from itself is zero, that is dist(A, A)=0;

  2. 2)

    The distance from A to B is the same as the distance from B to A, that is, dist (A, B)=dist(B, A) (the symmetry condition).

  3. 3)

    The intuitive idea is that ‘the shortest distance between any two points is a straight line’. The condition says that for any two points A, B and Z dist(A, B)⩽dist(A, Z)+dist(Z points, B).

Figure 6
figure 6

Two-dimensional representation of dataset in Table 9.

There are many possible distance measures, but the most popular is almost certainly the Euclidean distance.

By applying Equation (11) to the point (25, 10) and the value from Table 9, the five nearest points to the point (25, 10) are classified as very good, as shown in Table 10.

Table 10 The five nearest points to the point (25, 10)

MIN–MAX NORMALIZATION

Min–Max normalization works by seeing how much greater the field value is than the minimum value min(X) and scaling this difference by the range. That is

By applying Equation (12) to the dataset, the normalization can be obtained. Table 11 shows the sample of dataset normalization.

Table 11 Sample of normalization dataset

The formula for Euclidean distance between points (a1, a2, …, a n ) and (b1, b2, …, b n ) in n-dimensional space can be generalized by the following equation:

By using equation (13) and the training dataset of the five-Nearest Neighbor, the result is as shown in Table 12.

Table 12 The five-Nearest Neighbor result

To calculate the predictive accuracy for the k-Nearest Neighbor classification model, equation (10) was used.

  • Correctly classified instances=18

  • Incorrectly classified instances=6

  • Therefore, the Predictive accuracy

  • p=(18/24) × 100=75 percent

  •  1−p=(6/24) × 100=25 percent

DECISION TREE

Decision trees are powerful and popular for both classification and prediction. The attractiveness of tree-based methods is largely a result of the fact that decision trees represent rules. Rules can readily be expressed in English so that humans can understand them; they can also be expressed in a database access language such as SQL to retrieve records in a particular category. Decision trees are also useful for exploring data to gain insight into the relationships of a large number of candidate input variables to a target variable. Because decision trees combine both data exploration and modeling, they are a powerful first step in the modeling process even when building the final model using some other technique.13

Decision trees are an intuitive and widely used type of influence diagram. The basic goal of a decision tree is to find an optimal set of yes-or-no questions that ultimately leads to a correct classification, or probability. The tree must have meaningful criteria for choosing questions, and derive answers from the training dataset.14

Decision tree C4.5 algorithm

The decision tree C4.5 algorithm is Quinlan's extension of his own ID3 algorithm for generating decision trees.15 The C4.5 algorithm recursively visits each decision node, selecting the optimal split, until no further splits are possible.

The C4.5 algorithm uses the concept of information gain or entropy reduction to select the optimal split. Suppose the variable X whose k possible values have probabilities p1, p2, …, p k . What is the smallest number of bits, on average per symbol, needed to transmit a stream of symbols representing the values of X observed? The answer is called the entropy of X and is defined as

C4.5 uses this concept of entropy as follows. Suppose that we have a candidate split S, which partitions the training dataset T into several subsets, T1, T2, …, T k .

The mean information requirement can then be calculated as the weighted sum of the entropies for the individual subsets, as follows:

Where P i represents the proportion of records in subset i. Define the information gain to be

That is, the increase in information produced by partitioning the training data T according to this candidate split S. At each decision node, C4.5 chooses the optimal split to be the split that has the greatest information gain, gain(S).

Given a set S of cases, C4.5 first grows an initial tree using the divide-and-conquer algorithm as follows:

  • If all the cases in S belong to the same class or S is small, the tree is a leaf labeled with the most frequent class in S.

  • Otherwise, choose a test based on a single attribute with two or more outcomes. Make this test the root of the tree with one branch for each outcome of the test, partition S into corresponding subsets S1, S2, … according to the outcome for each case, and apply the same procedure recursively to each subset.

Equations (14) and (15) are used for the training dataset. Table 13 shows the optimal split that has the greatest information gain is (0.3694); this is achieved with the split by attribute SPC.

Table 13 Information gain for the first split node in C4.5 decision tree

Table 14 shows the information gain with the second spilt node, which was the attribute SRS with maximum Information Gain value (0.4151). Figure 7 shows the final decision tree.

Table 14 Information gain for the second split node in C4.5 decision tree
Figure 7
figure 7

The decision tree model.

Table 15 shows the result of testing the test dataset of the last 24 instances on the C4.5 decision tree model obtained in Figure 7.

Table 15 The C4.5 decision tree result

To calculate the predictive accuracy for the C4.5 decision tree Classification model, by using equation (9). Equation (10) was used to calculate the predictive accuracy for the C4.5 decision tree Classification model.

  • Correctly classified instances=16

  • Incorrectly classified instances=8

  • Therefore, the Predictive accuracy

  • p=(16/24) × 100=66.667 percent

  •  1−p=(8/24) × 100=33.333 percent

The C4.5 decision tree model was implemented by using data of 80 schools (out of 104 schools) and applying the following indicators: SPC, SPT, SRS, PSSS, ERS and STRS. The model was then tested by using the 24 schools (out of 104 schools). The results showed that there are 16 schools whose predicted School Performance is identical to the actual school performance, while there are eight schools whose predicted School Performance is different from the actual school performance. Predictive accuracy was 66.667 percent for the C4.5 decision tree model, which is an acceptable percentage in the implementation of data mining projects.

ACCURACY EVALUATION

After testing the three methods of data mining (Nearest Neighbor, Naïve Bayes and the C4.5 decision tree), the three methods were valuated by measuring accuracy, Weka version 3.6 was one of the data mining tools used in accuracy comparison. Weka was developed at the University of Waikato in New Zealand. The dataset used with the Weka is the same set of data that were used in the three methods, but converted by equation (13) to values between 0 and 1. Figure 8 shows the Weka system test option.

Figure 8
figure 8

Weka system test option.

Table 16 shows the results of testing each method of data mining by using different test modes. The results showed that the C4.5 decision tree had a high accuracy rate (95.1923 percent) compared with the other methods, when using the test mode (evaluate on training data).

Table 16 Accuracy of results by using WEKA

CONCLUSION

Data mining, also called Knowledge Discovery in Databases, is the field of discovering useful information from large amounts of data. In recent years, there has been increasing interest in the use of data mining to investigate scientific questions within educational research, an area of inquiry termed educational data mining. Educational data mining is the area of scientific inquiry centered on the development of methods for making discoveries within the unique kinds of data that come from educational settings, and using those methods to better understand students and the settings that they learn in.

The results were obtained by applying the normal method in the analysis by using Microsoft 2007; the results showed that the test method (evaluate on training data) achieved the accuracy (87.5 percent, 85.576 percent and 95.192 percent) of the method's Nearest Neighbor, Naïve Bayes and the C4.5 decision tree, respectively. Therefore, the C4.5 decision tree algorithm has a high accuracy rate compared with the other models, which is recommended to be used with education data processing and testing.