Variable selection methods in regression: Ignorable problem, outing notable solution

Ratner, Bruce

doi:10.1057/jt.2009.26

Variable selection methods in regression: Ignorable problem, outing notable solution

A Closer Look
Published: 04 March 2010

Volume 18, pages 65–75, (2010)
Cite this article

Download PDF

Journal of Targeting, Measurement and Analysis for Marketing

Variable selection methods in regression: Ignorable problem, outing notable solution

Download PDF

Bruce Ratner¹

26k Accesses
47 Citations
2 Altmetric
Explore all metrics

Abstract

Variable selection in regression – identifying the best subset among many variables to include in a model – is arguably the hardest part of model building. Many variable selection methods exist. Many statisticians know them, but few know they produce poorly performing models. Some variable selection methods are a miscarriage of statistics because they are developed by, in effect, debasing sound statistical theory. The purpose of this article is two-fold: (1) to re-examine the scope of the literature addressing the weaknesses of variable selection methods and (2) to re-enliven a possible solution to defining a better-performing regression model. To achieve this goal in practice, the article has two objectives: (1) to review five widely used variable selection methods, itemize some of their weaknesses and consider why they are used; and (2) to present Tukey's Exploratory Data Analysis (EDA) in the context of a natural seven-step cycle. Newcomers to Tukey's EDA should consider the seven-step cycle introduced in the narrative of Tukey's analytic philosophy. John W. Tukey (16 June 1915 – 26 July 2000) was a significant contributor to the field of statistics, but was a humble, unpretentious man, as he always considered himself a data analyst. Tukey's seminal book, Exploratory Data Analysis is known by the book's unique initialed title, EDA.

Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations

Article Open access 30 January 2023

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Article Open access 01 April 2016

Sampling Techniques for Quantitative Research

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

BACKGROUND

Classic statistics dictate that the statistician sets about dealing with a given problem with a pre-specified procedure designed for that problem. For example, the problem of predicting a continuous target variable (for example, profit) is solved by using the ordinary least squares (OLS) regression model, along with checking the well-known underlying OLS assumptions.¹ There are several candidate predictor variables at hand, allowing a workable-task for the statistician to check assumptions (for example, predictor variables are linearly independent). Likewise, the dataset has a practicable number of observations, making it also a workable-task for the statistician to check assumptions (for example, the errors are uncorrelated). As well, the statistician can perform the highly regarded yet often-discarded exploratory data analysis technique (EDA), in order to examine and apply the appropriate remedies for individual records that contribute to sticky data characteristics (for example, gaps, clumps and outliers). Importantly, EDA allows the statistician to assess whether or not a given variable, say, X, needs a transformation/re-expression (for example, log(X), sin(X) or 1/X). The traditional variable selection methods cannot achieve such transformations or a priori construction of new variables based on the original variables;² and this inability is a serious weakness of the variable selection methodology.³

At present, building an OLS regression model or a logistic regression model (LRM whereby the target variable is binary: yes-no/1–0) is often problematic because of the size of the dataset. Modelers work on big data – consisting of a multitude of variables, and a positive army of observations. The workable-tasks are no longer feasible. Modelers cannot sure-footedly use OLS regression and LRM on big data, as the two statistical regression models were conceived, testing and experimented within the small-data setting of the day (over 50 and 205 years ago, for LRM and OLS regression, respectively). We argue that fitting big data to a pre-specified small-framed model produces a skewed model with doubtful interpretability and questionable results.

The knowledge and practice of variable selection methods were developed when small data grew into early-size big data circa late 1960s/early 1970s. With only a single bibliographic citation ascribing variable selection methods to unsupported notions, I believe a reasonable scenario of the genesis of the methods was as follows.⁴ College statistics ‘nerds’ (intelligent thinkers) and computer science ‘geeks’ (intelligent doers) put together the variable selection methodology using a trinity of selection components:

1
statistical tests (for example, F, chi-square and t tests, and significance testing);
2
statistical criteria (for example, R-squared, adjusted R-squared, Mallows’ C_p and MSE⁵); and
3
statistical stopping rules (for example, P-values flags for variable entry/deletion/staying in a model).

The body of unconfirmed thinking about the variable selection methods was misguidedly developed on the basis of adroitness in computer-automated statistics. This ‘trinity’ distorts the original theoretical and inferential meanings of the component parts. The statistician executing this computer-driven statistical combination in an apparently intuitive, insightful way seemingly provided proof – face validity – that the problem of variable selection, that is, subset selection, was solved (at least to the uninitiated statistician).

These subset selection methods initially enjoyed wide acceptance with extensive use, and to an extent still do. Statisticians build at-risk accurate and stable models, either unknowingly using these unconfirmed methods or knowingly, because they do not know what to do. It was not long before these methods’ weaknesses began to generate many commentaries in the literature. I itemize nine weaknesses, below, for two of the traditional variable selection methods, All-subset and Stepwise. I briefly describe the five most frequently used variable selection methods in the next section.

1
For All-subset selection with more than 40 variables:⁴
1. a)
  The number of possible subsets can be huge.
2. b)
  Often, there are several good models, although some are unstable.
3. c)
  The best X variables may be no better than random variables, if the size sample is relatively smaller than the total number of variables.
4. d)
  The regression statistics and regression coefficients are biased.
2
All-subset selection regression can yield models that are too small.⁶
3
The rationale for the number of candidate variables, and not the number in the final model, is the number of degrees of freedom to consider.⁷
4
The data analyst knows more than the computer … and failure to use that knowledge produces inadequate data analysis.⁸
5
Stepwise selection yields confidence limits that are far too narrow.⁹
6
Regarding the frequency of obtaining authentic predictor and noise variables, the degree of correlation among the predictor variables affects the frequency with which authentic predictor variables can find their way into the final model. The number of candidate predictor variables can also affect the number of noise variables that gain entry to the model.¹⁰
7
Stepwise selection will not necessarily produce the best model if there are redundant predictors (a common problem).¹¹
8
There are two distinct questions here: (a) When is Stepwise selection appropriate? and (b) Why is it so popular?¹²
9
As to question (b) above, there are two groups that are inclined to favor its usage. One consists of individuals, with little formal training in data analysis, who confuse knowledge of data analysis with knowledge of the syntax of SAS, SPSS and so on. They seem to feel that ‘if it's there in a program, it's gotta be good – and better than actually thinking about what my data might look like’. They are fairly easy to spot and condemn in a group of well-trained data analysts. However, there is also a second group that is often well trained. Its members believe in statistics, but believe that given any properly obtained database, a suitable computer program can objectively make substantive inferences without active consideration of the underlying hypotheses. Stepwise selection is the parent of this blind data analysis …¹³

Currently, there is burgeoning research that continues the original efforts of subset selection by shoring up what we can call its ‘pseudo-theoretical’ foundation. It follows a line of examination that adds assumptions and makes modifications to eliminate weaknesses. As traditional methods are being mended, there are innovative approaches with starting points further afield than their traditional counterparts. There are freshly minted methods, such as the enhanced variable selection method built-in the GenIQ Model, constantly being developed.^{14, 15, 16, 17}

INTRODUCTION

Variable selection in regression – identifying the best subset among many variables to include in a model – is arguably the hardest part of model building. Many variable selection methods exist because they provide a solution to one of the most important problems in statistics.^{18, 19} Many statisticians know them, but few know they produce poorly performing models. The wanting variable selection methods are a miscarriage of statistics because they are developed by debasing sound statistical theory to a misguided pseudo-theoretical foundation. They are executed with computer-intensive search heuristics guided by rules of thumb. Each method uses a unique trio of elements, one from each component of the trinity of selection components.²⁰ Different sets of elements typically produce different subsets. The number of variables in common with the different subsets is small, and the sizes of the subsets can vary considerably.

An alternative view of the problem of variable selection examines certain subsets and selects the best subset, which either maximizes or minimizes an appropriate criterion. Two subsets are obvious – the best single variable and the complete set of variables. The problem lies in selecting an intermediate subset that is better than both of these extremes. Therefore, the issue is how to find the necessary variables among the complete set of variables by deleting both irrelevant variables (variables not affecting the dependent variable) and redundant variables (variables not adding anything to the dependent variable).²¹

I review five frequently used variable selection methods. These everyday methods are found in major statistical software packages.²² The test statistic for the first three methods uses either the F statistic for a continuous dependent variable or the G statistic for a binary dependent variable. The test statistic for the fourth method is either R-squared for a continuous dependent variable or the Score statistic for a binary dependent variable. The last method uses one of the following criteria: R-squared, adjusted R-squared or Mallows’ C_p.

1
Forward Selection (FS) – This method adds variables to the model until no remaining variable (outside the model) can add anything significant to the dependent variable. FS begins with no variable in the model. For each variable, the test statistic (TS), a measure of the variable's contribution to the model, is calculated. The variable with the largest TS value that is greater than a preset value C is added to the model. The test statistic is then recalculated for the variables still remaining, and the evaluation process is repeated. Thus, variables are added to the model one by one until no remaining variable produces a TS value that is greater than C. Once a variable is in the model, it remains there.
2
Backward Elimination (BE) – This method deletes variables one by one from the model until all remaining variables contribute something significant to the dependent variable. BE begins with a model that includes all variables. Variables are then deleted from the model one by one until all the variables remaining in the model have TS values greater than C. At each step, the variable showing the smallest contribution to the model (that is, with the smallest TS value that is less than C) is deleted.
3
Stepwise (SW) – This method is a modification of the forward selection approach, and differs in that variables already in the model do not necessarily stay. As in FS, SW adds variables to the model one at a time. Variables that have a TS value greater than C are added to the model. After a variable is added, however, SW looks at all the variables already included to delete any variable that does not have a TS value greater C.
4
R-squared (R-sq) – This method finds several subsets of different sizes that best predict the dependent variable. R-sq finds subsets of variables that best predict the dependent variable based on the appropriate TS. The best subset of size k has the largest TS value. For a continuous dependent variable, TS is the popular measure R-squared, the coefficient of multiple determination, which measures the proportion of the explained variance in the dependent variable by the multiple regression. For a binary dependent variable, TS is the theoretically correct but less-known Score statistic.²³ R-sq finds the best one-variable model, the best two-variable model, and so forth. However, it is unlikely that one subset will stand out as clearly being the best, as TS values are often bunched together. For example, they are equal in value when rounded at the, say, third place after the decimal point.²⁴ R-sq generates a number of subsets of each size, which allows the user to select a subset, possibly using non-statistical conditions.
5
All-possible Subsets – This method builds all one-variable models, all two-variable models, and so on, until the last all-variable model is generated. The method requires a powerful computer (because many models are produced), and the selection of any one of the criteria: R-squared, adjusted R-squared or Mallows’ C_p.

WEAKNESS IN THE STEPWISE METHOD

An ideal variable selection method for regression models would find one or more subsets of variables that produce an optimal model.²⁵ Its objectives are that the resultant models include accuracy, stability, parsimony, interpretability and the avoidance of bias in drawing inferences. Needless to say, the above methods do not satisfy most of these goals. Each method has at least one drawback specific to its selection criterion. In addition to the nine weaknesses mentioned above, I itemize a list of weaknesses of the most popular Stepwise method:²⁶

1
it yields R-squared values that are strongly biased on the high side;
2
the F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution;
3
the method yields confidence intervals for effects and predicted values that are falsely narrow;
4
it yields P-values that do not have the proper meaning, and their proper correction is a difficult problem;
5
it gives biased regression coefficients that need shrinking (the coefficients for remaining variables are too large);
6
it has severe problems in the presence of collinearity;
7
it is based on methods (for example, F tests) that were intended to be used to test pre-specified hypotheses;
8
increasing the sample size does not help significantly;
9
the number of candidate predictor variables affected the number of noise variables that gained entry to the model;
10
it prevents us from thinking about the problem; and
11
it uses a lot of paper.

I add to the tally of weaknesses by stating common weaknesses in regression models, as well as those specifically related to the OLS regression model and LRM:

The everyday variable selection methods in the regression model typically results in models having too many variables, an indicator of being overfitted. The prediction errors, which are inflated by outliers, are not stable. Thus, model-implementation results in unsatisfactory performance. For ordinary least squares regression, it is well known that, in the absence of normality or absence of linearity assumption or outlier(s) presence in the data, variable selection methods perform poorly. For logistic regression, the reproducibility of the computer-automated variable-selection models is unstable and not reproducible. The variables selected as predictor variables in the models are sensitive to unaccounted-for sample variation in the data.

Given the litany of weaknesses cited, the lingering question is that of why statisticians use variable selection methods to build regression models. To paraphrase Mark Twain, ‘Get your [data] first, and then you can distort them as you please’.²⁷ My answer is: ‘Modelers use variable selection methods every day because they can’. As a counterpoint to the absurdity of ‘because they can’, I enliven Tukey's solution of a ‘Natural Seven-step Cycle of Statistical Modeling and Analysis’ to defining a substantially better-performing regression model. I feel that newcomers to Tukey's EDA need the Seven-step Cycle introduced within the narrative of Tukey's analytic philosophy. Accordingly, I enfold the solution with front and back matter – The Essence of EDA, and The EDA School of Thought, respectively. I delve into Tukey's masterwork, but first I discuss an enhanced variable selection method, for which I might be the only exponent.

ENHANCED VARIABLE SELECTION METHOD

In lay terms, the variable-selection problem in regression can be stated as follows:

Find the best combination of the original variables to include in a model. The variable selection method neither states nor implies that it has an attribute to concoct new variables stirred up by mixtures of the original variables.

The attribute – data mining – is either overlooked, perhaps because it is reflective of the simple-mindedness of the problem-solution at the onset, or is currently sidestepped as the problem is too difficult to solve. A variable selection method without a data mining attribute obviously hits a wall, which beyond it would otherwise increase the predictiveness of the technique. In today's terms, the variable selection methods are without data mining capability. They cannot dig the data for the mining of potentially important new variables. (This attribute, which did not surface during my literature search, is a partial mystery to me.) Accordingly, I put forth the following definition of an enhanced variable selection method:

An enhanced variable selection method is one that identifies a subset that consists of the original variables and data-mined variables, whereby the latter are a result of the data mining attribute of the method itself.

The following five discussion points clarify the attribute weakness, and illustrate the concept of an enhanced variable selection method.

Consider the complete set of variables X₁, X₂, …, X₁₀. Any of the current variable selection methods in use find the best combination of the original variables (say X₁, X₃, X₇, X₁₀), but they can never automatically transform a variable (say transform X₁ to log X₁) to increase its information content (predictive power). Furthermore, none of the methods could generate a re-expression of the original variables (perhaps X₃/X₇) if the constructed variable, structure, were to offer more predictive power than the original component variables combined. In other words, current variable selection methods cannot find an enhanced subset, which needs to include, say, transformed and re-expressed variables (possibly X₁, X₃, X₇, X₁₀, logX₁, X₃/X₇). A subset of variables without the potential of new structure offering more predictive power clearly limits the modeler in building the best model.
Specifically, the current variable selection methods fail to identify structures of the types discussed here: transformed variables with a preferred shape. A variable selection procedure should have the ability to transform an individual variable, if necessary, to induce symmetric distribution. Symmetry is the preferred shape of an individual variable. For example, the workhorses of statistical measures – the mean and variance – are based on symmetric distribution. Skewed distributions produce inaccurate estimates for means, variances and related statistics, such as the correlation coefficient. Symmetry facilitates the interpretation of the variable's effect in an analysis. Skewed distributions are difficult to examine because most of the observations are bunched together at one end of the distribution. Modeling and analyses based on skewed distributions typically provide a model with doubtful interpretability and results.
The current variable selection method should also have the ability to straighten non-linear relationships. A linear or straight-line relationship is the preferred shape when considering two variables. A straight-line relationship between independent and dependent variables is an assumption of the popular statistical linear regression models (for example, OLS regression and LRM). (Remember that a linear model is defined as a sum of weighted variables, such as Y=b₀+b₁*X₁+b₂*X₂+b₃*X₃).²⁸ Moreover, among all the independent variables straight-line relationships are a desirable property.²⁹ In brief, straight-line relationships are easy to interpret: a unit of increase in one variable produces an expected constant increase in a second variable.
Constructed variables from the original variables using simple arithmetic functions. A variable selection method should have the ability to construct simple re-expressions of the original variables. Sum, difference, ratio or product variables potentially offer more information than the original variables themselves. For example, when analyzing the efficiency of an automobile engine, two important variables are miles traveled and fuel used (gallons). However, we know that the ratio variable of miles per gallon is the best variable for assessing the engine's performance.
Constructed variables from the original variables using a set of functions (for example, arithmetic, trigonometric and/or Boolean functions). A variable selection method should have the ability to construct complex re-expressions with mathematical functions that capture the complex relationships in the data, thus potentially offering more information than the original variables themselves. In an era of data warehouses and the Internet, big data consisting of hundreds of thousands to millions of individual records and hundreds to thousands of variables are commonplace. Relationships among many variables produced by so many individuals are sure to be complex, beyond the simple straight-line pattern. Discovering the mathematical expressions of these relationships, although difficult but practical guidance exists, should be the hallmark of a high-performance variable selection method. For example, consider the well-known relationship among three variables: the lengths of the three sides of a right triangle. A powerful variable selection procedure would identify the relationship among the sides, even in the presence of measurement error: the longer side (diagonal) is the square root of the sum of squares of the two shorter sides.

In sum, the attribute weakness implies that a variable selection method should have the ability to generate an enhanced subset of candidate predictor variables.

EDA

I present the ‘trinity’ of Tukey's EDA in a form relevant to the topic at hand: (a) The Essence of EDA; (b) The Natural Seven-step Cycle of Statistical Modeling and Analysis, serving as the most appropriate solution to the variable selection problem in regression; and (c) The EDA School of Thought.

(a) The Essence of EDA is best described in Tukey's own words: ‘Exploratory data analysis is detective work – numerical detective work –or counting detective work – or graphical detective work… [It is] about looking at data to see what it seems to say. It concentrates on simple arithmetic and easy-to-draw pictures. It regards whatever appearances we have recognized as partial descriptions, and tries to look beneath them for new insights’. EDA includes the following characteristics:

1
Flexibility – techniques with greater flexibility to delve into the data;
2
Practicality – advice on procedures to analyze data;
3
Innovation – techniques for interpreting results;
4
Universality – use all of statistics that apply to analyzing data; and
5
Simplicity – above all, the belief that simplicity is the golden rule.

The professional statistician has also been empowered by the computational strength of the PC, without which the natural seven-step cycle of statistical modeling and analysis would not be possible. The PC and the analytical cycle comprise the perfect pairing as long as the steps are followed in order and the information obtained from one step is used in the next. Unfortunately, statisticians are human, and often succumb to taking shortcuts through the seven-step cycle. They ignore the cycle and focus solely on the sixth step, identified below. However, careful statistical endeavor requires additional procedures, as described in the seven-step cycle that follows:³⁰

(b) The Natural Seven-step Cycle of Statistical Modeling and Analysis

1
Definition of the problem. Determining the best way to tackle the problem is not always obvious. Management objectives are often expressed qualitatively, in which case the selection of the outcome or target (dependent) variable is subjectively biased. When the objectives are clearly stated, the appropriate dependent variable is often unavailable, in which case a surrogate must be used.
2
Determining technique. The technique first selected is often that with which the data analyst is most comfortable, and not necessarily the best technique for solving the problem.
3
Use of competing techniques. Applying alternative techniques increases the odds that a thorough analysis is conducted.
4
Rough comparisons of efficacy. Comparing the variability of results across techniques can suggest additional techniques or the deletion of alternative techniques.
5
Comparison in terms of a precise (and thereby inadequate) criterion. An explicit criterion is difficult to define; therefore, precise surrogates are often used.
6
Optimization in terms of a precise and similarly inadequate criterion. An explicit criterion is difficult to define; therefore, precise surrogates are often used.
7
Comparison in terms of several optimization criteria. This constitutes the final step in determining the best solution.

The founding fathers of classical statistics – Karl Pearson and Sir Ronald Fisher – would have delighted in the PC's ability to free them from time-consuming empirical validations of their concepts. Pearson, whose contributions include regression analysis, the correlation coefficient, the standard deviation (a term he coined), and the chi-square test of statistical significance, would have likely developed even more concepts with the free time afforded by the PC. One can further speculate that the PC's functionality would have allowed Fisher's methods of maximum likelihood estimation, hypothesis testing and analysis of variance to have immediate, practical applications.

(c) The EDA School of Thought

Tukey's book is more than a collection of new and creative rules and operations; it defines EDA as a discipline that holds that data analysts fail only if they fail to try many things. It further espouses the belief that data analysts are especially successful if their detective work forces them to notice the unexpected. In other words, the philosophy behind EDA is a combination of attitude and flexibility to do whatever it takes to refine the analysis, and sharp-sightedness to observe the unexpected when it does appear. EDA is thus a self-propagating theory: each data analyst adds his or her own contribution, thereby contributing to the discipline, as I hope to accomplish with this article.

The sharp-sightedness of EDA warrants more attention, as it is a very important feature of the EDA approach. The data analyst should be a keen observer of those indicators that are capable of being dealt with successfully, and should use them to paint an analytical picture of the data. In addition to the ever-ready visual graphical displays as an indicator of what the data reveal, there are numerical indicators, such as counts, percentages, averages and the other classical descriptive statistics (for example, standard deviation, minimum, maximum and missing values). The data analyst's personal judgment and interpretation of indictors are not considered a bad thing, as the goal is to draw informal inferences, rather than those statistically significant inferences that are the hallmark of statistical formality.

In addition to visual and numerical indicators, there are the indirect messages in the data that force the data analyst to take notice, prompting responses such as ‘the data look like…’ or ‘it appears to be … ’ Indirect messages may be vague, but their importance lies in helping the data analyst to draw informal inferences. Thus, indicators do not include any of the hard statistical apparatus, such as confidence limits, significance test or standard errors.

With EDA, a new trend in statistics was born. Tukey and Mosteller quickly followed up in 1977 with the second EDA book (commonly referred to as EDA II), Data Analysis and Regression, which recasts the basics of classical inferential procedures of data analysis and regression as an assumption-free, nonparametric approach guided by ‘(a) a sequence of philosophical attitudes … for effective data analysis, and (b) a flow of useful and adaptable techniques that make it possible to put these attitudes to work’.³¹

In 1983, Hoaglin, Mosteller and Tukey succeeded in advancing EDA with Understanding Robust and Exploratory Analysis, which provides an understanding of how badly the classical methods behave when their restrictive assumptions do not hold, and offers alternative robust and exploratory methods to broaden the effectiveness of statistical analysis.³² It includes a collection of methods to cope with data in an informal way, guiding the identification of data structures relatively quickly and easily, and trading off optimization of objective for stability of results.

In 1991, the same authors continued their fruitful EDA efforts with Fundamentals of Exploratory Analysis of Variance.³³ They recast the basics of the analysis of variance with the classical statistical apparatus (for example, degrees of freedom, F ratios and P-values) in a host of numerical and graphical displays, which often provide insight into the structure of the data, such as size effects, patterns and interaction and behavior of residuals.

EDA set off a burst of activity in the visual portrayal of data. Published in 1983, Graphical Methods for Data Analysis (Chambers et al) presents new and old methods – some that require a computer and others that require only paper and pencil – but all are powerful data analysis tools to learn more about data structure.³⁴ In 1986, du Toit et al came out with Graphical Exploratory Data Analysis, providing a comprehensive yet simple presentation of the topic.³⁵ With Statistical Graphics for Visualizing Univariate and Bivariate Data (1997), and Statistical Graphics for Visualizing Multivariate Data (1998), Jacoby carries out his objective of obtaining pictorial representations of quantitative information by elucidating histograms, one-dimensional and enhanced scatterplots and non-parametric smoothing.^{36, 37} In addition, he successfully transfers graphical displays of multivariate data on a single sheet of paper, a two-dimensional space.

EDA presents a major paradigm shift in the ways models are built. With the mantra ‘Let your data be your guide’, EDA offers a view that is a complete reversal of the classical principles that govern the usual steps of model building. The EDA declares that the model must always follow the data, and not the other way around, as in the classical approach.

In the classical approach, the problem is stated and formulated in terms of an outcome variable Y. It is assumed that the true model explaining all the variation in Y is known. Specifically, it is assumed that all the structures (predictor variables, X_i s) affecting Y and their forms are known and present in the model. For example, if Age affects Y, but the log of Age reflects the true relationship with Y, then log of Age must be present in the model. Once the model is specified, the data are taken through the model-specific analysis, which provides the results in terms of numerical values associated with the structures, or estimates of the true predictor variables’ coefficients. Interpretation is then made for declaring X_i an important predictor, assessing how Xi affects the prediction of Y, and ranking X_i in order of predictive importance.

Of course, the data analyst never knows the true model. Therefore, familiarity with the content domain of the problem is used to explicitly put forth the true surrogate model, from which good predictions of Y can be made. According to Box, ‘all models are wrong, but some are useful’.³⁸ In this case, the model selected provides serviceable predictions of Y. Regardless of the model used, the assumption of knowing the truth about Y sets the statistical logic in motion to cause likely bias in the analysis, results and interpretation.

In the EDA approach, not much is assumed beyond having some prior experience with the content domain of the problem. The right attitude, flexibility and sharp-sightedness are the forces behind the data analyst, who assesses the problem and lets the data guide the analysis, which then suggests the structures and their forms of the model. If the model passes the validity check, then it is considered final and ready for results and interpretation to be made. If not, with the force still behind the data analyst, the analysis and/or data are revisited until new structures produce a sound and validated model, after which final results are found and interpretations made (see Figure 1). Without exposure to assumption violations, the EDA paradigm offers a degree of confidence that its prescribed exploratory efforts are not biased, at least in the manner of the classical approach. Of course, no analysis is bias-free, as all analysts factor their own bias into the equation.

With all its strengths and determination, EDA as originally developed had two minor weaknesses that could have hindered its wide acceptance and great success. One is of a subjective or psychological nature and the other is a misconceived notion. Data analysts know that failure to look into a multitude of possibilities can result in a flawed analysis, thus finding themselves in a competitive struggle against the data itself. Thus, EDA can foster in data analysts an insecurity that their work is never done. The PC can assist data analysts in enabling them to be thorough with their analytical due diligence, but bears no responsibility for the arrogance EDA engenders.

The belief that EDA, which was originally developed for the small-data setting, does not work as well with large samples is a misconception. Indeed, some of the graphical methods, such as the stem-and-leaf plots, and some of the numerical and counting methods, such as folding and binning, do break down with large samples. However, most of the EDA methodology is unaffected by data size. Neither the manner in which the methods are carried out nor the reliability of the results is changed. In fact, some of the most powerful EDA techniques scale up quite nicely, but do require the PC to do the serious number crunching of the big data.³⁹ For example, techniques such as ladder of powers, re-expressing and smoothing are valuable tools for large-sample or big-data applications.

CONCLUSION

Finding the best possible subset of variables to put in a model has been a frustrating exercise. Many variable selection methods exist. Many statisticians know them, but few know they produce poorly performing models. The resulting variable selection methods are a miscarriage of statistics because they are developed by debasing sound statistical theory to a misguided pseudo-theoretical foundation. I have reviewed the five widely used variable selection methods, itemized some of their weaknesses, and described why they are used. I have then sought to present a better solution to variable selection in regression: the Natural Seven-step Cycle of Statistical Modeling and Analysis. I feel that newcomers to Tukey's EDA need the Seven-step Cycle introduced within the narrative of Tukey's analytic philosophy. Accordingly, I have embedded the solution within the context of EDA philosophy.

References and Notes

Classical underlying assumptions, http://en.wikipedia.org/wiki/Regression_analysis.
The variable selection methods do not include the new breed of methods that have data mining capability.
Tukey, J.W. (1977) The Exploratory Data Analysis. Reading, MA: Addison-Wesley.
Google Scholar
Miller, A.J. (1990) Subset Selection in Regression. New York: Chapman and Hall, pp. iii–x.
Book Google Scholar
Statistical-Criteria-Supported-by-SAS.pdf (http://www.geniq.net/res/Statistical-Criteria-Supported-by-SAS.pdf).
Roecker, E.B. (1991) Prediction error and its estimation for subset-selected models. Technometrics 33: 459–468.
Article Google Scholar
Copas, J.B. (1983) Regression, prediction and shrinkage (with discussion). Journal of the Royal Statistical Society B 45: 311–354.
Google Scholar
Henderson, H.V. and Velleman, P.F. (1981) Building multiple regression models interactively. Biometrics 37: 391–411.
Article Google Scholar
Altman, D.G. and Andersen, P.K. (1989) Bootstrap investigation of the stability of a Cox regression model. Statistics in Medicine 8: 771–783.
Article Google Scholar
Derksen, S. and Keselman, H.J. (1992) Backward, forward and stepwise automated subset selection algorithms. British Journal of Mathematical and Statistical Psychology 45: 265–282.
Article Google Scholar
Judd, C.M. and McClelland, G.H. (1989) Data Analysis: A Model Comparison Approach. New York: Harcourt Brace Jovanovich.
Google Scholar
Bernstein, I.H. (1988) Applied Multivariate Analysis. New York: Springer-Verlag.
Book Google Scholar
Comment without an attributed citation: Frank Harrell, Vanderbilt University School of Medicine, Department of Biostatistics, Professor of Biostatistics, and Department Chair.
Kashid, D.N. and Kulkarni, S.R. (2002) A more general criterion for subset selection in multiple linear regression. Communication in Statistics-Theory & Method 31 (5): 795–811.
Article Google Scholar
Tibshirani, R. (1996) Regression shrinkage and selection via the Lasso. Journal of Royal Statistical Society B 58 (1): 267–288.
Google Scholar
Ratner, B. (2003) Statistical Modeling and Analysis for Database Marketing: Effective Techniques for Mining Big Data. Boca Raton, FL: CRC Press, Chapter 15, which presents the GenIQ Model (http://www.GenIQModel.com).
Book Google Scholar
Chen, S.-M. and Shie, J.-D. (1995) A New Method for Feature Subset Selection for Handling Classification Problems, MA, USA: Kluwer Academic Hingham. ISBN: 0-7803-9159-4.
SAS Proc Reg Variable Selection Methods.pdf.
Comment without an attributed citation: In 1996, Tim C. Hesterberg, Research Scientist at Insightful Corporation, asked Brad Efron for the most important problems in statistics, fully expecting the answer to involve the bootstrap, given Efrons status as inventor. Instead, Efron named a single problem, variable selection in regression. This entails selecting variables from among a set of candidate variables, estimating parameters for those variables, and inference – hypotheses tests, standard errors and confidence intervals.
Other criteria are based on information theory, and bayesian rules.
Dash, M. and Liu, H. (1997) Feature selection for classification. Intelligent Data Analysis. In: Proceedings of IEEE Knowledge and Data Engineering Exchange Workshop, Newport, CA, USA. IEEE Computer Society, Elsevier Science.
Google Scholar
SAS/STAT Manual. See PROC REG, and PROC LOGISTIC.
R-squared theoretically is not the appropriate measure for a binary dependent variable. However, many analysts use it with varying degrees of success.
Mark Twain quotation: ‘Get your facts first, then you can distort them as you please’, http://thinkexist.com/quotes/mark_twain/.
For example, consider two TS values: 1.934056 and 1.934069. These values are equal when rounding occurs at the third place after the decimal point: 1.934.
Even if there were perfect variable selection method, it is unrealistic to believe there is a unique best subset of variables.
Comment without an attributed citation: Frank Harrell, Vanderbilt University School of Medicine, Department of Biostatistics, Professor of Biostatistics, and Department Chair.
The weights or coefficients (b0, b1, b2 and b3) are derived to satisfy some criterion, such as minimize the mean-squared error used in ordinary least-square regression, or minimize the joint probability function used in logistic regression.
Fox, J. (1997) Applied Regression Analysis, Linear Models, and Related Methods. California: Sage Publications.
Google Scholar
The seven steps are Tukey's. The annotations are the author's.
Mosteller, F. and Tukey, J.W. (1977) Data Analysis and Regression. Reading, MA: Addison-Wesley.
Google Scholar
Hoaglin, D.C., Mosteller, F. and Tukey., J.W. (1983) Understanding Robust and Exploratory Data Analysis. New York: John Wiley & Sons.
Google Scholar
Hoaglin, D.C., Mosteller, F. and Tukey., J.W. (1991) Fundamentals of Exploratory Analysis of Variance. New York: John Wiley & Sons.
Book Google Scholar
Chambers, M.J., Cleveland, W.S., Kleiner, B. and Tukey, P.A. (1983) Graphical Methods for Data Analysis. California: Wadsworth & Brooks/Cole Publishing Company.
Google Scholar
du Toit Steyn, A.G.W. and Stumpf, R.H. (1986) Graphical Exploratory Data Analysis. New York: Springer-Verlag.
Book Google Scholar
Jacoby, W.G. (1997) Statistical Graphics for Visualizing Univariate and Bivariate Data. California: Sage Publication.
Book Google Scholar
Jacoby, W.G. (1998) Statistical Graphics for Visualizing Multivariate Data. California: Sage Publication.
Book Google Scholar
Box, G.E.P. (1976) Science and statistics. Journal of the American Statistical Association 71: 791–799.
Article Google Scholar
Weiss, S.M. and Indurkhya, N. (1998) Predictive Data Mining. San Francisco, CA: Morgan Kaufman Publishers.
Google Scholar

Download references

Author information

Authors and Affiliations

574 Flanders Drive, North Woodmere, NY 11581, USA
Bruce Ratner

Authors

Bruce Ratner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bruce Ratner.

Additional information

¹PhD, and Founder and President of DM STAT-I Consulting, has made the company the ensample for Statistical Modeling & Analysis and Data Mining in Direct & Database Marketing, Customer Relationship Management, Business Intelligence, and Information Technology. DM STAT-I specializes in the full range of standard statistical techniques, and methods using hybrid machine learning-statistics algorithms, such as its patented GenIQ Model© Modeling & Data Mining Software, to achieve its Clients’ Goals,across industries such as banking, insurance, finance, retail, telecommunications, healthcare, pharmaceutical, publication & Circulation, Mass & Direct Advertising, Catalog Marketing, e-Commerce, Web-mining, B2B, human capital management, and risk management. Bruce's par excellence consulting expertise is apparent in his best-selling book Statistical Modeling and Analysis for Database Marketing: Effective Techniques for Mining Big Data (based on Amazon Sales Rank within the DMspace since June 2003), which assures the client that their marketing decision problems will be solved with the optimal problem-solution methodology and rapid start-up and timely delivery of projects results, and that their projects will be executed with the highest level of statistical practice. He is often invited as a speaker at public and private industry events.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ratner, B. Variable selection methods in regression: Ignorable problem, outing notable solution. J Target Meas Anal Mark 18, 65–75 (2010). https://doi.org/10.1057/jt.2009.26

Download citation

Received: 25 November 2009
Revised: 25 November 2009
Published: 04 March 2010
Issue Date: 01 March 2010
DOI: https://doi.org/10.1057/jt.2009.26

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Variable selection methods in regression: Ignorable problem, outing notable solution

Abstract

Similar content being viewed by others

Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Sampling Techniques for Quantitative Research

BACKGROUND

INTRODUCTION

WEAKNESS IN THE STEPWISE METHOD

ENHANCED VARIABLE SELECTION METHOD

EDA

CONCLUSION

References and Notes

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Variable selection methods in regression: Ignorable problem, outing notable solution

Abstract

Similar content being viewed by others

Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Sampling Techniques for Quantitative Research

BACKGROUND

INTRODUCTION

WEAKNESS IN THE STEPWISE METHOD

ENHANCED VARIABLE SELECTION METHOD

EDA

CONCLUSION

References and Notes

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation