Article

Journal of Simulation (2008) 2, 19–27. doi:10.1057/palgrave.jos.4250032

Simulation experiments in practice: statistical design and regression analysis

J P C Kleijnen1

1Tilburg University (UvT), Tilburg, The Netherlands

Correspondence: JPC Kleijnen, Department of Information Management/Center for Economic Research (CentER), Tilburg University (UvT), Post box 90153, Tilburg 5000 LE, The Netherlands. E-mail: kleijnen@uvt.nl

Received 15 January 2007; Accepted 15 October 2007.

Top

Abstract

In practice, simulation analysts often change only one factor at a time, and use graphical analysis of the resulting Input/Output (I/O) data. The goal of this article is to change these traditional, naïve methods of design and analysis, because statistical theory proves that more information is obtained when applying Design Of Experiments (DOE) and linear regression analysis. Unfortunately, classic DOE and regression analysis assume a single simulation response that is normally and independently distributed with a constant variance; moreover, the regression (meta)model of the simulation model's I/O behaviour is assumed to have residuals with zero means. This article addresses the following practical questions: (i) How realistic are these assumptions, in practice? (ii) How can these assumptions be tested? (iii) If assumptions are violated, can the simulation's I/O data be transformed such that the assumptions do hold? (iv) If not, which alternative statistical methods can then be applied?

Keywords:

metamodel, experimental design, jackknife, bootstrap, common random numbers, validation

Top

1. Introduction

Experiments with simulation models should be done with great care; otherwise, the analysts' time used to collect data about the real (non-simulated) system and the computer's time to run the simulation model (computer code) are wasted. (I am giving synonyms because simulation is used in many different areas, each with their own jargon.) In other words, simulation is more than an exercise in computer programming. Therefore, simulation textbooks such as Law (2007) spend many chapters on the statistical aspects of simulation.

My goal with this article is to change traditional, naïve methods of design and analysis, because statistical theory proves that more information is obtained when applying Design Of Experiments (DOE) and linear regression analysis. Because most practitioners are not statisticians, I provide a tutorial overview of methods to improve the application of statistical design and analysis of experiments to discrete-event simulation models. I shall illustrate statistical principles through two simple simulation examples, namely the well-known M/M/1 queuing and the (sS) inventory models (Table 1 defines major acronyms, for example M/M/1). These two models are the building blocks for more complicated simulation models, as is also mentioned in Law (2007). My presentation is guided by 40 years of experience with the application of statistical methodology in manufacturing, supply chains, defence, etc; that is, application of DOE in practice has been demonstrated to be possible. Below, I shall give several references to case studies, hoping that these references convince practitioners of the merits of this statistical methodology.


More specifically, I revisit the classic assumptions for linear regression analysis and their concomitant designs. These classic assumptions stipulate a single (univariate) simulation output (response) and 'white noise' (defined in the next paragraph). In the M/M/1 example, this response may be the average waiting time of all customers simulated during a simulation run; in the inventory example the response may be the costs per time unit estimated by running the simulation and accumulating the inventory carrying, ordering, and stock-out costs.

White noise (say) e is Normally (or Gaussian), Independently, and Identically Distributed (NIID) with zero mean and some variance (say) Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author. As I shall show in the next sections, the white noise assumption implies the following four (sub)assumptions:

  1. The simulation responses are normally distributed.
  2. The simulation experiment does not use Common Random Numbers (CRN).
  3. When the simulation inputs change in the experiment, the expected values (or means) of the simulation outputs may also change—but their variances must remain constant.
  4. The linear regression model (for example a first-order polynomial) is assumed to be a 'valid' approximation of the Input/Output (I/O) behaviour of the underlying simulation model; that is, the residuals of the fitted regression model have zero means.

I shall try to answer the following questions, for each of these four assumptions:

  • How realistic are these assumptions?
  • How can these assumptions be tested if it is not obvious that the assumption is violated? (for example the analysts may test the normality of the simulation outputs; see below.)
  • If an assumption is violated, can the so-called simulation's I/O data be transformed such that the assumption holds? (An example of I/O data is the arrival and service rates in the M/M/1 example, which are input data; the average waiting times are output data.)
  • If such transformations cannot be found, which alternative statistical methods can then be applied?

The remainder of this article is organized in such a way that these questions are answered for each of the four classic assumptions listed above. So, in the next section, I discuss the consequences of having multiple simulation outputs (instead of a single output). Next, I address possible non-normality of the simulation output, including tests of normality, normalizing transformations of simulation I/O data, and jackknifing and bootstrapping as alternative methods that do not assume normality. Then I cover variance heterogeneity (or heteroscedasticity) of simulation outputs. Next I discuss CRN. Then I discuss problems that arise when low-order polynomials are not valid approximations. I conclude with a summary of major conclusions. An extensive list of references enables further research to be carried out easily.

Note that this article is an 'adaptation' of Kleijnen (2006); that is, in the present article I focus on discrete-event simulation (excluding deterministic simulation based on differential equations) and use only elementary mathematical statistics. More statistical details and background information are given in Kleijnen (2007, 2008).

Top

2. Multiple simulation output

The M/M/1 simulation may have the following three outputs: (i) the average waiting time, (ii) the maximum waiting time, and (iii) the average occupation (or 'busy') percentage of the server.

The (sS) simulation may have two outputs: (i) the sum of the holding and the ordering costs, averaged over the simulated periods; (ii) the service (or fill) rate, averaged over the same simulation periods (the service rate is used because the out-of-stock costs are hard to quantify in practice). The precise definitions of these costs and the service rate vary with the applications (see (Law, 2007 and also Angün et al, 2006 and Ivanescu et al, 2006).

A case study concerning a Decision Support System (DSS) for production planning is presented in Kleijnen (1993). Originally, the simulation model had a multitude of outputs. However, to support decision making, it turned out that it sufficed to consider only the following two outputs (DSS criteria, bivariate response): (i) the total production of steel tubes manufactured, which was of major interest to the production manager; (ii) the 90% 'quantile' (also erroneously called 'percentile') of delivery times, which was the sales manager's concern. Anyhow, a single simulation output did not suffice in this case study.

For general usage, I use the following notation for the simulation model itself:

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where w is the vector of (say) zgreater than or equal to1 simulation outputs (vectors and matrices are denoted by bold face symbols); s denotes the mathematical function that is defined by the computer code implementing the simulation model; dj denotes the jth factor (input variable) of the simulation model (for example the arrival rate or the service rate of the M/M/1 model). Then D=(dij) is the design matrix for the simulation experiment, with j=1, ..., k and i=1, ..., n, where k denotes the number of factors in the simulation experiment and n the number of combinations of factor levels (or values) in that experiment (these combinations are also called scenarios); p0 is the Pseudo-Random Number (PRN) seed (or initial value).

In the M/M/1 example, the average waiting time may be approximated by the following first-order polynomial if the traffic rate (say) x is 'low': Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author. In general, I assume that the simulation's multivariate I/O function in (1) is approximated by z univariate linear regression (meta)models:

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where yh denotes the n-dimensional vector with the regression predictors for the hth type of simulation output; X is the common n times q matrix of explanatory variables; for simplicity, I assume that all z regression metamodels are polynomials of the same order; for example, all regression models are second-order polynomials (if the regression model includes an intercept and q>2, then it is called a 'multiple' regression model); X is determined by D defined below (1) (for example if the regression model is a first-order polynomial, then X=(1D) where 1 denotes a vector with n ones); betah is the q-dimensional vector with the regression parameters for the hth metamodel; eh is the n-dimensional vector with the residuals for the hth metamodel, in the n factor combinations.

The multiple simulation outputs are correlated; for example, in the inventory example, an 'unusual' PRN stream may result in inventory costs that are 'relatively high'—that is, higher than expected—and a relatively high service percentage, so these two outputs are positively correlated. Consequently, it seems that the Ordinary Least Squares (OLS) estimators (say) betacrch of betah in (2) should be replaced by the Generalized Least Squares (GLS) estimator; GLS accounts for the correlations among simulation outputs. Fortunately, GLS reduces to OLS computed per output if the same design matrix is used (as is the case in Equation (2), where X has no subscript h) (see Rao (1959) and the more recent reference Ruud (2000, p 703)). These OLS estimators are

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where wmacrh denotes the simulation output of type h averaged over mgreater than or equal to1 replicates. (m is assumed to be constant over the n factor combinations; otherwise, the n averages per response type would have to be weighted by the number of replicates; see Kleijnen (1987, p 195). Replicates are IID, by definition.)

Because a simulation experiment uses the same design matrix to generate the multiple outputs, the estimator in (3) is the Best Linear Unbiased Estimator (BLUE). Indeed, (3) is a linear estimator, as it uses L=(X'X)-1X, which results in a (deterministic) linear transformation of the random simulation outputs. Furthermore, (3) gives an unbiased estimator if the residuals have zero mean. Finally, (3) gives the 'best' estimator, in the sense that it has minimum variance.

In summary, in case of multiple simulation outputs the simulation practitioners may still use the classic formulas, so they can easily obtain Confidence Intervals (CIs) and statistical tests for the regression parameters per output.

Top

3. Non-normal simulation output

The Least Squares (LS) criterion that was used to derive the regression estimators in (3) is a mathematical criterion, so LS does not assume a normal distribution. Only if the simulation analysts require statistical properties—such as BLUE, CIs, and tests—they usually assume a normal distribution. In the following subsections, I try to answer the four questions formulated in the Introduction.

3.1. Realistic normality assumption?

Simulation responses within a run are autocorrelated (serially correlated) so their covariances are not zero. By definition, a stationary covariance process has a constant mean and a constant variance; its covariances depend only on the lag |t-t'| between the variables wt and wt'. The average of a stationary covariance process is asymptotically normally distributed if the covariances tend to zero sufficiently fast for large lags (see Lehmann, 1999, Chapter 2.8). For example, in inventory simulation the output is often the costs averaged over the simulated periods; I expect this average to be normally distributed. Another output of an inventory simulation may be the service percentage calculated as the fraction of demand delivered from on-hand stock per (say) week, so 'the' output is the average per year computed from these 52 weekly averages. I expect this yearly average to be normally distributed—unless the service goal is 'close' to 100%, in which case the average service rate is cut off at this threshold and I expect the normal distribution to be a bad approximation.

Note that CIs based on Student's t statistic are known to be quite insensitive to non-normality, whereas the lack-of-fit F-statistic (see Equation (27)) is known to be more sensitive to non-normality; see (Kleijnen, 1987) for details including references.

In summary, a limit theorem may explain why simulation outputs are asymptotically normally distributed. Whether the actual simulation run is long enough, is always hard to know. Therefore it seems good practice to check whether the normality assumption holds (see the next subsection).

3.2. Testing the normality assumption

Basic statistics textbooks—but also see the recent article (Arcones and Wang, 2006)—and simulation textbooks—see Kleijnen (1987) and Law (2007)—propose several visual plots and goodness-of-fit statistics to test whether a set of observations comes from a specific distribution type such as a normal distribution. A basic assumption is that these observations are IID. Simulation analysts may therefore obtain 'many' (say, m=100) replicates for a specific factor combination (for example the base scenario) if computationally feasible. However, if a single simulation run takes relatively much computer time, then only 'a few' (say, 2less than or equal tomless than or equal to10) replicates are feasible, so the plots are too rough and the goodness-of-fit tests lack power (so these tests have a high probability of making a type-II error).

Actually, the white noise assumption concerns the metamodel's residuals e in (2)—not the simulation model's outputs w in (1). For simplicity of presentation, I again assume mi=mgreater than or equal to1. Even if the simulation outputs have a constant variance (say) Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author and are independent (no CRN) so Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author, the estimated residuals do not have constant variances and are not independent! More precisely, it can be proven that

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Nevertheless, analysts such as Ayanso et al (2006) apply visual inspection of residual plots, which are standard output of many statistical packages. For further discussion I refer to Atkinson and Riani (2000).

3.3. Transformations of simulation I/O data, jackknifing, and bootstrapping

The simulation output may be transformed to make it have a more normal distribution. A well-known transformation is the Box–Cox power transformation:

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where lambda is the transformation parameter that is estimated from the original simulation output data. A complication is that now the metamodel does not explain the behaviour of the original output, but the behaviour of the transformed output! For details on this transformation, I refer to Atkinson and Riani (2000, p 82) and Freeman and Modarres (2006).

Outliers occur more frequently when the actual distribution has 'fatter' tails than the normal distribution. Robust regression analysis might then be applied, as explained in Atkinson and Riani (2000) and Salibian-Barrera (2006). However, I have not seen any applications of this approach in simulation.

Normality is not assumed by the following two general statistical procedures that use the original simulation I/O data, namely jackknifing and bootstrapping. Both procedures have become popular since powerful and cheap computers have become available to the analysts.

3.4. Jackknifing

In general, jackknifing tries to solve the following two types of problems:

  1. How to compute CIs in case of non-normal observations?
  2. How to reduce possible bias of estimators?

Examples of non-normal observations are the estimated service rate close to 100% in inventory simulations, and extreme quantiles such as the 99.99% point in risk simulations; see the nuclear waste simulation in Kleijnen and Helton (1999). Examples of biased estimators follow below.

Suppose the analysts want a CI for the regression coefficients in case the simulation output has a very non-normal distribution. So the linear regression metamodel is still (2) with z=1. Assume that each factor combination is replicated m>1 times. The original OLS estimator (also see (3)) is then

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Jackknifing deletes the rth replicate among the m replicates, and recomputes the estimator:

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where wmacr-r is the n-dimensional vector with components that are the averages of the m-1 replicates after deleting replicate r:

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where the summation runs from 1 to m-1 (not m) in case r equals m.

For ease of presentation, I now focus on betaq (the last of the q regression parameters in the vector beta). Jackknifing uses the pseudovalue (say) J, which is the following weighted average of the original estimator and the qth component of the jackknifed estimator defined in (7)—with the number of observations as weights:

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

In (9), both the original and the jackknifed estimators are unbiased, so the pseudovalues also remain unbiased estimators. Otherwise it can be proven that the bias is reduced by the jackknifed point estimator

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

which is simply the average of the m pseudovalues defined in (9).

To compute a CI, jackknifing treats the m pseudovalues as if they were NIID; that is, jackknifing uses

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where tm-1; alpha/2 denotes the upper alpha/2 point of the distribution of Student's t statistic with m-1 Degrees of Freedom (DF), and

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

The interval in (11) may be used to test the null-hypothesis that the true regression parameter has a specific value, for example, zero.

Applications of jackknifing in simulation are numerous. For example, jackknifing gave CIs for Weighted LS (WLS) with weights based on the estimated variances of the simulation responses; see (18) below and Kleijnen et al (1987). In another example, jackknifing reduces the bias and computes CIs for a Variance Reduction Technique called control variates or regression sampling; see Kleijnen et al (1989). A final example concerns jackknifing in the renewal analysis of steady-state simulation; see Kleijnen and Van Groenendaal (1992, pp 202–203).

3.4. Bootstrapping

Bootstrapping is discussed in textbooks such as Davison and Hinkley (1997), Efron and Tibshirani (1993), Good (2005), and Lunneborg (2000); a recent article is Davidson and MacKinnon (2007). Bootstrapping may be used for two types of situations:

  1. The relevant distribution is not Gaussian.
  2. The statistic is not standard.

Sub (i): Reconsider the example used for jackknifing; that is, assume that the analysts want a CI for a regression coefficient in case of non-normal simulation output. Again assume that each factor combination is replicated m>1 times. The original LS estimator was given in (6).

The bootstrap distinguishes between the original observations w (see Equation (1)) and the bootstrapped observations (say) w* (note the superscript). I limit myself to standard bootstrapping, which assumes that the original observations are IID. In the jackknife example, there were m IID original simulated observations per factor combination.

The bootstrap observations are obtained by resampling with replacement from the set of original observations, while the sample size is kept constant, at m. This resampling is executed for each of the n combinations. These bootstrapped outputs w* give the bootstrapped average simulation output wmacr*. Substitution into (6) gives the bootstrapped LS estimator

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

To reduce sampling variation, this resampling is repeated (say) B times; B is known as the bootstrap sample size (typical values for B are 100 and 1000).

I again focus on the single regression parameter betaq. The bootstrap literature gives several procedures for the construction of CIs, but most popular is the following procedure. Determine the Empirical Density Function (EDF) of the bootstrap estimate; that is, sort the B observations from smallest to largest (the EDF is like a histogram). The lower limit of the CI is the alpha/2 quantile of the EDF; obviously, Balpha/2 values are smaller than this quantile. Likewise, the upper limit is the 1-alpha/2 quantile.

Applications of bootstrapping include the validation of trace-driven simulation models in case of serious non-normal outputs; see Kleijnen et al (2001).

Sub (ii): Besides classic statistics such as t- and F-statistics, the simulation analysts may be interested in statistics that have no tables with critical values (these tables are used to determine CIs). For example, the well-known coefficient of determination R2 may be bootstrapped to test the validity of regression metamodels in simulation; see Kleijnen and Deflandre (2006).

Top

4. Heterogeneous simulation output variances

In the following subsections, I try to answer the questions raised in the Introduction—in case the simulation outputs do not have a common variance.

4.1. Common variance in practice?

In practice, the variances of the simulation outputs change when factor combinations change. For example, the M/M/1 simulation not only has mean waiting times that change as the traffic rate changes—the variance of this output changes even more!

4.2. Testing the common variance assumption

Though it may be a priori certain that the variances of the simulation outputs are not constant, the analysts may still hope that the variances are 'nearly' constant in their particular application. Unfortunately, in practice the variances are unknown so they must be estimated. These estimators themselves have high variances. Moreover, there are n factor combinations in the simulation experiment, so n variance estimators need to be compared. This problem may be solved in many different ways, but I recommend the following distribution-free test defined in Conover (1980, p 241).

Subtract the sample average wi of factor combination i (i=1, ..., n) from the m simulation responses wir(r=1, ..., m). Compute the absolute values of these adjusted responses. Rank all resulting nm values from smallest to largest, assuming all combinations give identically distributed responses. To test this assumption, compute the sum of squared ranks per factor combination i. This gives a statistic (detailed in Conover) that is compared with the 1-alpha quantile of a chi-square statistic with n-1 DF.

4.3. Variance stabilizing transformations

The logarithmic transformation in (5) may be used not only to obtain normal outputs but also to obtain outputs with constant variances. A problem may again be that the regression metamodel now explains the transformed outputs instead of the original outputs.

4.4. Weighted least squares (WLS)

In case of heterogeneous variances, the LS criterion still gives an unbiased estimator. The variance of the OLS estimator, however, now is

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where the ith (i=1, ..., n) element on the main diagonal of cov(wmacr) is var(wi)/m. I shall present a simple method to derive CIs for the q individual OLS estimators, when discussing CRN below.

Though the OLS estimator remains unbiased, it is no longer the BLUE. The BLUE is now the WLS estimator

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where I explicitly denote the number of rows N=sumi=1nmi of X, which is an N times q matrix. Obviously, if mi=m, then N=nm. For such a constant number of replicates, the WLS estimator may be rewritten as

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where X is n times q and cov(wmacr)=cov(w)/m. The covariance matrix of this WLS estimator is

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

In practice, however, the matrix cov(w) is unknown so it must be estimated. The ith element on this diagonal matrix is estimated through the classic unbiased variance estimator

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Substituting the estimated matrix into the classic WLS formula (15) gives the Estimated WLS (EWLS) or Aitken estimator:

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

This is a nonlinear estimator! Consequently, the statistical analysis becomes more complicated. For example, the analogue of (16), namely

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

holds only asymptotically (under certain conditions); see, for example, Godfrey (2006) and Kleijnen et al (1985). Classic CIs no longer hold.

Relatively simple solutions for this type of problem have already been presented above, namely jackknifing and bootstrapping. Jackknifing of the EWLS estimator was indeed done in Kleijnen et al (1987), as follows. Delete the rth replicate among the m replicates, and recompute the EWLS estimator (see (7), (18)):

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author consists of the n averages computed after deleting replicate r, and Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author is computed from the same m-1 replicates. The estimator in (20) and the original estimator computed through (18) are used to compute the pseudovalues, which give the desired CI. Bootstrapping of the EWLS estimator is done in Kleijnen and Deflandre (2006).

4.1. Designs for variance heterogeneity

If the variances of the simulation outputs are not constant, classic designs still give unbiased OLS and WLS estimators. The literature pays little attention to the derivation of alternative designs in case of heterogeneous variances. However, Kleijnen and Van Groenendaal (1995) does investigate designs with factor combinations replicated so many times that the estimated variances of the average simulation response per combination are approximately constant. More precisely, var(wi)=sigmai2/mi implies that the number of replicates should satisfy

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where c0 is a positive constant such that the mi become integers. This equation means that the higher the variability of the simulation output for a particular input combination is, the more replicates are simulated. The allocation of the total number of simulation runs Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author according to (21) is not necessarily optimal, but it simplifies the regression analysis and the design of the simulation experiment. Indeed, the regression analysis can now apply OLS to the averages wi to get the BLUE.

In practice, the variances of the simulation outputs must be estimated. A two-stage procedure takes a pilot sample of (say) m0greater than or equal to2 replicates for each factor combination, and estimates the response variances through the analogue of (17):

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

with Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author. Combining (21), (22) implies that the number of additional replicates is mcirci-m0 with

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

after rounding to integer values (so, in the second stage no additional replicates are simulated for the combination with the smallest estimated variance). After the second stage, all mcirci replicates are used to estimate the average output and its variance. OLS is applied to these averages. The covariance matrix of the estimated regression parameters is estimated through (13) with the covariance matrix of the estimated simulation responses estimated through a diagonal matrix with diagonal elements Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author. Finally, CIs are based on the classic t statistic with DF equal to only m0-1.

Because these Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author may still differ considerably, this two-stage approach may be replaced by a purely sequential approach. The latter approach adds one replicate at a time after the pilot stage, until the estimated variances of the average simulation outputs have become approximately constant. This procedure requires fewer simulation responses than the two-stage procedure, but it is harder to understand, program, and implement.

Top

5. Common random numbers (CRN)

In the following subsections, I again try to answer the questions raised in the Introduction—now for the problems created by CRN.

5.1. CRN in practice

In practice, simulation analysts often use CRN, because CRN is the default of many simulation software packages; that is, the software automatically starts each run with the same PRN seed (p0 in Equation (1)). As an example, I consider an M/D/1 simulation; that is, a single server simulation with exponential interarrival times and constant service times. Suppose that a very extreme event occurs, namely all the PRNs happen to be close to one. The interarrival times are then close to zero. So—whatever traffic rate is simulated—the waiting times tend to be higher than expected; that is, the simulation responses for different traffic rates are positively correlated.

In general, CRN implies that the simulation outputs of different factor combinations are positively correlated across these combinations: cov(wiwi')>0 with i, i'=1, ..., n. The goal is to reduce the variances of the estimated factor effects; actually, the variance of the estimated intercept increases when CRN is used. CRN gives better predictions of the output for combinations not yet simulated—provided the higher inaccuracy of the estimated intercept is outweighed by the higher accuracy of all other estimated effects.

5.2. OLS versus GLS

Because CRN makes the simulation outputs correlated, the analysts have two options:

  1. Continue to use OLS
  2. Switch to GLS.

Sub (i): The variance of the OLS estimator is given by (13), but now cov(wmacr) is not a diagonal matrix. I propose the following simple CIs, assuming mgreater than or equal to2 replicates; also see Law (2007, p 627). From replicate r, compute

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

The n components of the vector wr are correlated because of the CRN and may have different variances (see the preceding section on WLS). Yet, the m estimators of (say) the last regression parameter betaq are independent (because they use non-overlapping PRN streams) and have a common standard deviation (say) sigma(betaq) so the following expression has a t distribution with m-1 DF:

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Sub (ii): CRN implies that the BLUE is the GLS estimator; see (14) where cov(w) is now not diagonal. In practice, cov(w) is estimated through

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

The resulting Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author is singular if mless than or equal ton; see Dykstra (1970). If m>n, then the analogue of (18) gives Estimated GLS (EGLS). This EGLS estimator can again be analysed through jackknifing and bootstrapping. However, Kleijnen (1992) compares OLS and EGLS, relying on the asymptotic covariance matrix; see (19) with nondiagonal response covariance matrix. But 'bootstrap tests ... yield more reliable inferences than asymptotic tests in a great many cases'; see Davidson and MacKinnon (2006).

In summary, CRN with EGLS may give better point estimates of the factor effects (except for the intercept), but a proper statistical analysis requires m>n replicates. OLS requires only mgreater than or equal to2 replicates.

5.3. Designs for CRN

The literature pays no attention to the derivation of designs that allow for CRN. Sequential procedures are proposed in Kleijnen and Van Beers (2004) and Van Beers and Kleijnen (2006). These two publications select the next factor combination to be simulated, assuming the simulation I/O data are analysed through Kriging (instead of linear regression), which allows the simulation outputs to be correlated.

Top

6. Validation of linear regression metamodel

In the following subsections, I again try to answer the questions raised in the Introduction—in case the fitted linear regression model does not 'adequately' approximate the underlying simulation model.

6.1. Tests for the validity of the linear regression model

A valid regression model implies that it has zero mean residuals, so the following null-hypothesis holds: H0: E(e)=0. To test this hypothesis, the analysts may apply the classic lack-of-fit F-statistic, assuming white noise; see Kleijnen (2008). However, this assumption is not valid if the analysts apply CRN. The analysts may then apply the following variant derived in Rao (1967) and evaluated in Kleijnen (1992):

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

with the conditions n>q and m>n; the symbol Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author denotes the EGLS residuals so Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author. This test also allows EWLS instead of EGLS. Normality of the simulation output is an important assumption for both the classic F-test and Rao's F-test. In case of non-normality, the analysts may apply jackknifing or bootstrapping; bootstrapping of Rao's statistic (and the classic R2 statistic) is indeed done in Kleijnen and Deflandre (2008).

An alternative test uses cross-validation and the t-statistic, which is less sensitive to non-normality than the F-statistics; see Kleijnen (1992). Moreover, this t-statistic requires fewer replications, namely m>1 instead of m>n if EWLS or EGLS is used. For details, I refer to Kleijnen (2008).

Besides these quantitative tests, the analysts may use graphical methods to judge the validity of a fitted metamodel (be it a linear regression model or some other type of metamodel such as a Kriging model). Scatterplots are well known. The panel discussion published in Simpson et al (2004) also emphasizes the importance of visualization; also see Helton et al (2006). If these validation tests reject the null-hypothesis, then the analysts may consider the alternatives discussed in the next subsection.

6.2. Transformations for improved validity of linear regression model

A well-known transformation in queuing simulations combines two simulation inputs—namely, the arrival rate (say) lambda and the service rate mu—into a single independent regression variable—namely, the traffic rate lambda/mu. Another transformation replaces lambda and mu and the regression predictor y by log(lambda), log(mu), and log(y) to make the first-order polynomial metamodel approximate relative changes.

Another simple transformation assumes that the I/O function of the underlying simulation model is monotonic. The original values of the dependent and independent variables are then replaced by their ranks, which results in so-called rank regression; see Conover and Iman (1981) and Saltelli and Sobol (1995). Such a rank regression is applied to find the most important factors in a simulation model of nuclear waste disposal, in Kleijnen and Helton (1999).

Transformations may also be applied to make the simulation output better satisfy the assumptions of normality (see Equation (5)) and variance homogeneity. Unfortunately, different goals of the transformation may conflict with each other; for example, the analysts may apply the logarithmic transformation to reduce non-normality, but this transformation may give a metamodel in variables that are not of immediate interest.

I do not recommend routinely augmenting the metamodel with higher-order terms (for example interactions among triplets of factors) because these terms are hard to interpret. Nevertheless, if the analysts' goal is not to understand the underlying simulation model but to predict the output of a (possibly expensive) simulation model, then high-order terms may be added. Indeed, classic full-factorial designs such as 2k designs enable the estimation of all interactions, including high-order interactions. If more than two levels are simulated per factor, then the following types of metamodels may be considered.

6.3. Alternative metamodels

There are several alternative metamodel types; for example, Kriging and neural network models. These alternatives may give better predictions than low-order polynomials do. However, these alternatives are so complicated that they do not help the analysts better understand the underlying simulation model. Furthermore, these alternative metamodels require alternative design types. This is a completely different issue, so I refer to the extensive literature on this topic—including Kleijnen (2008).

Top

7. Conclusions

In this survey, I discussed the practical implications of the assumptions of classic linear regression analysis and the concomitant statistical designs. I pointed out that multiple simulation outputs may still be analysed through OLS per output type. I addressed possible non-normality of simulation output, including normality tests, normalizing transformations of simulation I/O data, and distribution-free jackknifing and bootstrapping. I presented analysis and design methods for simulation outputs that do not have a common variance. I discussed how to analyse simulation outputs that use CRN. I discussed possible lack-of-fit of low-order polynomial metamodels, and possible remedies. I gave many references for further study of these issues.

I hope that practitioners will be stimulated to apply this statistical methodology to obtain more information from their simulation experiments. Statistical designs can be proven to be much better than designs changing only one factor at a time. Regression models formalize scatter plots and other popular graphical techniques for analysing the simulation model's I/O data, so more objective conclusions become possible.

Finally, I hope that this methodology will be incorporated in future simulation software.

Top

References

  1. Angün E, den Hertog D, Gürkan G and Kleijnen JPC (2006). Response surface methodology with stochastic constrains for expensive simulation. Working Paper, Tilburg University, Tilburg, Netherlands.
  2. Arcones MA and Wang Y (2006). Some new tests for normality based on U-processes. Stat Probab Lett 76: 69–82. | Article |
  3. Atkinson A and Riani M (2000). Robust Diagnostic Regression Analysis. Springer: New York.
  4. Ayanso A, Diaby M and Nair SK (2006). Inventory rationing via drop-shipping in Internet retailing: A sensitivity analysis. Eur J Opl Res 171: 135–152. | Article |
  5. Conover WJ (1980). Practical Nonparametric Statistics, 2nd edn, Wiley: New York.
  6. Conover WJ and Iman RL (1981). Rank transformations as a bridge between parametric and nonparametric statistics. Am Statist 35: 124–133. | Article |
  7. Davidson R and MacKinnon JG (2007). Improving the reliability of bootstrap tests with the fast double bootstrap. Comput Stat Data Anal 51: 3259–3281. | Article |
  8. Davison AC and Hinkley DV (1997). Bootstrap Methods and their Application. Cambridge University Press: Cambridge.
  9. Dykstra RL (1970). Establishing the positive definiteness of the sample covariance matrix. Ann Math Stat 41: 2153–2154. | Article |
  10. Efron B and Tibshirani RJ (1993). An Introduction to the Bootstrap. Chapman & Hall: New York.
  11. Freeman J and Modarres R (2006). Inverse Box Cox: The power-normal distribution. Stat Probab Lett 76: 764–772. | Article |
  12. Godfrey LG (2006). Tests for regression models with heteroskedasticity of unknown form. Comput Stat Data Anal 50(10): 2715–2733. | Article |
  13. Good PI (2005). Resampling Methods: A Practical Guide to Data Analysis; 3rd edn, Birkhäuser: Boston.
  14. Helton JC, Johnson JD, Sallaberry CJ and Storlie CB (2006). Survey of sampling-based methods for uncertainty and sensitivity analysis. Reliab Eng Syst Saf 91: 1175–1209. | Article |
  15. Ivanescu C, Bertrand W, Fransoo J and Kleijnen JPC (2006). Bootstrapping to solve the limited data problem in production control: An application in batch processing industries. J Opl Res Soc 57: 2–9. | Article |
  16. Kleijnen JPC (1987). Statistical Tools for Simulation Practitioners. Marcel Dekker: New York.
  17. Kleijnen JPC (1992). Regression metamodels for simulation with common random numbers: Comparison of validation tests and confidence intervals. Mngt Sci 38: 1164–1185.
  18. Kleijnen JPC (1993). Simulation and optimization in production planning: A case study. Decis Supp Syst 9: 269–280. | Article |
  19. Kleijnen JPC (2006). White noise assumptions revisited: Regression metamodels and experimental designs for simulation practice. In: Perrone LF, Wieland FP, Liu J, Lawson BG, Nicol DM and Fujimoto RM (eds). Proceedings of the 2006 Winter Simulation Conference, New York, ACM, pp 107–117.
  20. Kleijnen JPC (2007). Regression models and experimental designs: A tutorial for simulation analysts. In: Henderson SG, Biller B, Hsieh M-H, Shortle J, Tew JD and Barton RR (eds) Proceedings of the 2007 Winter Simulation Conference, New York, ACM.
  21. Kleijnen JPC (2008). Design and Analysis of Simulation Experiments. Springer Science+Business Media, New York.
  22. Kleijnen JPC, Cheng RCH and Bettonvil B (2001). Validation of trace-driven simulation models: Bootstrapped tests. Mngt Sci 47(11): 1533–1538.
  23. Kleijnen JPC, Cremers P and Van Belle F (1985). The power of weighted and ordinary least squares with estimated unequal variances in experimental designs. Commun Stat Simul Comput 14(1): 85–102. | Article |
  24. Kleijnen JPC and Deflandre D (2006). Validation of regression metamodels in simulation: Bootstrap approach. Eur J Opl Res 170: 120–131. | Article |
  25. Kleijnen JPC and Helton JC (1999). Statistical analyses of scatter plots to identify important factors in large-scale simulations, 1: Review and comparison of techniques. Reliab Eng Syst Saf 65: 147–185. | Article |
  26. Kleijnen JPC, Karremans PCA, Oortwijn WK and Van Groenendaal WJH (1987). Jackknifing estimated weighted least squares: JEWLS. Commun Stat Theory Meth 16: 747–764. | Article |
  27. Kleijnen JPC, Kriens J, Timmermans H and Van den Wildenberg H (1989). Regression sampling in statistical auditing: A practical survey and evaluation (including Rejoinder. Stat Neerl 43: 193–207, 225.
  28. Kleijnen JPC and Van Beers WCM (2004). Application-driven sequential designs for simulation experiments: Kriging metamodeling. J Opl Res Soc 55: 876–883. | Article |
  29. Kleijnen JPC and Van Groenendaal WJH (1992). Simulation: A Statistical Perspective. John Wiley: Chichester.
  30. Kleijnen JPC and Van Groenendaal WJH (1995). Two-stage versus sequential sample-size determination in regression analysis of simulation experiments. Am J Math Mngt Sci 15: 83–114.
  31. Law AM (2007). Simulation Modeling and Analysis, 4th edn, McGraw-Hill: Boston.
  32. Lehmann EL (1999). Elements of Large-Sample Theory. Springer: New York.
  33. Lunneborg CE (2000). Data Analysis by Resampling: Concepts and Applications. Duxbury Press: Pacific Grove, CA.
  34. Rao CR (1959). Some problems involving linear hypothesis in multivariate analysis. Biometrika 46: 49–58.
  35. Rao CR (1967). Least squares theory using an estimated dispersion matrix and its application to measurement of signals. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability I: 355–372.
  36. Ruud PA (2000). An Introduction to Classical Econometric Theory. Oxford University Press: New York.
  37. Salibian-Barrera M (2006). Bootstrapping MM-estimators for linear regression with fixed designs. Stat Probab Lett 76: 1287–1297. | Article |
  38. Saltelli A and Sobol IM (1995). About the use of rank transformation in sensitivity analysis of model output. Reliab Eng Syst Saf 50: 225–239. | Article |
  39. Simpson TW, Booker AJ, Ghosh D, Giunta AA, Koch PN and Yang R (2004). Approximation methods in multidisciplinary analysis and optimization: A panel discussion. Struct Multidis Optim 27(5): 302–313.
  40. Van Beers WCM and Kleijnen JPC (2006). Customized sequential designs for random simulation experiments: Kriging metamodeling and bootstrapping. Working Paper, Tilburg University, Tilburg, the Netherlands.
Top

Acknowledgements

I thank the editor and an anonymous referee for comments that helped me to improve the presentation in this article.