1. Introduction

In the modern business world, there is high demand for the ability to create and deliver effective and useable simulation models to aid business decisions in uncertain environments. Among simulation experts many factors have been noted to affect the ability to develop and deploy useful models within an appropriate timescale and within budget. However, it is accepted that one of the serious limitations to achieve the above objective is inefficient data collection (Perera and Liyanage, 2000) with up to 40% of project time being required on data gathering and validation (Trybula, 1994). This may have an impact on the model design. A case in point includes the situation where a modeller quickly designs a relatively complex model with limited data (or even without any data) and then collects the required data. This dramatically increases the possibility of having to alter the original design due to assumptions and limitations built into the model being significantly affected by the quality and availability of the required data. On the other hand, a modeller could also be trapped into thinking that all data have to be collected first before a model can be built. This increases the potential to collect more data than needed, hence, significantly wasting scarce resources (time, money, and personnel) on a simulation project. Therefore, a good data collection methodology and its place within the overall simulation modelling methodology are important.

The identification and collection of data is closely linked to the simulations’ conceptual model. Although there has been an increase in the number of researchers that address the designing of simulation conceptual models, there seems to be paucity in the research into the actual collection of data for use in simulation. The objective of our research is to understand more about the issues surrounding the identification and collection of data for simulation projects.

We define the data identification in simulation modelling as a process that highlights the required data sets and their desired properties such as accuracy, sample period and format to allow the simulation model to achieve the project objectives. Data collection is therefore defined as a process to obtain the identified data sets that meet the desired properties whether from a client directly or through direct collection. The main contribution of this paper is to propose a data collection method based on our observation on how consultants at a management consultancy firm, Operational Research in Health Ltd (ORH Ltd), identify and collect data for simulation projects. Our research also reveals the variation in data identification process which is influenced by the experience of the modellers and their usual role in a simulation project. The research data was collected by carrying out a participant observations method and a mock case scenario.

The rest of this paper is organised as follows. In Section 2, we present the overview of data collection methodology in main simulation textbooks and case-based articles. This is followed by a discussion on the related work in data collection methodology for simulation projects. This section highlights the lack of work in this area. Section 3 explains our research methodology and the results are presented in Section 4. Based on the data collected during the research, we discuss our findings and contributions in Section 5. Finally, we end our paper with the conclusion in Section 6.

2. Literature review

Skoogh and Johansson (2008) note the lack of structured data collection methodologies in simulation and state the need of such methodologies to support the more traditional working procedures followed. This is rather unfortunate since data quality and availability have been mentioned as two of the challenging issues in many simulation projects. For example, Onggo et al (2010) and Gunal et al (2008) documented simulation projects in the European Commission and UK police forces, respectively, where data quality had been issues for project delivery. In this section, we review the existing literature to find out how data has been collected and data issues have been addressed in simulation projects.

2.1. Data collection in simulation textbooks and case-oriented articles

The main simulation textbooks discuss data collection very briefly and almost always as a sub-section rather than devoting a considerable space for deeper discussion. Pidd (2004) only commented on data collection issues and methodologies as a side topic. Law (2007) and Banks et al (2005) devoted two pages or less to comment on data identification and collection. Robinson (2004) was an exception. He dedicated six pages on data requirements and collection. Based on the requirements, data in simulation modelling is grouped into three types: contextual data to understand the problem situation, data for model realisation and data for model validation. Based on the availability, the data is grouped into three categories: available data, data sets that are unavailable but can be collected, and data that is not available and cannot be collected. Apart from Robinson (2004), this shows the lack of inclusion of data collection methodology within key simulation textbooks.

Given the lack of the guidelines for data collection from the main simulation textbooks, we turn our attention to case-oriented simulation articles. We search for papers on simulation applications from the INFORMS Interfaces Journal from 2006 to 2011. This journal is chosen because it publishes articles that apply Operational Research techniques into real-world cases. We select articles that have the word ‘simulation’ in their title or abstract. From this list, we choose articles that show the application of simulation as their main theme. Table 1 details the papers reviewed, a description of the objective of the simulation model, and their references to the data collection stage along with any problems identified during the stage. The articles in the table are not meant to be representative of all simulation applications. It only shows how the authors report on the data collection process and any related issues during the collection of data. Publications in academic journals usually focus on academic contributions; hence, it is understandable that the data collection process and the related issues surrounding the collection of data are not reported in detail. The table shows that most articles do not discuss the data collection process and the related issues in detail. Hence, they do not provide an alternative source of information for readers who would like to learn more about how to collect data in real simulation projects.

Table 1 Articles on simulation applications from Interfaces between 2006 and 2011

2.2. Related work in methods for simulation data collection

Perera and Liyanage (2000) conducted a survey on data collection practice. They highlighted seven major causes for inefficient data collection. Poor data availability was perceived to have the most impact on delivering effective data collection practices. The next major cause of problem was the requirement of high-level model details. However, this could be down to choosing an inappropriate level of detail. These two major causes are related because by striving for extremely high model detail may lead people to believe that data availability is the top data issue in simulation. Robinson (2004, p 68) mentions the difficulties in finding quality data when model detail is increased, supporting a possible relationship between availability and model detail issues. The third top ranked data issue involved the identifying of available data sources within the system being modelled due to the existence of multiple sources or the indirect nature of many data categories. This survey shows that data collection is one of the major issues in simulation modelling. A more recent survey also confirmed that data issues were common and had significant impact on project outcomes (Onggo et al, 2013).

Research into data collection methodology in simulation has been dominated by automatic data collection, mainly in manufacturing. Lehtonen and Seppala (1997) looked at data gathering methodologies in a logistics simulation project, however only briefly commenting on the collection and focusing on the analysis of data. Perera and Liyanage (2000) carried out work into a methodology for data identification and collection in manufacturing systems, proposing several specific solutions to the industry using IDEF-based methodologies. Robertson and Perera (2002) discussed the use of an automated data collection process for simulation by integrating the data collection system with Enterprise Resource Planning software.

There has been research into a more generic data collection methodology. Skoogh and Johansson (2008) proposed a detailed generic methodology for input data management during the early stages of a discrete event simulation (DES) project life cycle. They produced a detailed process map to describe 13 steps taken throughout the collection and validation of data for use in simulation. The first two steps identify and define relevant parameters and available data. There is a great emphasis on selecting an appropriate level of detail for each parameter to ensure that high system complexity is captured within the model. The third step in the process relates to the selecting of methods to obtain the required data sets, both for easily available and unavailable data. This step follows the guidelines set by Robinson (2004, p 98) for three ways of collecting unavailable data; discussions with subject matter experts, review of historical data or the use of standardised data from process libraries. This is followed by the fourth step that decides whether all specified data will be found, either collected or estimated. If it is agreed all the specified data can be collected, then the next step is to create data sets and start a validation process, checking for sufficient representations before producing final documentation.

The work of Skoogh and Johansson (2008) has been incorporated by Bengtsson et al (2009) when investigating the input data management within DES projects. Bengtsson et al (2009) take the methodology proposed and extend the data input management process into storing and converting data easily for further DES projects.

3. Research methods

In this research, we are interested to understand two related practices: (1) how simulation practitioners identify data for simulation projects and (2) how simulation practitioners collect data for simulation projects. These two research questions concern with how practitioners make sense of the identification and collection of data for simulation projects and aims to extract the tacit knowledge of the practitioners. This ‘sense-making’ process with the intention to increase our general understanding on the two practices, epistemologically, lends itself to a social constructionist view (Easterby-Smith et al, 2012). Aligned with the constructivist view, this research uses a qualitative research approach and the research was conducted at ORH, a management consultancy firm based in Reading (UK) specialising in healthcare.

3.1. Identification and collection of data for simulation projects

In this study, we used participant observation method to collect the research data. The observation was done for one year. This method allows us to extract the tacit knowledge accumulated by the consultants in identifying and collecting data for simulation projects by observing their practices in a number of simulation projects. The observation was conducted by one of the authors, Hill, who worked as an intern for a year as part of his degree at Lancaster University. This allowed him to interact and build relationships with the consultants at ORH. He participated in a number of consultation projects during the internship. The experience gained from the projects and his observation on how the consultants identify and collect data for simulation projects have given us a good understanding on the data identification and collection practice at ORH.

The next step is to test our understanding by doing it on a real project. We picked a project called ‘Assessing Call-Taker Levels’ that involved recommending new staffing levels for ambulance control rooms to deal with increasing calls and to incorporate the installation of a new triage system that had the effect of increasing call durations. The project was conducted for a UK Ambulance Service. In this project, Hill conducted a full data collection life cycle for a real-world project based on our understanding of ORH’s data collection practice. Hill undertook the roles of analyst and modeller while being overseen by a senior consultant.

During the project, Hill wrote his experience in a diary. The diary is organised into several plan-act-reflect cycles, inspired by the action-research cycles (Lewin, 2005). At the plan stage, we identify the nature of the problem situation, including all interrelated factors, develop a working theory about the situation, and specify actions that should alleviate the situation. This is followed by the act stage in which we undertake actions in the agreed area of application in line with the plan and establish whether the theoretical effects of the action were realised. Finally, at the reflect stage, we reflect on what has been achieved in terms of both practical outcomes and new knowledge, and whether a new cycle is required. The detailed journal entries can be found in Hill and Onggo (2012).

3.2. Variations in data collection

Although the participant observation method allows us to understand the data identification and collection process, it does not allow us to observe any variations in the data identification and collection practice that can be attributed to consultants’ level of experience very easily. This is because they have been working on similar projects (ie healthcare related) so they seem to automatically switch to the usual data identification and collection process. Furthermore, they usually work in a group of three or four consultants with different levels of experience. Hence, it is difficult to see the effect of their level of experience on their data collection practice.

To collect the data for this research, we conducted an experiment using a mock scenario. In this experiment we observe how the consultants identify the required data for the simulation model. The experiment should allow us to identify any variation in the data identification process (but not the data collection process) that can be attributed to the consultants’ level of experience. The experiment will not show us the variation in the data collection process unless we observe the process over a period of time. This method is not practical since it will use a significant amount of consultants’ working time for a mock project.

When designing the mock scenario, careful consideration was placed on ensuring the situation created was realistic and covered all aspects of a real-world problem along with being feasible when considering the availability of consultants. Therefore, as the consultants need to have an understanding of the model being used in order to create a list of data requirements for any simulation project, the scenario was designed to not only capture the steps taken during the data identification stage but the whole model formulation process. The main benefit from taking this approach is the freedom of thought placed on the consultants, allowing a methodology of their own to be used ensuring no bias is incorporated into the task by providing a model to use.

The scenario used was chosen purposefully to lie outside the expertise of consultants to ensure their personal methodologies were adopted rather than them switching into an automated set of procedures. Despite this, possible topics were chosen which the consultants had sufficient knowledge of to allow the research to be meaningful and representative of a simulation project. As well as the scenario topic, it is crucial to set the difficulty at the right level to allow the consultants’ skills and practices to be used and therefore captured during the session. Consideration was placed on having a high difficulty level while ensuring the time frame required for consultants to complete the task was suitable to the environment in which it was being set. Factors including likely availability, willingness to participate and room availability were all considered when structuring the chosen problem. Taking all these factors into consideration, the following problem was designed:

Otting Reed Hurst train station lies on the Best Western line running from Manchester to London. A major renovation of the train station is planned for March 2012 and ticket administration is one area of particular interest for reducing expenditure and improving customer satisfaction. You are working for Steamline Consultancy Ltd and have been asked by the director of station operations to improve the ticket administration service at Otting Reed Hurst station through the below question. What is the required number of manned ticket desks and running self-service ticket machines at Otting Reed Hurst train station for each hour of the day across each day of the week? Please work as a team to investigate the above problem and create a set of data requirements needed to produce a solution through the use of simulation. When thinking about the required data, please also consider the design of simulation model to be used. Try to work in structured manner thinking about all aspects of the problem.

It can be seen how the scenario had a clear objective, similar to a real simulation project, with sufficient instructions and context. As well as the content of problem, the structure of teams who would complete the task had to be considered. Since we are interested in the variations that can be attributed to the consultants’ level of experience (and hence, their usual job role in a simulation project), consultants were divided into three groups based on their experience as follows.

  • Group 1: four consultants with less than 2-year experience.

  • Group 2: four consultants with experience between 2 and 5 years.

  • Group 3: five consultants with more than 5-year experience.

The structure of conducting the mock scenario had to be designed around work commitments and availability of rooms as mentioned above. After evaluating the possible duration of each session, a session length of 60 min was chosen, with 5 min for introducing the task, 5 min for the participants to familiarise themselves with the problem, 40 min to work through the problem and a final 5 min for debriefing and questions. During the 40 min, consultants had the use of note paper and a white board. All notes were collected and used in analysis.

4. Results and analysis

4.1. Results from participant observation

Based on our observation, we produced a flowchart that shows the data identification and collection process used by ORH for its simulation projects (see Figure 1). Typically, ORH will submit a data specification form, outlying the format, the required fields for all data sets, and the sample period. Shortly after sending a request, the client will comment on the availability of data within their organisation which influences the approach taken to collect and use representative data. Note that we differentiate between collecting the available data from client (ie get data from client) and collecting the unavailable or partially available data (ie collect data). The process described in Figure 1 is triggered whenever there is a need for data during a simulation project. When the data identification and collection process ends, ORH will present the findings to the client. If the client agrees with the findings then the data will be used. Otherwise, they will discuss options to resolve the data issues. Figure 1 also shows that, depending on the availability status of the required data, there are three main process pathways.

Figure 1
figure 1

Data identification and collection process.

First, if the specified data is fully available and meet all requirements given in the data specification form, ORH will progress to collect the said data and carry out investigative analysis which is then presented to the client for data validity checks. An example of this was found during the collection of planned and actual call taker numbers for each hour of the day across the week. The client could readily provide the said data for two of its control rooms. However, as evident from the journal entry (Hill and Onggo, 2012), further analysis showed that there was an increase in call taker hours throughout the final few months of the samples. The project team re-analysed the data and checked its validity with the client. They explained to the client how this might affect the simulation result. As a result, with the client’s agreement, the sample size was cut to ensure representative data was being used by removing the outliers.

The second possibility is when data is available but does not meet all requirements. In this case, ORH will assess whether the data can still be used in modelling (although this may lead to altering the model design or adding more assumptions). If not, ORH will use a trade-off between the amount of time required to collect the data and the overall benefit of having the data available to use in modelling. An example of this was found when trying to collect call taker hours for one of the three control rooms (Hill and Onggo, 2012). As extreme levels of time and effort would have to be spent manually collecting and then processing the data both on the client side and ORH’s role, it was decided with the client to collect a smaller sample of less granular data. This approach ensured the project was not delayed and sufficient data had been collected to continue modelling with small assumptions in place.

Finally, if the requested data is not available, ORH takes a different approach. Discussions are held with the client, to investigate if any alternative data is available to collect. If so, the iterative cycle described above is then started until the client is happy with the findings. However, if no alternative data can be found, two options are available: use assumptions or adapt the modelling approach to deal with less granular data. An example of the use of assumptions was found during the collection of outgoing call duration (Hill and Onggo, 2012). Because the outgoing call data for the requested sample period was unavailable, with the client’s agreement, data from the previous year was used for outgoing calls. In this situation, past project experience was used to aid the development of assumptions that could be used in the simulation model. An example of the use of less granular data was found when data was collected for new call durations under the triage system (Hill and Onggo, 2012). ORH requested raw call data to estimate the distribution of inter-arrival rates for incoming calls. It was found that the data was not available. However, less granular data, that is, number of calls in 15-min blocks by call category, was available. It was decided that using the less granular data was more desirable than collecting extra data, due to tight time constraints and report deadlines. It should be noted that if data sources are unavailable or assumptions are used, fewer parameters are available for analysis, optimisation and sensitivity modelling, hence reducing the possible impact of the project. In the project, though not being able to obtain a call taker unavailability factor within Client Y’s operations, a possible optimisation parameter was lost due to the factor being used for model validation as a result of poor confidence in the supplied data.

4.2. Results from the experiment

In total, 508 sentences were coded across the three groups with 140, 177 and 191 sentences captured in groups 1, 2 and 3, respectively. The coding followed the guidelines set in Willemain (1995) in which he divided modelling topics into Context, Structure, Realisation, Assessment and Implementation. In model context, modellers would identify relevant data that helps them understand the context of the modelling work. In model structure, the data that is required to understand the input, output and structure of the model needs to be identified. The data for model realisation is related to the estimation of model parameters and analysis of model outputs. For model assessment, modellers need to identify data needed for model validation. These four topics are particularly relevant to the data identification process. The model implementation concerns with the use of the model by the client; hence, it is not relevant to the objective of our research.

Under the situation where a sentence falls out of the five topics used, it is disregarded and not used in analysis. An example of this would be a participant discussing the weather or commenting on an unrelated topic to the exercise. It could be argued that by not including these sentences in the total sentence count, a proportion of the participants’ time is being missed. However as the research is concentrating on the time spent discussing the problem, including the ‘irrelevant’ sentences would produce inaccurate findings of how experts split their attention by skewing the results.

The result (Table 2) is not significantly different from Willemain (1995) except for model realisation. The difference for model realisation could be explained by the fact that the participants were aware of our interest in the identification of data and in their experience most of the data are used in model realisation. This seems to affect the amount of time that is dedicated to analysing the data for parameter estimation and output analysis. Hence, we need to keep this in mind when interpreting this finding. Our result is more homogeneous than Willemain (1995) because the participants in our experiment are more homogeneous (ie from the same company).

Table 2 Proportion of time for each modelling topic

Figures 2, 3 and 4 show a time series chart for each group describing the sequence in which topics were discussed. From the three charts, it can be noted how each group discussed each topic in a slightly different order. Model structure was considered throughout the task by group 2, compared with groups 1 and 3 who discussed model structure in blocks throughout the task. Group 3 also differs from the other two, when looking at their approach to discussing model implementation, with group 3 not visiting the topic until late in the second quarter, unlike groups 1 and 2 where emphasis was placed throughout the task. When considering the time devoted to model context, all three groups are different; group 1 only considers model context in two blocks, both in the first half of the session. Comparing this with groups 2 and 3, it is seen that their time is spread across the task, with the majority of time spent in the earlier periods. Although it can be said a similarity lies between groups 2 and 3 on their time devoted to model context, it should be noted how group 3 places a high proportion of its time in the earlier stages to model context. In comparison, group 2 has a more balanced approach in the first half of the task with model structure and model realisation then becoming dominant in the latter stages. In summary, the result shows the approaches taken by each group vary which may be influenced by the level of experience and their usual role in a simulation project.

Figure 2
figure 2

Sequence of topics discussed in group 1.

Figure 3
figure 3

Sequence of topics discussed in group 2.

Figure 4
figure 4

Sequence of topics discussed in group 3.

5. Discussion

It should be noted from the outset that we use a qualitative research approach in line with the constructivism research philosophy. Hence, our findings should not be taken as the single absolute truth (as in the positivism). However, our findings are useful because they reflect the reality of the consultants participating in this study. In this discussion, we will highlight our contributions.

5.1. Data identification and collection process for simulation projects

This paper contributes to research into simulation practice by producing a guideline for data identification and collection in simulation projects. The guideline has been shown in Figure 1 based on our observation on the practice done by consultants at ORH. This practice has been formed through the regular exchanges of best practices among the ORH consultants. Being a small firm, ORH has seen high interactions between its staff which provide a conducive environment for such knowledge sharing.

The guideline is consistent with the three data categories based on its availability discussed in Robinson (2004). We have also applied the guideline on a real project. Furthermore, the guideline is an extract of what has been considered to be the best practice at ORH. Therefore, it has been used in simulation projects done by ORH consultants. Based on these, we can conclude that the guideline is flexible and robust allowing for all situations faced in simulation projects observed during the research. Although this guideline is well tested, consistent with our constructionist view, this guideline may not be the only one that works.

Based on our observation, we can divide the data issues into availability, structure and quality. The data availability status is consistent with the three categories mentioned in Robinson (2004). The structure of data can be an issue where the data is available on a format that cannot be readily used for analysis using computers. This includes mental data (tacit knowledge), qualitative data, quantitative data in hard-copy, and raw electronic data that is spread across various systems or must be processed before it can be used. Finally, the quality of the data refers to its validity, completeness and correct level of detail.

Although we can conclude that the guideline is quite robust, the number of cycles during the data collection process could be reduced by introducing a preliminary investigation on how data has been collected and stored. In our example, in cases where the staffing data was not available for one of the control rooms or where multiple internal sources had to be combined to produce the call taker hours, had ORH been aware of the issue prior to submission of a data request, alternative actions could have been taken for the collection of data. The first example (staffing data) shows us that we cannot assume that data is available just because we think the client must have it for its operation. The second example (multiple data sources) teaches us that the process of combining data from multiple sources can take more time than anticipated and can affect the total amount of time spent in the data collection stage. Not only were delays in project progression a consequence, the validity and confidence in data could be reduced as a result of losing validation data in the process of combining data sources.

A preliminary investigation should be able to detect the data availability issue at the expense of adding one more step to the process. However, given that the delay between the submission of the data specification form and the reply from the client usually takes longer than the added time needed for the preliminary investigation, this approach may reduce the amount of time needed for the data collection stage in a simulation project.

5.2. Variation in data identification practice

The experiment using a mock case study reveals an interesting result. Given that the coding in our experiment is based on the aggregate of individuals within each group, it should reduce the effect of personal style on the observed differences between groups. Furthermore, our experiment is based on consultants within a small management consulting company with high levels of interaction between each other. Consequently, we would expect them to follow similar approaches to modelling. However, we can still see some variations between groups. This could be explained by the difference in the level of experience and the participants’ usual role when undertaking a simulation project. The more experienced or senior consultants (groups 2 and 3) tend to spend more time to understand the context of the modelling work. This is partly because they usually take more responsibility in the activities related to model context such as getting to know the clients and their needs, understanding the reasons why the modelling project is needed, etc. The mid-level consultants (group 2) spend more time in model structure which is consistent with their usual role in a simulation project, that is, data analysis and conceptual modelling. On the other hand, the junior consultants (group 1) tend to spend more time on model implementation than the more senior consultants, which is related to activities such as training the clients on how to use the model and making sure the model is used correctly for decision making.

The detailed analysis on the conversation in each group reveals further differences in terms of the identification of data needed for the simulation. The junior consultants (group 1) concentrated more on what data that would be useful and the accuracy requirements of inputs prior to discussing model structure and model inputs. However, after briefly considering the model structure, the emphasis was placed again on the accuracy of data required, but this time with specific model inputs in mind. By visiting the accuracy of data requirements twice, once before defining inputs and once after, a repeat of conversations and decision making took place. Considerable time was then spent on discussing the approaches to collect data with the right accuracy. The discussion was conducted even before the specific data source was identified. Finally, the group discussed how the model and data would be used to aid the client and also questioned if the right level of detail could be gathered from the chosen sources. However, other approaches to sourcing the data required for input were not discussed, if the preferred route was not possible.

Group 2, containing consultants with between 2 and 5 years of simulation experience, focused heavily on the requirements and structure of the chosen model before identifying any data requirements. In order to identify the data required, the group defined variables that would be used in the model and then derived a set of data requirements in order for the variables to be modelled at the right accuracy level. The final stages used by the group concentrated on grouping the required data into different categories, based on the level of importance for a solution to be completed. Group 2 was the only group to include a step of checking the proposed model structure with the client before specifying a data list.

Consultants in group 3 have more than 5 years of experience. They took an entirely different approach from the other two groups. The group spent a large proportion of the time discussing what sources of data are available without considering how each data source would be used and more importantly what specific data items would be requested. Variables were then defined after listing all data sources, before methods for collecting the data were chosen. It was only after this stage that the structure of the model was discussed along with the requirements of the output.

5.3. Data validation

From the experiences gained from this research, we have specifically observed that although a lengthy process at the time, ORH through good client interaction and their industry knowledge of Emergency Services managed to ensure the data was as correct as feasibly possible. For example, when collecting incoming call data across the control rooms, ORH used data available for ambulance incidents to cross check that levels of incoming calls to control rooms were of a realistic proportion, using past projects as benchmarks. This shows that having the know-how and experience of using ‘benchmarking’ style analysis can aid to develop confidence in data sets and offer alternative data sources. Hence, although we agree with researchers (eg, Sargent, 2005) that ensuring data validity is difficult, we believe that data validity could be improved by building good client interactions and developing enough knowledge on a specific domain knowledge.

5.4. Research methods

Finally, we highlight our contribution from the two research methods we use for this paper. The data identification and collection process takes a significant amount of project time. Hence, participant observation method is suitable to capture this process. The use of participant observation method for over a period of time is rarely used in simulation research. This paper shows that participant observation method can be a useful tool for us to understand the use of simulation modelling in practice. The use of an experiment using a mock case study is not new in simulation research (eg, Tako and Robinson, 2010). This paper further highlights the importance of the choice of mock case scenario. In our case, we need to make sure that the participants do not use the automated standard procedure due to high familiarity with the case by choosing a case outside their usual domain but familiar enough for them to complete the task within the specified duration. This is because we want to observe any variation that can be attributed to the participants’ level of experience.

6. Conclusion

Despite being an important practical issue, the lack of research in the area of data identification and collection in simulation indicates the existence of a gap between research and practice in this area. This calls for closer collaboration between researchers and practitioners to tackle the issues surrounding the identification and collection of data in simulation projects. Robinson (2002) identified the importance of the availability and accuracy of data within his 19 dimensions of simulation project quality presenting the relationship of, poor quality of data leads to a poor model and ultimately an unsuccessful simulation project. This reinforces the importance of having an effective method for the identification and collection of data. This research is a step that we have taken to study this important practical issue of data identification and data collection in simulation practice.

Our research has shown that a methodology is in place at ORH Ltd during the collection of specified data and is one of a logical and robust manner. Their methodology allows the data collection stage to work around the specific clients’ needs and varying data availability levels. The data collection methodology has been presented in this paper. It should provide a guideline to practitioners in the data identification and collection for simulation projects. Based on our evaluation using a real project, we propose to add an extra step that identifies the availability of data sources prior to requesting data. This step should avoid many of the problems experienced in the project. We believe that, with preliminary investigation to identify available data sources prior to requesting data being included in the initial phases, more realistic and achievable data could be requested and could reduce time spent in the data collection stage. Further work will be required to evaluate the cost and benefit of the additional step.

Although exposed to the same data collection methodology, the experiment using a mock scenario has allowed two clear attention profiles of research participants to be found during the identification of required data. The first method is one of high attention to structure with minimal focus across context, assessment and implementation. This approach is heavily set towards continuous reference to structure throughout the process. In comparison, the second method focuses equally across context, structure and realisation with little attention to assessment and implementation. This result has shown the variations in the approaches used by practitioners to identify data. This could be influenced by their level of experience and their usual role in a simulation project.