Introduction and background

To demonstrate how data mining techniques might be systematically applied to Comparative Effectiveness Research (CER), we obtained data from a large national health insurance company and used the data mining toolset available from SAS Institute's Enterprise Miner software (SAS Institute, Cary, NC). The intention in our example was to see if machine learning could be used to generate new, interesting hypotheses regarding available oral anti-diabetic medications used to treat type 2 diabetes.

It is widely recognized that data mining of anonymized administrative data, such as longitudinally linked medical and pharmacy claims with the associated benefit eligibility records, can and does play an informal role in hypothesis generation for CER and other areas of inquiry (Schneeweiss, 2007; Berger, 2009, and Sullivan & Goldmann, 2011). However, there is little in the way of systematic application of these techniques. The healthcare system as a whole, and CER specifically, can extract greater benefits from these technologies if a more focused and intentional approach is taken.

Type 2 diabetes is a major contributor to the growing problem of healthcare costs. According to the Congressional Budget Office spending on healthcare as a percentage of GDP has roughly tripled between 1960 and 2005 (CBO, 2007). It could more than triple again by 2082 to consume nearly half of GDP unless something is done to rein in costs. It is one of the leading causes of morbidity and mortality in the United States because of its role in the development of co-morbid conditions such as ophthalmic disease, kidney disease, and cardiovascular disease (National Institute of Diabetes and Digestive and Kidney Diseases [NIDDK], 2011). Adults with diabetes have heart disease and stroke death rates 2–4 times higher than adults without diabetes (NIDDK, 2011). According to the Centers for Disease Control (CDC, 2010) 23.6 million people or 7.8% of the population had diabetes in 2007, and if current trends continue then up to one-third of Americans alive today will develop diabetes before they die. In the same report, the CDC also finds that $116 billion in direct medical costs were expended to treat diabetes in 2007, and the economy lost another $58 billion indirectly due to the negative effects of the disease on economic productivity. But the cost is not economic alone: those that suffer from the illness lose an average of 10–15 years of life and diabetes is the leading cause of blindness, kidney failure, and amputations (NIDDK, 2011).

Tremendous pressure has been placed on the U.S. government to control costs while still providing quality healthcare outcomes. In 2009 the Obama Administration passed the American Recovery and Reinvestment Act of 2009 (U.S. Congress 2009), which included $145.7 billion for healthcare. Most of this funding is being used to support existing government programs that provide healthcare directly, such as Medicare, Medicaid and the Veteran's Administration. However, $25.8 billion has been set aside for accelerating the adoption of health IT, with $1.1 billion of these funds directed at CER (Department of Health and Human Services [DHHS], 2010).

The DHHS created an official definition of CER (DHHS, 2009a, 2009b) and has proposed a framework (DHHS, 2009b) for embedding it into the American healthcare system. Their definition is:

Comparative effectiveness research is the conduct and synthesis of research comparing the benefits and harms of different interventions and strategies to prevent, diagnose, treat and monitor health conditions in ‘real world’ settings. The purpose of this research is to improve health outcomes by developing and disseminating evidence-based information to patients, clinicians, and other decision-makers, responding to their expressed needs, about which interventions are most effective for which patients under specific circumstances.

• To provide this information, comparative effectiveness research must assess a comprehensive array of health-related outcomes for diverse patient populations and sub-groups.

• Defined interventions compared may include medications, procedures, medical and assistive devices and technologies, diagnostic testing, behavioral change, and delivery system strategies.

• This research necessitates the development, expansion, and use of a variety of data sources and methods to assess comparative effectiveness and actively disseminate the results.

The scientific community is constantly looking to understand relationships among symptoms, diseases, and treatment. Beginning in the 1960s, there were several attempts at building ‘expert systems’ that attempted to codify the diagnostic experience of experts into knowledge representations and algorithms to assist physicians with differential diagnosis (Pople, 1973, 1977; Shortliffe & Buchanan, 1975; Szolovits & Pauker, 1978). Given an initial list of symptoms, such systems engage in a dialog with a diagnostician to hypothesize the diseases that explain a given set of symptoms in a process characterized as abductive logic by Peirce (1883) in the 19th century and mechanized by researchers in early medical diagnosis systems in the 1970s (i.e. Pople). The first medical diagnostic system that covered the entire field of Internal Medicine was INTERNIST/CADUCEUS, which guided expert diagnosticians through an interactive dialog involving a differential diagnosis. This system recognized that diagnosis is a complex art, requiring the physician to combine heuristic medical knowledge and real-world experience into an inductive reasoning process, where the objective is to hone in on the correct differential diagnosis with minimal questioning and invasive procedures. Such systems have shown to be effective in helping with differential diagnosis when there are competing opinions among experts about the causes of symptoms, and also helping experts avoid tunnel vision or an inappropriate premature diagnostic conclusion.

Businesses including casinos, banks, manufacturers, and retailers (Dhar & Stein, 1996) have been using data mining methods successfully for over a decade now to draw actionable conclusions from the data they collect as a by-product of commerce. Data mining is an exploratory approach to data analysis, where hypotheses can emerge from the data instead of having to be specified beforehand.

In order to leverage the expanding healthcare data assets in the U.S. to measure and explore the fundamental concepts of CER, the traditional approaches to scientific discovery used in the healthcare arena can benefit immensely when augmented with state-of-the-art data mining and machine intelligence technology. Methods from data mining have the potential to maximize the benefits of the existing and proposed repositories of healthcare information to improve quality and access while reducing costs. Equally importantly, since patterns often emerge before reasons for them become apparent, this approach can be very useful in providing the research community with early insights into costs and outcomes thereby bringing attention to these areas. In our study, for example, our findings confirm some of treatment regimens proposed by the American Diabetes Association (ADA), but also reveal unforeseen patterns that merit further consideration.

Data and methods

Anonymized longitudinally linked medical claims, pharmacy claims, and eligibility records for commercially insured patients from January 2004 through March 2010 were used to construct patient-level records for this analysis. Selected commercially insured patients were required to be 18 years or older and to have had a diagnosis for uncomplicated type 2 diabetes (Appendix A), that is, the ‘Trigger’, between 2004 and 2009 such that there were at least 270 days of medical and pharmacy benefits eligibility prior to the trigger (the ‘Clean’ period) and at least 450 days of medical and pharmacy benefits eligibility following the trigger (the ‘Treatment Selection’ and ‘Outcome Observation’ periods) – see Figure 1.

Figure 1
figure 1

Graphical representation of study period phases.

Diabetes is often treated with multiple medications simultaneously. The 180-day Treatment Selection period (Table 1) was constructed to allow adequate time to observe which oral anti-diabetic medications newly diagnosed type 2 diabetic patients selected. Furthermore, patients must have had no diagnosis for type 1 diabetes (Appendix B) and no insulin (Appendix C) treatments at any time during the entire 720-day study period. They must have had no diabetes-related complications (Appendix D) during the Clean and Treatment Selection periods. And they must have had no insulin and no oral anti-diabetic medications (Appendix E) during the Clean period. As a result of applying these inclusion and exclusion criteria to the data, 66,523 patients were included in the study. The resulting age–gender distribution and geographic distribution are given in Table 1 and 2, respectively.

Table 1 Age–gender distribution of selected patients (n=66,523)
Table 2 Geographic distribution of selected patients (n=66,523)

Both outcome variables and explanatory variables were measured in the relevant study sub-periods (Table 3) for use in downstream data mining analyses. Outcome variables, measured during the Outcome Observation period, included total charges (including drug charges), medical charges (excluding drug charges), and a binary variable indicating whether or not the patient developed a diabetes-related complication. Explanatory variables included patient age group, patient gender, flags for each type of oral anti-diabetic medication observed during the Treatment Selection period, and two measures of the patient's pre-existing healthcare status: the number of unique 3-digit ICD9 diagnosis groupings during the Clean period, and the number of unique drug subclasses during the Clean period.

Table 3 Means (standard deviations) for selected variables during each of the three study sub-periods (n=66,523)

For most patients, the ADA recommends that specific therapy combinations be applied in a stepwise approach until the patient's blood A1C (glycated hemoglobin) is measured at less than 7% (Inzucchi, 2012). If A1C can’t be adequately controlled within 3–6 months of diagnosis then the ADA recommends adding insulin to patients’ therapy regimens (Inzucchi, 2012). We have excluded these patients (3612 patients) to focus on oral anti-diabetic therapies. Adding the complexity of insulin use is left for further research.

The ADA's consensus algorithm is depicted graphically in Figure 2. Each of these therapy regimens was observed in our study population, as were other regimens that are not mentioned in the algorithm. (Note that in this paper we use the term ‘biguanide’ synonymously with the diagram's use of ‘metformin’. Likewise we use the term ‘glitazone’ synonymously with the diagram's use of ‘TZD’ and ‘thiazolidinedione’.)

Figure 2
figure 2

Graphical representation of the ADA's consensus algorithm.Source: Inzucchi (2012).

Dozens of therapy regimens were observed including monotherapy on a single compound, multiple therapies, and no therapies at all (which we interpreted to be a regimen of lifestyle changes only). A flag was constructed for each regimen seen in at least 0.5% of treated patients (i.e. excluding those with only lifestyle changes) using the drug utilization of the entire Treatment Selection period for each patient. In some cases regimens were grouped together by common components to form large enough categories. All other therapies were included in an ‘other’ category. The results of this classification of patients are presented in Table 4.

Table 4 Classification of patients by mutually exclusive therapy regimens observed during the treatment selection period (n=66,523)

Results

Medical charges

We start by looking at medical charges during the outcome observation period, which we take to serve as a proxy for the utilization of medical services. Drug charges are excluded because they are highly correlated with whether selected therapies still enjoy patent protection and therefore command higher prices than generically available alternatives.

Means and standard deviations of medical costs during the outcome observation period are presented in Table 5 by treatment regimen. The difference between each possible pair of means was tested using a t-test. Pairs where the null hypothesis that the difference between the means is equal to zero could be rejected at the P=0.05 level of significance are presented in Table 6.

Table 5 Mean annualized medical charges (and standard deviations) during the outcome observation period, by treatment regimen (n=66,523)
Table 6 Differences between mean annualized medical charges during the outcome observation period, by treatment regimen pairs whose differences are significantly greater than zero at the P=0.05 level (n=66,523)

One might expect that patients who are treated with more complex treatment regimens involving simultaneous use of multiple medications (which can be an added burden to patients), also have more severe disease and this would lead to higher utilization of the healthcare system and higher medical charges during the outcome observation period, independent of treatment selection. Along the same lines, it should be noted that the ADA algorithm (Inzucchi, 2012) lists weight gain as disadvantages of both sulfonylurea and glitazone therapy while it lists weight loss as an advantage of GLP1 therapy, and this may drive certain types of patients to certain regimens.

The ‘Other therapy NEC’ category contains patients treated with the most complex (two to three or more different medications) and least frequently used treatment regimens. Tables 5 and 6 show that these patients are associated with significantly higher medical charges during the outcome observation period than patients using all of the other therapy regimens, and the difference is statistically significant in all but four cases (sulfonylurea monotherapy, GLP1 monotherapy, DPP4 monotherapy, and biguanide with one or more other therapy NEC). This may be an example of advanced disease leading to advanced treatment regimens.

However, the data also show examples where more complex treatment regimens are associated with lower medical charges during the outcome observation period than the simpler treatment regimens. This is particularly true in the case of sulfonylurea monotherapy. Five more complex regimens are associated with significantly lower medical charges during the outcome observation period than sulfonylurea alone (biguanide+sulfonylurea+glitazone, biguanide+sulfonylurea, biguanide+DPP4, biguanide+glitazone with one or more other therapy NEC, and biguanide+glitazone). While it may be that sulfonylurea monotherapy is given to sicker patients, there are two regimens that are associated with lower medical charges during the outcome observation than sulfonylurea monotherapy but that includes sulfonylurea itself.

In fact, this pattern repeats with the GLP1 monotherapy, DPP4 monotherapy, and glitazone monotherapy regimens. In all three cases, the group of patients who also added biguanide to their regimen had significantly lower medical charges during the outcome observation period than with monotherapy. Of course some patients may not be able to tolerate the side effects of biguanide, but this pattern suggests that biguanide should be included in patients’ treatment regimens whenever possible.

Interestingly, the group having the simplest treatment regimen of all, ‘Lifestyle Changes Only’, had significantly higher medical charges during the outcome observation period than four other biguanide-based regimens. Three include biguanide with another medication (biguanide+sulfonylurea, biguanide_DPP4, and biguanide+glitazone), and the fourth is biguanide monotherapy. This underscores the point made previously about including biguanide whenever possible – even if a new patient's disease is controlled enough to consider treating it with lifestyle changes alone.

We find these univariate results suggest interesting patterns of outcomes associated with different treatment regimens across the population. However, for discovery patterns that account for simultaneous consideration across multiple treatment and patient profile variables, we turn to more sophisticated data mining techniques that search the space more intelligently.

Medical charges decision tree

Using the mutually exclusive flags indicating each of the 16 treatment regimens as well as gender, age group, census region, year of index, and counts of diagnosis and prescription drug groupings during the clean period, a decision tree model was generated for patients’ medical charges during the outcome observation period. The tree induction algorithm partitions the data recursively to identify subsets of the data where the distribution of the dependent variable is as different as possible from the distribution of the variable in the data as a whole (Breiman et al, 1984). The algorithm constructed a tree with 23 terminal nodes (i.e. mutually exclusive patient groupings) using eight variables – age group, counts of pre-existing diagnosis and prescription drug groupings, and five of the binary treatment regimen flags (sulfonylurea monotherapy, glitazone monotherapy, biguanide+sulfonylurea+glitazone, other therapy NEC, and lifestyle changes only).

The tree induction algorithm was run with a 10-fold cross validation (Breiman et al, 1984). The resulting tree is too large for practical graphical representation so we show only the terminal nodes (designated with an arbitrary Node ID) in Table 7. Node size (n), and mean and standard deviation medical charges during the outcome observation period are also given. T-tests were used to detect differences between each possible pair of terminal nodes. Since the tree algorithm is designed to create dissimilar groups, it is not surprising that most of the terminal nodes are statistically significantly different from each other at the P=0.05 level.

Table 7 Differences between medical charges during the outcome observation period, by patient groups defined by terminal nodes from a tree algorithm (n=66,523)

The results show several interesting patterns. Notice that the algorithm splits the data by the number of 3-digit ICD9 diagnosis codes and number of drug subclasses prior to diabetes diagnosis. These attributes can be viewed as a surrogate the patient's ‘health status’ prior to diabetes diagnosis in that a larger number indicates poorer health relative to a lower number. The striking pattern here is that regardless of treatment regimen, patients with relatively poor ‘health status’ (as measured by the counts of diagnosis groupings and prescription drug subclasses) prior to diabetes diagnosis have statistically significantly higher medical charges during the outcome observation period than patients with relatively good health status prior to diabetes diagnosis. This pattern can be seen broadly in nodes 1–7, which are loosely ordered from lowest to highest number of diagnosis groupings. It can also be seen separately for patients under 55 (nodes 8 and 9), and patients over 55 (nodes 10–12).

The nodes including various therapy regimens (nodes 13–23) have higher counts of diagnosis groupings and prescription drug subclasses during the clean period, suggesting treatment selection may be more important for sicker patients. Several of these nodes identified small groups of patients where medical charges and variances are unusually high (e.g. nodes 17, 19, 21, 23). The numbers of cases in these groups are small, so they might just represent unusual situations; however, they could also represent emerging regimens.

Nodes 15 and 16 represent another interesting pattern. They show segments that have identical profiles in terms of counts of diagnosis groupings and prescription drug subclasses prior to diabetes diagnosis but differ in terms of therapy regimen, with node 15 representing the lifestyle changes only regimen. A comparison of these two suggests that patients with the lifestyle changes-only regimen have statistically significantly higher medical charges (mean $36,314, sd $89,311) than patients with any other regimen except for sulfonylurea monotherapy and other therapy NEC (mean $28,137, sd $56,820).

The tree-based analysis further confirms the previous univariate analysis suggesting patients with a medication-based therapy regimen, particularly those that include biguanide, are associated with lower medical charges than patients with lifestyle changes only. It also sheds additional light by highlighting patient subgroups where this is particularly so, as is suggested in nodes 15 and 16.

Medical charges regression

To test the hypothesis that patients’ medical charges during the outcome observation period are a function of the mutually exclusive flags indicating each of the 16 treatment regimens, gender, age group, census region, year of index, and counts of diagnosis and prescription drug groupings during the clean period, a stepwise OLS multiple regression analysis was performed. Levels of F to enter the model and F to remove from the model were both set to correspond to P=0.05. The stepwise procedure produced a model with 14 of the 31 available variables and provided partial confirmation of the hypothesis. While the low R-square (0.032) indicated that a low proportion of the variability in medical charges was fitted by the model, the F ratio (F=155.80, P< 0.0001) indicated that the model has statistically significant predictive capability. Parameter estimates are presented in Table 8.

Table 8 Medical charges regression model parameter estimates (n=66,523)

The parameter estimates from this model show both expected and unexpected patterns in the data. Perhaps expectedly, holding all else equal, the parameter estimates suggest the following:

  • Patients in the three older age groups (45–54, 55–64, 65+) would be expected to have higher medical charges during the outcome observation period (coefficient estimates=$462, $1,172, $1,280, respectively) compared with the patients in the two younger age groups (18–34, 35–44).

  • Patients whose diabetes diagnosis occurred in the earliest year (index year=2005) of the study period had lower medical charges during the outcome observation period (coefficient estimate=−$590) than patients in the later years (2006 – 2008), likely due to cost increases over time.

  • Each diagnosis group observed prior to a patient's diabetes diagnosis increases the expected medical charges during the outcome observation period by $2,172. Each drug subclass observed prior to a patient's diabetes diagnosis increases the expected medical charges during the outcome observation period by $846. It is not surprising that patients with relatively poor health status (as measured by the counts of diagnosis groupings and prescription drug subclasses) would be expected to have higher healthcare utilization.

  • Patients might be expected to have medical charges positively correlated with the number of medications in the selected regimen, because patients with more advanced disease progression, and hence more utilization of healthcare, might be treated with more advanced regimens. The model shows that patients with biguanide monotherapy would be expected to have lower medical charges during the outcome observation period (coefficient estimate=−$1,072) than patients with biguanide+glitazone (coefficient estimate =−$751), who in turn would be expected to have lower medical charges during the outcome observation period than patients with biguanide+sulfonylurea+glitazone (coefficient estimate=$1,136).

Less expectedly, again holding all else equal, the parameter estimates also suggest the following.

  • Patients in the Midwest and South regions would be expected to have lower medical charges during the outcome observation period (coefficient estimates=−$928 and −$374, respectively) compared with the patients in the West and Northeast. This suggests some regional variation in utlization.

  • Patients using the biguanide+DPP4 regimen are associated with lower expected medical charges (coefficient estimate=−$1,327). Since relatively few patients select this regimen (836), this result suggests taking a closer look at its potential benefits.

Diabetes-related complications

While costs are certainly an important component of the current healthcare policy debate, the quality of patient's own outcome is of critical interest. To compare effectiveness among competing medications along this dimension, we classified patients according to whether or not they developed diabetes-related complications during the outcome observation period.

Proportions of patients developing a diabetes-related complication during the outcome observation period are presented in Table 9 by treatment regimen. The difference between each possible pair of proportions was tested using a Fisher's Exact test, which is appropriate to use when small cell sizes are present and the analysis can be constructed using 2 × 2 contingency tables. In this case, the contingency table is limited to patients taking either of the two regimens in the pair and is constructed by classifying patients as to regimen (defined by the pair) and presence of a complication (yes or no). Pairs where the null hypothesis (that patients on each regimen are equally likely to have complications) could be rejected at the P=0.05 level of significance are presented in Table 10.

Table 9 Proportion with diabetes-related complications treatment regimen during the outcome observation period (n=66,523)
Table 10 Differences between proportions with diabetes-related complications during the outcome observation period, by treatment regimen pairs whose differences are statistically significantly greater than zero at the P=0.05 level (n=66,523).

As with the analysis of medical charges, the analysis of the proportion or percentage with complications is complicated by the possibility that patients who are treated with ‘more complex’ treatment regimens also have more severe disease to begin with, which would account for their higher proportion of complications. However, it is interesting to note that even among treatments of ‘similar complexity’ such as biguanide+sulfonylurea (2.86%) and biguanide+glitazone (1.96%), the former has a significantly higher proportion of complications. Since glitazone is a newer class of drugs, this pattern might be indicative that it is more effective than sulfonylurea when used in combination with biguanide.

Diabetes-related complications decision tree

Mutually exclusive flags were used to indicate treatment regimens. These were appended to gender, age, census region, year of index, and counts of pre-existing diagnosis and prescription drug groupings. A decision tree model was generated yielding a tree with 13 terminal nodes. As with the medical charges tree, we show only the terminal nodes in Table 11. Node size (n), and proportion with a diabetes-related complication during the outcome observation period are also given for each node. Fisher's Exact tests were used to detect statistically significantly differences (P=0.05) between each possible pair of leaves.

Table 11 Differences between proportions with diabetes-related complications during the outcome observation period, by patient groups defined by terminal nodes from a tree algorithm (n=66,523)

As with the medical charges tree, these results suggest several subgroups (e.g. nodes 10, 11, 14, and 21) where patients with relatively poor ‘health status’, as measured by the high counts prescription drug subclass prior to diabetes diagnosis, have a statistically significantly higher proportion with diabetes-related complications during the outcome observation period.

Since complications are relatively rare, many terminal nodes represent subgroups with a very small proportion of complications. These might just represent unusual situations, but they might be indicative of an emergent phenomenon worth investigating further. For example, nodes 8 and 10 identified very small groups of patients where proportions of complications were unusually high (22.2% and 23.1%, respectively). However, it is interesting that among 18–64 year-old patients those with more than 10 drug subclasses prior to diabetes diagnosis using the sulfonylurea monotherapy regimen didn’t have a statistically significantly higher rate of complications (8.4%) than those with more than 21 – quite high – drug subclasses prior to diabetes diagnosis on any other regimen (9.4%). As with the univariate analysis, this tree analysis is also complicated by the possibility that patients who are treated with more complex regimens have more severe disease, which could lead to a higher proportion of complications. We also wonder whether some patients do not tolerate biguanide well, perhaps due to GI side effects, and something about this kind of patient aside from their treatment selection makes them more likely to develop complications. A more detailed analysis is left for future research.

Diabetes-related complications regression

To test the hypothesis that patients’ chances of developing a diabetes-related complication during the outcome observation period are a function of the mutually exclusive flags indicating each of the 16 treatment regimens, gender, age group, census region, year of index, and counts of diagnosis and prescription drug groupings during the clean period, a stepwise multiple logistic regression analysis was performed. Levels of F to enter and F to remove were both set to correspond to P=0.05. Results of the analysis produced a model with 14 of the 31 available variables and provided partial confirmation of the hypothesis. The area under ROC curve statistic for this model (c=0.62) indicated that it has better than random predictive capability. Parameter estimates are presented in Table 12.

Table 12 Logistic regression model results for diabetes-related complications

Not unexpectedly, compared with 45–64-year-olds, the model shows lower chances of diabetes-related complications during the outcome observation period for 18–34 (odds ratio=0.68) and 35–44-year-olds (odds ratio=0.70), and higher chances for those 65 and older (odds ratio=1.85). It also shows the chances of diabetes-related complications during the outcome observation period increase by 1.08 times with each additional diagnosis group and by 1.04 times with each additional prescription drug subclass observed prior to diabetes diagnosis. Though it's not clear why, the model also shows those living in the west have higher chances of diabetes-related complications during the outcome observation period (odds ratio=1.17) compared with other regions.

Regarding treatment regimens, the stepwise selection excluded DPP4 monotherapy, biguanide+glitazone, glitazone monotherapy, lifestyle changes only, biguanide monotherapy, GLP1 monotherapy, biguanide with 1 or more other therapy NEC, and biguanide+sulfonylurea with 1 or more other therapy NEC. These excluded regimens can be grouped into three categories: lifestyle changes only, monotherapies other than sulfonylurea, and polytherapies that include biguanide. Hence the treatment regimens that the stepwise selection included in the model, all of them polytherpaies except sulfonylurea monotherapy, are considered relative to the excluded regimens. Broadly this comparison shows that regimens with few medications are associated with lower chances for diabetes-related complications – except sulfonylurea monotherapy, confirming both the univariate and tree analyses.

Limitations

As has been pointed out several times already, correlation does not necessarily indicate causation. We cannot tell from the data whether patients who are sicker to begin with the use of particular therapy regimens, or patients who use particular therapy regimens end up sicker. Also, certain types of patients may be intolerant of certain therapy regimens due to side effects or contraindications and hence have a more limited set of available treatment options. We have attempted to control for pre-existing health status prior to diabetes diagnosis as much as possible, while still highlighting the important patterns revealed in the data.

Furthermore, there are many attributes that one would expect to have an effect on patient outcomes that are either not available from our data or not considered as part of this analysis. Examples include the underlying reasons for treatment selection and outcomes, medication adherence, persistence and compliance, blood test results, height, weight, and other clinical data.

Claims data, such as those used in this analysis, are not collected for research purposes and come with several limitations of their own. These include the following.

  • The data are limited to covered services and benefit designs such as drug formularies can artificially influence treatment decisions.

  • The coding of medical procedures and diagnoses is motivated by medical billing.

  • Providers may indicate certain medical coding on the basis of reimbursement maximization rather than clinical correctness.

  • The data are not necessarily generalizable to the entire population because patients with other types of insurance, no insurance, or who live outside the U.S. and participate in an entirely different healthcare system may be different from the U.S.-based commercially insured population from which our data were drawn.

Conclusions

Our results are supported in a few important ways by the literature, but they are unique in that they compare multiple therapies and multiple outcomes simultaneously while other studies found in the literature generally confine themselves to a particular therapy or to the comparison of a specific set of therapies. For example, the literature reveals several studies that show that sulfonylureas may be associated with higher risk of mortality and with higher risk of developing comorbidities such as cardiovascular disease and cancer risks ( Johnson et al, 2005; Karter et al, 2005; Nichols et al, 2005; Eurich et al, 2005a, 2005b; Bowker et al, 2006, Simpson et al, 2006, Tzoulaki, 2009). It stands to reason that the poorer clinical outcomes of sulfonylurea users would be associated with increased diabetes-related complications rates compared to diabetics who don’t use sulfonylureas, which is supported by both this and other studies. Another study looked at cases where patients who were initiated on a sulfonylurea that later required the addition of a second therapy and found that such patients, when augmented with metformin (biguanide), had 33% lower costs than those augmented with glitazones (Kalsekar & Latran, 2007). Still another study found that patients using metformin could be less adherent to their therapy regimens than users of either sulfonylureas or pioglitazone (a glitazone) and still achieve similar cost reductions compared with patients who were fully non-adherent to other therapy regimens (Hansen et al, 2010). None of the literature we reviewed provides as comprehensive a review of real-world therapy regimens in the context of the ADA's consensus algorithm.

Despite of the novelty of our findings, the study raises more focused questions for further investigation, such as why costs or complication rates associated with certain treatments are higher than others. Indeed, we consider the strengths of a study such as ours as being one of raising questions and focusing further investigation that attempts to answer the ‘why’ questions. Some of the specific questions are the following:

  1. 1

    Patients treated with sulfonylurea monotherapy are rife for further investigation. They seem to fair worse that those treated with other regimens in many respects. Delving deeper into this population is a top priority.

  2. 2

    How would the results change by conditioning on the ‘A1C’ blood sugar levels? The ADA recommends that A1C levels should be controlled to be under 7. It is possible that many of the treatments prescribed by physicians take into account the specific A1C levels and that higher levels are correlated with complications. This is an obvious area for further inquiry.

  3. 3

    Another deeper investigation would be to consider weight levels of individuals and analyze whether costs and complications are related to obesity. Specifically, for individuals who choose ‘lifestyle only’ changes to control diabetes, are there more complications when individuals are overweight?

  4. 4

    The area of complications needs further inquiry in general. Specifically, what is the distribution of complications for the different segments? What are the reasons for the differences in complications?

  5. 5

    It is worth extending this study to insulin users, an area that is significant both in terms of healthcare costs and complications, and also extend our methodology to other costly and prevalent health conditions (Soni, 2011) such as heart disease, cancer, and mental disorders.

  6. 6

    How accurate are the models extracted through the data-mining process used by the physicians for prediction? In other words, given the current ‘state’ of a patient, what is the accuracy of the model in predicting costs and complications of alternative treatments?

A large portion of costs arise from very few cases: 68% of utilization among those newly diagnosed type 2 diabetes, for example, are incurred by 10% of the population. We believe this is primarily because healthcare providers are focused on treating individuals’ emergent problems – as they should be – rather than identifying broader patterns across large numbers of individuals. The result is that healthcare systems identify who uses up resources now, rather than who will likely use the lion's share soon. There is little financial incentive for providers to identify who is more likely to get very sick since reimbursement is tied to visits and services, not prevention.

What is needed is a system that takes a bird's eye view of large swaths of healthcare data and identifies patterns that are predictive of the outcomes of interest. In this way, healthcare systems can automatically direct the attention of healthcare providers on improving outcomes of those who are most likely to get sick in the future.

From a scientific perspective, the broader impacts of our research are in advancing the state of knowledge about healthcare, and in providing tools and systems that have pragmatic value as part of the health delivery system infrastructure in the U.S. and globally. Data play a big role in the development of theory in virtually all fields of science, particularly in healthcare. Our research will contribute towards developing etiologies for major diseases, but in a way that uses hard evidence from actual use of the healthcare system. Achieving these desired benefits requires an information infrastructure that links databases and predictive analytics seamlessly into healthcare systems. With the increasing prevalence of electronic health records and connectivity, we are beginning to see such an infrastructure emerge.

Prediction is not easy. In domains characterized by a lack of theory, it is particularly challenging. Scientific progress requires careful analysis of problem domains including the peculiar data in each of them to understand which approaches and methods work best. Our research is at the intersection of big data and health sciences. It suggests that pattern detection in large amounts of real world healthcare data can predict groups at risk, enabling more proactive attention and prevention, with huge potential cost savings in healthcare.

The massive volumes of data generated by the healthcare system are much too big to be analyzed by humans alone. The increase in self-monitoring tools available to individuals through advances in information technologies will only increase the flood of data. The U.S. needs a healthcare information infrastructure where data are interpreted automatically by systems that inform the patient and healthcare providers of the specific health risks to individuals. Our research represents a step towards achieving the above objective. At the end of this research program, we should have the basis for the data-driven identification of ‘the future sick’, thereby allowing intelligent targeting for disease management and prevention programs.