Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 32

Emerging Business Intelligence Framework for a Clinical Laboratory Through Big Data Analytics

Emad A. Mohammed¹; Christopher Naugler²; Behrouz H. Far¹ ¹ Department of Electrical and Computer Engineering, Schulich School of Engineering, University of Calgary, Calgary, AB, Canada
² Departments of Pathology and Laboratory Medicine and Family Medicine, University of Calgary and Calgary Laboratory Services, Diagnostic and Scientific Centre, Calgary, AB, Canada

Abstract

Modern clinical decision support based on data analytics requires a framework that incorporates distributed processing platforms, sustainable data models, and inference algorithms. The ultimate objective of this chapter is to identify the common components of a user-centered analytics framework that can reason using different clinical historical Big Data. Those components emerge through two case studies that identify potential analytics to support the decisions of laboratory managers. In the case studies, the outputs are visualizing and estimating of clinical test volumes, which will lead to optimum purchasing and fiscal planning. We particularly focus on the reusable business intelligence (BI) components that can help running similar business processes from a single manager’s perspective, as well as the BI components that can be reused by several managers in a clinical laboratory setting. This is the first attempt at design and implementation of a user-centered framework for clinical laboratory settings as a BI platform.

Keywords

Big Data analytics

clinical laboratory

business intelligence (BI) platform

architectural framework

time series analysis

visualization method

1 Introduction

Clinical laboratories provide medical test services to a variety of customers. From a business perspective, providing high-quality services at lower costs mandates the need for efficient utilization of clinical resources (McDonald et al., 1997). Historical and current clinical resource utilization data sets are now digitally available at almost all clinical facilities (Kessel et al., 2014). The improvement of digital equipment facilitates the acquisition and storage of all kind of clinical data at a rate much faster than what can be processed using traditional processing systems (Peters and Buntrock, 2014), which give rise to the clinical Big Data era.

Clinical Big Data is the technical term used to describe massive databases with varying high volume, variety, and velocity [e.g., electronic medical record (EMR) and biometrics data]. These databases present difficulties with storage, processing, and visualization (Rajaraman and Ullman, 2012; Coulouris et al., 2005).

Business intelligence (BI) is defined as the use of data and dedicated processing to assist informed decisions in diverse administrative settings (Pine et al., 2011). An intrinsic trait of an BI is that it integrates data from a diversity of sources, resulting in an effective information framework for clinical laboratory decision makers (Negash, 2004).

A clinical laboratory facility can have many managers with different perspectives (Bennett, 2007) (e.g., clinical section manager, logistic manager, finance manager, etc.). BI can deliver assistances to clinical laboratory directors comprising effective utilization of human resources (Crist-Grundman and Mulrooney, 2010), improved process efficiency, and unnecessary cost prevention (Foshay and Kuziemsky, 2014). BI should help clinical laboratory managers with different roles in describing and diagnose current performance and predicting the future performance. To deal with these challenges, new software programming frameworks to multithread computing tasks have been developed (Coulouris et al., 2005; de Oliveira Branco, 2009; Dean and Ghemawat, 2008). However, many clinical laboratory settings have not yet applied BI systems (Foshay and Kuziemsky, 2014), and there has been very limited research on the factors that contribute to the successful implementation of BI in a health care–specific context (Foshay and Kuziemsky, 2014).

Several BI applications already exist (Anon, 2014e, 2014f, 2014d, 2014b, 2014c, Publishing, 2014), however they usually suffer from two drawbacks: firstly, they offer a rigid and less flexible framework that addresses the need of a specific group of users; secondly, they adopt the scope of scenario-based simulation rather than Big Data–driven analytics.

New challenges have emerged, especially with respect to processing of the massive data sets that are being produced daily. This drives the need for a new framework that can be used by different clinical laboratory managers to support the informed decisions.

In this chapter, we explore a combination of reusable data analytics (i.e., BI framework) that helps clinical laboratory managers with different roles to become more effective by supporting their decisions using the BI framework. Two real-life case studies (namely, clinical laboratory test usage pattern visualization and estimation of clinical test volumes) are designed to identify potential analytics to support decisions of laboratory managers. The objective is to identify the reusable components of a user-centered framework (i.e., clinical laboratory managers with different roles and perspectives) based on Big Data analytics of clinical data sets. The implementation and validation of other framework components/analytics will be explored in future research.

2 Motivation

The case studies have emerged from real-life concerns of laboratory managers. Increases in population have a stunning influence on health care, including laboratory diagnostics. Clinical laboratories in Canada have experienced substantial growth in utilization in recent years. However, in Alberta, the volumes of all types of laboratory testing are increasing at a rate much faster than population growth, with a 36% increase in chemistry test volumes between 2003 and 2009 (Di Matteo and Di Matteo, 2009). There have been repeated statements that a substantial amount of laboratory tests are tangibly redundant. A current meta-analysis assessed the proportion of redundant tests to be in the range of 30% (Zhi et al., 2013). In Calgary alone, this equates to as many as 8 million potentially unnecessary laboratory tests per year, representing a cost of at least $80 million per year in direct and indirect costs. Indications that additional test utilization does not advance clinical outcomes have been reported in a number of studies that have found no association between the volume of tests ordered and clinical outcome (Daniels and Schroeder, 1977; Ashley et al., 1972; Bell et al., 1998; Powell and Hampers, 2003). Redundant laboratory tests not only waste valuable resources, but also may lead to patient impairment as false positives. If a healthy person is subjected to 10 unrelated (unnecessary) tests, the probability of at least 1 deviant test result is 40% (Axt-Adam et al., 1993). This deviant test result may lead to unnecessary, far-reaching, expensive, and time-consuming diagnostic examinations (Lewandrowski, 2002).

The driving forces for the efforts illustrated in this chapter are derived from the literature cited above, and the current BI frameworks are based on text and flowchart process simulation for best scenario practices and optimization rather than Big Data analytics (Anon, 2014b, 2014c, Publishing, 2014). Text- and flowchart-based approaches are subjective (i.e., users use different synonymous and acronyms, and different flowcharts are created for the same process). On the other hand, Big Data are more representative of the variations in a given process. Moreover, representing a complex process using text and flowcharts may be inaccurate, while complex process traits are encapsulated in the Big Data representation form. Furthermore, as more and more data are collected for the same measurement, the data model becomes more and more accurate.

Another driving motivation is that there is no efficient statistical method to highlight the overutilization or underutilization of clinical laboratory Big Data (Naugler, 2013, 2014; MacMillan, 2014). Moreover, physician test ordering patterns have no analyzing/feedback mechanism (Plebani, 1999; Plebani et al., 2014; Kiechle et al., 2014), and thus, a significant amount of clinical tests and resources (e.g., medical equipment utilization, more technician workload, and cost, etc.) are misused. Furthermore, there is no efficient method to measure or visualize a human performance index (Wennberg, 2004; Ashton et al., 1999; Monsen et al., 2008)—analysis of variance. In addition, most managerial decisions are based on descriptive analytics that tell what happened, and little are based on predictive analytics that tell what is going to happen (Davenport, 2013).

3 Material and methods

3.1 Data source

Supervisors at Calgary Laboratory Services (CLS), University of Calgary (Anon, 2014g), provide different types of data sets (i.e., clinical laboratory test volumes and clinical test utilization by physicians).

3.2 MapReduce framework

A commonly implemented programming framework that depends on functional programming for massive/Big Data processing is the MapReduce framework (de Oliveira Branco, 2009; Dean and Ghemawat, 2008; Peyton Jones, 1987). MapReduce is an evolving programming framework for massive data applications proposed by Google. It is based on functional programming (Peyton Jones, 1987), where the designer defines map and reduce tasks to process large sets of distributed data. Applications of MapReduce (Dean and Ghemawat, 2008) enable many of the common tasks on massive data to be implemented on computing clusters in a way that is tolerant of hardware failures.

3.3 The Hadoop Distributed File System

Hadoop (Bryant, 2007; White, 2012; Shvachko et al., 2010) is an open-source implementation of the MapReduce framework for implementing applications on large clusters of commodity hardware (Anon, 2014k). Hadoop is a platform that affords both distributed file system (DFS) and processing capabilities. Hadoop was realized to resolve a scalability issue in the Nutch project (Shvachko et al., 2010; Olson, 2010), which is an open-source crawler engine that uses the MapReduce and the Big-table (Olson, 2010). Hadoop is a distributed master-slave architecture that contains the Hadoop Distributed File System (HDFS) and the MapReduce framework. Characters intrinsic to Hadoop are data partitioning and parallel computation of massive data sets. Hadoop storage and processing capabilities balance with the addition of computing machines to the cluster, with volume sizes in the terabyte/petabyte level on clusters with thousands of machines.

3.4 The emerging framework

Unlike other works that start with the design of the framework, we selected the empirical route to identify reusable BI analytics components. The methodology is based on conducting the case studies and identifying the commonalities between them and then extracting reusable framework elements. The results are presented next, and details about the case studies will follow.

The framework on clinical Big Data analytics emphasizes the modeling of several interacting processes in a clinical setting (e.g., clinical test utilization pattern, test procedures, specimen collection/handling, etc.). This indeed can be constructed using clusters of commodity hardware and the appropriate open-source tool on top of the cluster to construct convenient processing platform for the massive clinical data. This is the basis of future laboratory informatics applications, as laboratory data are increasingly integrated and consolidated.

A main requirement of the framework is that it should adapt to the changes of users and their perspectives. Figure 32.1 shows the different perspectives of a clinical laboratory setting. The clinical section managers (i.e., general pathology, clinical biochemistry, microbiology, hematology, and cytopathology) can use the framework to plan for clinical lab workload and demand forecasting. Human resources managers can use the framework to monitor staff key performance indicators (KPIs) and for capacity planning. Rewards analytics can be used by different managers to estimate the trend of award increasing/decreasing according to the desired KPIs. Environmental health and safety (EH&S) managers can use the framework to detect disease outbreaks. Planning and new business managers can use the framework analytics to simulate different investment scenarios (e.g., purchase planning, effect of certain supply cutoff, etc.). Finance managers can use the framework for many different purposes, such as payroll, rewards, and fraud detection.

f32-01-9780128025086 — Figure 32.1 The different perspectives of a clinical laboratory setting.

In some clinical facilities, the blood/tissue samples are collected by courier services from patient service centers (PSCs)—a typical scenario in the City of Calgary—and sampling handling time may play a significant role in the accuracy of test results. The logistic manager can use the system to monitor the performance of the courier service fleet drivers, estimate the sample handling time, and plan for driver routes for sampling pickup and handling. Information technology (IT) service administration managers can use this framework to estimate the potential utilization of the IT system and infrastructure, and hence plan for system maintenance and expansions.

The identified components of the framework of a clinical laboratory setting are shown in Figure 32.2. It illustrates the different analytics services that can be utilized by clinical laboratory users. Every component consists of a set of MapReduce statistical algorithms that help the lab directory with a specific concern to support undergoing decisions in a specific process through the analysis of historical Big Data (e.g., hiring new staff, clinical test workload management, etc.). The design of the system is derived from clinical laboratory managers’ different roles and perspectives illustrated in Figure 32.1.

f32-02-9780128025086 — Figure 32.2 User-centered framework architecture in a clinical laboratory setting.

3.5 Laboratory management system components

The laboratory director/manager is responsible for the overall operation and administration of the laboratory, including the employment of competent qualified personnel (Reller et al., 2001; Bennett, 2007). It is the lab director’s responsibility to ensure that the laboratory develops and uses a quality system approach to provide accurate and reliable patient test results. In the quality system approach, the laboratory focuses on comprehensive and coordinated efforts to achieve accurate, reliable, and timely testing services. The quality system approach includes all the laboratory policies, processes, procedures, and resources needed to achieve consistent and high-quality testing services.

As a result of the many, complex lab director/manager responsibilities, there is a pressing need toward the automation of the overall process through the development of a Big Data analytics framework that can aid a lab director to inform decisions. In the following section, different components of the system are explained.

3.6 Lab management application interface

The lab management application interface is the main screen that a lab director interacts with. It provides different groups of functionalities that process different types of Big Data (e.g., clinical test, human resources, and traffic data for CLS fleet management, etc.) through the Hadoop platform. The programs are developed in MapReduce and coded in Java and R statistics packages.

3.7 Administration services

The laboratory director has many responsibilities related to human resources. This person must ensure that the laboratory has a suitable number of trained staff with adequate supervision to meet the loads of the laboratory service, regulations, and accreditation standards.

3.8 Test procedure services

The laboratory director is responsible for all aspects of testing. The director guarantees the selection of suitable analyzers, reagents, supplies, calibrators, and control materials, so the test methods have performance characteristics that meet the needs of laboratory users (Simpson et al., 2000).

3.9 Operational management services

The critical responsibilities of a lab director are strategic planning, organizational goal setting, capital and operational budgeting, research and development, marketing, and vendor contracting. The director must manage patients and clinicians in negotiations and decisions related to operational management.

3.10 Service Infrastructure-Hadoop platform and Hadoop Enabled Automated Laboratory Transformation Hub (HEALTH) cluster

These services collectively handle connection to the computing infrastructure, as this connection handles the passage of the correct data with the required type of processing (e.g., test procedure service).

3.11 Data warehouse management service

The data warehouse management service is a connection to the required data at the data centers or the associated laboratory information system (LIS).

3.12 Ubuntu Juju as a service orchestration and bundling

Ubuntu Juju is an open-source service orchestration (Anon, 2014n). It consents of software to be deployed, integrated, and scaled on a cloud service or server. The underlying mechanism behind Juju is known as Charms. Moreover, Juju has an element called Charm Bundles. A Charm Bundle allows a group of charms, properties, and relations to be exported into a YAML file, which can be imported into another Juju environment.

3.13 Typical framework usage scenario in a clinical laboratory setting

The framework can be used by laboratory managers differently according to each manager’s perspective. If a laboratory manager who is interested in purchase planning for test consumables (e.g., different test consumables for the next 3 months), he or she can use the “lab management application interface” to set up a perspective for planning future test consumable purchasing. This will instruct the framework to search for suitable data sets and analytics algorithms to estimate the test volumes. These data sets and algorithms will be retrieved from the framework through the orchestration services “Ubuntu juju,” which is facilitated by attaching every data set in the data warehouse and the analytics algorithms with a tag that can be used by the service orchestration to bundle data sets with the correct analytics algorithms. The data can be acquired from the clinical laboratory repository using the suitable data connections between platforms [e.g., Oracle to Hadoop data connector (Anon, 2014i)]. This is followed by bundling the Service Infrastructure-Hadoop platform service to the retrieved data sets and algorithms. The infrastructure Hadoop platform serves as the processing platform for the framework. The framework will then provide the available services (functionalities) and data to carry on with the user demands through the “lab management application interface.” If neither the right data set nor the right analytics exist, the lab management application interface will display an error message to the user asking for other keywords/tags for the requested services.

4 Use-cases

In this section, we present the details of the two use-cases. The first reflects a typical scenario for the use of analytics for visualization of the clinical laboratory test usage problem, and the second employs Big Data analytics for prediction. Many of the typical clinical decision support use-cases fall into either of these categories.

Precisely stated, the purpose of these use-cases is to estimate the clinical test volumes of a given test per unit time—a time series analysis (i.e., Alberta provincial test volume estimation). Moreover, the clinical laboratory test utilization pattern service has a novel method to visualize the individual physician usage pattern for a given test or a test panel (i.e., a group of standard tests). In the following discussion, the clinical laboratory test utilization pattern visualization and provincial laboratory test volume estimation are illustrated and detailed, along with the associated design limitations.

5 Case Study 1: Clinical laboratory test usage patterns visualization

Excessive usage of diagnostic tests is a key challenge in health-care systems. Data sets of clinical test volumes for individual physicians are available from hospitals’ laboratory information systems (LISs). The unjustified usage of clinical tests drives the need for a tool to underline the usage pattern (utilization) of a given clinical test or test panel of a physician among his or her peers. The data sets must be normalized for different physician characteristics (e.g., number of working hours, years of experience, etc.,), patient status (e.g., condition of the patient, number of visits, etc.), and department workload (e.g., emergency, outpatient, hospitalized patient, etc.).

6 Data source and methodology

For illustration, we use a simple example representing a set of five different clinical tests with varying test volumes are acquired for 35 physicians over a 3-month time span at one medical facility at the city of Calgary. The data were anonymized to comply with the Personal Information Protection and Electronic Documents Act of Canada. (Anon, 2014j).

In this section, a novel graphical tool based on the z-score to identify extremes of practice variance from laboratory test ordering data is used to analyze and visualize the usage pattern of the recorded five clinical tests. This methodology can be used to assess the efficacy of a test usage control criterion (Murphy and Henry, 1978; Plebani et al., 2014; Bates et al., 1999). Evaluation and visualization of physician usage patterns of clinical tests could yield a beneficial tool to visualize physician utilization of clinical tests. In the following discussion, we describe a novel method for comparing the utilization patterns of different physicians using laboratory data. In order to use this method, the number of physicians in a given data set had to be greater than or equal to 30; otherwise, we used the T-test (Rosner, 2010).

The steps of the proposed method are:

1. Acquire test volumes for a group of physicians for a given test or test panel. The test volumes must be normalized per physician for a given range of physician characteristics, patient status, and department workload, as described earlier in this chapter.

2. Calculate the mean and standard deviation of the test volumes per test as shown by Eqs. (32.1) and (32.2):

$μ_{T} = \frac{1}{N} \sum_{p = N}^{p = 1} X_{p},$ $μ_{T} = \frac{1}{N} \sum_{p = N}^{p = 1} X_{p},$

(32.1)

where μ_T is the mean of a given test (T); N is the number of physicians using test (T); and X_p is the volume of test (T) used by physician (p).

$σ_{T} = \sqrt{\frac{\sum_{p = N}^{p = 1} {(X_{p} - μ_{T})}^{2}}{N}},$ $σ_{T} = \sqrt{\frac{\sum_{p = N}^{p = 1} {(X_{p} - μ_{T})}^{2}}{N}},$

si2_e (32.2)

where σ_T is the standard deviation of a given test (T).

3. Calculate the z-score for each test per physician, as shown by Eq. (32.3):

${zscore}_{p |T} = \frac{X_{p} - μ_{T}}{σ_{T}},$ ${zscore}_{p |T} = \frac{X_{p} - μ_{T}}{σ_{T}},$

(32.3)

where zscore_p|T is the z-score of test (T), which has test volume X_p that is used by the physician (p).
The adaptation of test volumes into z-scores permits assessment of clinical laboratory tests with different average volumes.

4. Calculate the mean z-score for all tests and the standard deviation of their z-scores per physician, as shown in Eqs. (32.4) and (32.5):

$μ_{{zscore}_{p}} = \frac{1}{T} \sum_{i = τ}^{i = 1} {zscore}_{p |T_{i}}$ $μ_{{zscore}_{p}} = \frac{1}{T} \sum_{i = τ}^{i = 1} {zscore}_{p |T_{i}}$

(32.4)

$σ_{{zscore}_{p}} = \sqrt{\frac{\sum_{i = τ}^{i = 1} {({zscore}_{p |T_{i}} - μ_{{zscore}_{p}})}^{2}}{T}}$ $σ_{{zscore}_{p}} = \sqrt{\frac{\sum_{i = τ}^{i = 1} {({zscore}_{p |T_{i}} - μ_{{zscore}_{p}})}^{2}}{T}}$

si5_e (32.5)

5. For each physician, plot the mean z-scores against the standard deviation of z-scores for that physician.

6. Define performance marker lines at the mean z-score and the mean standard deviation of z-scores to divide the z-score space into four regions. The explanation of these regions is described in Table 32.1.

Table 32.1

Categorical Classification of the z-score Space

Region/Group	Explanation
A	Physicians with high overall usage of a specific test and low variance from their peer group
B	Physicians with both high volumes of tests and high disparity from their peer group
C	Physicians with high practice variance but lower overall usage than group B physicians
D	Physicians with both low variations from peers and low overall test volumes

7 Results and discussion

The data set for 35 physicians utilizing five different tests (i.e., chemical seven test panel CH7; D-Dimer, DDEL; throat swab for beta hemolytic streptococcus, M BETA; urine culture, M URINE; and Troponin I, TNIV) were processed via the proposed method using the R statistical package (Anon, 2014m), the z-scores for each test were calculated, and the average z-score for all five tests was also calculated. Figure 32.3 illustrates the usage pattern of the CH7 test by the 35 physicians, illustrating that most of the physicians belonged to group D, a small number of physicians were in groups A and B, and only a few physicians belonged to group C. This reveals an optimal degree of usage of this test.

f32-03-9780128025086 — Figure 32.3 Usage pattern of the CH7 for 35 physicians.

The vertical, solid black line dividing the golden rectangle characterizes the mean usage pattern of the five tests (x-axis). A physician who equally uses the five tests is characterized as a red dot on this line. A physician who uses all five tests at a slower rate than average will be characterized by a red dot on the left side of this line, and vice versa.

The y-axis characterizes the usage probability of this test (i.e., CH7), represented by its computed z-score. It would not be possible to convert the z-scores into probabilities unless it is assumed that the test volumes are drawn from a normal distribution. A adaptation table to transform z-scores into probabilities can be found in Held and Bove (2014) and Rosner (2010).

The allowed disparity within the test volumes for a given test is represented by the width of the golden rectangle, and the height represents the range of the usage probabilities. The horizontal line crossing the zero of the y-axis represents the usage probability of the mean of the underlined test. The golden rectangle space is divided into four groups by the intersection of the vertical line representing the mean usage pattern of the test and the zero z-score line. A physician with perfect average usage will be represented by a dot at the intersection of these two lines. The distribution of the points outside the rectangle represents the outliers. There are many factors that may drive the physician to utilize more tests to have a better assessment of the patient conditions, and thus, Eq. (32.3) must be normalized by these factors, if quantifiable, and the modified equation for computing the z-score becomes

${zscoreMod}_{p |T} = \frac{X_{p} - μ_{T}}{σ_{T} * N F s}$ ${zscoreMod}_{p |T} = \frac{X_{p} - μ_{T}}{σ_{T} * N F s}$

(32.6)

where zscoreMod is the modified z-score and NFs are the product of all the normalization factors.

Figure 32.4 shows the average usage pattern for all five tests, the average usage pattern illustrates the most common group is Group D, with a wider probability of usage for all five tests.

f32-04-9780128025086 — Figure 32.4 Average usage pattern of all five tests for 35 physicians.

8 Limitations

The visualization model assumes that the data is drawn from normal distributions to translate the z-score to the probability of usage. This visualization framework is designed to view the relative usage pattern of physicians utilizing a group of clinical tests. This framework is used to visualize practice variance, not to provide conclusive proof of redundant usage of clinical tests. The framework has some restrictions, as it does not account for prospective influencing parameters such as limited ranges of practice, patient conditions, and number of hospital visits per patient that may affect test usage characteristics, as these parameters are not quantified.

9 Case Study 2: Provincial laboratory clinical test volume estimation

One main characteristic of any clinical laboratory BI platform is the ability to estimate clinical laboratory test volumes (El-Gayar and Timsina, 2014; Ferranti et al., 2010; Ashrafi et al., 2014). Accordingly, test consumables and workload can be optimized by short-term reliable forecasting. Large amounts of clinical test volume data are accessible from a variety of available data ware house in hospitals and clinical settings (Anon, 2014g). In this section, a Holt-Winters method (De Gooijer and Hyndman, 2006; Chatfield and Yar, 1988) is used to estimate the province of Alberta laboratory test volume. The results are compared to an estimation using an Auto Regressive Integrated Moving Average (ARIMA) model (Rosner, 2010).

10 Data source and methodology

A huge volume of data has been queried to consolidate a data set of clinical laboratory test volume for all ordered tests over a 40-month span from all Alberta medical facilities. This data is provided by provincial laboratory utilization office, CLS, and at the University of Calgary (Anon, 2014g).

10.1 Holt-Winters model

The Holt-Winters method is a statistical method of prediction/estimation, applied to time series considered by the existence of trend and seasonality that is founded on the exponential weight moving average method. This is achieved by separating the data into three parts (i.e., level, trend, and seasonal index). The Holt-Winters method has two types: one method for additive seasonality and the other for multiplicative seasonality. The multiplicative Holt-Winters method is described by the following equations:

$Level : L_{t} = α (\frac{Y_{t}}{S_{t - p}}) + (1 - α) * (L_{t - 1} - b_{t - 1})$ $Level : L_{t} = α (\frac{Y_{t}}{S_{t - p}}) + (1 - α) * (L_{t - 1} - b_{t - 1})$

si7_e (32.7)

$Trend : b_{t} = β * (L_{t} - L_{t - 1}) + (1 - β) * b_{t - 1}$ $Trend : b_{t} = β * (L_{t} - L_{t - 1}) + (1 - β) * b_{t - 1}$

(32.8)

$Sesonal Index : S_{t} = γ * (\frac{Y_{t}}{L_{t}}) + (1 - γ) * S_{t - p}$ $Sesonal Index : S_{t} = γ * (\frac{Y_{t}}{L_{t}}) + (1 - γ) * S_{t - p}$

si9_e (32.9)

$Forecast : F_{t + k} = (L_{t} + k * b_{t}) * S_{t + k - p}$ $Forecast : F_{t + k} = (L_{t} + k * b_{t}) * S_{t + k - p}$

(32.10)

where p is the number of data points of the seasonal cycle. The smoothing factor are α, β, and γ where 0≤α≤1, 0≤β≤1 and 0≤γ≤1. The seasonal index demonstrates the differences between the current level and the data at the recorded point in the seasonal cycle.

The additive Holt-Winters method is described by the following equations:

$Level : L_{t} = α (Y_{t} - S_{t - p}) + (1 - α) * (L_{t - 1} - b_{t - 1})$ $Level : L_{t} = α (Y_{t} - S_{t - p}) + (1 - α) * (L_{t - 1} - b_{t - 1})$

(32.11)

$Trend : b_{t} = β * (L_{t} - L_{t - 1}) + (1 - β) * b_{t - 1}$ $Trend : b_{t} = β * (L_{t} - L_{t - 1}) + (1 - β) * b_{t - 1}$

(32.12)

$Sesonal Index : S_{t} = γ * (Y_{t} - L_{t}) + (1 - γ) * S_{t - p}$ $Sesonal Index : S_{t} = γ * (Y_{t} - L_{t}) + (1 - γ) * S_{t - p}$

(32.13)

$Forecast : F_{t + k} = (L_{t} + k * b_{t}) + S_{t + k - p}$ $Forecast : F_{t + k} = (L_{t} + k * b_{t}) + S_{t + k - p}$

(32.14)

The initial values for the level, trend, and seasonal index are estimated by carrying out a simple decomposition into trend and seasonal components using the moving averages model on the first period (i.e., 12 months) (Anon, 2014m).

The goodness-of-fit of the model is evaluated using the mean square error measure (Rosner, 2010) (MSE) defined by the following equation:

$M S E = \frac{1}{n} \sum_{i}^{1} {(Y_{i} - F_{i})}^{2},$ $M S E = \frac{1}{n} \sum_{i}^{1} {(Y_{i} - F_{i})}^{2},$

(32.15)

where Y_i.is the observed value at time (i) and n is the total number of points MSE represents the goodness-of-fit of the model to the given data. Moreover, the coefficient of determination (Rosner, 2010) (R²) value can be used to further examine the goodness-of-fit of the model. It is defined as the relative enhancement in the prediction/estimation of the model, compared to the average value of the observations (i.e., representing the data with the mean model), where it designates the goodness-of-fit of the model to predict/estimate the future values.

A zero value of R² specifies that the model does not increase the prediction/estimation ability over the mean model, and one specifies perfect prediction/estimation. The R² value can be calculated as

$R^{2} = \frac{1}{n} \sum_{i}^{1} \frac{{(F_{i} - \bar{Y})}^{2}}{{(Y_{i} - \bar{Y})}^{2}} .$ $R^{2} = \frac{1}{n} \sum_{i}^{1} \frac{{(F_{i} - \bar{Y})}^{2}}{{(Y_{i} - \bar{Y})}^{2}} .$

si16_e (32.16)

where $\bar{Y}$ $\bar{Y}$ is the average value of the recorded measurements.

10.2 ARIMA model

ARIMA is a hybrid model consisting of two models and a differencing parameter. The first is the autoregressive (AR) model, where the value of a variable in one cycle is related to its values in previous cycles. The second model is the moving average (MA) model, which accounts the relationship between a variable and the residual errors from previous cycles. The differencing is a technique to transform the data from non-stationary to stationary nature by taking the d difference of the data (Rosner, 2010).

The ARIMA model can be describes by the following equation:

$y_{t} = μ + \sum_{i = 1}^{p} γ_{i} y_{t - i} + ϵ_{i} + \sum_{i = 1}^{q} θ_{i} ϵ_{t - i},$ $y_{t} = μ + \sum_{i = 1}^{p} γ_{i} y_{t - i} + ϵ_{i} + \sum_{i = 1}^{q} θ_{i} ϵ_{t - i},$

(32.17)

where μ is a constant and γ_i is the coefficient for the lagged variable in time t-p, ε_i is the error due to the current observation, θ_i is the coefficient of the lagged error term in time t − q, and p, q are time lags associated with the AR and MR models, respectively.

The variable y_t is represented by an ARIMA (p,d,q) model, which designates an ARMA model with p autoregressive lags, q moving average lags, and a difference in the order of d.

10.2.1 Holt-Winters and ARIMA model assumption testing

The fundamental notion of the Holt-Winters and ARIMA models is that the data of time series under test is stationary (De Gooijer and Hyndman, 2006). Stationary means that it arises from a stochastic process (i.e., stable process) whose parameters (i.e., the mean and variance) are constant over time (Grenander and Rosenblatt, 1957). The augmented Dickey-Fuller (ADF) test (Engle and Granger, 1987) is used to test for the stationary nature of the time series.

10.2.2 Model selection criteria

Akaike information criterion (AIC) (Akaike, 1974) is a fined technique based on in-sample fit to estimate the likelihood of a model to predict/estimate the future values.

A good model is the one that has minimum AIC among all the other models. The AIC can be used to select between the additive and multiplicative Holt-Winters models.

Bayesian information criterion (BIC) (Stone, 1979) is another criteria for model selection that measures the trade-off between model fit and complexity of the model. A lower AIC or BIC value indicates a better fit.

The following equations are used to estimate the AIC and BIC (Stone, 1979; Akaike, 1974) of a model:

$A I C = - 2 * ln (L) + 2 * k$ $A I C = - 2 * ln (L) + 2 * k$

(32.18)

$B I C = - 2 * ln (L) + 2 * ln (N) * k$ $B I C = - 2 * ln (L) + 2 * ln (N) * k$

(32.19)

where L is the value of the likelihood, N is the number of recorded measurements, and k is the number of estimated parameters.

The stationary nature of the provincial laboratory test is tested using the R statistical package (Anon, 2014m) and it is used to model the data using Holt-Winters and ARIMA models. The Holt-Winters parameters α, β, and γ is calculated based on minimizing one-step-ahead error (i.e., MSE) value of each model. A set of ARIMA models are used to compare the performance of the Holt-Winters models in estimating the provincial test volume. Minimum AIC and BIC values are used as model selection criteria. The optimal model is selected based on the highest R² and minimum AIC and BIC.

10.3 Stationary testing

The ADF test is based on the null hypothesis test. The more negative the test is, the stronger the rejections of the null hypothesis, which announces the stationary nature of the time series. Table 32.2 shows the result of the ADF test and the associated ρ values.

Table 32.2

ADF Test Output

Time Series	ADF Test
Time Series	Dickey-Fuller statistic	ρ (significance <= 0.01)
Provincial test volumes	− 5.6335	0.01

t0015

Auto-correlation function (ACF) is a measure of the degree of similarity between the original time series and the time lag version of the time series. ACF can evaluate the stationary nature of a time series, high correlation resulting from high similarity, and less stationary representing a slowly decaying ACF. Figure 32.5 shows the ACF of the provincial test volumes time series, which reflects the stationary nature of the recorded data when correlated with itself for different time lags.

f32-05-9780128025086 — Figure 32.5 ACF for the provincial test volume time series.

10.4 ARIMA model selection

The ADF test suggests the time series is of a stationary nature, and thus, the candidate ARIMA models will be reduced to ARMA models with only AR and MA with p and q lags, respectively, with no differencing variable d. The minimum AIC or BIC is used to choose the best model.

Table 32.3 illustrates the AIC and BIC for the candidate ARIMA models with the model complexity (i.e., the number of parameters associated with every mode). The selected ARIMA model is ARIMA (1,0,1) and it is used for the provincial test volume time series estimation.

Table 32.3

ARIMA Models Used to Model the Provincial Test Volumes with the Associated AIC and BIC

Time Series	ARIMA (1,0,1) Number of Calculated Parameters = 2		ARIMA (1,0,2) Number of Calculated Parameters = 3		ARIMA (2,0,1) Number of Calculated Parameters = 3		ARIMA (2,0,2) Number of Calculated Parameters = 4		ARIMA (3,0,3) Number of Calculated Parameters = 6
Time Series	AIC	BIC	AIC	BIC	AIC	BIC	AIC	BIC	BIC	AIC
Provincial test volumes	1125.8	1132.5	1128.4	1136.9	1127.5	1135.9	1145.16	1155.3	1128.7	1142.2

t0020

The models are arranged according to the complexity of the model (i.e., number of parameters to be estimated).

10.5 Performance comparison and model selection

Table 32.4 shows the MSE and R² values recorded for every model. The Holt-Winters multiplicative and additive model have R² = 0.8791, and 0.8619, respectively, which means that they have a better performance than the mean model; however, they have MSE = 9.05*10¹⁰ and 8.9*10¹⁰, respectively. On the other hand. ARIMA (1,0,1) has a better MSE than both the two Holt-Winters model (5.1*10¹⁰). However, the ARIMA model has the smallest R² = 0.49541. The main reason for this is that the ARIMA model tend to memorize the repeating monthly pattern resulting in a poor estimation of the time series.

Table 32.4

Holt-Winters and ARIMA (1,0,1) Model Performance

Time Series	Mean Square Error (MSE)			R²
Time Series	HWA	HWM	ARIMA	HWA	HWM	ARIMA
Provincial test volumes	8.9*10¹⁰	9.05*10¹⁰	5.1*10¹⁰	0.8619	0.8791	0.49541

t0025

The mean square error (MSE) signifies the goodness of the model to fit the data. The R² value signifies the goodness of the model to predict the variance in the data (Holt-Winters multiplicative “HWM”, Holt-Winters additive “HWA”, and ARIMA models).

The multiplicative and additive Holt-Winters models tend to adapt the estimation of the provinical test volume time series according to the previous samples weighted by the exponential smoothing parameters (i.e., α, β, and γ). The smoothing parameters are shown in Table 32.5.

Table 32.5

Smoothing Factors Used in the Holt-Winters Multiplicative and Additive Models

Time Series	Holt-Winters Multiplicative Model Smoothing Parameters			Holt-Winters Additive Model Smoothing Parameters
Time Series	α	β	γ	α	β	γ
Provincial test volumes	0.32598	0	0	0.33989	0	0

t0030

11 Results and discussion

Figures 32.6, 32.7, and 32.8 show the performance of the Holt-Winters multiplicative, Holt-Winters additive, and ARIMA (1,0,1) models to fit the provincial volume time series and estimate future values with a 95% confidence interval. The observed data are represented by the black curve; the fitted/estimated data are shown in red, and the 95% confidence intervals are shown in blue. The 95% confidence intervals signify the disparities (i.e., precision) of the model.

f32-06-9780128025086 — Figure 32.6 Holt-Winters multiplicative model for the provincial test volume time series.

f32-07-9780128025086 — Figure 32.7 Holt-Winters additive model for the provincial test volume time series.

f32-08-9780128025086 — Figure 32.8 ARIMA (1,0,1) model for the provincial test volume time series.

The estimation of the smoothing parameters α, β, and γ are based on minimizing one step ahead of the MSE. If these parameters are found to be near zero, it means that the model adds more weight to the previous samples, resulting in more smoothing. Nevertheless, if these parameters are found to be near 1, it means the model tends to add less weight to the previous samples and thus feature less smoothing.

Minimum AIC and BIC information criteria is used as model selection criteria whenever the performance parameters (i.e., MSE and R²) are close for the model ensemble. Table 32.6 shows the AIC and BIC values for the Holt-Winters multiplicative and additive models, along with the selected model. When R² < = 0.5, then it may be better to choose the mean model to fit and estimate the time series (De Gooijer and Hyndman, 2006), which is the case for the ARIMA (1,0,1) model with R² = 0.49541.

Table 32.6

Information Criteria of Holt-Winters Models

Time Stamp	Holt-Winters Multiplicative Model		Holt-Winters Additive Model		Selected Model
Time Stamp	AIC	BIC	AIC	BIC	Selected Model
Provincial test volumes	1081.416	1110.127	1075.947	1104.658	Holt-Winters additive

t0035

The best modes are selected based on the minimum AIC and BIC

Table 32.6 shows that the Holt-Winters additive model is a better model to fit and estimate the provincial test volume time series. This means that the level, trend, and the error are additive in nature. This is clearly shown in Figure 32.9, where the estimated values by the additive model (i.e., xhat), level, trend, and seasonality show additive behavior.

f32-09-9780128025086 — Figure 32.9 The individual component of the Holt-Winters additive model for fitting the provincial test volume time series. The level and trend of the time series show additive behavior.

Figure 32.10 shows the histogram of the residual error of the Holt-Winters additive model with a normal distribution curve fitted to the residual errors. The diagram shows that the error is almost centered on 0 mean, which means that the errors are uncorrelated and normally distributed. This strengthens the choosing of the additive Holt-Winters model to fit and estimate the data.

f32-10-9780128025086 — Figure 32.10 The residual error histogram of the additive Holt-Winters model. A normal distribution curve is fitted to the histogram showing that these residual errors are uncorrelated (centered on 0 mean).

12 Limitations

Figure 32.7 shows the estimation horizon of 6 months in the future of the additive Holt-Winters model. The estimation horizon of the 6 months in the future and the upper and lower limits (i.e., precision) are shown in Table 32.7. When the estimation horizon moves forward, the 95% confidence interval becomes wider, resulting in poor precision. This drawback results from the limitation that the additive Holt-Winters model is static (i.e., the exponential smoothing parameters are calculated only once from the historical data). This is due to the limited computational resources as the time series increases with time. A good solution is to implement an adaptive Holt-Winters model that utilizes the error between the estimated and exact value of the future points to update the smoothing parameters of the model; however, it may consume a good deal of computational resources.

Table 32.7

Estimation of the Provincial Test Volume for 6 Months, Starting August 2014, Using the Additive Holt-Winters Model

Date	Estimated Provincial Test Volume	Upper Limit	Lower Limit
August 2014	5,943,146	6,111,962	5,774,330
September 2014	6,101,442	6,279,742	5,923,141
October 2014	6,368,317	6,555,622	6,181,013
November 2014	6,169,616	6,365,512	5,973,721
December 2014	5,674,618	5,878,744	5,470,492
January 2015	6,310,749	6,522,786	6,098,713

t0040

13 Conclusion and future work

This chapter presents a user-centered analytics framework as a BI platform that can be used in a clinical laboratory setting to come to an informed conclusion on questions such as “What is the usage pattern of a clinical test?” and “Can we estimate the clinical test volumes for the next month?”. We selected the empirical route in order to understand and describe a BI framework for a clinical laboratory setting. The practice is based on conducting real-life case studies, identifying the commonalities between them, and extracts reusable framework elements that can be used as a BI framework to help clinical laboratory managers support their informed decisions.

This BI is composed of multiple services that can be used by clinical laboratory managers (e.g., CLS managers). The BI is based on the MapReduce framework and an open-source implementation of the Hadoop is used to process the massive clinical laboratory data sets to form the basic Big Data analytics tools. The user-centered analytics framework is designed with the clinical laboratory managers in mind, resulting in the BI platform with the necessary functionalities for different laboratory manager perspectives.

The two use-case studies present the value of BI analytics framework in managing clinical laboratory test volumes (i.e., usage pattern visualization and test volume estimation) utilizing the clinical databases available at the data warehouse. The analytics presented in the two use-cases present tools that can be reused by managers of different clinical laboratory settings to visualize performance and estimate resources (i.e., logistic manager, who estimates a specimen handling time; financial manager, who estimates employee rewards based on the facility-specific key performance indicators).

Big Data tools (Olston et al., 2008; 2014l; Chen et al., 2014; Hortonworks, 2014; 2014a; 2014h) present a development of Big Data analysis imposed by the advent of large-scale databases. The Hadoop platform and the MapReduce programming framework already have a substantial base in the bioinformatics community, especially in the field of next-generation sequencing analysis, and such use is increasing.

This study makes the following contributions to the literature:

• This is the first attempt to address a user-centered analytics framework based on Big Data analytics to process massive clinical data sets.

• The application of a novel visualization tool based on the z-score to visualize the clinical test utilization.

• The application of Holt-Winters method to estimate the provincial test volume of the province of Alberta, Canada. To the best of our knowledge, this is the first attempt to develop demand-estimation analytics for the clinical laboratory test volumes in Alberta.

Future research on the clinical Big Data analytics BI framework should accentuate the modeling of whole-business interacting subprocesses (e.g., clinical test utilization pattern, test procedures, specimen collection/handling, etc.). This can be assembled using inexpensive clusters of commodity hardware and the proper tool to construct a suitable processing framework for handling massive clinical data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 32: Emerging Business Intelligence Framework for a Clinical Laboratory Through Big Data Analytics

Create new playlist

Sign In

Sign Up