Mary E. Helander
Data Science Department, IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
Modern solution methodology offers a set of macro- and micro-practices that help a practitioner systematically maximize the odds of a successful analytics project outcome.
Methodology is all about approach. Every discipline, whether it be applied or theoretical in nature, has methodologies. While there is no one standard analytics solution methodology, common denominators of solution methodologies are the shared purposes of being systematic, creating believable results, and being repeatable. That is to say, a solution methodology helps practitioners and researchers alike to progress efficiently toward credible results that can be reproduced.
Whether we mean a methodology at a macro- or microlevel, analytics practitioners at all stages of experience generally rely on some form of methodology to help ensure successful project outcomes. The goal of this chapter is to provide an organized view of solution methodologies for the analytics practitioner. We begin by observing that, in today's practice, there does not appear to be a shared understanding of what is meant by the word solution.
In its purest form, a solution is an answer to a problem. A problem is a situation in need of a repair, improvement, or replacement. A problem statement is a concise description of that situation. Problem definition is the activity of coming up with the problem statement. Problem-solving, in its most practical sense, involves the collective actions that start with identifying and describing the problematic situation, followed by systematically identifying potential solution paths, selecting a best course of action (i.e., the solution), and then developing and implementing the solution. Problem-solving is, by far, one of the most valuable skills an analytics practitioner can hone, and is even an important life skill!
Most of us first encountered problem-solving as students exposed to mathematics at primary, secondary, and collegiate education levels, where a problem–for example, given two points in a plane, and , find the midpoint–is more often than not stated explicitly. The solution can be found with some geometry and algebra wrangling. See Eves [1]. If asked to solve this problem for homework or on an exam, we probably did not get full credit unless we showed our work. This shown work we can think of as the solution methodology for the problem. This sample math problem can be used to illustrate the fact that there are different ways to solve a problem: For example, to use those same methodology steps to find a midpoint solution, if presented with the two points in polar coordinates, one may proceed using an entirely different approach by applying methods from trigonometry.
Similarly, in analytics practice, the path to a solution is generally not unique. For example, Ref. [2] describes a study of the variation in approach (and results) by 29 independent analytics teams working on the same data and problem statement. The path to a solution may involve a straightforward set of steps, or it may need some clever new twist; the method chosen may depend on the form of the available data, the assumptions, and context. A big difference between the problems that we encounter in school and the problems that we encounter in real life is usually that in real life, we are rarely presented with a clean problem statement with, for example, the given information. Still, writing down the steps we use to get from the problem statement to the solution is generally a good idea. In most cases, we can write down steps that are general enough so that we're able to find solutions to new and challenging problems.
What do we mean by a “solution”? To the purist, a solution is this: The correct answer to a problem. It is what you write down on your exam in response to a problem statement. If you get the answer right, and if you have adequately satisfied the requirement of showing your work, you earn full credit for the solution. In some cases, you may get the wrong answer, but if some of your shown work is okay, you may still earn partial credit. Similarly, in practice, analytics that produce unexpected or flawed results may earn their creators recognition for solid work that has gone into the project, and practitioners may get the opportunity to revise these analytics, just as authors of peer-reviewed papers may get the opportunity to make major revisions to their work during the review process. Without a transparent methodology, however, it is more difficult for evaluators of a project to appreciate the practitioners' findings and effort when they are presented with results that are unexpected or questionable.
Methodological steps are analogous to what we mean more generally by a solution methodology or approach. When we're starting out, the steps give us an approximate roadmap to follow in our analytics project. When we're done, if we've followed a roadmap and have perhaps even documented the steps, then it is easier to trace these steps, to repeat them, and to explain to stakeholders and potential users or sponsors how our solution was derived. It might be that the steps themselves are so innovative that we patent some aspect of the approach, or perhaps we find that publishing about some aspect of the project, the technology, or the outcome is useful for sharing the experience with others and promoting best practices. In any of these cases, having followed some methodology helps tremendously in describing and building credibility into whatever it was that we did to reach the solution.
Experienced analytics professionals already know this too well: In practice, new projects rarely, if ever, start out with a well-defined problem statement. The precision of a problem statement in the real world will never be as clearly articulated as it was in our math classes in grade school, high school, and college. Indeed, there may be contrasting and even conflicting versions of the underlying problem statement for a complex system in a real-world analytics project, particularly when teams of people with varying experiences, backgrounds, opinions, and observations come together to collaborate. Using our sample math problem to illustrate, this would be equivalent to some people's thinking that the problem is to find a point solution , while others might think that the solution should be defined by the intersection of two or more lines, or perhaps that it should be defined by a circle with a very small radius that covers the point of intersection, and so on. The point is that the solution can be relevant to the interpretation of the problem, and thus, when the problem is not defined for us precisely–and even sometimes when the problem is–people may interpret it in different ways, which may lead to entirely different solution approaches.
An important message here is that time and effort up-front on a problem statement is time well spent, as it will help clarify a direction and create consistent understanding of the practitioners' end goals.
In today's commercial world of software and services, the word solution may be used to describe a whole collection of technologies that address an entire class of problems. The problems being solved by these commercial technologies may not be specifically defined in the ways we have been used to seeing problems defined in school. For example, a commercial supply chain software provider may have a suite of solutions that claim to address all the needs of a retail business.
In other words, in today's world of commercial software and services, the word solution has become synonymous with the word product. In fact, in some circles, it is not cool to say that the solution solves a problem because this suggests that there is a problem. Problems, at least in our our modern Western capitalist culture, are no big deal. Therefore, we don't really have them. However, we do have plenty of solutions, especially when it comes to commercial products. So, we begin this chapter by pointing out the elephant in many project conference rooms: Problems are not sexy, but solutions are! While this line of thinking is indeed the more positive and inspiring outlook, and while it makes selling solutions easier, unfortunately, it often leads to implementing the wrong solutions, or to failing altogether at solution implementation. Why? There are many reasons, but one of the most obvious and common reasons is that ill-defined, poorly understood, or denied problems are difficult–if not impossible–to actually solve.
The previous section, hopefully, has left the reader with a strong impression that recognizing the underlying problem is a first step toward solving it. It is in this spirit that this chapter introduces the notions of macro- and microsolution methodologies for analytics projects and organizes their content around them. Macro-methodologies, as we shall see in a section devoted to their description, provide the more general project path and structure. Four alternative macro-methodologies will be described in that section with this important caveat: Any one of them is good for practitioners to use; the most important thing is for practitioners to follow some macro-methodology, even if it is a hybrid.
Micro-methodology, on the other hand, is the collection of approaches used to apply specific techniques to solve very specific aspects of a problem. For every specific technique, there are numerous textbooks (and sometimes countless papers) describing its theory and detailed application. There is no way we will be able to cover all possible problem-solving techniques, which is not the purpose of this chapter. Instead, this chapter covers an array of historically common techniques that are relevant to INFORMS and to analytics practitioners in order to illustrate micro-solution methodology, that is, to expose, compare–and in some cases, contrast–the approaches used.
Figure 5.1 provides an illustration of the chapter topic breakdown. Note that all solution methodology descriptions in this chapter, both at macro- and microlevels, are significantly biased in favor of operations research and management sciences. This is so because this chapter appears in an analytics book published in affiliation with INFORMS, the international professional organization aimed at promoting operations research and management science. The stated purpose of INFORMS is “to improve operational processes, decision-making, and management by individuals and organizations through operations research, the management sciences, and related scientific methods.” (see the INFORMS Constitution [3].)
With the rise in the use of quantitative methods, particularly OR and MS, to solve problems in the business world, the business analytics community has adopted a paradigm that classifies analytics in terms of descriptive, predictive, and prescriptive categories. These correspond respectively to analytics that help practitioners to understand the past (i.e., describe things), to make recommendations about the present (i.e., prescribe things), and to understand the future (i.e., predict things). The author of this chapter believes that the paradigm originated at SAS [4], one of the most well-known analytics software and solutions companies today.
Granted, many disciplines today are using analytics, and the descriptive–predictive–prescriptive analytics paradigm has no doubt helped evangelize analytics to disciplines. However, it should be noted that we explicitly have chosen to organize this chapter directly around macro- and micro-methodologies, and within the micro-category, exploratory, data-independent, and data-dependent technique categories. While intending to complement the“descriptive–predictive–prescriptive” analytics paradigm, this orgnaization emphasizes that solution techniques do not necessily fall neatly into one of the paradigm bins. Instead, techniques in common categories tend to have threads based on underlying problem structure, model characteristics, and relationships to data, as opposed to what that the analytics project outcome may drive (i.e., to describe, to predict, or to prescribe). From the perspective of analytics solutions methodologies, this can also help avoid an unintentional marginalization of techniques that fall into the descriptive analytics category.
After reading this chapter, a practitioner will
As described in the Introduction, a macro-solution methodology is comprised of general steps for an analytics project, while a micro-methodology is specific to a particular type of technical solution. In this section, we describe macro-methodology options available to the analytics practictioner.
Since a macro-methodology provides a high-level project path and structure, that is, steps and a potential sequence for practitioners to follow, practitioners can use it as an aid to project planning and activity estimation. Within the steps of a macro-methodology, specific micro-methodologies may be identified and planned, aiding practitioners in the identification of specific technical skills and even named resources that they will need in order to solve the problem.
Four general macro-methodology categories are covered in this section:
We reiterate here that there is some overlap in these methodologies and that the most important message for the practitioner is to follow a macro-solution methodology. In fact, even a hybrid will do.
The scientific research methodology, also known as the scientific method [5], has very early roots in science and inquiry. While formally credited to Francis Bacon, its inspiration likely dates back to the time of the ancient Greeks and the famous scholar and philosopher Aristotle [6].
This methodology has served humankind well over the years, in one form or another, and has been particularly embraced by the scientific disciplines where theories often are born from interesting initial observations. In the early days, and even until more recently (i.e., within the last 20 years–merely a blip in historical time!), a plethora of digital data was not available for researchers to study; data were a scarce resource and were expensive to obtain. Most data were planned, that is, collected from human observation, and then treated as a limited, valuable resource. Because of its value both to researchers' eventual conclusions and to the generalizations that they are able to make based upon their findings, the scientific methodology related to data collection has evolved into a specialty in and of itself within applied statistics: experimental design. In fact, many modern-day graduate education programs in the United States require that students take a course related to research methodology either as a prerequisite for graduate admission or as part of their graduate coursework so that graduate students learn well-established systematic steps for research, sometimes specifically for setting up experiments and handling data, to support their MS or PHD thesis. Often, this type of requirement is not uncommon in social sciences, education, engineering, mathematics, computer science, and so on–that is, these requirements are not limited strictly to the sciences.
The general steps of the scientific method, with annotations to show their alignment with a typical analytics project, are the following:
As is evident here, the scientific method is a naturally iterative process designed to be adaptive and to support systematic progress that gets more and more specific as new knowledge is learned. When followed and documented, it allows others to replicate a study in an attempt to validate (or refute) its results. Note that reproducibility is a critical issue in scientific discovery and is emerging as an important concern with respect to data-dependent methods in analytics (see Refs [7,8]).
Peer review in research publication often assumes that some derivative of the scientific method has been followed. In fact, some research journals mandate that submitted papers follow a specific outline that coincides closely with the scientific method steps. For example, see Ref. [9], which recommends the following outline: Introduction, Methods, Results, and Discussion (IMRAD). While the scientific method and IMRAD for reporting may not eliminate the problem of false discovery (see, for example, Refs [10,11]), they can increase the chances of a study being replicated, which in turn seems to reduce the probability of false findings as argued by Ioannidis [12].
Because of this relationship to scientific publishing, and to research in general, the scientific method is recommended for analytics professionals who plan eventually to present the findings of their work at a professional conference or who might like the option of eventually publishing in a peer-reviewed journal. This methodology is also recommended for analytics projects that are embedded within research, particularly those where masters and doctoral theses are required, or in any research project where a significant amount of exploration (on data) is expected and a new theory is anticipated. In summary, the scientific method is a solid choice for research-and-discovery-leaning analytics projects as well as any engagement that is data exploratory in nature.
Throughout this chapter, analytics solution methodology is taken to mean the approach used to solve a problem that involves the use of data. It is worth bringing this point up in this section again because, as mentioned in the Introduction, our perspective assumes an INFORMS audience. Thus, we are biased toward these methodology descriptions for analytics projects that will be applying some operations research/management science techniques. While it was natural to start this macro-section with the oldest, most established, mother of all exploratory methodologies (the scientific method of the last section), it is natural to turn our attention next to the macro-method established in the OR/MS practitioner community.
In general, one may find some variant of this project structure in introductory chapters of just about any OR/MS textbook, such as Ref. [13], which is in its fourth edition, or Ref. [14], which was in its seventh edition in 2002. (There have been later editions, which Dr. Hillier published alone and with other authors after the passing of Dr. Lieberman.)
Most generally, the OR project methodology steps include some form of the following progression:
Collecting data is a key part of early OR project methodology, and is intriciately coupled with the problem definitition step, as noted in Ref. [14]. In modern analytics projects, data collection generally means identifying and unifying digital data sources, such as transactional (event) data (e.g., from an SAP system), entity attribute data, process description data, and so on. Moving data from the system of record and transforming it into direct insights or reforming it for model input parameters are important steps that may be overlooked or under-estimated in terms of effort needed.
As noted earlier, we live in a world where “solutions” are sexy and “problems” are not–further adding to the challenge and importance of this step. In comparison with the scientific method of the previous section, this step intersects most closely with the activities and purposes of A.1, A.2, and A.3.
It is not surprising that the OR project method, being exploratory in nature, is somewhat of a derivative of the scientific method. As Hillier and Lieberman point out in the introductory material of Ref. [16], operations research has a fairly broad definition, but in fact gets its name from research on operations. The study objects of the research are “operations,” or sometimes “systems.” These operations and systems are often digital in their planning and execution, and so tons of data now exist to model, recreate them, and model/experiment with them. In other words, these observable digital histories mean they are rich in data (analytics) that can be used to model very quickly. Unfortunately, the ability to jump right into modeling, analysis, and conclusions often means skipping over early methodological steps, particularly in the area of problem definition.
“The cross-industry standard process for data mining methodology,” [17,18] known as CRISP or CRISP-DM, is credited to Colin Shear, who is considered to be a pioneeer in data mining and business analytics [19]. This methodology heavily influences the current practical use of SPSS (Statistical Package for the Social Sciences), a software package with its roots in the late 1960s that was acquired by IBM in 2009 and that is currently sold as IBM's main analytics “solution” [18].
As an aside, note that SAS and SPSS are commercial packages that were born in about the same era and that were designed to do roughly the same sort of thing–the computation of statistics. SAS evolved as the choice vehicle of the science and technical world, while SPSS got its start among social scientists. Both have evolved into the data-mining and analytics commercial packages that they are today, heavily influencing the field. As mentioned earlier, the “descriptive–predictive–prescriptive” paradigm appears to have its roots in SAS. As noted above, CRISP is heavily peddled as the methodology of choice for SPSS. However, we note that this methodology is a viable one for data-mining methods that use any package, including R and SAS.
The steps of the CRISP-DM macro-methodology, from Ref. [17], are the following:
The CRISP-DM macro-methodology is thought of as an iterative process. In fact, the scientific method and the OR project method can also be embedded in an iterative process. More details of the CRISP-DM macro-methodology can be found in Chapter 7.
Software engineering is relevant to analytics macro-solution methodology because of the frequent expectation of an outcome implemented in a software tool or system. The steps of the most standard software engineering methodology, the waterfall method, are the following:
A number of other software engineering methodologies exist. See, for example, Ref. [21] for descriptions of rapid application development (comprised of data modeling, process modeling, application generation, testing, and turnover), the incremental model (analysis, design, code, test, etc.; analysis, design, code, test, etc.; analysis, design, code, test, etc.), and the spiral model (customer communication, planning, risk analysis, engineering, construction and release, evaluation). When looking more deeply at these steps, one can see that they can also be mapped to the other macro-methodologies–note that Agile, a popular newer form of software development, is very much like the Incremental model in that it focuses on fast progress with iterative steps.
Figure 5.2 shows how the four macro-solution methodologies are comparatively related. It is not difficult to imagine any of these macro-methodologies embedded in an iterative process. One can also see, through their relationships, how it can be argued that each one, in some way, is derivative of the scientific method.
Every analytics project is unique and can benefit from following a macro-methodology. In fact, a macro-methodology can literally save a troubled project, can help to ensure credibility and repeatability, can provide a structure to an eventual experience paper or documentation, and so on. In fact, veteran practitioners may use a combination of steps from different macro-methodologies without being fully conscious of doing so. (All fine and good, but, in fact, you veterans could contribute to our field significantly if you documented your projects in the form of papers submitted for INFORMS publication consideration and if, in those papers, you described the methodology that you used.)
The take-home message about macro-methodologies is that it is not necessarily important exactly which one of them you use–its just important that you use one (or a hybrid) of them. It is recommended that, for all analytics projects, the steps of problem definition and verification and validation be inserted and strictly followed, whether the specific macro-methodology used calls them out directly or not.
In this section, we turn our attention to micro-methodology options available to the analytics practitioner.
In general, for any micro-methodology, two factors are most significant in how one proceeds to “solutioning”:
Modeling approaches vary widely, even within the discipline of operations research. For example, data, numerical, mathematical, and logical models are distinguished by their form; stochastic and deterministic models are distinguished by whether they consider random variables or not; linear and nonlinear models are differentiated by assumptions related to the relationship between variables and the mathematical equations that use them, and so on. We note that micro-solution methodology depends on the chosen modeling approach, which in turn depends on domain understanding and problem definition–that is, some of those macro-methodology steps covered in the previous section. Skipping over those foundational steps becomes easier to justify when the methods that are most closely affiliated with them (e.g., descriptive statistics and statistical inference) are side-lined in a rush to use “advanced (prescriptive) analytics.”
Thus, we begin this micro-solution methodology section by re-stating the importance of following a macro-solution methodology, and by emphasizing that the selection of appropriate micro-solution methodologies–which could even constitute a collection of techniques–is best accomplished when practitioners integrate their selection considerations into a systematic framework that enforces some degree of precision in problem definition and domain understanding, that is, macro-method steps in the spirit of A.1, A.2, A.3, B.1, B.2, C.1, C.2, C.3, and D.1 (see Figure 5.2).
All of this is not to diminish the importance of the form and purpose of the project analytics, that is, the data, in selection of micro-solution methodologies to be used. In fact,
are all consequential in micro-solution methodology. However, it is the model that is our representation of the real world for purposes of analysis or decision-making, and as such it gives the context for the underlying problem and the understanding of the domain in which “solving the problem” is relevant. This is why consideration of (i) the specific modeling approach should always take precedence over (ii) the manner of leveraging the data. Thus, this section is organized around modeling approaches first, while taking their relationship to analytics into account as a close second.
This section presents the micro-solution methodologies in these three general groups:
Note that these groups are not directly aligned with the “descriptive–predictive–prescriptive” paradigm but are intended to complement the paradigm. In fact, depending on the nature of the underlying problem being “solved,” and as this section shall illustrate, a micro-methodology very often draws from two or three of the three (i.e., “descriptive,” “predictive,” and “prescriptive”) characterizations at a time–sometimes implicitly, and at other times explicitly.
Since it is impractical to cover every conceivable technique, this section covers an array of historically common techniques relevant to the INFORMS and analytics practice with the goals of illustrating how and when to select techniques. (Note that we will use the word technique or method to describe a specific micro-solution methodology.) While pointers to references are provided for the reader to find details of specific techniques, we use certain model and solution technique details to expose why choosing an approach is appropriate, how the technique relates to micro (and in some cases, macro)-methodology, and to compare and contrast choices in an effort to help the reader differentiate between concepts. And while there are many, many flavors of models and modeling perspectives (e.g., an iconic model is usually a physical representation of the real world, such as a map or a model airplane), we'll generally stay within the types of models most familiar to the operations research discipline. Further reading on the theory of modeling can be found in the foundational work of Zeigler [22], in introductory material of Law and Kelton [23], and of course in our discipline standards such as Hillier and Lieberman [14,16] and Winston [13]. Others, such as Kutner et al. [24], Shearer [25], Hastie et al. [26], Provost and Fawcett [27], and Wilder and Ozgur [28], expose and contrast the practice and theory of modeling led from the perspective of data first. General model building is also the topic of the next chapter of this book.
We turn next to the presentation of each of the above micro-solution methodology groups. Each micro-methodology group is presented using the following framework:
This group of micro-solution methodologies includes everything we do to explore operations, processes, and systems to increase our understanding of them, to discover new information, and/or to test a theory. Sometimes, the real-world system, which is the main object of our study, exists and is operational so that we can observe it, either directly or through a data history (i.e., indirectly). Sometimes, the operation we are interested in does not exist yet, but there are related data that help us understand the environment in which a new system might operate. The important thread for this group involves discovery.
Problems that are addressed by methods in this exploratory group are in this group because they can be generally characterized by, for example, the following questions: How does this work? What is the predominant factor? Are these two things equal? What is the average value? What is the underlying distribution? What proportion of these tests are successful? In fact, it is in this group that the (macro) scientific method has most relevance, because it helps us to formulate research queries and structure the processes of collecting data, estimating, and inferring. Exploration and discovery is often where analytics projects start, both in research and the real world of analytics practice. It is also not uncommon to repeat or return to exploration and discovery steps as a project progresses and new insights are found, even from other forms of micro-solution methodologies. As an example, consider a linear programming model (that will be covered in Group II) that needs cost coefficients for instantiating the parameters of an objective function. In some cases, simple unit costs may exist. In many real-world scenarios, however, costs change over time and have complex dependencies. Thus, estimating the cost coefficients may be considered an exploration and discovery subproblem within a project. In this example, the problems addressed may be finding the valid range for a fixed cost coefficient's value or finding 95% confidence intervals for the cost coefficients. Questioning the assumption that the cost function is indeed linear with respect to its variable for a specified range is another example of a problem here.
When considering exploration and discovery, the relevant models are statistical models. Here, we mean statistical models in their most general sense: the underlying distributions, the interplay between the random variables, and so on. In fact, part of the exploration may be to determine the relevant underlying statistical model–for example, determining if an underlying population is normally distributed in some key performance metric, or if a normal-inducing transformation of observations will justify a normality assumption. The importance of recognizing the underlying models formally when doing exploration and discovery is related to the assumptions formed for using subsequent techniques.
Data when the micro-methodology group is one of exploration and discovery may be obtained in a number of ways. In the most classic deployment of the scientific method, data are created specifically to answer the exploration questions, by running experiences, observing, and recording the data. In today's world of digital operations and systems, historical data are often available to enable the exploration and discovery process. Data “collection” in these digital cases may take more of the form of identifying digital data sources, exploring the data elements and characterizing their meaning as well as their quality, and so on, and even “mining” large data sets to zero in on the most pertinent forms of the data. In these cases of already-existing data, it is equally important to consider the research questions, the underlying problem being solved, and the relevant models. For example, one may have a fairly large volume of data to work with (i.e.,“Big Data”), but despite the generous amount of data, the data cover a time period or geography that is not directly relevant to the problem being studied. For example, if a database contains millions of sales transactions for frozen snacks purchased in Scandinavian countries during the months of January and February, the data may not be relevant to finding the distribution of daily demand for the same population during summer months, or for a population of a different geography at any time, or for the distribution of daily demand for frozen meals (i.e., nonsnacks) for a population of any geography in any time period. In some situations, we may have so much data (i.e., “Big Data”) that we decide to take a representative random sample.
In general, for this group of methods, the problem one wishes to solve and the assumptions related to the statistical models considered are the most important data considerations. In certain cases, practitioners may like to think that their exploration process is so preliminary that a true problem statement (that is sometimes stated as a research question plus hypotheses) and any call out of modeling assumptions are considered unnecessary. However preliminary, exploration can usually benefit by introducing some methodological steps, even if the problem statement and modeling assumptions are themselves preliminary.
Keeping in mind that “solving” a problem related to an exploration and discovery process involves trying to answer an investigational question, it should be no surprise that techniques related to descriptive statistical models are at the core of the micro-solution methodologies for this group. Applied statistical analysis and inference have a traditional place in the general research scientific methods related to exploration, and they also carry the discovery needed for the data handling and wrangling required by other “advanced” models and solution techniques. In fact, one of the great ironies of our field is that the statistical models and techniques that constitute “descriptive models and techniques” are the oldest and most well formed in theory and practice of all solution methodologies related to analytics and operations research. Hence, passing them over for “advanced” (e.g., prescriptive or predictive) techniques should elicit at least some derision.
This collection of techniques might be, arguably, the most important subset of the micro-solution methodology techniques. Why? Because even prescriptive and predictive techniques reckon on them.
Techniques here range from deriving descriptive statistics (mean, variance, percentiles, confidence intervals, histograms, distributions, etc.) from data to advanced model fitting, forecasting, and linear regression. Supporting techniques include experimental design, hypothesis testing, analysis of variance, and more–many of which are disciplines and complete fields of expertise in and of themselves.
The methods of descriptive statistics are fairly straightforward, and most analytics professionals likely have their favorite textbooks to use for reference. For example, coming from an engineering background, one may have used Ref. [29]. Reference [30] is the standard for mathematics-anchored folks. Reference [31] is the usual choice for the serious experimenters. For the most part, all of these methods help us to use and peruse data to gain insights about a process or system under study. Usually, that system is observable, either directly or indirectly (e.g., in the form of a digital transaction history, which is often the case today). While not as old as the scientific method, the field of statistics is old enough to have developed a great amount of rigor–but it also has lived through a transformational period over the past 30+ years, as we've moved from methods that rely on observations that needed to be carefully planned (i.e., experimental design) and took great effort to collect (i.e., sampling theory and observations) to a world in which data are ubiquitous. In fact, many Big Data exploratory methods are based on using statistical sampling techniques–even though we may have available to us, in glorious digital format, an exhaustive data set, that is, the entire population!
Histograms, boxplots, scatter plots, and heatmaps (showing the correlation coefficient statistics between pairs of variables) are examples of visualizations that, paired with descriptive statistics and inference, help practitioners to understand data and to check assumptions. See Figures 5.3–5.6, respectively. Histograms and boxplots are powerful means of identifying outliers and anomalies that may lead to avoiding data in certain ranges, identifying missing values, or even spotting evidence of data-transmission errors.
Descriptive statistics are equally powerful for exploring nonquantitative data. Finding the number of unique values of a text field, and finding how frequently these unique values occur in the data, is standard for understanding data. Again, together with scatter plots and heatmaps for data visualization, correlation analysis is usually done during data exploration to help practitioners understand the relationships between different types of data.
Overall, the micro-methodologies formed by the wealth and rigor of statistical analysis provide the analytics professional with tools that are specifically aimed at drawing conclusions in a systematic and fact-based way and at getting the most out of the data available, while also taking into consideration some of the inherent uncertainty of conclusions. For example, computing a confidence interval for an estimated mean not only gives us information about the magnitude of the mean but it also provides a direct methodology for deciding if the true mean is actually equal to some value. We can test to see if the mean is really zero by noticing if the confidence interval includes the value of zero. By virtue of taking variance and sample size into its calculation, the confidence interval, along with the underlying assumption of distribution, gives us a hint about how well we can rely on this type of test.
Hypotheses tests in general are one of the most powerful and rigorous ways to make very solid conclusions based on fact. The methods of hypotheses testing depend on what type of statistic is being used (mean, variance, proportion, etc.), what the nature of the test is (compared to a value, compared to two or more values that have been statistically estimated, etc.), how the data were derived (sampling assumptions and overall experimental design), and other assumptions, such as that of the underlying population's distribution. In going from the sparse, hard-to-get data of the past to the abundant, sometimes full population data of the present, it seems to be true that many practitioners are sidestepping the rigor and power of statistical inference and losing, perhaps, the ability to gain full credibility and value from their conclusions. In fact, one way to bring this practice back on track is to tie the micro-methods of statistics back into the macro-methodologies, either the scientific method, which has natural hypothesis-setting and testing steps, or macro-methods with steps that are derivatives of it.
Within the myriad of applied statistical techniques for understanding processes and systems through data, an incredibly powerful methodology that should be in every analytics professional's toolbox is the ANOVA. ANOVA stands for analysis of variance. In a tabular and well-oiled form and method, ANOVA is the quintessential approach for understanding data by virtue of how they help analysts organize and explain sources of variance (and error). The method gets its name from the fact that the table is an accounting of variance by attributable source, and one way to think about it is really as a bookkeeping practice for explaining what causes variance. ANOVA tables are natural mechanics for performing statistical tests, such as comparison of variance to see which source in a system is more significant. A basic extension of ANOVA is the multi-variate analysis of variance (MANOVA), which extends this methodology by considering the presence of multiple dependent variables at once.
Any statistics textbook of worth should have at least one chapter devoted to ANOVA computations and applications, including tests. Reference [32] is a favorite text for analysts who frequently use regression analysis, which is closely tied to the methodology of ANOVA–they basically go hand-in-hand. Regression is the stepping stone for analytics and in particular modeling that is derived from data–it is the essential method when one wishes to find a relationship, generally a linear equation, between one or more independent variables and a response variable. The mechanics of this method involve estimating the values of a y-intercept and slope (for a single independent variable). This is called the method of least squares, and it is basically the solution to an embedded optimization problem. Solution methodology for the least squares problem, for example, Ref. [33], is also an illustration showing that the the techniques of micro-methodologies often depend on one another–in this case, a statistical modeling technique dependent on an underlying optimization method. Figure 5.7 exhibits a range of observations before applying a transformation to linearize the data and fit a linear regression (see Figure 5.8), illustrating another common form of common and complimenting techniques (i.e., mathematical data transformation prior to applying of a micro-methodology).
In summary, micro-methodologies for exploration and discovery rely on the following core techniques:
This area of analytics and OR is most closely and traditionally related to the scientific method and to the discovery and research processes in general, and it is not surprising that there are hundreds, maybe thousands, of textbooks devoted to this statistical topic, since virtually every field of study and research in science, social sciences, education, engineering, and technology relies on these methods as the underlying basis for testing research questions and drawing conclusions from data.
An important function of applied statistics in the analytics world today is in preparing data for other methods, for example, creating the parameters for the math programming techniques described in the previous section. In this case, and in the case of methods covered in the subsequent sections, statistical inference is the important methodology for providing the systematic process and rigor behind data-preparation steps, for just about any other method in analytics and OR that relies on any data. Thus, in virtually every analytics project involving data, statistical analysis and particularly inference methods will always have a role.
Next, we consider micro-methodologies where the models for which the techniques used to “solve” problems are independent of data. Note that this does not mean that the models and techniques do not use data. On the contrary! Here, the assumption of “independence of data” means that we can find a general solution path whether or not we know the data. In other words, we can find a solution and then plug the data in later so that we can then say something about that particular instance of the problem and its solution.
This group is distinguished by the fact that data, that is, our analytics, create an instance of the problem through parameters such as coefficients, right-hand-side values, interarrival time distributions, and so on. Problems of interest in this group are those in which we seek a modeling context that allows for either experimentation (as an alternative to experimenting on the real-world system) or decision support (i.e., optimization). The problem statements that characterize this group are of one of two forms: experimental (i.e., what-if analysis) or prescriptive (e.g., what should I do to optimize?).
As discussed in the Introduction of this chapter, problem statements are often elusive, particularly in the early phases of a real-world project. In that spirit, it is not uncommon to have a problem statement formulated somewhat generally for this group: How can I make improvements to the system (or operation) of interest? Or, how can I build the best new system given some set of operating assumptions?
Some of the modeling options relevant to this group include the following:
Indeed, these modeling options include many viable modeling paths. The most significant factor in determining the modeling path relates back to questions that are fundamental to the problem statement, which may also characterize the analytics project objective: Do I want to model an existing or new system? Am I trying to build a new system or improve an existing one? How complex are the dynamics of the system? Are there clear decisions to be made that can be captured with decision variables and mathematical equations (or equalities) that constrain the variables and may also be used to drive an objective function that minimizes or maximizes something?
In this group, data serve the purpose of creating parameters for the models. For simulation, probability, and queueing models, this may mean data that help to fit distributions for describing interarrival or service times or any other random variables in a system. For optimization models, we generally seek data for parameterizing right-hand-side values, technical coefficients within constraint equations, objective function cost coefficients, and so on.
Traditionally, operations researchers developed models with scant or hoped-for data. In some cases, practitioners may have compensated for unavailable data by making inferences from logic and/or using sensitivity analysis to test the robustness of solutions with respect to specfic parameter input values. Indeed, that models with solution techniques became the original core of operations research modeling is not entirely surprising, given the preanalytics era challenge of data availability.
In today's world of analytics, a new challenge is that the data needed to parameterize models in this class may be too much (versus the old problem of too little). In this case, the micro-methods of Group I come in handy and should be used, for example, for everything from the estimation of point estimates to finding confidence interval estimates that specify interesting ranges for sensitivity analyses to distribution fitting and hypotheses testing.
When random variate generation is used to create, for example, interarrival and service times, these models are considered stochastic. In general, discrete event simulation models rely heavily on statistical and probability models and techniques for preparing inputs. Stochastic simulation models in general, once implemented in computer code (either high level or a language or package designed explicitly for simulation) basically form experimental systems in that they attempt to mimic the real-world system (or some scoped portion) for the purpose of performing what-if analyses. For example, when simulating an inventory-control system, how are stock-outs impacted if the daily demand doubles but the inventory replenishment and ordering policies stay the same? In simulating the traffic flowing through an intersection between two major roads, what is the impact on average time waiting for a red light to turn green, if the timing of the light changing is changed from 45 to 60 seconds? In simulating cashier lanes in a popular grocery store, will five cashier lanes be sufficient to ensure that all check-out lanes have fewer than three customers at least 95% of the time?
Simulation modeling is one of the most malleable techniques in our analytics toolbox. It is also one of the easiest to abuse (e.g., when results from unverified or unvalidated simulation models are proclaimed as “right”). From an analytics solution methodology perspective, it is important to note that simulation output data should be statistically analyzed, that is, appropriate statistical techniques should be deployed. In fact, the techniques (and macro- plus micro-solution methodologies) can and should be applied to the output of simulations. A comprehensive treatment of system simulation is provided in Ref. [23]. In general, this subfield of OR has led the way in methodological innovations, as exemplified by the aforementioned work in model verification and validation by Sargent [15].
This collection of techniques includes linear programming, nonlinear programming, integer programming, mixed-integer programming, and discrete and combinatorial optimization. A set of specialty algorithms and methods related to network flows and network optimization is often included with these models and techniques.
These methods all begin similarly: There is a decision to be made, where the decision can be described through values of a number of variable settings (called decision variables). Feasibility (i.e., that at least one solution represented as values of the decision variable settings can be found) is generally determined by a set of mathematical equations or inequalities (thus, the name mathematical programming). The selection of a best solution to the decision variables, if one exists, is guided by one or more equations, usually prefaced by the word maximize or minimize.
Which solution method to choose among these techniques is generally determined by the forms of variables, constraints, and objective function. Thus, some “modeling” (stating what the variables are, describing the decisions, describing the system and decision problem in terms of the variables, that is, the objective and constraint functions) must usually take place in order for practitioners to determine the appropriate micro-solution methodology. For example, if all constraint and objective functions are linear with respect to the the decision variables, then linear programming micro-methodologies are appropriate. Linear programming is usually the starting point for most undergraduate textbooks and courses in introductory operations research; see, for example, Ref. [14]. The standard micro-solution methodology for linear programming is the simplex method, which dates back to the early origins of operations research (see Ref. [42]).
The simplex method, invented by George Dantzig (considered to be one of the pioneers of operations research [43]), is a methodology that systematically advances and inspects solutions at corner points of a feasible region, effectively moving along the exterior frame of the region. In April 1985, operations research history was made again when Karmarkar presented the interior point method to a standing-room-only crowd at the ORSA/TIMS conference in Boston, Massachusetts [44,45]. The new method proposed moving through the interior of the feasible region instead of striding along from extreme point to extreme point [46]. It held implications not only for solving linear programming models, but also for solving nonlinear programming models, which are distinguished by the fact that one or more of the constraints or the objective function(s) is nonlinear with respect to decision variables.
As the number of decision variables and constraints become large, large-scale optimization techniques become important to all forms of math programs–these micro-methodologies involve solution strategies such as relaxation (i.e., removing one or more constraints to attempt to make the problem “easier” to solve), decomposition (i.e., breaking the problem up into smaller, easier-to-solve versions), and so on. Finding more efficient techniques for larger problem sizes (i.e., problems that have more variables and constraints, perhaps in the thousands or millions) has become the topic of many research theses and dissertations by graduate students in operations research and management science.
Among the most challenging problems in this space are the models where variables are required to be integers (i.e., integer programming or mixed-integer programming) or discrete (leading to various combinatorial optimization methods). While many specialty techniques exist for integer and mixed-integer (combinatorial/discrete) models, the branch-and-bound technique remains the de facto general standard for attempting to solve the most difficult, that is, NP (nondeterministic polynomial time) decision problems (see Refs [47,48]), Branch and bound is an example of implicit enumeration, and, while not as old as the simplex method, is one of the oldest (and perhaps most general) solution techniques in operations research.
To summarize, mathematical programming techniques span the following:
Examining the structure of a nonlinear programming model reveals that there are times when an NLP may be transformed to an LP formulation, which is preferrable because of the general availability of off-the-shelf LP packages. However, it should be noted that one of the most common mistakes by practitioners is to try to use an LP solution package outright for an NLP formulation.
Figure 5.11 shows a classic visualization of a feasible region for math programming, in this case with a linearized feasible region (with two decision variables) and either a linear or nonlinear objective function. In this case, it was possible to achieve a valid linear feasible region for the example by converting a nonlinear inequality (system reliability as a linear function of decision variables that are its component failure intensities, and ) using a natural logarithm transformation.
In contrast to linear programming, the methods deployed by nonlinear programming generally follow an if-then-else-if-then-else-if- and so on deduction, where one chooses a micro-solution methodology based on the convexity or concavity (or pseudo- or quasi-) forms of the feasible region and objective function. The best way to determine which micro-methodology to use for a nonlinear program is actually to write down the model variables, constraints, and objective function, then mathematically characterize the forms, and then consult one of the classic textbooks, such as Refs [50,51] as a guide to choosing the most appropriate solution technique.
Some other specialty forms that we will not cover here exist, including dynamic programming, multiobjective or multicriteria programming, and stochastic and constraint programming.
While the specific micro-methodology chosen will depend on the type of problem faced, the assumptions made by the practitioner, and the model selected, the success of the models and techniques in this group hinges on certain macro-methodology steps, particularly business understanding and problem definition (including assumptions). As mentioned earlier, the scientific method and the exploratory micro-methodologies are appropriate for fitting model parameters and testing various assumptions (e.g., linearity, pseudo-convexity). The OR project methodology steps were designed specifically with projects using these micro-methods in this group. However, it should be noted that a few of the CRISP-DM steps can also be applicable; for example, when data are sought for parameter fitting–specifically the data understanding and data preparation steps. In some cases, more advanced transformations of data are needed in preparation for use in these modeling techniques. In fact, in some cases, the analytics we would like to introduce as parameters is derived from forecasting–that is, a special class of predictive modeling, which we turn to next.
Historically, the operations research discipline has been a collection of quantitative modeling methodologies that have their roots in logistics and resource planning. Over the past two decades, with the surge in data available for problem-solving, “research on operations” (i.e., operations and systems understanding), and model building, an emphasis of operations research (and management science) has shifted to embrace insights that can be derived directly from data. In this section, many of the traditional OR modeling approaches and their techniques were presented with the main message that these are largely models that have solution techniques that are independent of, but not isolated from, data.
This section considers the final group of micro-methodologies, that is, those where the models involve solution techniques that are not possible to execute unless there are data present. In other words, they are data-dependent. Examples of solutions, in these cases, are the explanation or creation of additional system entity attributes or a prediction about a future event based on a trend that is observable in the data.
This group of micro-methods is most often used in conjunction with data mining. While these problems share the theme of exploration and discovery with Group I, the outcomes tend to be broader in nature and with fewer assumptions (e.g., normality of data). Problems relevant here include the desire to create categories of things according to common or similar features; finding patterns to explain circumstances or phenomena, that is, seeking understanding through common factors; understanding trends in processes and systems over time (and/or space); and understanding the relationships between cause and effect for the purpose of predicting some future outcome given similar circumstances.
Typical examples of problems of interest include understanding which retail items tend to be purchased together; sorting research articles into categories based on similarities in content identified through common keywords, concepts, methodology, or conclusions; determining if the fall in sales revenue is due to a trend in consumer preferences; if a pattern of behavior exists (e.g.: Are referees more likely to give red cards to soccer players of darker skin tone? which was studied in Ref. [2]); and others.
Some of the main models used in this micro-methodology group include the following:
It should be noted that there is intersection with Groups I and II. Specifically, these methods borrow heavily from statistical analysis and even optimization (e.g., by solving an underlying total distance minimization problem).
By design, this group is most distinguished in consideration of the data dependency on model building and solution techniques. Furthermore, data for this group of micro-methods are generally assumed to be abundant–for example, digital history of sales transactions, Internet sites visited, searched keywords, and so on. Data are often collected by observing digital interactions by a large number of people with systems such as Internet services and applications via browser connections or a mobile device that has passive data collection (e.g., location services) allowed, either intentionally or unintentionally.
A key distinction of these data is that they are not planned in the same way that exploratory methods, say, related to the scientific method of inquiry, may involve experimental design, observation, and data collection. In fact, for data to be considered “usable” in this group, it often must be interpreted, or derived by mining, analyzing, inferring, or applying models and techniques to create meaningful new features.
The following are some of the most common micro-solution techniques for this group:
All of these find common factors while using differing by underlying computation approach. See Ref. [58] for details.
See Refs [26,61,62] for overviews and commenets on these and related methods.
See Ref. [55] for a comprehensive treatment of graph models, network-based problems, and an exhaustive accounting of known algorithms. Hastie et al. [26] extend these basic graphical models for statistical machine learning techniques, including neural networks.
While CRISP-DM is likely the most common of the used macro-methodologies for this group, analytics projects leveraging data-dependent methods are likely to benefit from any and all macro-methodologies. In fact, because this set of methods is most closely related to evaluation and discovery of complex cause-and-effect relationships, as well as differentiation (through classification and categorization; which are sometimes prone to discrimination that can lead to inequity and unfair treatment of groups of people), practitioners should take utmost care in verifying, validating, and creating project documentation that promotes study replication.
While it may seem that this group of methods is all about the data–because they are data-dependent–that is not really true. Like all analytics solution methods, it really is still all about the problem. Because to have meaning, solutions must solve a problem. Also note that these analytics methods are sometimes referred to as the advanced analytics methods. The author would like to point out that they are, in fact, the newest, least established, and least proven in practice, of all methods in our discipline. This implies that they are the least advanced analytics methods and suggests that we should all be working harder to deepen their theory and rigor–which is actually what we are good at as an INFORMS community.
In summary of micro-methdologies, we emphasize that analytics problems encountered in practice seldom require techniques that fall into only one micro-methodology category. Techniques in one category may build on techniques from another category–for example, as noted earlier, linear regression modeling within data dependent methodologies relies on solving an underlying optimization problem. Regression modelers who use software packages to fit their data may not be aware that the least squares optimizatoin problem is being solved in the background. However, to truly understand our methods and results, it is important to be aware of the background mechanics and connections. This specific type of dependency is, in fact, common–particularly in the realm of contemporary statistical machine learning.
Projects in practice often leverage methodologies in progression as well–for example, using descriptive statistics to explore and understand a system in the early stages of a project may lead to building of an optimization model to support a specific business or operations decision. If the decision needs to be made for a scenario that will take place in the future, then forecasts may be used to specify the optimization model's input parameters. At the same time, it is important to keep in mind that there may be trade-offs to consider when combining different techniques. For instance, in this same example project requiring forecasted parameters of an optimization model, the practitioner has a choice between using a sophisticated predictive technique that yields more accurate forecast but leads to a complex, difficult-to-solve nonlinear optimization model, or using a simpler predictive approach that sacrifices some of the forecast accuracy, but leads to a simpler, linear optimization model.
The micro-solution methods available to analytics practitioners are many. However, it should be noted that making this selection is analogous to being an artist and deciding among watercolor, oil, or acrylic paint; deciding what kind of surface to paint on, for example, canvas, wood, paper, and so on; deciding how big to make the piece, and so on. But it is probably not unlike being the painter in these ways as well: You are most likely to pick the method you are most familiar with, just as the watercolor specialist is less likely to choose charcoal for a new painting of the sunset.
A critical success factor in technical projects, particularly where there is any element of exploration and discovery, is project planning. This is no different for analytics projects. In fact, when one adds the expectation for a usable outcome (i.e., a tested and implemented process coded in software, running on real data, complete with a user interface and full documentation, all while providing smashing insights and impactful results), the project risks and failure odds go up fast. As mentioned in the macro-methodology section, the macro-methods align nicely with project planning because they give a roadmap that equates to the high-level set of sequential actities in an analytics project. When considering macro- and micro-method planning together, skills and details of activities can be revealed, so that task estimation and dependencies are possible. In fact, one of the traditional applications of network models taught to students of operations research is the PERT (program evaluation and review technique)/CPM (critical path method)–a micro-method that practitioners can apply to macro-methodology for helping to smoothly plan and schedule a complex set of related activities (see Ref. [14]).
When there are expectations for a usable software implementation outcome, practitioners can augment their macro-methodology steps with appropriate software engineering steps. The software engineering requirement step is recommended for planning desired outcome function, as well as usability needs and assumptions. In fact, complex technical requirements, such as integration into an existing operations environment, or perhaps data traceability for regulatory compliance, are best considered early in requirements steps that compliment domain and data understanding steps.
Overall, while prototyping and rapid development often coincide with projects of more exploratory nature, which analytics projects often are, some project planning and ongoing project management is the best way to minimize risks of failure, budget overruns, and outcome disappointments.
Most if not all of our analytics projects need some computational support in the form of software and tools. Aside from DIY software, which is sometimes necessary when new methods or new extensions are developed for a project, most micro-solution methods are available in the form of commercial and/or open-source software.
Without intending to endorse any specific software package or brand, a few packages are named here to provide illustrations of appropriate packages, while leaving to the reader to decide which packages are most appropriate for their specific project needs.
For (Group I) exploration, discovery, and understanding methods, popular packages include R, Python, SAS, SPSS, MATLAB, MINITAB, and Microsoft EXCEL. Swain [64] provides a very recent (2017) and comprehensive survey of statistical analysis software, intended for the INFORMS audience. Most of these packages also include GLM, factoring, and clustering methods needed to cover (Group III) data-dependent methods, as well.
For (Group II), a fairly recent survey of simulation software, again by Swain [65] and a very recent linear programming software survey by Fourer [66], are resources for selecting tools to support these methods, respectively. An older but still useful nonlinear programming software survey by Nash [67] is a resource to practitioners. MATLAB, Mathematica, and Maple continue to provide extensive toolboxes for nonlinear optimization needs. For Branch and Bound, the IBM ILOG CPLEX toolbox is freely available to academic researchers and educators. COIN-OR, Gurobi, GAMS, LINDO, AMPL, SAS, MATLAB, and XPRESS all provide various toolboxes across the optimization space. More and more, open source libraries related to specific languages, such as Python, now offer tools that are ready to use–for example, StochPY is a Python library addressing stochastic modeling methods.
As a final note, practitioners using commercial or open-source software packages for analytics are encouraged to use them carefully within a macro-solution methodology. In particular, verification, that is, testing to make sure the package provides correct results, is always recommended.
Visualization has always been important to problem-solving. Imagine in high school having to study analytical geometry without 3D sketches of cylinders. Similarly, operations research has a strong history of illustrating concepts through visualization. Some examples include feasible regions in optimization problems, state space diagrams in stochastic processes, linear regression models, various forms of data plots, and network shortest paths. In today's world of voluminous data, sometimes the best way to understand data is to visualize it, and sometimes the only way to explain results to an executive is to show a picture of the data and something illustrating the “solution.”
Other chapters in this book cover the topic of analytics and visualization, for example, see Chapters 3 and 6. The following points regarding visualization from a solution methodology perspective are provided in order to establish a tie with the methods of this chapter:
Many disciplines are using analytics in research and practice. As shown in the macro-methodology section summary, all macro-methodologies are derivatives of the scientific method. In fact, many of our micro-solution methodologies are shared and used across disciplines. As a community, we benefit from and have influenced shared methods with the fields of science, engineering, software development and computer science (including AI and machine learning), education, and the newly evolving discipline of data science. This cross-pollination helps macro- and micro-solution methodologies to stay relevant.
This chapter has presented analytics solution methodologies at both macro- and microlevels. Although this chapter makes no claim to cover all possible solution methodologies comprehensively, hopefully the reader has found the chapter to be a valuable resource and a thought-provoking reference to support the practice of an analytics and OR project. The chapter goals of enlightening the distinctions of macro- versus micro-solution methodologies, providing enough details of these solution methodologies for a practitioner to incorporate them into the design of a high-level analytics project plan according to some macro-level solution methodology, and providing some guidance for assessing and selecting appropriate micro-solution methodologies appropriate for a new analytics project should have hopefully come through in the earlier sections and sections. In addition to a few pearls scattered throughout the chapter, we conclude by stating that solution methodologies can help the analytics practitioner and can help that practitioner help our discipline at large, which can then help more practitioners. That's a scalable and iterative growth process that can be accomplished through reporting our experiences at conferences and through peer-reviewed publication, which often forces us to organize our thoughts in terms of methodology anyway, so we might as well start with it too! The main barriers for solution methodology seem to be myths. Dispelling some of the myths of analytics solution methodology is covered in these final few paragraphs.
The scientific method may be old, but it is not dead yet. By illustrating its relationship to several macro-solution methodologies in this chapter, we've shown that the scientific method is indeed alive and well. Arguments to use it literally may be futile, however, since the world of technology and analytics practice often places time and resource constraints on projects that demand quick results. Admittedly, it is quite possible that rigor and systematic methodology could lead to results that are contrary to the “desired” outcome of an analytics study. Thus, without intentionally doing so, our field of practice may be inadvertantly missing the discovery of truth and its consequences.
Imagine for a moment that analytics practitioners used systematic solution methodologies to a greater extent, particularly at the macrolevel and then publish their applied case study following an outline that detailed the steps that they had followed. Our published applied literature could then be a living source of experience and practice to emulate, not only for learning best practices and new techniques, but also for learning how to apply and perfect the old standards. More analytics projects might be done faster because they wouldn't have to “start from scratch” and reinvent a process of doing things. Suppose that analytics practitioners, in addition to putting rigor into defining their problem statements, also enumerated their research questions and hypotheses in the early phases of their project. Would we publish experiences that report rejecting a hypotheses? Does anyone know of at least one published science research paper that reports rejecting a hypothesis let alone one in the analytics and OR/MS literature?
Research articles on failed projects rarely (probably never) get published, and these could quite probably be the valuable missing links to helping practitioners and researchers in the analytics/OR field be more productive, do higher quality work, and thrive by learning from studies that show what doesn't work. When authentically applied, the scientific method should result in a failed hypothesis every once in a while, reflecting the true nature of exploration and the risks we take as researchers of operations and systems. The modern deluge of data allows us to inquire and test our hunches systematically without the limitations and scarcity of observations we faced in the past. Macro-solution methodologies, either the scientific method or any derivative of it (which is just about all of them), could relieve analytics projects cramps not only by giving us efficient and repeatable approaches but also by recognizing that projects sometimes “fail” or reject a null hypothesis–doing so within the structure of a methodology allows it to be reported in an objective, thoughtful manner that others can learn from and that can help practitioners and researchers avoid reinvention.
We've all heard the saying, if all you have is a hammer, then every problem looks like a nail. This general concept, phrased in a number of different ways since first put forward in the mid-1960s, is credited to Maslow [69], who authored the book Psychology of Science. In our complex world, there are usually many alternate ways to solve a problem. These choices, in analytics projects, may be listed among the micro-methodology techniques described in this chapter or elsewhere. Sometimes, there are well-established techniques that work just fine, and sometimes a new technique needs to be created. The point is that there are many ways to solve a problem, even though many of us tend to first resort to our favorite ways because those tend to align with our personal experiences and expertise. That's not a bad approach to project work, because experience usually means that we are using other knowledge and lessons learned. However, behind this is the danger of possibly using the wrong micro-solution methodology. In fact, the problem of an ill-defined problem can lead to overreliance on certain tools–often the most familiar ones. What does this mean? That in our macro-solution methodology, steps such as understanding the business and data, defining the problem, and stating hypotheses are useful in guiding us to which micro-methodologies to choose from and thus avoiding the potential pitfalls of picking the wrong micro-method or overusing a solution method.
In math class, school teachers might make a grading key that lists the right answer to each exam or homework problem. In practice however, there is no solutions manual or key for checking if an analytics project outcome is right or wrong. We have steps within various macro-solution methodologies, for example, verification, that help us to try to make the best case for the outcome being considered “right,” but for the most part, the correctness of an analytics project outcome is generally elusive, and projects are usually judged by the perceived results of the implementation of a solution. In analytics and OR practice, there are cases where the implementation results were judged as wildly successful, for example, analytics/OR project recognized as an INFORMS Edelman award finalist for its contribution to a company's saving of over $1 billion might actually be judged as not successful because the company creating the OR solution was not able to commercialize the assets and find practitioners in its ranks to learn and deploy them and thus reproduce the solution as a profitable product (see, for example, Ref. [70]).
Documentation of reasons for analytics project failures probably exists, but it is rarely reported as such. Plausible reasons for failure (or, perhaps more accurately, “lack of perceived success”) include the following ones: the solution was implemented, but there was no impact, or it was not used; a solution was developed but never implemented; a viable solution was not found; and so on. Because of the relationship between analytics projects and information technology and software, some insights can be drawn from those more general domains. Reference [71] provides an insightful essay on why IT projects fail that is loaded with examples and experiences, many with analogues and wisdom transferable back to analytics. Software project failures have been studied in the software engineering community for over two decades, with various insights; see, for example, Ref. [72]. The related area of systems engineering offers good general practices and a guide to systematic approaches: One of the most recognized for the field of industrial engineering is by Blanchard and Fabrycky in its fifth edition [73].
It is important to remember that in practice ultimate perceived success or failure of an analytics project may not mean “finding the right answer,” that is, finding the right solution. By perceived success, we mean that an analytics solution was implemented to solve a real-world problem with some meaningful impact acknowledged by stakeholders. Conversely, perceived failure means that for one of a number of reasons, the project was deemed not successful by some or all of the stakeholders. Not unlike some micro-solution methodologies of classic operations research, we have necessary and sufficient conditions for achieving success in an analytics project, and they seem to be related to perception and quality. Analytics practitioners need to judge these criteria for their own projects, while perhaps keeping in mind that there have been well-meaning and not-so-well-meaning uses of data and information to create perceptions and influence. See, for example, How to Lie with Statistics by Darrell Huff [74] and the more contemporary writing, which is similar in concept, How to Lie with Maps by Mark Monmonier [75].
The book How to Lie with Analytics has not been written yet, but unfortunately it is likely already practiced. By practicing some form of systematic solution methodologies, macro and micro, in our analytics projects, we may help our field to form an anchoring credibility that is resilient when that book does come out.
Sincere thanks to two anonymous reviewers for critically reading the chapter and suggesting substantial improvements and clarifications; Dr. Lisa M. Dresner, Associate Professor of Writing Studies and Rhetoric at Hostra University, for proofreading, editing, and rhetoric coaching; and Dr. Joana Maria, Research Staff Member and Data Scientist at IBM Research, for inspiring technical discussions and pointers to a number of relevant articles.