5
Solution Methodologies

Mary E. Helander

Data Science Department, IBM T. J. Watson Research Center, Yorktown Heights, NY, USA

Modern solution methodology offers a set of macro- and micro-practices that help a practitioner systematically maximize the odds of a successful analytics project outcome.

5.1 Introduction

Methodology is all about approach. Every discipline, whether it be applied or theoretical in nature, has methodologies. While there is no one standard analytics solution methodology, common denominators of solution methodologies are the shared purposes of being systematic, creating believable results, and being repeatable. That is to say, a solution methodology helps practitioners and researchers alike to progress efficiently toward credible results that can be reproduced.

Whether we mean a methodology at a macro- or microlevel, analytics practitioners at all stages of experience generally rely on some form of methodology to help ensure successful project outcomes. The goal of this chapter is to provide an organized view of solution methodologies for the analytics practitioner. We begin by observing that, in today's practice, there does not appear to be a shared understanding of what is meant by the word solution.

5.1.1 What Exactly Do We Mean by “Solution,” “Problem,” and “Methodology?”

In its purest form, a solution is an answer to a problem. A problem is a situation in need of a repair, improvement, or replacement. A problem statement is a concise description of that situation. Problem definition is the activity of coming up with the problem statement. Problem-solving, in its most practical sense, involves the collective actions that start with identifying and describing the problematic situation, followed by systematically identifying potential solution paths, selecting a best course of action (i.e., the solution), and then developing and implementing the solution. Problem-solving is, by far, one of the most valuable skills an analytics practitioner can hone, and is even an important life skill!

Most of us first encountered problem-solving as students exposed to mathematics at primary, secondary, and collegiate education levels, where a problem–for example, given two points in a plane, img and img, find the midpoint–is more often than not stated explicitly. The solution img can be found with some geometry and algebra wrangling. See Eves [1]. If asked to solve this problem for homework or on an exam, we probably did not get full credit unless we showed our work. This shown work we can think of as the solution methodology for the problem. This sample math problem can be used to illustrate the fact that there are different ways to solve a problem: For example, to use those same methodology steps to find a midpoint solution, if presented with the two points in polar coordinates, one may proceed using an entirely different approach by applying methods from trigonometry.

Similarly, in analytics practice, the path to a solution is generally not unique. For example, Ref. [2] describes a study of the variation in approach (and results) by 29 independent analytics teams working on the same data and problem statement. The path to a solution may involve a straightforward set of steps, or it may need some clever new twist; the method chosen may depend on the form of the available data, the assumptions, and context. A big difference between the problems that we encounter in school and the problems that we encounter in real life is usually that in real life, we are rarely presented with a clean problem statement with, for example, the given information. Still, writing down the steps we use to get from the problem statement to the solution is generally a good idea. In most cases, we can write down steps that are general enough so that we're able to find solutions to new and challenging problems.

What do we mean by a “solution”? To the purist, a solution is this: The correct answer to a problem. It is what you write down on your exam in response to a problem statement. If you get the answer right, and if you have adequately satisfied the requirement of showing your work, you earn full credit for the solution. In some cases, you may get the wrong answer, but if some of your shown work is okay, you may still earn partial credit. Similarly, in practice, analytics that produce unexpected or flawed results may earn their creators recognition for solid work that has gone into the project, and practitioners may get the opportunity to revise these analytics, just as authors of peer-reviewed papers may get the opportunity to make major revisions to their work during the review process. Without a transparent methodology, however, it is more difficult for evaluators of a project to appreciate the practitioners' findings and effort when they are presented with results that are unexpected or questionable.

Methodological steps are analogous to what we mean more generally by a solution methodology or approach. When we're starting out, the steps give us an approximate roadmap to follow in our analytics project. When we're done, if we've followed a roadmap and have perhaps even documented the steps, then it is easier to trace these steps, to repeat them, and to explain to stakeholders and potential users or sponsors how our solution was derived. It might be that the steps themselves are so innovative that we patent some aspect of the approach, or perhaps we find that publishing about some aspect of the project, the technology, or the outcome is useful for sharing the experience with others and promoting best practices. In any of these cases, having followed some methodology helps tremendously in describing and building credibility into whatever it was that we did to reach the solution.

5.1.2 It's All About the Problem

Experienced analytics professionals already know this too well: In practice, new projects rarely, if ever, start out with a well-defined problem statement. The precision of a problem statement in the real world will never be as clearly articulated as it was in our math classes in grade school, high school, and college. Indeed, there may be contrasting and even conflicting versions of the underlying problem statement for a complex system in a real-world analytics project, particularly when teams of people with varying experiences, backgrounds, opinions, and observations come together to collaborate. Using our sample math problem to illustrate, this would be equivalent to some people's thinking that the problem is to find a point solution img, while others might think that the solution should be defined by the intersection of two or more lines, or perhaps that it should be defined by a circle with a very small radius that covers the point of intersection, and so on. The point is that the solution can be relevant to the interpretation of the problem, and thus, when the problem is not defined for us precisely–and even sometimes when the problem is–people may interpret it in different ways, which may lead to entirely different solution approaches.

An important message here is that time and effort up-front on a problem statement is time well spent, as it will help clarify a direction and create consistent understanding of the practitioners' end goals.

5.1.3 Solutions versus Products

In today's commercial world of software and services, the word solution may be used to describe a whole collection of technologies that address an entire class of problems. The problems being solved by these commercial technologies may not be specifically defined in the ways we have been used to seeing problems defined in school. For example, a commercial supply chain software provider may have a suite of solutions that claim to address all the needs of a retail business.

In other words, in today's world of commercial software and services, the word solution has become synonymous with the word product. In fact, in some circles, it is not cool to say that the solution solves a problem because this suggests that there is a problem. Problems, at least in our our modern Western capitalist culture, are no big deal. Therefore, we don't really have them. However, we do have plenty of solutions, especially when it comes to commercial products. So, we begin this chapter by pointing out the elephant in many project conference rooms: Problems are not sexy, but solutions are! While this line of thinking is indeed the more positive and inspiring outlook, and while it makes selling solutions easier, unfortunately, it often leads to implementing the wrong solutions, or to failing altogether at solution implementation. Why? There are many reasons, but one of the most obvious and common reasons is that ill-defined, poorly understood, or denied problems are difficult–if not impossible–to actually solve.

5.1.4 How This Chapter Is Organized

The previous section, hopefully, has left the reader with a strong impression that recognizing the underlying problem is a first step toward solving it. It is in this spirit that this chapter introduces the notions of macro- and microsolution methodologies for analytics projects and organizes their content around them. Macro-methodologies, as we shall see in a section devoted to their description, provide the more general project path and structure. Four alternative macro-methodologies will be described in that section with this important caveat: Any one of them is good for practitioners to use; the most important thing is for practitioners to follow some macro-methodology, even if it is a hybrid.

Micro-methodology, on the other hand, is the collection of approaches used to apply specific techniques to solve very specific aspects of a problem. For every specific technique, there are numerous textbooks (and sometimes countless papers) describing its theory and detailed application. There is no way we will be able to cover all possible problem-solving techniques, which is not the purpose of this chapter. Instead, this chapter covers an array of historically common techniques that are relevant to INFORMS and to analytics practitioners in order to illustrate micro-solution methodology, that is, to expose, compare–and in some cases, contrast–the approaches used.

Figure 5.1 provides an illustration of the chapter topic breakdown. Note that all solution methodology descriptions in this chapter, both at macro- and microlevels, are significantly biased in favor of operations research and management sciences. This is so because this chapter appears in an analytics book published in affiliation with INFORMS, the international professional organization aimed at promoting operations research and management science. The stated purpose of INFORMS is “to improve operational processes, decision-making, and management by individuals and organizations through operations research, the management sciences, and related scientific methods.” (see the INFORMS Constitution [3].)

img

Figure 5.1 A breakdown of analytics solution methodologies (and related topics) covered in this chapter.

5.1.5 The “Descriptive–Predictive–Prescriptive” Analytics Paradigm

With the rise in the use of quantitative methods, particularly OR and MS, to solve problems in the business world, the business analytics community has adopted a paradigm that classifies analytics in terms of descriptive, predictive, and prescriptive categories. These correspond respectively to analytics that help practitioners to understand the past (i.e., describe things), to make recommendations about the present (i.e., prescribe things), and to understand the future (i.e., predict things). The author of this chapter believes that the paradigm originated at SAS [4], one of the most well-known analytics software and solutions companies today.

Granted, many disciplines today are using analytics, and the descriptive–predictive–prescriptive analytics paradigm has no doubt helped evangelize analytics to disciplines. However, it should be noted that we explicitly have chosen to organize this chapter directly around macro- and micro-methodologies, and within the micro-category, exploratory, data-independent, and data-dependent technique categories. While intending to complement the“descriptive–predictive–prescriptive” analytics paradigm, this orgnaization emphasizes that solution techniques do not necessily fall neatly into one of the paradigm bins. Instead, techniques in common categories tend to have threads based on underlying problem structure, model characteristics, and relationships to data, as opposed to what that the analytics project outcome may drive (i.e., to describe, to predict, or to prescribe). From the perspective of analytics solutions methodologies, this can also help avoid an unintentional marginalization of techniques that fall into the descriptive analytics category.

5.1.6 The Goals of This Chapter

After reading this chapter, a practitioner will

  1. be able to distinguish macro- versus micro-solution methodologies,
  2. be ready to design a high-level analytics project plan according to some macro-level solution methodology,
  3. be better at assessing and selecting appropriate microlevel solution methodologies appropriate for a new analytics project, based on a general understanding of the project objectives, type and approximate amounts of data available to the project, and various types of resources (e.g., people and skills, computing, time, and funding),
  4. be armed with a few pearls of wisdom and lessons learned in order to help maximize the success of her or his next analytics project,
  5. understand the significance of methodology to the practice of analytics within operations research and other disciplines.

5.2 Macro-Solution Methodologies for the Analytics Practitioner

As described in the Introduction, a macro-solution methodology is comprised of general steps for an analytics project, while a micro-methodology is specific to a particular type of technical solution. In this section, we describe macro-methodology options available to the analytics practictioner.

Since a macro-methodology provides a high-level project path and structure, that is, steps and a potential sequence for practitioners to follow, practitioners can use it as an aid to project planning and activity estimation. Within the steps of a macro-methodology, specific micro-methodologies may be identified and planned, aiding practitioners in the identification of specific technical skills and even named resources that they will need in order to solve the problem.

Four general macro-methodology categories are covered in this section:

  1. The scientific research methodology
  2. The operations research project methodology
  3. The cross-industry standard process for data mining (CRISP-DM) methodology
  4. The software engineering methodology

We reiterate here that there is some overlap in these methodologies and that the most important message for the practitioner is to follow a macro-solution methodology. In fact, even a hybrid will do.

5.2.1 The Scientific Research Methodology

The scientific research methodology, also known as the scientific method [5], has very early roots in science and inquiry. While formally credited to Francis Bacon, its inspiration likely dates back to the time of the ancient Greeks and the famous scholar and philosopher Aristotle [6].

This methodology has served humankind well over the years, in one form or another, and has been particularly embraced by the scientific disciplines where theories often are born from interesting initial observations. In the early days, and even until more recently (i.e., within the last 20 years–merely a blip in historical time!), a plethora of digital data was not available for researchers to study; data were a scarce resource and were expensive to obtain. Most data were planned, that is, collected from human observation, and then treated as a limited, valuable resource. Because of its value both to researchers' eventual conclusions and to the generalizations that they are able to make based upon their findings, the scientific methodology related to data collection has evolved into a specialty in and of itself within applied statistics: experimental design. In fact, many modern-day graduate education programs in the United States require that students take a course related to research methodology either as a prerequisite for graduate admission or as part of their graduate coursework so that graduate students learn well-established systematic steps for research, sometimes specifically for setting up experiments and handling data, to support their MS or PHD thesis. Often, this type of requirement is not uncommon in social sciences, education, engineering, mathematics, computer science, and so on–that is, these requirements are not limited strictly to the sciences.

The general steps of the scientific method, with annotations to show their alignment with a typical analytics project, are the following:

  1. A.1. Form the Research Question(s). This step is the one that usually kicks off a project involving the scientific method. However, as already noted, these types of projects may be inspired by some interesting initial observation. In applying this step to an analytics project in practice, the research questions may also relate to an underlying problem statement, which typically forms the preface for the project.
  2. A.2. State One or More Hypotheses. In its most specific form, this step may involve stating the actual statistic that will be estimated and tested: for example, img, that is, that two treatment means are the same. (Note that a treatment mean is the average of observations from an experiment with a set of common inputs, that is, fixed independent variable values are the treatment.) Interpreted more broadly, the hypotheses to test imply the specific techniques that will be applied. For example, the hypothesis that two means are identical implies that some specific techniques of experimental design, data collection, statistical estimation, and hypothesis testing will be applied. However, one might also consider more general project hypotheses: for example, we suspect that cost, quality of service, and peer pressure are the most significant reasons that cell phone customers change their service providers frequently. These types of hypotheses imply specific techniques in churn modeling.
  3. A.3. Examine and Refine the Research Question and Hypotheses. In this step, the investigating team tries to tune up the output of the first two steps of the scientific method. Historically, this is done to make sure that the planning going forward is done in the most efficient and credible way, so that ultimately, the costly manual data collection leads to usable data and scientifically sound conclusions–otherwise, the entire research project becomes suspect and a waste of time (not to mention, the discrediting of any conclusions or general theory that the team is trying to prove). This step is not much different in today's data-rich world: Practitioners should still want to make sure they are asking the right questions, that is, setting up the hypotheses to test so that the results they hope to get will not be challenged, while trying to ensure that this is all done as cost-effectively and in as timely a manner as possible. In today's world, because of the abundance of digital data, this sometimes means exploration on small or representative data sets. This can lead to the identification of additional data needed (including derivatives of the available data), as well as adjustments to the questions and hypotheses based on improved understanding of the underlying problem and the addition of preliminary insights. Notice the carry forward of “problem understanding” that happens naturally in this step. In fact, it is good to consider the acceptable conclusion of this step as one where the underlying problem being addressed can be well enough articulated that stakeholders, sponsors, and project personnel all agree. Some preliminary model building, to support the “examination” aspect of this step, may occur here.
  4. A.4. Investigate, Collect Data, and Test the Hypotheses. In traditional science and application of the scientific method, this meant the actual steps of performing experiments, collecting and recording observations, and actually performing the tests (which were usually statistically based). Applied to analytics projects, this macro-methodology step means preparing the final data, modeling, and observing the results of the model.
  5. A.5. Perform Analysis and Conclude the General Result. In this step, we perform the final analysis. In traditional science, does the analysis support the hypotheses? Can we draw general conclusions such as the statement of a theory? In analytics projects, this is the actual application of the techniques to the data and the drawing of general conclusions.

As is evident here, the scientific method is a naturally iterative process designed to be adaptive and to support systematic progress that gets more and more specific as new knowledge is learned. When followed and documented, it allows others to replicate a study in an attempt to validate (or refute) its results. Note that reproducibility is a critical issue in scientific discovery and is emerging as an important concern with respect to data-dependent methods in analytics (see Refs [7,8]).

Peer review in research publication often assumes that some derivative of the scientific method has been followed. In fact, some research journals mandate that submitted papers follow a specific outline that coincides closely with the scientific method steps. For example, see Ref. [9], which recommends the following outline: Introduction, Methods, Results, and Discussion (IMRAD). While the scientific method and IMRAD for reporting may not eliminate the problem of false discovery (see, for example, Refs [10,11]), they can increase the chances of a study being replicated, which in turn seems to reduce the probability of false findings as argued by Ioannidis [12].

Because of this relationship to scientific publishing, and to research in general, the scientific method is recommended for analytics professionals who plan eventually to present the findings of their work at a professional conference or who might like the option of eventually publishing in a peer-reviewed journal. This methodology is also recommended for analytics projects that are embedded within research, particularly those where masters and doctoral theses are required, or in any research project where a significant amount of exploration (on data) is expected and a new theory is anticipated. In summary, the scientific method is a solid choice for research-and-discovery-leaning analytics projects as well as any engagement that is data exploratory in nature.

5.2.2 The Operations Research Project Methodology

Throughout this chapter, analytics solution methodology is taken to mean the approach used to solve a problem that involves the use of data. It is worth bringing this point up in this section again because, as mentioned in the Introduction, our perspective assumes an INFORMS audience. Thus, we are biased toward these methodology descriptions for analytics projects that will be applying some operations research/management science techniques. While it was natural to start this macro-section with the oldest, most established, mother of all exploratory methodologies (the scientific method of the last section), it is natural to turn our attention next to the macro-method established in the OR/MS practitioner community.

In general, one may find some variant of this project structure in introductory chapters of just about any OR/MS textbook, such as Ref. [13], which is in its fourth edition, or Ref. [14], which was in its seventh edition in 2002. (There have been later editions, which Dr. Hillier published alone and with other authors after the passing of Dr. Lieberman.)

Most generally, the OR project methodology steps include some form of the following progression:

  1. B.1. Define the Problem and Collect Data. As most seasoned analytics and OR practitioners know, problem statements are generally not crisply articulated in the way we have been used to seeing them in school math classes. In fact, as noted earlier, sponsors and stakeholders may have disparate and sometimes conflicting views on what the problem really is. Sometimes, some exploratory study of existing data, observing the real-world system (if it exists), and interviewing actors and users of the system helps researchers to gain the system and data understanding needed for them to clarify what the problem is that should be solved by the project. The work involved in this step should not be underestimated, as it can be crucial to later steps in the validation and in the acceptance/adoption/implementation of the project's results. It is a good idea to document assumptions, system and data understanding, exploratory analyses, and even conversations with actors, sponsors, and other stakeholders. Finding consensus about a written problem statement, or a collection of statements, can be critical to the success of the project and the study, so it is worth it to spend time on this, review it, and attempt to build broad consensus for a documented problem statement.

    Collecting data is a key part of early OR project methodology, and is intriciately coupled with the problem definitition step, as noted in Ref. [14]. In modern analytics projects, data collection generally means identifying and unifying digital data sources, such as transactional (event) data (e.g., from an SAP system), entity attribute data, process description data, and so on. Moving data from the system of record and transforming it into direct insights or reforming it for model input parameters are important steps that may be overlooked or under-estimated in terms of effort needed.

    As noted earlier, we live in a world where “solutions” are sexy and “problems” are not–further adding to the challenge and importance of this step. In comparison with the scientific method of the previous section, this step intersects most closely with the activities and purposes of A.1, A.2, and A.3.

  2. B.2. Build a Model. There are many options for this step, depending on the type of problem being solved and on the objective behind solving it. For example, if we are seeking improved understanding, the model may be descriptive in nature, and the techniques may be those of statistical inference. If we are trying to support a complex decision, such as where to build a new firehouse and how to staff it, then we may build descriptive models to analyze current urban demand patterns; we may build predictive models that take those outputs to project future demand; and then we may build an optimization model to locate the facility so that future demand is best served. Much of this step is based on available data, as well as on available tools and skills, which sometimes means we choose to build the models that we are most familiar with or that we have the skills to support. This step most intersects with the activities and purposes of A.3, although it is not an exact mapping.
  3. B.3. Find and Develop a Solution. In OR, this traditionally has meant the work of solving the equations or doing the math that finds the solution, designing the algorithm, and coming up with a computer code to implement the algorithm. There are many variants of this step today because the models may be derived fully from data or logic, and the micro-methods for finding the solutions can be specific to the technique. However, the common denominator here has to do with the algorithm, or in some cases, the heuristic: It is the recipe for taking the data, assumptions, and so on. and converting it to a useable result, however that is done. Computer code just helps us to do that most efficiently. This step intersects most closely with the activities and purposes of A.4, although it is only partial in mapping. As we shall see in a later section, this step interlocks with micro-solution methodologies that can constitute the details of this macro-step.
  4. B.4. Test (Verify) and Validate. This step is actually a whole bunch of activities. Testing and verifying are often used interchangeably in software development, and since we often program (i.e., “implement”) our model solution (algorithm, heuristic, process, model, etc.), the interchange works here in the OR project methodology. The act of testing, or verifying, is making sure that whatever it is you made and are calling the model or solution is actually doing what you think it is doing. This is different from validation, which is making sure a model is representative of whatever you are trying to mimic, for example, a real-world system or process and a decision-making scenario. Validation asks the following question: Does the model behave as if it were the real system? There are entire areas of research devoted to these topics, not just in the analytics and OR fields, but in statistics and software engineering as well. They all are better because of the cross learning that has happened. For example, statistical methods can be used to generate and verify test cases. In validation, statistical methods are often used in rigorous simulation studies–which are basically statistical experiments done with a computer program, and as such lend themselves very nicely to things such as pairwise comparison with historical observations from the true system. Dr. Robert Sargent is one of the pioneers in computer simulation, output analysis and verification, and validation methodologies–the canonical methods he described in his 2007 paper [15] provide valuable lessons not only for simulation modelers, but also for those doing testing, verification, and validation in other types of analytics and OR projects.
  5. B.5. Disseminate, Use, or Deploy. Once the solution is ready to be used, it is rolled out (disseminated, deployed), and the work is still not done! Usually, at this stage, there needs to be training, advocacy, sometimes adjustment, and virtually always maintenance (fixing things that are wrong, or adding new features as the users and stakeholders hopefully become enthralled with the work and have new ideas for it). At this stage, it is usually useful to have baked in some monitoring–that is, if you can think ahead to put in metrics that automatically observe value that is being derived from using the solution, that's awesome foresight. In too many analytics and OR projects, deployment and dissemination merely means a final presentation and report. In some cases, those recommendations are good enough! In others, they might signal that the true solution is not really intended to be “used.” Sometimes, this leads to an iterative process of refinement and redeployment, allowing practitioners to restart this entire step process. In other cases, you write the report, and perhaps an experience paper gets submitted to a peer-reviewed journal or is presented at an INFORMS conference. Whatever the outcome, practitioners need to keep in mind that all projects are worthy learning experiences–even the ones that are not deployed in the manner in which we were hoping.

It is not surprising that the OR project method, being exploratory in nature, is somewhat of a derivative of the scientific method. As Hillier and Lieberman point out in the introductory material of Ref. [16], operations research has a fairly broad definition, but in fact gets its name from research on operations. The study objects of the research are “operations,” or sometimes “systems.” These operations and systems are often digital in their planning and execution, and so tons of data now exist to model, recreate them, and model/experiment with them. In other words, these observable digital histories mean they are rich in data (analytics) that can be used to model very quickly. Unfortunately, the ability to jump right into modeling, analysis, and conclusions often means skipping over early methodological steps, particularly in the area of problem definition.

5.2.3 The Cross-Industry Standard Process for Data Mining (CRISP-DM) Methodology

“The cross-industry standard process for data mining methodology,” [17,18] known as CRISP or CRISP-DM, is credited to Colin Shear, who is considered to be a pioneeer in data mining and business analytics [19]. This methodology heavily influences the current practical use of SPSS (Statistical Package for the Social Sciences), a software package with its roots in the late 1960s that was acquired by IBM in 2009 and that is currently sold as IBM's main analytics “solution” [18].

As an aside, note that SAS and SPSS are commercial packages that were born in about the same era and that were designed to do roughly the same sort of thing–the computation of statistics. SAS evolved as the choice vehicle of the science and technical world, while SPSS got its start among social scientists. Both have evolved into the data-mining and analytics commercial packages that they are today, heavily influencing the field. As mentioned earlier, the “descriptive–predictive–prescriptive” paradigm appears to have its roots in SAS. As noted above, CRISP is heavily peddled as the methodology of choice for SPSS. However, we note that this methodology is a viable one for data-mining methods that use any package, including R and SAS.

The steps of the CRISP-DM macro-methodology, from Ref. [17], are the following:

  1. C.1. Business Understanding. This step is, essentially, the domain understanding plus problem definition step. In the business analytics context, CRISP calls out specific activities in this step, such as stating background, defining the business objectives, defining data-mining goals, and defining the success criteria. Within this step, traditional project planning (cost/benefits, risk assessment, and project plan) are included. This step also involves assessment of tools and techniques. Note that this step aligns with B.1 of the OR project methodology.
  2. C.2. Data Understanding. This is a step used to judge what data is available, by specifically identifying and describing it (for example, with a data dictionary) and assessing its quality or utility for the project goals. In most cases, actual data is collected and explored/tested.
  3. C.3. Data Preparation. This is the step where analysts decide which data to use and why. This step also includes “data cleansing” (roughly, the act of finding and fixing or removing strange or inaccurate data, and in some cases, adding, enhancing, or modifying data to fix incomplete forms), reformatting data, and creating derivative data (i.e., extracting implied or derived attributes from existing data, merging data, etc.). An example of reformatting data would be converting GIS latitude and longitude (i.e., latitude/longitude) data from degree/minute/second format, for example, img N, img W, to decimal degrees, that is, img.An example of enhancing in data cleansing is finding and adding a postal code field to a street, city, state address or geocoding the address (i.e., finding the corresponding latitude/longitude). Data merging is a common activity in this step, and it generally is used to create extended views of data by adding attributes, via match up by some key. Note that a common “mistake” among inexperienced data scientists is to try to merge extremely large unsorted data sets. Packages such as SPSS, SAS, and R, and even scripting languages such as Python, allow for these common types of data movement, but without presorting lists, execution to accomplish merge operations can end up taking days instead of a few minutes when the list sizes are in the millions, which is not an unrealistic volume of data to be working with these days.
  4. C.4. Modeling. This is the step where models are built and applied. In data mining and knowledge discovery, the models are generally built from the data (e.g., a regression model with a single independent variable is basically a model of a linear relationship where the data is used to derive the slope and y-intercept). Other modeling-related steps include articulating the assumptions, assessing the model, and fitting parameters. Note that this step, together with the previous two steps, aligns with B.2 and B.3 of the OR project methdology.
  5. C.5. Evaluation. This step is equivalent to the OR project verification and validation step. See B.4. Note that Dwork et al. [20] give a well-recognized example of a validation method for data dependent methods.
  6. C.6. Deployment. This step is equivalent to the OR project deployment step. See B.5.

The CRISP-DM macro-methodology is thought of as an iterative process. In fact, the scientific method and the OR project method can also be embedded in an iterative process. More details of the CRISP-DM macro-methodology can be found in Chapter 7.

5.2.4 Software Engineering-Related Solution Methodologies

Software engineering is relevant to analytics macro-solution methodology because of the frequent expectation of an outcome implemented in a software tool or system. The steps of the most standard software engineering methodology, the waterfall method, are the following:

  1. D.1. Requirements. This step is a combination of understanding the business or technical environment in which a system will be used and identifying the behavior (function) and various other attributes (performance, security, usability, etc.) that are needed for a solution. Advisable prerequisites for identifying high quality requirement specifications are problem, business, and data understanding. Thus, this step aligns with B.1, C.1, and C.2.
  2. D.2. Design. The design step in software engineering translates the requirements (usually documented in a “specification”) into a technical plan that covers, at a higher level, the software components and how they fit together, and at a lower level, how the components are structured. This generally includes plans for databases, queries, data movement, algorithms, modules or objects to be coded, and so on.
  3. D.3. Implementation. Implementation refers to the translation of the design into code that can be executed on a computer.
  4. D.4. Verification. Similar to previous macro-methodologies, verification means testing. In software, this can be unit testing, system testing, performance testing, reliability testing, and so on. The step is similar to other macro-method verification steps in that it is intended to make sure that the code works as intended.
  5. D.5. Maintenance. This is the phase, in software development, that assumes the programs have been deployed and when sometimes either bug fixes will need to be done or else new functions may be added.

A number of other software engineering methodologies exist. See, for example, Ref. [21] for descriptions of rapid application development (comprised of data modeling, process modeling, application generation, testing, and turnover), the incremental model (analysis, design, code, test, etc.; analysis, design, code, test, etc.; analysis, design, code, test, etc.), and the spiral model (customer communication, planning, risk analysis, engineering, construction and release, evaluation). When looking more deeply at these steps, one can see that they can also be mapped to the other macro-methodologies–note that Agile, a popular newer form of software development, is very much like the Incremental model in that it focuses on fast progress with iterative steps.

5.2.5 Summary of Macro-Methodologies

Figure 5.2 shows how the four macro-solution methodologies are comparatively related. It is not difficult to imagine any of these macro-methodologies embedded in an iterative process. One can also see, through their relationships, how it can be argued that each one, in some way, is derivative of the scientific method.

Figure depicts relationship among the macro-methodologies.

Figure 5.2 Relationship among the macro-methodologies.

Every analytics project is unique and can benefit from following a macro-methodology. In fact, a macro-methodology can literally save a troubled project, can help to ensure credibility and repeatability, can provide a structure to an eventual experience paper or documentation, and so on. In fact, veteran practitioners may use a combination of steps from different macro-methodologies without being fully conscious of doing so. (All fine and good, but, in fact, you veterans could contribute to our field significantly if you documented your projects in the form of papers submitted for INFORMS publication consideration and if, in those papers, you described the methodology that you used.)

The take-home message about macro-methodologies is that it is not necessarily important exactly which one of them you use–its just important that you use one (or a hybrid) of them. It is recommended that, for all analytics projects, the steps of problem definition and verification and validation be inserted and strictly followed, whether the specific macro-methodology used calls them out directly or not.

5.3 Micro-Solution Methodologies for the Analytics Practitioner

In this section, we turn our attention to micro-methodology options available to the analytics practitioner.

5.3.1 Micro-Solution Methodology Preliminaries

In general, for any micro-methodology, two factors are most significant in how one proceeds to “solutioning”:

  1. The specific modeling approach
  2. The manner in which the data (analytics) are leveraged with respect to model building as well as analysis prior to modeling and using the model

Modeling approaches vary widely, even within the discipline of operations research. For example, data, numerical, mathematical, and logical models are distinguished by their form; stochastic and deterministic models are distinguished by whether they consider random variables or not; linear and nonlinear models are differentiated by assumptions related to the relationship between variables and the mathematical equations that use them, and so on. We note that micro-solution methodology depends on the chosen modeling approach, which in turn depends on domain understanding and problem definition–that is, some of those macro-methodology steps covered in the previous section. Skipping over those foundational steps becomes easier to justify when the methods that are most closely affiliated with them (e.g., descriptive statistics and statistical inference) are side-lined in a rush to use “advanced (prescriptive) analytics.”

Thus, we begin this micro-solution methodology section by re-stating the importance of following a macro-solution methodology, and by emphasizing that the selection of appropriate micro-solution methodologies–which could even constitute a collection of techniques–is best accomplished when practitioners integrate their selection considerations into a systematic framework that enforces some degree of precision in problem definition and domain understanding, that is, macro-method steps in the spirit of A.1, A.2, A.3, B.1, B.2, C.1, C.2, C.3, and D.1 (see Figure 5.2).

All of this is not to diminish the importance of the form and purpose of the project analytics, that is, the data, in selection of micro-solution methodologies to be used. In fact,

  • how data are created, collected, or acquired,
  • how data are mined, transformed, and analyzed,
  • how data are used to build and parameterize models, and
  • whether general “solutions” to models are dependent or independent of the data

are all consequential in micro-solution methodology. However, it is the model that is our representation of the real world for purposes of analysis or decision-making, and as such it gives the context for the underlying problem and the understanding of the domain in which “solving the problem” is relevant. This is why consideration of (i) the specific modeling approach should always take precedence over (ii) the manner of leveraging the data. Thus, this section is organized around modeling approaches first, while taking their relationship to analytics into account as a close second.

5.3.2 Micro-Solution Methodology Description Framework

This section presents the micro-solution methodologies in these three general groups:

  1. Group I. Micro-solution methodologies for exploration and discovery
  2. Group II. Micro-solution methodologies using models where techniques to find solutions are independent of data
  3. Group III. Micro-solution methodologies using models where techniques to find solutions are dependent on data

Note that these groups are not directly aligned with the “descriptive–predictive–prescriptive” paradigm but are intended to complement the paradigm. In fact, depending on the nature of the underlying problem being “solved,” and as this section shall illustrate, a micro-methodology very often draws from two or three of the three (i.e., “descriptive,” “predictive,” and “prescriptive”) characterizations at a time–sometimes implicitly, and at other times explicitly.

Since it is impractical to cover every conceivable technique, this section covers an array of historically common techniques relevant to the INFORMS and analytics practice with the goals of illustrating how and when to select techniques. (Note that we will use the word technique or method to describe a specific micro-solution methodology.) While pointers to references are provided for the reader to find details of specific techniques, we use certain model and solution technique details to expose why choosing an approach is appropriate, how the technique relates to micro (and in some cases, macro)-methodology, and to compare and contrast choices in an effort to help the reader differentiate between concepts. And while there are many, many flavors of models and modeling perspectives (e.g., an iconic model is usually a physical representation of the real world, such as a map or a model airplane), we'll generally stay within the types of models most familiar to the operations research discipline. Further reading on the theory of modeling can be found in the foundational work of Zeigler [22], in introductory material of Law and Kelton [23], and of course in our discipline standards such as Hillier and Lieberman [14,16] and Winston [13]. Others, such as Kutner et al. [24], Shearer [25], Hastie et al. [26], Provost and Fawcett [27], and Wilder and Ozgur [28], expose and contrast the practice and theory of modeling led from the perspective of data first. General model building is also the topic of the next chapter of this book.

We turn next to the presentation of each of the above micro-solution methodology groups. Each micro-methodology group is presented using the following framework:

  1. What are the general characteristics of problems we try to “solve” by micro-solution methodologies of this group? What are some examples?
  2. Which models are used by the micro-solution methodologies of this group? What are the typical underlying assumptions of the models, and what are their advantages and disadvantages?
  3. How are data considered by this group? That is, how are data created, collected, acquired or mined, transformed, analyzed, used to build and parameterize models, and so on?
  4. What are some of the known techniques related to finding solutions to the underlying problem based on use of each model type?
  5. What is the relationship to macro-methodology steps?
  6. What are the main takeaways regarding the micro-methodology group?

5.3.3 Group I: Micro-Solution Methodologies for Exploration and Discovery

This group of micro-solution methodologies includes everything we do to explore operations, processes, and systems to increase our understanding of them, to discover new information, and/or to test a theory. Sometimes, the real-world system, which is the main object of our study, exists and is operational so that we can observe it, either directly or through a data history (i.e., indirectly). Sometimes, the operation we are interested in does not exist yet, but there are related data that help us understand the environment in which a new system might operate. The important thread for this group involves discovery.

Group I: Problems of Interest

Problems that are addressed by methods in this exploratory group are in this group because they can be generally characterized by, for example, the following questions: How does this work? What is the predominant factor? Are these two things equal? What is the average value? What is the underlying distribution? What proportion of these tests are successful? In fact, it is in this group that the (macro) scientific method has most relevance, because it helps us to formulate research queries and structure the processes of collecting data, estimating, and inferring. Exploration and discovery is often where analytics projects start, both in research and the real world of analytics practice. It is also not uncommon to repeat or return to exploration and discovery steps as a project progresses and new insights are found, even from other forms of micro-solution methodologies. As an example, consider a linear programming model (that will be covered in Group II) that needs cost coefficients for instantiating the parameters of an objective function. In some cases, simple unit costs may exist. In many real-world scenarios, however, costs change over time and have complex dependencies. Thus, estimating the cost coefficients may be considered an exploration and discovery subproblem within a project. In this example, the problems addressed may be finding the valid range for a fixed cost coefficient's value or finding 95% confidence intervals for the cost coefficients. Questioning the assumption that the cost function is indeed linear with respect to its variable for a specified range is another example of a problem here.

Group I: Relevant Models

When considering exploration and discovery, the relevant models are statistical models. Here, we mean statistical models in their most general sense: the underlying distributions, the interplay between the random variables, and so on. In fact, part of the exploration may be to determine the relevant underlying statistical model–for example, determining if an underlying population is normally distributed in some key performance metric, or if a normal-inducing transformation of observations will justify a normality assumption. The importance of recognizing the underlying models formally when doing exploration and discovery is related to the assumptions formed for using subsequent techniques.

Group I: Data Considerations

Data when the micro-methodology group is one of exploration and discovery may be obtained in a number of ways. In the most classic deployment of the scientific method, data are created specifically to answer the exploration questions, by running experiences, observing, and recording the data. In today's world of digital operations and systems, historical data are often available to enable the exploration and discovery process. Data “collection” in these digital cases may take more of the form of identifying digital data sources, exploring the data elements and characterizing their meaning as well as their quality, and so on, and even “mining” large data sets to zero in on the most pertinent forms of the data. In these cases of already-existing data, it is equally important to consider the research questions, the underlying problem being solved, and the relevant models. For example, one may have a fairly large volume of data to work with (i.e.,“Big Data”), but despite the generous amount of data, the data cover a time period or geography that is not directly relevant to the problem being studied. For example, if a database contains millions of sales transactions for frozen snacks purchased in Scandinavian countries during the months of January and February, the data may not be relevant to finding the distribution of daily demand for the same population during summer months, or for a population of a different geography at any time, or for the distribution of daily demand for frozen meals (i.e., nonsnacks) for a population of any geography in any time period. In some situations, we may have so much data (i.e., “Big Data”) that we decide to take a representative random sample.

In general, for this group of methods, the problem one wishes to solve and the assumptions related to the statistical models considered are the most important data considerations. In certain cases, practitioners may like to think that their exploration process is so preliminary that a true problem statement (that is sometimes stated as a research question plus hypotheses) and any call out of modeling assumptions are considered unnecessary. However preliminary, exploration can usually benefit by introducing some methodological steps, even if the problem statement and modeling assumptions are themselves preliminary.

Group I: Solution Techniques

Keeping in mind that “solving” a problem related to an exploration and discovery process involves trying to answer an investigational question, it should be no surprise that techniques related to descriptive statistical models are at the core of the micro-solution methodologies for this group. Applied statistical analysis and inference have a traditional place in the general research scientific methods related to exploration, and they also carry the discovery needed for the data handling and wrangling required by other “advanced” models and solution techniques. In fact, one of the great ironies of our field is that the statistical models and techniques that constitute “descriptive models and techniques” are the oldest and most well formed in theory and practice of all solution methodologies related to analytics and operations research. Hence, passing them over for “advanced” (e.g., prescriptive or predictive) techniques should elicit at least some derision.

This collection of techniques might be, arguably, the most important subset of the micro-solution methodology techniques. Why? Because even prescriptive and predictive techniques reckon on them.

Techniques here range from deriving descriptive statistics (mean, variance, percentiles, confidence intervals, histograms, distributions, etc.) from data to advanced model fitting, forecasting, and linear regression. Supporting techniques include experimental design, hypothesis testing, analysis of variance, and more–many of which are disciplines and complete fields of expertise in and of themselves.

The methods of descriptive statistics are fairly straightforward, and most analytics professionals likely have their favorite textbooks to use for reference. For example, coming from an engineering background, one may have used Ref. [29]. Reference [30] is the standard for mathematics-anchored folks. Reference [31] is the usual choice for the serious experimenters. For the most part, all of these methods help us to use and peruse data to gain insights about a process or system under study. Usually, that system is observable, either directly or indirectly (e.g., in the form of a digital transaction history, which is often the case today). While not as old as the scientific method, the field of statistics is old enough to have developed a great amount of rigor–but it also has lived through a transformational period over the past 30+ years, as we've moved from methods that rely on observations that needed to be carefully planned (i.e., experimental design) and took great effort to collect (i.e., sampling theory and observations) to a world in which data are ubiquitous. In fact, many Big Data exploratory methods are based on using statistical sampling techniques–even though we may have available to us, in glorious digital format, an exhaustive data set, that is, the entire population!

Histograms, boxplots, scatter plots, and heatmaps (showing the correlation coefficient statistics between pairs of variables) are examples of visualizations that, paired with descriptive statistics and inference, help practitioners to understand data and to check assumptions. See Figures 5.35.6, respectively. Histograms and boxplots are powerful means of identifying outliers and anomalies that may lead to avoiding data in certain ranges, identifying missing values, or even spotting evidence of data-transmission errors.

A bar graphical representation where frequency is plotted on the y-axis on a scale of 0–100 and daily disbursements (units) on the x-axis on a scale of <6–> = 101.

Figure 5.3 An example of a histogram showing the frequency (distribution) for unit disbursements of a single food item at a New York City digital food pantry from January 2, 2013 to April 24, 2017.

img

Figure 5.4 An example of a boxplot showing the distribtuion for unit weekday demand of for three food categories at a New York City digital food pantry from January 2, 2013 to April 24, 2017.

img

Figure 5.5 Example of a scatter plot visually showing the relationship between the daily (mean) demand and nonfill percentage for a set of stock keeping units. There appears to be no signficant correlation for this product set.

img

Figure 5.6 Example of a heat plot visually showing the pairwise correlation coefficient for a set of stock keeping units (SKUs). There are several negatively correlated pairs of SKUs, indicated by the dark red, and several positively correlated SKUs, indicated by the blue for those not in the diagonal.

Descriptive statistics are equally powerful for exploring nonquantitative data. Finding the number of unique values of a text field, and finding how frequently these unique values occur in the data, is standard for understanding data. Again, together with scatter plots and heatmaps for data visualization, correlation analysis is usually done during data exploration to help practitioners understand the relationships between different types of data.

Overall, the micro-methodologies formed by the wealth and rigor of statistical analysis provide the analytics professional with tools that are specifically aimed at drawing conclusions in a systematic and fact-based way and at getting the most out of the data available, while also taking into consideration some of the inherent uncertainty of conclusions. For example, computing a confidence interval for an estimated mean not only gives us information about the magnitude of the mean but it also provides a direct methodology for deciding if the true mean is actually equal to some value. We can test to see if the mean is really zero by noticing if the confidence interval includes the value of zero. By virtue of taking variance and sample size into its calculation, the confidence interval, along with the underlying assumption of distribution, gives us a hint about how well we can rely on this type of test.

Hypotheses tests in general are one of the most powerful and rigorous ways to make very solid conclusions based on fact. The methods of hypotheses testing depend on what type of statistic is being used (mean, variance, proportion, etc.), what the nature of the test is (compared to a value, compared to two or more values that have been statistically estimated, etc.), how the data were derived (sampling assumptions and overall experimental design), and other assumptions, such as that of the underlying population's distribution. In going from the sparse, hard-to-get data of the past to the abundant, sometimes full population data of the present, it seems to be true that many practitioners are sidestepping the rigor and power of statistical inference and losing, perhaps, the ability to gain full credibility and value from their conclusions. In fact, one way to bring this practice back on track is to tie the micro-methods of statistics back into the macro-methodologies, either the scientific method, which has natural hypothesis-setting and testing steps, or macro-methods with steps that are derivatives of it.

Within the myriad of applied statistical techniques for understanding processes and systems through data, an incredibly powerful methodology that should be in every analytics professional's toolbox is the ANOVA. ANOVA stands for analysis of variance. In a tabular and well-oiled form and method, ANOVA is the quintessential approach for understanding data by virtue of how they help analysts organize and explain sources of variance (and error). The method gets its name from the fact that the table is an accounting of variance by attributable source, and one way to think about it is really as a bookkeeping practice for explaining what causes variance. ANOVA tables are natural mechanics for performing statistical tests, such as comparison of variance to see which source in a system is more significant. A basic extension of ANOVA is the multi-variate analysis of variance (MANOVA), which extends this methodology by considering the presence of multiple dependent variables at once.

Any statistics textbook of worth should have at least one chapter devoted to ANOVA computations and applications, including tests. Reference [32] is a favorite text for analysts who frequently use regression analysis, which is closely tied to the methodology of ANOVA–they basically go hand-in-hand. Regression is the stepping stone for analytics and in particular modeling that is derived from data–it is the essential method when one wishes to find a relationship, generally a linear equation, between one or more independent variables and a response variable. The mechanics of this method involve estimating the values of a y-intercept and slope (for a single independent variable). This is called the method of least squares, and it is basically the solution to an embedded optimization problem. Solution methodology for the least squares problem, for example, Ref. [33], is also an illustration showing that the the techniques of micro-methodologies often depend on one another–in this case, a statistical modeling technique dependent on an underlying optimization method. Figure 5.7 exhibits a range of observations before applying a transformation to linearize the data and fit a linear regression (see Figure 5.8), illustrating another common form of common and complimenting techniques (i.e., mathematical data transformation prior to applying of a micro-methodology).

Figure depicts an independent and a response variable before transformation-induced linearity.

Figure 5.7 An independent and a response variable before transformation-induced linearity.

Figure depicts an independent and a response variable after transformation-induced linearity, with linear regression line.

Figure 5.8 An independent and a response variable after transformation-induced linearity, with linear regression line.

In summary, micro-methodologies for exploration and discovery rely on the following core techniques:

  • Basic statistics
  • Experimental design
  • Sampling and estimation
  • Hypothesis testing
  • Linear regression
  • ANOVA and MANOVA

Group I: Relationship to Macro-Methodologies

This area of analytics and OR is most closely and traditionally related to the scientific method and to the discovery and research processes in general, and it is not surprising that there are hundreds, maybe thousands, of textbooks devoted to this statistical topic, since virtually every field of study and research in science, social sciences, education, engineering, and technology relies on these methods as the underlying basis for testing research questions and drawing conclusions from data.

Group I: Takeaways

An important function of applied statistics in the analytics world today is in preparing data for other methods, for example, creating the parameters for the math programming techniques described in the previous section. In this case, and in the case of methods covered in the subsequent sections, statistical inference is the important methodology for providing the systematic process and rigor behind data-preparation steps, for just about any other method in analytics and OR that relies on any data. Thus, in virtually every analytics project involving data, statistical analysis and particularly inference methods will always have a role.

5.3.4 Group II: Micro-Solution Methodologies Using Models Where Techniques to Find Solutions Are Independent of Data

Next, we consider micro-methodologies where the models for which the techniques used to “solve” problems are independent of data. Note that this does not mean that the models and techniques do not use data. On the contrary! Here, the assumption of “independence of data” means that we can find a general solution path whether or not we know the data. In other words, we can find a solution and then plug the data in later so that we can then say something about that particular instance of the problem and its solution.

Group II: Problems of Interest

This group is distinguished by the fact that data, that is, our analytics, create an instance of the problem through parameters such as coefficients, right-hand-side values, interarrival time distributions, and so on. Problems of interest in this group are those in which we seek a modeling context that allows for either experimentation (as an alternative to experimenting on the real-world system) or decision support (i.e., optimization). The problem statements that characterize this group are of one of two forms: experimental (i.e., what-if analysis) or prescriptive (e.g., what should I do to optimize?).

As discussed in the Introduction of this chapter, problem statements are often elusive, particularly in the early phases of a real-world project. In that spirit, it is not uncommon to have a problem statement formulated somewhat generally for this group: How can I make improvements to the system (or operation) of interest? Or, how can I build the best new system given some set of operating assumptions?

Group II: Relevant Models

Some of the modeling options relevant to this group include the following:

  • Probability models
  • Queueing models
  • Simulation and stochastic models
  • Mathematical and optimization models
  • Network models

Indeed, these modeling options include many viable modeling paths. The most significant factor in determining the modeling path relates back to questions that are fundamental to the problem statement, which may also characterize the analytics project objective: Do I want to model an existing or new system? Am I trying to build a new system or improve an existing one? How complex are the dynamics of the system? Are there clear decisions to be made that can be captured with decision variables and mathematical equations (or equalities) that constrain the variables and may also be used to drive an objective function that minimizes or maximizes something?

Group II: Data Considerations

In this group, data serve the purpose of creating parameters for the models. For simulation, probability, and queueing models, this may mean data that help to fit distributions for describing interarrival or service times or any other random variables in a system. For optimization models, we generally seek data for parameterizing right-hand-side values, technical coefficients within constraint equations, objective function cost coefficients, and so on.

Traditionally, operations researchers developed models with scant or hoped-for data. In some cases, practitioners may have compensated for unavailable data by making inferences from logic and/or using sensitivity analysis to test the robustness of solutions with respect to specfic parameter input values. Indeed, that models with solution techniques became the original core of operations research modeling is not entirely surprising, given the preanalytics era challenge of data availability.

In today's world of analytics, a new challenge is that the data needed to parameterize models in this class may be too much (versus the old problem of too little). In this case, the micro-methods of Group I come in handy and should be used, for example, for everything from the estimation of point estimates to finding confidence interval estimates that specify interesting ranges for sensitivity analyses to distribution fitting and hypotheses testing.

Group II: Solution Techniques

  • Basic Probability. Practitioners should use these techniques following the choice of models that yield descriptions about the inherent uncertainty of events in a system. These are techniques used to estimate discrete choice probabilities or to fit probability distribution parameters. The quintessential example is estimating the probabilities of simple events such as those in a decision tree (see Figure 5.9). Comprehensive treatment of probability models and their solution techniques can be found in Ref. [34].
  • Stochastic Processes. In general, one moves to stochastic processes (from basic probability models and techniques) when there is a dynamic aspect of systems being studied. Processes are often described by states and transitions, either discrete or continuous in nature. Comprehensive treatment of solution techniques can be found in Ref. [35].
  • Queueing Theory. A queueing system is basically any system where waiting in line may occur when there is contention for one or more limited resources. These systems occur almost everywhere! For example, they occur when people are waiting in line for a cashier at a grocery store, a bank teller, or an ATM; they occur when manufacturing subassemblies (i.e., partially finished products) wait for the attention of machines and operators; and they occur virtually in call centers and communications systems (e.g., see Ref. [36]). An example of a queueing system configuration in a manufacturing system is given in Figure 5.10. Techniques in this area are derivatives of probability, stochastic processes, systems theory, differential equations, and calculus. In simpler systems, a closed form solution (i.e., a well-formed equation) may exist, and in more complicated systems, an approximation or bounding method is used because the equations to “solve” (i.e., find the number of servers to ensure the expected waiting time is no more than x, etc.) cannot be derived. One of the most important results in this area, for our field, is Little's law (img, see Refs [37,38]), which basically tells us the relationship between waiting time and queue length in a system with arrivals related to rate img. For a comprehensive treatment of this area, see the foundational work in Refs [39,40].
  • Monte Carlo Simulation. This technique has its roots in numerical methods–the canonical application is computing an estimate for the definite integral, that is, the area under a function within a range. This technique works by converting random numbers (between 0 and 1) into points that land proportionally under or over the function. The area approximation is found by counting the number of points generated under the curve and comparing that number to the number of generated pointed containing the function over the range. Today, this method forms the basis for the acceptance–rejection method of random variate generation (see Ref. [23]) and for estimating performance metrics of a system when time advance is not sophisticated.
  • Discrete Event Simulation. This technique extends the techniques of Monte Carlo by considering the advance of time in a more sophisticated fashion, that is, the time flow mechanism and an event calendar that keeps track of discrete events to be processed. Discrete events provide the logic for updating system state variables, which dynamically represent the system and are used to capture performance variables of interest such as (for a queueing system): resource utilization, waiting time, number in line, and others.

    When random variate generation is used to create, for example, interarrival and service times, these models are considered stochastic. In general, discrete event simulation models rely heavily on statistical and probability models and techniques for preparing inputs. Stochastic simulation models in general, once implemented in computer code (either high level or a language or package designed explicitly for simulation) basically form experimental systems in that they attempt to mimic the real-world system (or some scoped portion) for the purpose of performing what-if analyses. For example, when simulating an inventory-control system, how are stock-outs impacted if the daily demand doubles but the inventory replenishment and ordering policies stay the same? In simulating the traffic flowing through an intersection between two major roads, what is the impact on average time waiting for a red light to turn green, if the timing of the light changing is changed from 45 to 60 seconds? In simulating cashier lanes in a popular grocery store, will five cashier lanes be sufficient to ensure that all check-out lanes have fewer than three customers at least 95% of the time?

    Simulation modeling is one of the most malleable techniques in our analytics toolbox. It is also one of the easiest to abuse (e.g., when results from unverified or unvalidated simulation models are proclaimed as “right”). From an analytics solution methodology perspective, it is important to note that simulation output data should be statistically analyzed, that is, appropriate statistical techniques should be deployed. In fact, the techniques (and macro- plus micro-solution methodologies) can and should be applied to the output of simulations. A comprehensive treatment of system simulation is provided in Ref. [23]. In general, this subfield of OR has led the way in methodological innovations, as exemplified by the aforementioned work in model verification and validation by Sargent [15].

  • Mathematical Programming and Optimization. Mathematical programming and discrete optimization models and techniques are at the core of the operations research discipline. These techniques form what has become known as the prescriptive category. At this point, it is worth bringing up that prescriptive approaches provide the classic notion of context for the decisions they are designed to support–that is, they define how to prescribe in general–while data, in the form of model input or output, gives the instantiation–that is, they help use the model to prescribe for a specific problem instance. For a more in-depth discussion of the INFORMS definition of analytics, that is, aligned with the notion of making better decisions, see Ref. [41].
img

Figure 5.9 Illustration of a decision tree. Square (blue) nodes represent decision points with choice arcs emanating from them. Ovals (orange) represent external events, with uncertainties captured in adjacent probabilistic arcs. Diamonds (gray) illustrate the space of all possible outcomes, each with an associated value.

img

Figure 5.10 Example of a complex queueing system involving four servers in sequence. Work items arrive for service from the top server, then move sequentially downward, queue to queue. When completed by the fourth server, they leave the system.

This collection of techniques includes linear programming, nonlinear programming, integer programming, mixed-integer programming, and discrete and combinatorial optimization. A set of specialty algorithms and methods related to network flows and network optimization is often included with these models and techniques.

These methods all begin similarly: There is a decision to be made, where the decision can be described through values of a number of variable settings (called decision variables). Feasibility (i.e., that at least one solution represented as values of the decision variable settings can be found) is generally determined by a set of mathematical equations or inequalities (thus, the name mathematical programming). The selection of a best solution to the decision variables, if one exists, is guided by one or more equations, usually prefaced by the word maximize or minimize.

Which solution method to choose among these techniques is generally determined by the forms of variables, constraints, and objective function. Thus, some “modeling” (stating what the variables are, describing the decisions, describing the system and decision problem in terms of the variables, that is, the objective and constraint functions) must usually take place in order for practitioners to determine the appropriate micro-solution methodology. For example, if all constraint and objective functions are linear with respect to the the decision variables, then linear programming micro-methodologies are appropriate. Linear programming is usually the starting point for most undergraduate textbooks and courses in introductory operations research; see, for example, Ref. [14]. The standard micro-solution methodology for linear programming is the simplex method, which dates back to the early origins of operations research (see Ref. [42]).

The simplex method, invented by George Dantzig (considered to be one of the pioneers of operations research [43]), is a methodology that systematically advances and inspects solutions at corner points of a feasible region, effectively moving along the exterior frame of the region. In April 1985, operations research history was made again when Karmarkar presented the interior point method to a standing-room-only crowd at the ORSA/TIMS conference in Boston, Massachusetts [44,45]. The new method proposed moving through the interior of the feasible region instead of striding along from extreme point to extreme point [46]. It held implications not only for solving linear programming models, but also for solving nonlinear programming models, which are distinguished by the fact that one or more of the constraints or the objective function(s) is nonlinear with respect to decision variables.

As the number of decision variables and constraints become large, large-scale optimization techniques become important to all forms of math programs–these micro-methodologies involve solution strategies such as relaxation (i.e., removing one or more constraints to attempt to make the problem “easier” to solve), decomposition (i.e., breaking the problem up into smaller, easier-to-solve versions), and so on. Finding more efficient techniques for larger problem sizes (i.e., problems that have more variables and constraints, perhaps in the thousands or millions) has become the topic of many research theses and dissertations by graduate students in operations research and management science.

Among the most challenging problems in this space are the models where variables are required to be integers (i.e., integer programming or mixed-integer programming) or discrete (leading to various combinatorial optimization methods). While many specialty techniques exist for integer and mixed-integer (combinatorial/discrete) models, the branch-and-bound technique remains the de facto general standard for attempting to solve the most difficult, that is, NP (nondeterministic polynomial time) decision problems (see Refs [47,48]), Branch and bound is an example of implicit enumeration, and, while not as old as the simplex method, is one of the oldest (and perhaps most general) solution techniques in operations research.

To summarize, mathematical programming techniques span the following:

  • Linear Programming (LP). These models are characterized by constraints and an objective function, which are linear with respect to decision variables. The canonical reference is Ref. [49]. Introductory operations research textbooks by Hillier and Lieberman [16] and by Winston [13] provide anchoring chapters on linear programming. While most textbook coverage of linear programming focuses on the simplex method, Ref. [33] provides an entry-level version of the interior point method that students and practitioners may find helpful before turning to more complex descriptions, such as those found in Refs [44,46].
  • Nonlinear Programming (NLP). These models are characterized by constraints or objective functions that are nonlinear with respect to decision variables. Comprehensive treatment can be found in Refs [50,51]. References [13,16] provide introductory material covering the most widely used methods and optimality conditions (i.e., Karush–Kuhn–Tucker, or KKT).

Examining the structure of a nonlinear programming model reveals that there are times when an NLP may be transformed to an LP formulation, which is preferrable because of the general availability of off-the-shelf LP packages. However, it should be noted that one of the most common mistakes by practitioners is to try to use an LP solution package outright for an NLP formulation.

Figure 5.11 shows a classic visualization of a feasible region for math programming, in this case with a linearized feasible region (with two decision variables) and either a linear or nonlinear objective function. In this case, it was possible to achieve a valid linear feasible region for the example by converting a nonlinear inequality (system reliability as a linear function of decision variables that are its component failure intensities, img and img) using a natural logarithm transformation.

img

Figure 5.11 Example from Ref. [52] showing a linearized (system reliability) feasible region as a function of two decision variables, LAMBDA1 (img) and LAMBDA2 (img). The contours of the cost-to-attain linear cost function (top), or nonlinear function (bottem), show the optimized solution either at a corner point (top) or at a constraint midpoint (bottem), respectively.

In contrast to linear programming, the methods deployed by nonlinear programming generally follow an if-then-else-if-then-else-if- and so on deduction, where one chooses a micro-solution methodology based on the convexity or concavity (or pseudo- or quasi-) forms of the feasible region and objective function. The best way to determine which micro-methodology to use for a nonlinear program is actually to write down the model variables, constraints, and objective function, then mathematically characterize the forms, and then consult one of the classic textbooks, such as Refs [50,51] as a guide to choosing the most appropriate solution technique.

  • Integer and Mixed Programming. These models are characterized by some or all of the decision variables required to be integer in value. Introductory operations research textbooks by Hillier and Lieberman [16] and Winston [13] both provide excellent chapters on this topic. More in-depth treatment of techniques for handling these types of decision models can be found in Ref. [53].
  • Discrete, Combinatorial, and Network Optimization. These models are characterized by some or all of the decision variables required to be discrete in nature. Techniques for handling these types of combinatorial decisions can be found in classics by Bertsimas and Tsitsiklis [54] and Papadimitriou and Steiglitz [47]. In some cases, decision problems of the discrete or combinatorial forms (i.e., where the feasible region is generally countable, consisting of discrete solution options as opposed to being in continuous space), we may choose a method that is tailored for the specific problem instead of working with the mathematical programming form directly. Discrete and combinatorial problems usually involve some kind of searching through a space, and often, that space is best represented by a complex data structure (such as a tree, or a network. See, for example, Figure 5.12). Examples include the shortest-path problem, the minimum-spanning-tree problem, the traveling salesman problem, the knapsack, bin-packing, set-covering, and clique problems, scheduling and sequencing problems, and so on. For details on the techniques behind these micro-solution methodologies, see Refs [47,48,53], which are the classic texts by the pioneers of the integer and discrete/combinatorial methods. For network-specific algorithms and methods, see Ref. [55].
Figure depicts the example of an undirected network. The edge set is a configuration of the ARPANET from about 1977.

Figure 5.12 Example of an undirected network. The edge set is a configuration of the ARPANET from about 1977 [56].

Some other specialty forms that we will not cover here exist, including dynamic programming, multiobjective or multicriteria programming, and stochastic and constraint programming.

Group II: Relationship to Macro-Methodologies

While the specific micro-methodology chosen will depend on the type of problem faced, the assumptions made by the practitioner, and the model selected, the success of the models and techniques in this group hinges on certain macro-methodology steps, particularly business understanding and problem definition (including assumptions). As mentioned earlier, the scientific method and the exploratory micro-methodologies are appropriate for fitting model parameters and testing various assumptions (e.g., linearity, pseudo-convexity). The OR project methodology steps were designed specifically with projects using these micro-methods in this group. However, it should be noted that a few of the CRISP-DM steps can also be applicable; for example, when data are sought for parameter fitting–specifically the data understanding and data preparation steps. In some cases, more advanced transformations of data are needed in preparation for use in these modeling techniques. In fact, in some cases, the analytics we would like to introduce as parameters is derived from forecasting–that is, a special class of predictive modeling, which we turn to next.

Group II: Takeaways

Historically, the operations research discipline has been a collection of quantitative modeling methodologies that have their roots in logistics and resource planning. Over the past two decades, with the surge in data available for problem-solving, “research on operations” (i.e., operations and systems understanding), and model building, an emphasis of operations research (and management science) has shifted to embrace insights that can be derived directly from data. In this section, many of the traditional OR modeling approaches and their techniques were presented with the main message that these are largely models that have solution techniques that are independent of, but not isolated from, data.

5.3.5 Group III: Micro-Solution Methodologies Using Models Where Techniques to Find Solutions Are Dependent on Data

This section considers the final group of micro-methodologies, that is, those where the models involve solution techniques that are not possible to execute unless there are data present. In other words, they are data-dependent. Examples of solutions, in these cases, are the explanation or creation of additional system entity attributes or a prediction about a future event based on a trend that is observable in the data.

Group III: Problems of Interest

This group of micro-methods is most often used in conjunction with data mining. While these problems share the theme of exploration and discovery with Group I, the outcomes tend to be broader in nature and with fewer assumptions (e.g., normality of data). Problems relevant here include the desire to create categories of things according to common or similar features; finding patterns to explain circumstances or phenomena, that is, seeking understanding through common factors; understanding trends in processes and systems over time (and/or space); and understanding the relationships between cause and effect for the purpose of predicting some future outcome given similar circumstances.

Typical examples of problems of interest include understanding which retail items tend to be purchased together; sorting research articles into categories based on similarities in content identified through common keywords, concepts, methodology, or conclusions; determining if the fall in sales revenue is due to a trend in consumer preferences; if a pattern of behavior exists (e.g.: Are referees more likely to give red cards to soccer players of darker skin tone? which was studied in Ref. [2]); and others.

Group III: Relevant Models

Some of the main models used in this micro-methodology group include the following:

  • Generalized linear models are a collection of models including traditional linear and logistic regression models. Logistic models have discrete (category) response variables.
  • Common factor and principal component models are used to find the common denominators in groups.
  • Clustering models are used to find groupings of things.
  • Classification models are used to determine which set something belongs to. The main difference from clustering methods is that these are generally considered supervised learning (i.e., a training set is known and is used to guide membership), whereas clustering techniques are generally unsupervised. Note that supervised and unsupervised model building are described in detail in Chapter 6, Modeling Building.
  • Graph-based models are general purpose data structures that support various models in this group.
  • Time series models, for example, ARMA (auto-regressive-moving-average model), are used to model trends over time.
  • Neural networks are generally used to direct inference in pattern recognition.

It should be noted that there is intersection with Groups I and II. Specifically, these methods borrow heavily from statistical analysis and even optimization (e.g., by solving an underlying total distance minimization problem).

Group III: Data Considerations

By design, this group is most distinguished in consideration of the data dependency on model building and solution techniques. Furthermore, data for this group of micro-methods are generally assumed to be abundant–for example, digital history of sales transactions, Internet sites visited, searched keywords, and so on. Data are often collected by observing digital interactions by a large number of people with systems such as Internet services and applications via browser connections or a mobile device that has passive data collection (e.g., location services) allowed, either intentionally or unintentionally.

A key distinction of these data is that they are not planned in the same way that exploratory methods, say, related to the scientific method of inquiry, may involve experimental design, observation, and data collection. In fact, for data to be considered “usable” in this group, it often must be interpreted, or derived by mining, analyzing, inferring, or applying models and techniques to create meaningful new features.

Group III: Solution Techniques

The following are some of the most common micro-solution techniques for this group:

  • Generalized Linear Model Techniques–See Refs [26,57] for a review of techniques. Techniques include the following:
    • – Imputation of missing data, for systematically replacing missing values with contants or other values.
    • – The method of least squares, which finds the model parameters that minimize the sum of squared residual (i.e., distance to the fitted model) terms.
    • – Statistical analysis and inference (e.g., estimation and hypothesis testing) for evaluating models.
  • Factor/principal component analysis techniques include the following:
    • – NIPA (noniterated principal axis method)
    • – IPA (iterated principal axis method)
    • – ML (maximum likelihood factor analysis method)

      All of these find common factors while using differing by underlying computation approach. See Ref. [58] for details.

  • Clustering analysis uses a variety of techniques, depending on the nature of the data. Some specific techniques include the following:
    • – Univariate and bivariate plots such as histograms, scatter plots, boxplots, and others may be used to visually aid the clustering process (see Figures 5.35.5).
    • – Graph-based techniques may be used to generate additional features, such as distance and neighborhoods.
    • – Hierarchical and nearest neighbor clustering, see Ref. [59].
    • – Specialty methods such as collaborative filtering, market basket association, or affinity analysis may be used for specific problems, such as finding the items in a retail shopping basket that are generally purchased together (see Ref. [60]).
    • – Linear models for classification, see Refs [24,26,61].
  • Classification Methods: Some specific techniques include the following:
    • – Linear classifiers, such as logistic regression
    • – Support vector machines (SVMs)
    • – Partitioning
    • – Neural networks
    • – Decision trees

    See Refs [26,61,62] for overviews and commenets on these and related methods.

  • Graph-based modeling techniques are often used to derive features of components for other types of models. For example,
    • – shortest path algorithm helps to identify nearest neighbors for clustering.
    • – minimum spanning tree helps to determine connected subcomponents in a general graph.

      See Ref. [55] for a comprehensive treatment of graph models, network-based problems, and an exhaustive accounting of known algorithms. Hastie et al. [26] extend these basic graphical models for statistical machine learning techniques, including neural networks.

  • Time series models, for example, ARMA (autorrgressive-moving-average model); see Ref. [63] for an exhaustive treatment of theory and techniques. See Figure 5.13 for an example of raw time series data.
  • Neural networks methods are described in detail in Ref. [61].
img

Figure 5.13 Example of raw time series data before ARMA methods are applied. Unit dispense history for a single food item at a New York City digital food pantry from January 2, 2013 to April 24, 2017.

Group III: Relationship to Macro-Methodologies

While CRISP-DM is likely the most common of the used macro-methodologies for this group, analytics projects leveraging data-dependent methods are likely to benefit from any and all macro-methodologies. In fact, because this set of methods is most closely related to evaluation and discovery of complex cause-and-effect relationships, as well as differentiation (through classification and categorization; which are sometimes prone to discrimination that can lead to inequity and unfair treatment of groups of people), practitioners should take utmost care in verifying, validating, and creating project documentation that promotes study replication.

Group III: Takeaways

While it may seem that this group of methods is all about the data–because they are data-dependent–that is not really true. Like all analytics solution methods, it really is still all about the problem. Because to have meaning, solutions must solve a problem. Also note that these analytics methods are sometimes referred to as the advanced analytics methods. The author would like to point out that they are, in fact, the newest, least established, and least proven in practice, of all methods in our discipline. This implies that they are the least advanced analytics methods and suggests that we should all be working harder to deepen their theory and rigor–which is actually what we are good at as an INFORMS community.

5.3.6 Micro-Methodology Summary

In summary of micro-methdologies, we emphasize that analytics problems encountered in practice seldom require techniques that fall into only one micro-methodology category. Techniques in one category may build on techniques from another category–for example, as noted earlier, linear regression modeling within data dependent methodologies relies on solving an underlying optimization problem. Regression modelers who use software packages to fit their data may not be aware that the least squares optimizatoin problem is being solved in the background. However, to truly understand our methods and results, it is important to be aware of the background mechanics and connections. This specific type of dependency is, in fact, common–particularly in the realm of contemporary statistical machine learning.

Projects in practice often leverage methodologies in progression as well–for example, using descriptive statistics to explore and understand a system in the early stages of a project may lead to building of an optimization model to support a specific business or operations decision. If the decision needs to be made for a scenario that will take place in the future, then forecasts may be used to specify the optimization model's input parameters. At the same time, it is important to keep in mind that there may be trade-offs to consider when combining different techniques. For instance, in this same example project requiring forecasted parameters of an optimization model, the practitioner has a choice between using a sophisticated predictive technique that yields more accurate forecast but leads to a complex, difficult-to-solve nonlinear optimization model, or using a simpler predictive approach that sacrifices some of the forecast accuracy, but leads to a simpler, linear optimization model.

The micro-solution methods available to analytics practitioners are many. However, it should be noted that making this selection is analogous to being an artist and deciding among watercolor, oil, or acrylic paint; deciding what kind of surface to paint on, for example, canvas, wood, paper, and so on; deciding how big to make the piece, and so on. But it is probably not unlike being the painter in these ways as well: You are most likely to pick the method you are most familiar with, just as the watercolor specialist is less likely to choose charcoal for a new painting of the sunset.

5.4 General Methodology-Related Considerations

5.4.1 Planning an Analytics Project

A critical success factor in technical projects, particularly where there is any element of exploration and discovery, is project planning. This is no different for analytics projects. In fact, when one adds the expectation for a usable outcome (i.e., a tested and implemented process coded in software, running on real data, complete with a user interface and full documentation, all while providing smashing insights and impactful results), the project risks and failure odds go up fast. As mentioned in the macro-methodology section, the macro-methods align nicely with project planning because they give a roadmap that equates to the high-level set of sequential actities in an analytics project. When considering macro- and micro-method planning together, skills and details of activities can be revealed, so that task estimation and dependencies are possible. In fact, one of the traditional applications of network models taught to students of operations research is the PERT (program evaluation and review technique)/CPM (critical path method)–a micro-method that practitioners can apply to macro-methodology for helping to smoothly plan and schedule a complex set of related activities (see Ref. [14]).

When there are expectations for a usable software implementation outcome, practitioners can augment their macro-methodology steps with appropriate software engineering steps. The software engineering requirement step is recommended for planning desired outcome function, as well as usability needs and assumptions. In fact, complex technical requirements, such as integration into an existing operations environment, or perhaps data traceability for regulatory compliance, are best considered early in requirements steps that compliment domain and data understanding steps.

Overall, while prototyping and rapid development often coincide with projects of more exploratory nature, which analytics projects often are, some project planning and ongoing project management is the best way to minimize risks of failure, budget overruns, and outcome disappointments.

5.4.2 Software and Tool Selection

Most if not all of our analytics projects need some computational support in the form of software and tools. Aside from DIY software, which is sometimes necessary when new methods or new extensions are developed for a project, most micro-solution methods are available in the form of commercial and/or open-source software.

Without intending to endorse any specific software package or brand, a few packages are named here to provide illustrations of appropriate packages, while leaving to the reader to decide which packages are most appropriate for their specific project needs.

For (Group I) exploration, discovery, and understanding methods, popular packages include R, Python, SAS, SPSS, MATLAB, MINITAB, and Microsoft EXCEL. Swain [64] provides a very recent (2017) and comprehensive survey of statistical analysis software, intended for the INFORMS audience. Most of these packages also include GLM, factoring, and clustering methods needed to cover (Group III) data-dependent methods, as well.

For (Group II), a fairly recent survey of simulation software, again by Swain [65] and a very recent linear programming software survey by Fourer [66], are resources for selecting tools to support these methods, respectively. An older but still useful nonlinear programming software survey by Nash [67] is a resource to practitioners. MATLAB, Mathematica, and Maple continue to provide extensive toolboxes for nonlinear optimization needs. For Branch and Bound, the IBM ILOG CPLEX toolbox is freely available to academic researchers and educators. COIN-OR, Gurobi, GAMS, LINDO, AMPL, SAS, MATLAB, and XPRESS all provide various toolboxes across the optimization space. More and more, open source libraries related to specific languages, such as Python, now offer tools that are ready to use–for example, StochPY is a Python library addressing stochastic modeling methods.

As a final note, practitioners using commercial or open-source software packages for analytics are encouraged to use them carefully within a macro-solution methodology. In particular, verification, that is, testing to make sure the package provides correct results, is always recommended.

5.4.3 Visualization

Visualization has always been important to problem-solving. Imagine in high school having to study analytical geometry without 3D sketches of cylinders. Similarly, operations research has a strong history of illustrating concepts through visualization. Some examples include feasible regions in optimization problems, state space diagrams in stochastic processes, linear regression models, various forms of data plots, and network shortest paths. In today's world of voluminous data, sometimes the best way to understand data is to visualize it, and sometimes the only way to explain results to an executive is to show a picture of the data and something illustrating the “solution.”

Other chapters in this book cover the topic of analytics and visualization, for example, see Chapters 3 and 6. The following points regarding visualization from a solution methodology perspective are provided in order to establish a tie with the methods of this chapter:

  • Analytics and OR researchers and practitioners should consider visualizations that support understanding of raw data, understanding of transformed data, enlightenment of process and method steps, and solution outcomes.
  • Visualization in analytics projects has three forms, which are not always equivalent:
    1. Exploratory–that is, the analyst needs to create quick visualizations to support their exploration and discovery process. The visualizations may help to build intuition and give new ideas, but are not necessarily of “publish” or “presentation” quality.
    2. Presentation–that is, the analyst needs to create visualizations as part of a presentation of ideas, method steps, and results to sponsors, stakeholders, and users.
    3. Publishing–that is, the analyst wants to create figures or animations that will be published or posted and must be of suitable quality for archival purposes.

5.4.4 Fields with Related Methodologies

Many disciplines are using analytics in research and practice. As shown in the macro-methodology section summary, all macro-methodologies are derivatives of the scientific method. In fact, many of our micro-solution methodologies are shared and used across disciplines. As a community, we benefit from and have influenced shared methods with the fields of science, engineering, software development and computer science (including AI and machine learning), education, and the newly evolving discipline of data science. This cross-pollination helps macro- and micro-solution methodologies to stay relevant.

5.5 Summary and Conclusions

This chapter has presented analytics solution methodologies at both macro- and microlevels. Although this chapter makes no claim to cover all possible solution methodologies comprehensively, hopefully the reader has found the chapter to be a valuable resource and a thought-provoking reference to support the practice of an analytics and OR project. The chapter goals of enlightening the distinctions of macro- versus micro-solution methodologies, providing enough details of these solution methodologies for a practitioner to incorporate them into the design of a high-level analytics project plan according to some macro-level solution methodology, and providing some guidance for assessing and selecting appropriate micro-solution methodologies appropriate for a new analytics project should have hopefully come through in the earlier sections and sections. In addition to a few pearls scattered throughout the chapter, we conclude by stating that solution methodologies can help the analytics practitioner and can help that practitioner help our discipline at large, which can then help more practitioners. That's a scalable and iterative growth process that can be accomplished through reporting our experiences at conferences and through peer-reviewed publication, which often forces us to organize our thoughts in terms of methodology anyway, so we might as well start with it too! The main barriers for solution methodology seem to be myths. Dispelling some of the myths of analytics solution methodology is covered in these final few paragraphs.

5.5.1 “Ding Dong, the Scientific Method Is Dead!” [68]

The scientific method may be old, but it is not dead yet. By illustrating its relationship to several macro-solution methodologies in this chapter, we've shown that the scientific method is indeed alive and well. Arguments to use it literally may be futile, however, since the world of technology and analytics practice often places time and resource constraints on projects that demand quick results. Admittedly, it is quite possible that rigor and systematic methodology could lead to results that are contrary to the “desired” outcome of an analytics study. Thus, without intentionally doing so, our field of practice may be inadvertantly missing the discovery of truth and its consequences.

5.5.2 “Methodology Cramps My Analytics Style”

Imagine for a moment that analytics practitioners used systematic solution methodologies to a greater extent, particularly at the macrolevel and then publish their applied case study following an outline that detailed the steps that they had followed. Our published applied literature could then be a living source of experience and practice to emulate, not only for learning best practices and new techniques, but also for learning how to apply and perfect the old standards. More analytics projects might be done faster because they wouldn't have to “start from scratch” and reinvent a process of doing things. Suppose that analytics practitioners, in addition to putting rigor into defining their problem statements, also enumerated their research questions and hypotheses in the early phases of their project. Would we publish experiences that report rejecting a hypotheses? Does anyone know of at least one published science research paper that reports rejecting a hypothesis let alone one in the analytics and OR/MS literature?

Research articles on failed projects rarely (probably never) get published, and these could quite probably be the valuable missing links to helping practitioners and researchers in the analytics/OR field be more productive, do higher quality work, and thrive by learning from studies that show what doesn't work. When authentically applied, the scientific method should result in a failed hypothesis every once in a while, reflecting the true nature of exploration and the risks we take as researchers of operations and systems. The modern deluge of data allows us to inquire and test our hunches systematically without the limitations and scarcity of observations we faced in the past. Macro-solution methodologies, either the scientific method or any derivative of it (which is just about all of them), could relieve analytics projects cramps not only by giving us efficient and repeatable approaches but also by recognizing that projects sometimes “fail” or reject a null hypothesis–doing so within the structure of a methodology allows it to be reported in an objective, thoughtful manner that others can learn from and that can help practitioners and researchers avoid reinvention.

5.5.3 “There Is Only One Way to Solve This”

We've all heard the saying, if all you have is a hammer, then every problem looks like a nail. This general concept, phrased in a number of different ways since first put forward in the mid-1960s, is credited to Maslow [69], who authored the book Psychology of Science. In our complex world, there are usually many alternate ways to solve a problem. These choices, in analytics projects, may be listed among the micro-methodology techniques described in this chapter or elsewhere. Sometimes, there are well-established techniques that work just fine, and sometimes a new technique needs to be created. The point is that there are many ways to solve a problem, even though many of us tend to first resort to our favorite ways because those tend to align with our personal experiences and expertise. That's not a bad approach to project work, because experience usually means that we are using other knowledge and lessons learned. However, behind this is the danger of possibly using the wrong micro-solution methodology. In fact, the problem of an ill-defined problem can lead to overreliance on certain tools–often the most familiar ones. What does this mean? That in our macro-solution methodology, steps such as understanding the business and data, defining the problem, and stating hypotheses are useful in guiding us to which micro-methodologies to choose from and thus avoiding the potential pitfalls of picking the wrong micro-method or overusing a solution method.

5.5.4 Perceived Success Is More Important Than the Right Answer

In math class, school teachers might make a grading key that lists the right answer to each exam or homework problem. In practice however, there is no solutions manual or key for checking if an analytics project outcome is right or wrong. We have steps within various macro-solution methodologies, for example, verification, that help us to try to make the best case for the outcome being considered “right,” but for the most part, the correctness of an analytics project outcome is generally elusive, and projects are usually judged by the perceived results of the implementation of a solution. In analytics and OR practice, there are cases where the implementation results were judged as wildly successful, for example, analytics/OR project recognized as an INFORMS Edelman award finalist for its contribution to a company's saving of over $1 billion might actually be judged as not successful because the company creating the OR solution was not able to commercialize the assets and find practitioners in its ranks to learn and deploy them and thus reproduce the solution as a profitable product (see, for example, Ref. [70]).

Documentation of reasons for analytics project failures probably exists, but it is rarely reported as such. Plausible reasons for failure (or, perhaps more accurately, “lack of perceived success”) include the following ones: the solution was implemented, but there was no impact, or it was not used; a solution was developed but never implemented; a viable solution was not found; and so on. Because of the relationship between analytics projects and information technology and software, some insights can be drawn from those more general domains. Reference [71] provides an insightful essay on why IT projects fail that is loaded with examples and experiences, many with analogues and wisdom transferable back to analytics. Software project failures have been studied in the software engineering community for over two decades, with various insights; see, for example, Ref. [72]. The related area of systems engineering offers good general practices and a guide to systematic approaches: One of the most recognized for the field of industrial engineering is by Blanchard and Fabrycky in its fifth edition [73].

It is important to remember that in practice ultimate perceived success or failure of an analytics project may not mean “finding the right answer,” that is, finding the right solution. By perceived success, we mean that an analytics solution was implemented to solve a real-world problem with some meaningful impact acknowledged by stakeholders. Conversely, perceived failure means that for one of a number of reasons, the project was deemed not successful by some or all of the stakeholders. Not unlike some micro-solution methodologies of classic operations research, we have necessary and sufficient conditions for achieving success in an analytics project, and they seem to be related to perception and quality. Analytics practitioners need to judge these criteria for their own projects, while perhaps keeping in mind that there have been well-meaning and not-so-well-meaning uses of data and information to create perceptions and influence. See, for example, How to Lie with Statistics by Darrell Huff [74] and the more contemporary writing, which is similar in concept, How to Lie with Maps by Mark Monmonier [75].

The book How to Lie with Analytics has not been written yet, but unfortunately it is likely already practiced. By practicing some form of systematic solution methodologies, macro and micro, in our analytics projects, we may help our field to form an anchoring credibility that is resilient when that book does come out.

5.6 Acknowledgments

Sincere thanks to two anonymous reviewers for critically reading the chapter and suggesting substantial improvements and clarifications; Dr. Lisa M. Dresner, Associate Professor of Writing Studies and Rhetoric at Hostra University, for proofreading, editing, and rhetoric coaching; and Dr. Joana Maria, Research Staff Member and Data Scientist at IBM Research, for inspiring technical discussions and pointers to a number of relevant articles.

References

  1. 1 Eves H (2000) Analytic geometry, in W. H. Beyer, ed., Standard Mathematical Tables and Formulae, 29th ed. ( CRC Press, Inc., Boca Raton, FL), pp. 174–207.
  2. 2 Silberzahn R, Uhlmann EL, Martin D, Anselmi P, Aust F et al. (2015) Many analysts, one dataset: making transparent how variations in analytical choices affect results, Open Science Framework. Available at https://osf.io/gvm2z/ (accessed July 2, 2017).
  3. 3 INFORMS (2015) Constitution of the Institute for Operations Research and the Management Sciences. Technical report, Catonsville, MD.
  4. 4 SAS (2017) What is analytics? Available at https://www.sas.com/en_us/insights/analytics/what-is-analytics.html (June 7, 2017).
  5. 5 Wikipedia Scientific Method. Available at https://en.wikipedia.org/wiki/Scientific_method (accessed July 2, 2017).
  6. 6 Harris W (2008) How the scientific method works. Available at http://science.howstuffworks.com/innovation/scientific-experiments/scientific-method.htm (accessed January 14, 2008).
  7. 7 Hoefling H, Rossini A (2014) Reproducible research for large-scale data, in Stodden V, Leisch F, Peng RD, eds., Implementing Reproducible Research, 1st ed., The R Series ( Chapman and Hall/CRC), pp. 220–240.
  8. 8 Stodden V, Leisch F, Peng RD, eds., (2014) Implementing Reproducible Research, 1st ed., The R Series ( Chapman and Hall/CRC).
  9. 9 Foreman H (2015) What distinguishes a good manuscript from a bad one? https://www.elsevier.com/connect/get-published-what-distinguishes-a-good-manuscript-from-a-bad-one (accessed July 2, 2017).
  10. 10 Foster KR Skufca J (2016) The problem of false discovery: many scientific results can't be replicated, leading to serious questions about what's true and false in the world of research. IEEE Pulse 7 (2): 37–40.
  11. 11 Hardt M, Ullman J (2014) Preventing false discovery in interactive data analysis is hard. Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (FOCS'14), Washington, DC, IEEE Computer Society, pp. 454–463.
  12. 12 Ioannidis JPA (2005) Why most published research findings are false. PLoS Med 2 (8): 696–701.
  13. 13 Winston WL (2003) Operations Research: Applications and Algorithms, 4th ed. ( Duxbury Press).
  14. 14 Hillier FS, Lieberman GJ (2002) Introduction to Operations Research, 7th ed. ( McGraw-Hill).
  15. 15 Sargent RG (2007) Verification and validation of simulation models, in Henderson SG, Biller B, Hsieh MH, Shortle J, Tew JD, Barton RR, eds., Proceedings of the 2007 Winter Simulation Conference, pp 124–137.
  16. 16 Hillier FS, Lieberman GJ, Nag B, Basu P (2009) Introduction to Operations Research, 9th ed. ( McGraw-Hill).
  17. 17 Pete C, Julian C, Randy K, Thomas K, Thomas R, Colin S, Rüdiger W (2000) CRISP-DM 1.0 Step-by-Step Data Mining Guide. Technical report, The CRISP-DM Consortium.
  18. 18 IBM (2011) IBM SPSS Modeler CRISP-DM Guide. Technical report, Catonsville, MD.
  19. 19 Wirth R (2000) CRISP-DM: towards a standard process model for data mining. Proceedings of the Fourth International Conference on the Practical Application of Knowledge Discovery and Data Mining, pp. 29–39.
  20. 20 Cynthia D, Vitaly F, Moritz H, Toniann P, Omer R, Aaron R (2015) The reusable holdout: preserving validity in adaptive data analysis. Science 349 (6248): 636–638.
  21. 21 Pressman RS (1997) Software Engineering: A Practitioner's Approach, 4th ed. ( McGraw Hill).
  22. 22 Zeigler BP (1976) Theory of Modelling and Simulation ( John Wiley & Sons, Inc., New York).
  23. 23 Law AM, Kelton WD (1999) Simulation Modeling and Analysis, 3rd ed. ( McGraw-Hill).
  24. 24 Kutner M, Nachtsheim C, Neter J, Li W (2004) Applied Linear Statistical Models, 5th ed. ( McGraw-Hill/Irwin).
  25. 25 Shearer C (2000) The CRISP-DM model: the new blueprint for data mining. Journal of Data Warehousing 5: 13–22.
  26. 26 Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer Series in Statistics ( Springer Science & Business Media).
  27. 27 Provost F, Fawcett T (2013) Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking, 1st ed. ( O'Reilly Media, Sebastopol, CA).
  28. 28 Wilder CR, Ozgur CO (2015) Business analytics curriculum for undergraduate majors. INFORMS Trans. Educ. 15 (2): 180–187.
  29. 29 Walpole RE, Myers RH (1978) Probability and Statistics for Engineers and Scientists, 2nd ed. ( Macmillan Publishing Co., Inc).
  30. 30 Hogg RV, Craig AT (1978) Introduction to Mathematical Statistics, 4th ed. ( Macmillian Publishing).
  31. 31 Box GEP, Hunter JS, Hunter WG (2005) Statistics for Experimenters: Design, Innovation, and Discovery, 2nd ed., Wiley Series in Probability and Mathematical Statistics, ( Wiley-Interscience).
  32. 32 Draper NR Smith H (1981) Applied Regression Analysis, 2nd ed. ( John Wiley & Sons, Inc., New York).
  33. 33 Nocedal J, Wright S (2006) Numerical Optimization, 2nd ed., Springer Series in Operations Research and Financial Engineering ( Springer).
  34. 34 Ross SM (2012) A First Course in Probability, 9th ed. ( Pearson).
  35. 35 Ross SM (1995) Stochastic Processes, 2nd ed. ( John Wiley & Sons, Inc., New York).
  36. 36 Suri R, Diehl GWW, de Treville S, Tomsicek MJ, (1995) From CAN-Q to MPX: evolution of queuing software for manufacturing. INFORMS Interfaces 25 (5): 128–150.
  37. 37 Little JDC (2011) OR FORUM: Little's law as viewed on its 50th anniversary. Oper. Res. 59 (3): 536–549.
  38. 38 Little JDC, Graves SC (2008) Little's law, in Chhajed D, Lowe TJ, eds., Building Intuition: Insights from Basic Operations Management Models and Principles ( Springer Science+Business Media, LLC, New York, NY), pp. 81–100.
  39. 39 Kleinrock L (1975) Queueing Systems. Volume 1: Theory ( John Wiley & Sons, Inc., New York).
  40. 40 Kleinrock L (1976) Queueing Systems. Volume 2: Computer Applications ( John Wiley & Sons, Inc., New York).
  41. 41 Keenan PT, Owen JH, Schumacher K (2017) ABOK Chapter 1: introduction to analytics, in Cochran JJ, ed., Analytics Body of Knowledge (ABOK) ( INFORMS).
  42. 42 Nash JC (2000) The (Dantzig) simplex method for linear programming. Comput. Sci. Eng. 2 (1): 29–31.
  43. 43 Gass SI, Assad AA (2011) Transforming research into action: history of operations research. INFORMS Tutorials in Operations Research 1–14.
  44. 44 Karmarkar NK (1984) A new polynomial-time algorithm for linear programming. Combinatorica 4 (4): 373–395.
  45. 45 Tsuchiya T (1996) Affine scaling algorithm, in Terlaky T, ed., Interior Point Methods of Mathematical Programming, ( Springer), pp. 35–82.
  46. 46 Adler I, Resende MGC, Veiga G, Karmarkar N (1989) An implementation of Karmarkar's algorithm for linear programming. Math. Program. 44 (1): 297–335.
  47. 47 Papadimitriou CH, Steiglitz K (1982) Combinatorial Optimization: Algorithms and Complexity ( Prentice Hall, Inc).
  48. 48 Parker RG, Rardin RL (1988) Discrete Optimization of Computer Science and Scientific Computing ( Academic Press).
  49. 49 Bazaraa MS, Javis JJ, Sherali HD (2009) Linear Programming and Network Flows, 4th ed. ( John Wiley & Sons, Inc., New York).
  50. 50 Bazaraa MS, Sherali HD, Shetty CM (2006) Nonlinear Programming: Theory and Algorithms, 3rd ed. ( Wiley-Interscience).
  51. 51 Bertsekas D (2016) Optimization and Computation, Nonlinear Programming, 3rd ed. ( Athena Scientific).
  52. 52 Helander ME, Zhao M, Ohlsson N (1998) Planning models for software reliability and cost. IEEE Trans. Softw. Eng. 24 (6): 420–434.
  53. 53 Nemhauser G, Wolsey LA (1988) Integer and Combinatorial Optimization ( John Wiley & Sons, Inc., New York).
  54. 54 Bertsimas D, Tsitsiklis JN (1997) Introduction to Linear Optimization, 3rd ed., Athena Scientific Series in Optimization and Neural Computation, 6 ( Athena Scientific).
  55. 55 Ahuja RK, Magnanti TL, Orlin JB (1993) Network Flows: Theory, Algorithms, and Applications ( Prentice Hall).
  56. 56 Lukasik SJ (2011) Why the ARPANET was built. IEEE Ann. Hist. Comput. 33 (3): 4–21.
  57. 57 West BT, Welch KB, Galecki AT (2014) Linear Mixed Models: A Practical Guide Using Statistical Software, 2nd ed. ( Chapman and Hall/CRC).
  58. 58 Fabrigar LR, Wegener DT (2011) Exploratory Factor Analysis of Understanding Statistics ( Oxford University Press).
  59. 59 Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster Analysis, 5th ed., Wiley Series in Probability and Statistics ( John Wiley & Sons, Inc., New York).
  60. 60 Larose DT, Larose CD (2014) Discovering Knowledge in Data: An Introduction to Data Mining, 2nd ed., Wiley Series on Methods and Applications in Data Mining ( John Wiley & Sons, Inc., New York).
  61. 61 Bishop CM (2007) Pattern Recognition and Machine Learning of Information Science and Statistics ( Springer).
  62. 62 Domingos P. (2012) A few useful things to know about machine learning. Commun. ACM 55 (10): 78–87.
  63. 63 Brockwell PJ Davis RA (1991) Time Series: Theory and Methods, 2nd ed., Springer Series in Statistics ( Springer Science+Business Media).
  64. 64 Swain JJ (2017) Statistical analysis software survey: the joys and perils of statistics. OR/MS Today 44 (1).
  65. 65 Swain JJ (2015) Simulation software survey. OR/MS Today 43 (5).
  66. 66 Fourer R (2017) Linear programming: software survey. OR/MS Today 44 (3).
  67. 67 Nash SG (1998) Nonlinear programming software survey. OR/MS Today 26 (3).
  68. 68 Harburg EY (1939) Ding-dong! the witch is dead, The Wizard of Oz ( Metro-Goldwyn-Mayer Inc., Beverly Hills, CA).
  69. 69 Maslow AH (1966) Psychology of Science, 1st ed. ( Joanna Cotler Books).
  70. 70 Katircioglu K, Gooby R, Helander M, Drissi Y, Chowdhary P, Johnson M, Yonezawa T (2014) Supply chain scenario modeler: a holistic executive decision support solution. INFORMS Interfaces 44 (1): 85–104.
  71. 71 Liebowitz J (2015) Project failures: what management can learn. IT Prof. 17 (6): 8–9.
  72. 72 Charette RN (2005) Why software fails. IEEE Spectr. 42 (9): 42–49.
  73. 73 Blanchard BS Fabrycky WJ (2010) Systems Engineering and Analysis, 5th ed., Prentice Hall International Series in Industrial & Systems Engineering ( Pearsons).
  74. 74 Huff D (1954) How to Lie with Statistics ( W. W. Norton & Company) (reissue edition October 17, 1993).
  75. 75 Monmonier M (1996) How to Lie with Maps, 2nd ed. ( University of Chicago Press).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset