The need for data analysis patterns (in software engineering)

B. Russo    Software Engineering Research Group, Faculty of Computer Science, Free University of Bozen-Bolzano, Italy

Abstract

When you call a doctor, you would expect that she would be prepared with a set of remedies for your disease. You would not be pleased to see her digging into a huge amount of clinical data while she makes a diagnosis and searches for a solution for your problem, neither would you expect her to propose a cure based on your case alone. The remedies she proposes are solutions to recurring problems that medical researchers identify by analyzing data of patients with similar symptoms and medical histories. Remedies are coded in a language that a doctor understands (eg, they tell when and how to treat a patient) and lead to meaningful conclusions for patients with the same disease (eg, they tell the probability that the disease will be defeated and eventually with which consequences). Once found, such solutions can be applied over and over again. With the repeated use of a solution, medical researchers can indeed gain knowledge on successes and failures of a remedy and provide meaningful conclusions to future patients thereafter.

Keywords

Data analysis; Data types; Data analysis patterns; Scikit-learn initiative; Software engineering data; Anti-pattern

The Remedy Metaphor

When you call a doctor, you would expect that she would be prepared with a set of remedies for your disease. You would not be pleased to see her digging into a huge amount of clinical data while she makes a diagnosis and searches for a solution for your problem, neither would you expect her to propose a cure based on your case alone. The remedies she proposes are solutions to recurring problems that medical researchers identify by analyzing data of patients with similar symptoms and medical histories. Remedies are coded in a language that a doctor understands (eg, they tell when and how to treat a patient) and lead to meaningful conclusions for patients with the same disease (eg, they tell the probability that the disease will be defeated and eventually with which consequences). Once found, such solutions can be applied over and over again. With the repeated use of a solution, medical researchers can indeed gain knowledge on successes and failures of a remedy and provide meaningful conclusions to future patients thereafter.

The remedy metaphor helps describe how data analysis patterns are used in empirical sciences. First, a pattern is a coded solution of a recurring problem. When a problem occurs several times, we accumulate knowledge on the problem and its solutions. With this knowledge, we are able to code a solution in some sort of modeling language that increases its expressivity and capability of re-use. Second, a pattern is equipped with a sort of measure of success of the solution it represents. The solution and the measure result from the analysis of historical data and provide actionable insight for future cases.

Does it make sense to speak about patterns in modern software engineering? The answer can only be yes. Patterns are a form of re-use and re-use is one of the key principles in modern software engineering. Why is this? Re-use is an instrument to increase the economy of development and prevent human errors in software development processes. In their milestone book, Gamma et al. [1] introduced (design) patterns as a way “to reuse the experience instead of rediscovering it.” Thus, patterns as a form of re-use help build software engineering knowledge from experience.

Does it make sense to speak about patterns of data analysis in modern software engineering? Definitely yes. Data analysis patterns are “remedies” for recurring data analysis problems that have arisen during the conception, development, and use of the software technology. They are codified solutions that lead to meaningful conclusions for software engineering stakeholders and can be reused for comparable data. In other words, a data analysis pattern is a sort of model expressed in a language that logically describes a solution to a recurring data analysis problem in software engineering. They can possibly be automated. As such, data analysis patterns help “rise from the drudgery of random action into the sphere of intentional design,” [4].

Why aren’t they already diffusely used? The majority of us have the ingrained belief that methods and results from individual software analysis pertain to the empirical context from which data has been collected. Thus, in almost every new study, we re-invent the data analysis wheel. It is as if we devised a new medical protocol for any new patient. Why is this? One of the reasons is related to software engineering data and the role it has taken over the years.

Software Engineering Data

A large part of modern software engineering research builds new knowledge by analyzing data of different types. To study distributed development processes, we analyze textual interactions among developers of open source communities and use social networks, complex systems, or graph theories. If we instead want to predict if a new technology will take off in the IT market, we collect economic data and use the Roger’s theory of diffusion of innovation. To understand the quality of modern software products, we mine code data and their evolution from previous versions. Sometimes, we also need to combine data of a different nature collected from different sources and analyzed with various statistical methods.

Thus, data types can be very different. For example, data can be structured or unstructured (ie, lines of code or free text in review comments and segments of videos), discrete or continuous (ie, number of bugs in software or response time of web services), qualitative or quantitative (ie, complexity of a software task and Cyclomatic code complexity), and subjective or objective (ie, ease of use of a technology or number of back links to web sites). In addition, with the Open Source Software (OSS) phenomenon, cloud computing, and the Big Data era, data has become more distributed, big, and accessible; but also noisy, redundant, and incomplete. As such, researchers must have a good command of analysis instruments and a feel for the kinds of problems and data they apply to.

Needs of Data Analysis Patterns

The need for instruments like data analysis patterns becomes more apparent when we want to introduce novices to the research field. In these circumstances, we encounter the following issues.

Studies do not report entirely or sufficiently about their data analysis protocols. This implies that analyses are biased or not verifiable. Consequently, secondary studies like mapping studies and systematic literature reviews that are mandated to synthesize published research lose their power. Data analysis patterns provide software engineers with a verifiable protocol to compare, unify, and extract knowledge from existing studies.

Methods and data are not commonly shared. It is customary to develop ad-hoc scripts and keep them private or use tools as black-box statistical machines. In either case, we cannot access the statistical algorithm, verify, and re-use it. Data analysis patterns are packaged to be easily inspected, automated, and shared.

Tool-driven research has some known risks. Anyone can easily download statistical tools from the Internet and perform sophisticated statistical analyses. The Turin award Butler Lampson [2] warns not to abuse of statistical tools: “For one unfamiliar with the niceties of statistical analysis it is difficult to view with any feeling other than awe the elaborate edifice which the authors have erected to protect their data from the cutting winds of statistical insignificance.” A catalog of data analysis patterns helps guide researchers in the selection of appropriate analysis instruments.

Analysis can be easily biased by the human factor. Reviewing papers on machine learning for defect prediction, Shepperd et al. [3] analyzed more than 600 samples from the highest quality studies on defect prediction to determine what factors influence predictive performance and find that “it matters more who does the work than what is done.” This incredible result urges the use of data analysis patterns to make a solution independent from the researchers who conceived it.

Building Remedies for Data Analysis in Software Engineering Research

As in any research field, needs trigger opportunities and challenge researchers. Today, we are called to synthesize our methods of analysis [4], and examples of design patterns are already available [5]. We need more, though. The scikit-learn initiative [http://scikit-learn.org/stable/index.html] can help software engineers in case they need to solve problems with data mining, ie, the computational process of discovering patterns in data sets. The project provides online access to a wide range of state-of-the-art tools for data analysis as codified solutions. Each solution comes with a short rationale of use, a handful of algorithms implementing it, and a set of application examples. Fig. 1 illustrates how we can find the right estimator for a machine learning problem.

f04-01-9780128042069
Fig. 1 Flowchart displaying different estimators and analysis path for a machine learning problem. Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

How can we import these, or similar tools, in the software engineering context? We need first to identify the requirements for a data analysis pattern in software engineering.

In our opinion, a data analysis pattern shall be:

 A solution to a recurrent a software engineering problem

 Re-usable in different software engineering contexts

 Automatable (eg, by coding algorithms of data analysis in some programming language)

 Actionable (eg, the scikit-learn tools)

 Successful to a certain degree (eg, by representing state-of-the-art data analysis in software engineering)

Then the key steps to construct such a pattern will include, but not be restricted to:

 Mining literature to extract candidate solutions

 Identifying a common language to express a solution in a form that software engineers can easily understand and re-use. For instance, we can think of an annotated Unified Modeling Language (UML) or algorithm notation expressing the logic of the analysis

 Defining a measure of success for a solution

 Validating the candidate solutions by replications and community surveys to achieve consensus in the research community.

Reflecting on the current situation, we also see the need to codify anti-patterns, ie, what not to do in data analysis. Given the amount of evidence in our field, this must be a much easier task!

References

[1] Gamma E., Helm R., Johnson R., Vlissides J. Design patterns: elements of reusable object-oriented software. Boston, MA: Addison-Wesley Longman Publishing Company; 1995.

[2] Lampson B.W. A critique of an exploratory investigation of programmer performance under on-line and off-line conditions. IEEE Trans Hum Factors Electron. 1967;HFE-8(1):48–51.

[3] Shepperd M., Bowes D., Hall T. Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng. 2014;40(6):603–616.

[4] Johnson P., Ekstedt M., Jacobson I. Where's the theory for software engineering? IEEE Softw. 2012;29(5):94–95.

[5] Russo B. Parametric classification over multiple samples. In: Proceedings of 2013 1st international workshop on data analysis patterns in software engineering (DAPSE), May 21, 2013, San Francisco, CA, USA, IEEE, 23–5; 2013.

[6] Bird C., Menzies T., Zimmermann T. First international workshop on data analysis patterns in software engineering (DAPSE 2013). In: Proceedings of the 2013 international conference on software engineering (ICSE 2013); Piscataway, NJ, USA: IEEE Press; 2013:1517–1518 DAPSE2013.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset