Chapter 1

Introduction

Before we begin: for the very impatient (or very busy) reader, we offer an executive summary in Section 1.3 and statement on next directions in Chapter 25.

1.1 Why read this book?

NASA used to run a Metrics Data Program (MDP) to analyze data from software projects. In 2003, the research lead, Kenneth McGill, asked: “What can you learn from all that data?” McGill's challenge (and funding support) resulted in much work. The MDP is no more but its data was the seed for the PROMISE repository (Figure 1.1). At the time of this writing (2014), that repository is the focal point for many researchers exploring data science and software engineering. The authors of this book are long-time members of the PROMISE community.

f01-01-9780124172951
Figure 1.1 The PROMISE repository of SE data: http://openscience.us/repo.

When a team has been working at something for a decade, it is fitting to ask, “What do you know now that you did not know before?” In short, we think that sharing needs to be studied much more, so this book is about sharing ideas and how data mining can help that sharing. As we shall see:

 Sharing can be very useful and insightful.

 But sharing ideas is not a simple matter.

The bad news is that, usually, ideas are shared very badly. The good news is that, based on much recent research, it is now possible to offer much guidance on how to use data miners to share.

This book offers that guidance. Because it is drawn from our experiences (and we are all software engineers), its case studies all come from that field (e.g., data mining for software defect prediction or software effort estimation). That said, the methods of this book are very general and should be applicable to many other domains.

1.2 What do we mean by “sharing”?

To understand “sharing,” we start with a story. Suppose two managers of different projects meet for lunch. They discuss books, movies, the weather, and the latest political/sporting results. After all that, their conversation turns to a shared problem: how to better manage their projects.

Why are our managers talking? They might be friends and this is just a casual meeting. On the other hand, they might be meeting in order to gain the benefit of the other's experience. If so, then their discussions will try to share their experience. But what might they share?

1.2.1 Sharing insights

Perhaps they wish to share their insights about management. For example, our diners might have just read Fred Brooks's book on The Mythical Man Month [59]. This book documents many aspects of software project management including the famous Brooks' law which says “adding staff to a late software project makes it later.”

To share such insights about management, our managers might share war stories on (e.g.) how upper management tried to save late projects by throwing more staff at them. Shaking their heads ruefully, they remind each other that often the real problems are the early lifecycle decisions that crippled the original concept.

1.2.2 Sharing models

Perhaps they are reading the software engineering literature and want to share models about software development. Now “models” can be mean different things to different people. For example, to some object-oriented design people, a “model” is some elaborate class diagram. But models can be smaller, much more focused statements. For example, our lunch buddies might have read Barry Boehm's Software Economics book. That book documents a power law of software that states that larger software projects take exponentially longer to complete than smaller projects [34].

Accordingly, they might discuss if development effort for larger projects can be tamed with some well-designed information hiding.1

(Just as an aside, by model we mean any succinct description of a domain that someone wants to pass to someone else. For this book, our models are mostly quantitative equations or decision trees. Other models may more qualitative such as the rules of thumb that one manager might want to offer to another—but in the terminology of this chapter, we would call that more insight than model.)

1.2.3 Sharing data

Perhaps our managers know that general models often need tuning with local data. Hence, they might offer to share specific project data with each other. This data sharing is particularly useful if one team is using a technology that is new to them, but has long been used by the other. Also, such data sharing is become fashionable amongst data-driven decision makers such as Nate Silver [399], or the evidence-based software engineering community [217].

1.2.4 Sharing analysis methods

Finally, if our managers are very experienced, they know that it is not enough just to share data in order to share ideas. This data has to be summarized into actionable statements, which is the task of the data scientist. When two such scientists meet for lunch, they might spend some time discussing the tricks they use for different kinds of data mining problems. That is, they might share analysis methods for turning data into models.

1.2.5 Types of sharing

In summary, when two smart people talk, there are four things they can share. They might want to:

 share models;

 share data;

 share insight;

 share analysis methods for turning data into models.

This book is about sharing data and sharing models. We do not discuss sharing insight because, to date, it is not clear what can be said on that point. As to sharing analysis methods, that is a very active area of current research; so much so that it would premature to write a book on that topic. However, for some state-of-the-art results in sharing analysis methods, the reader is referred to two recent articles by Tom Zimmermann and his colleagues at Microsoft Research. They discuss the very wide range of questions that are asked of data scientists [27, 64] (and many of those queries are about exploring data before any conclusions are made).

1.2.6 Challenges with sharing

It turns out that sharing data and models is not a simple matter. To illustrate that point, we review the limitations of the models learned from the first generation of analytics in software engineering.

As soon as people started programming, it became apparent that programming was an inherently buggy process. As recalled by Maurice Wilkes [443] speaking of his programming experiences from the early 1950s:

It was on one of my journeys between the EDSAC room and the punching equipment that hesitating at the angles of stairs the realization came over me with full force that a good part of the remainder of my life was going to be spent in finding errors in my own programs.

It took several decades to find the experience required to build a size/defect relationship. In 1971, Fumio Akiyama described the first known “size” law, saying the number of defects D was a function of the number of lines of code; specifically

D=4.86+0.018*loc

si1_e

Alas, nothing is as simple as that. Lessons come from experience and, as our experience grows, those lessons get refined/replaced. In 1976, McCabe [285] argued that the number of lines of code was less important than the complexity of that code. He proposed “cyclomatic complexity,” or v(g), as a measure of that complexity and offered the now (in)famous rule that a program is more likely to be defective if

v(g)>10

si2_e

At around the same time, other researchers were arguing that not only is programming an inherently buggy process, its also inherently time-consuming. Based on data from 63 projects, Boehm [34] proposed in 1981 that linear increases in code size leads to exponential increases in development effort:

effort=a×KLOCb×i(Emi×Fi)

si3_e  (1.1)

Here, a, b are parameters that need tuning for particular projects and Emi are “effort multiplier” that control the impact of some project factor Fi on the effort. For example, if Fi is “analysts capability” and it moves from “very low” to “very high,” then according to Boehm's 1981 model, Emi moves from 1.46 to 0.71 (i.e., better analysts let you deliver more systems, sooner).

Forty years later, it is very clear that the above models are true only in certain narrow contexts. To see this, consider the variety of software built at the Microsoft campus, Redmond, USA. A bird flying over that campus would see dozens of five-story buildings. Each of those building has (say) five teams working on each floor. These 12 * 5 * 5 = 300 teams build a wide variant of software including gaming systems, operating systems, databases, word processors, etc. If we were to collect data from this diverse set of projects, then it would be

 A sparsely populated set of observations within a much large space of possible software projects;

 About a very diverse set of activities;

 That are undertaken for an ever-changing set of tasks;

 Using an ever-evolving set of programs and people.

Worse still, all that data would be about software practices at one very large commercial company, which may not apply to other kinds of development (e.g., agile software teams or government development labs or within open source projects).

With that preamble, we now ask the reader the following question:

Is it likely that any single model holds across all the 12 * 5 * 5 = 300 software projects at Microsoft (or to other organizations)?

The premise of this book is that the answer to this questions is “NO!!”; that is, if we collect data from different software projects, we will build different models and none of them may be relevant to any other (but stay calm, the next section offers three automatic methods for managing this issue).

There is much empirical evidence that different software projects produce different models [190, 218, 280]. For example, suppose we were learning a regression model for software project effort:

effort=β0+β1x1+β2x2+

si4_e

To test the generality of this model across multiple projects, we can learn this equation many times using different subsets of the data. For example, in one experiment [291], we learned this equation using 20 different (2/3)rds random samples of some NASA projects. As shown in Figure 1.2, the β parameters on the learned effort models vary tremendously across different samples. For example, “vexp” is “virtual machine experience” and as shown in Figure 1.2 its βi value ranges from −8 to −3.5. In fact, the signs of five βi coefficients even changed from positive to negative (see “stor,” “aexp,” “modp,” “cplx,” “sced”).

f01-02-9780124172951
Figure 1.2 Instability in effort models: sorted βi values from local calibration on 20*(66%) samples of NASA93 data. From [302]. Coefficients learned using Boehm's recommended methods [34]. A greedy backward selection removed attributes with no impact on estimates (so some attributes have less than 20 results). From [291].

Defect models are just as unstable as effort models. For example, Zimmermann et al. [463] learned defect predictors from 622 pairs of projects 〈project1, project2〉. In only 4% of pairs, the defect predictors learned in project1 worked in project2. Similar findings (of contradictory conclusions in defect prediction) concern the use of object-oriented metrics. Turhan [291] reviewed the conclusions of 28 studies that discuss the effectiveness of different object-oriented metrics for predicting defects. In Figure 1.3:

f01-03-9780124172951
Figure 1.3 Instability in defect models: studies reporting significant (“+”) or irrelevant (“−”) metrics verified by univariate prediction models. Blank entries indicate that the corresponding metric is not evaluated in that particular study. Colors comment on the most frequent conclusion of each column. CBO, coupling between objects; RFC, response for class (# methods executed by arriving messages); LCOM, lack of cohesion (pairs of methods referencing one instance variable, different definitions of LCOM are aggregated); NOC, number of children (immediate subclasses); WMC, # methods per class. From [291].

 A “+” indicates a metric is significantly correlated to detects;

 A “−” means it was found to be irrelevant to predicting defects;

 And white space means that this effect was not explored.

Note that for nearly all metrics except for “response for class,” the effects differ wildly in different projects.

For the manager of a software project, these instabilities are particularly troubling. For example, we know of project managers who have made acquisition decisions worth tens of millions of dollars based on these βi coefficients; i.e., they decided to acquire the technologies that had most impact on the variables with largest βi coefficients. Note that if the βi values are as unstable as shown in Figure 1.2, then the justification for those purchases is not strong.

Similarly, using Figure 1.3, it is difficult (to say the least) for a manager to make a clear decision about, for example, the merits of a proposed coding standard where maximum depth of inheritance is required to be less than some expert-specified threshold.

1.2.7 How to share

The above results suggest that there are little to no shareable general principles in human activities such as building software:

 We cannot share models without tuning them with data.

 The best models for different domains may be different.

 And even if not, if we dare to tune someone else's model with data, then this may be unwise because not all data is relevant outside of the context where it was collected.

But this book is not a council of despair. Clearly, it is time to end the Quixotic quest for one and only one model for diverse activities like software engineering. Also, if one model and one data source are not enough, then it is time to move to multiple data sources and multiple models.

1.3 What? (our executive summary)

1.3.1 An overview

The view of this book is that, when people meet to discuss data, that discussion needs automatic support tools to handle multiple models and multiple data sources. More specifically, for effective sharing, we need three kinds of automatic methods:

1. Although not all shared data from other sites is relevant, some of it is. The trick is to have the right relevancy filter that shares just enough of the correct data.

2. Although not all models move verbatim from domain to domain, it is possible to automatically build many models, then assess what models work best for a particular domain.

3. It is possible and useful to automatically form committees of models (called ensembles) in which different models can debate and combine their recommendations.

This book discusses these kinds of methods. Also discussed will be

 Methods to increase the amount of shared data we can access. For example, privacy algorithms will be presented that let organizations share data without divulging important secrets. We also discuss data repair operators that can compensate for missing data values.

 Methods to make best use of that data via active learning (which means learning the fewest number of most interesting questions to ask, thus avoiding needless data collection).

3.2 More details

Like a good cocktail, this book is a mix of parts—two parts introductory tutorials and two parts technical details:

 Part I and Part II are short introductory notes for managers and technical developers. These first two parts would be most suitable for data scientist rookies.

 Part III (Sharing Data) and Part IV (Sharing Models) describe leading edge methods taken from recent research papers. The last two parts would be more suitable for seasoned data scientists.

As discussed in Part I, it cannot be stressed highly enough that understanding organizational issues is just as important as understanding the data mining technology. For example, it is impossible to scale up data sharing without first addressing issues of confidentiality (the results of Chapter 16 show that such confidentiality is indeed possible, but more work is needed in this area). Many existing studies on software prediction systems tend to concentrate on achieving the “best” model fitting to a given task. The importance of another task—providing insights—is frequently overlooked.

When seeking insight, it is very useful to “shrink” by reducing it to just the essential, content (this simplifies the inspection and discussion of that data). There are sound theoretical reasons for believing that many data sets can be extensively “shrunk” (see Chapter 15). For methods to accomplish that task, see the data mining pruning operators of Chapter 10 as well as the CHUNK and PEEKING and QUICK tools of Chapter 12, Chapter 15, and Chapter 17.

One aspect of our work that is different than many other researchers is our willingness to “go inside” the data miners. It is common practice to use data miners as “black boxes.” Although much good work can be done that way, we have found that the more we use the data miners, the more we want to adjust how they function. Much of the above chapters can be summarized as follows:

 Here's the usual way people use the data miners …

 … and here's a new way that leads to better predictions.

Based on our experience, we would encourage more experimentation on the internals of these data miners. The technology used in those miners is hardly static. What used to be just regression and classifiers is now so much more (including support vector machines, neural networks, genetic algorithms, etc.). Also, the way we use these learners is changing. What used to be “apply the learners to the data” is now transfer learning (Chapter 13 and Chapter 14), active learning (Chapter 18), ensemble learning (Chapter 20, Chapter 21, and Chapter 22), and so on. In particular, we have seen that ensembles can be very powerful and versatile tools. For instance, we show in Chapter 20, Chapter 21, Chapter 22, and Chapter 24 that their power can be extended from static to dynamic environments, from single to multiple goals/objectives, from within-company to transfer learning. We believe that ensembles will continue to show their value in future research.

Another change is the temporal nature of data mining. Due to the dynamism and uncertainty of the environments where companies operate, software engineering is moving toward dynamic adaptive automation. We show in this book (Chapter 21) that the effects of environment changes in the context of software effort estimation, and how to benefit from updating models to reflect the current context of software companies. Changes are part of software companies' lives and are an important issue to be considered in the next research frontier of software prediction systems. The effect of changes in other software prediction tasks should be investigated, and we envision the proposal of new approaches to adapt to changes in the future.

More generally, we offer the following caution to industrial data scientists:

An elementary knowledge of machine learning and/or data mining may be of limited value to a practitioner willing to make a career in the data science field.

The practical problems facing software engineering, as well as the practicalities of real-world data sets, often require a deep understanding of the data and tailoring the right learners and algorithms to it. The tailoring can be done through augmenting a particular learner (like augmenting nearest-neighbor algorithm in TEAK, discussed in Chapter 14) or integrating the power of multiple algorithms into one (like a QUICK ensembling together a selected group of learners, as discussed in Chapter 18). But in every tailoring scenario, a practitioner will be required to justify his decisions and choices of algorithms. Hence, rather than an elementary knowledge, a deeper understanding of the algorithms (as well as their on-the-field-experience notes through books like this one) is a must for a successful practitioner in data science.

The last part of the book addresses multiobjective optimization. Similarly, when a software engineer is planning the development of a software, he/she may be interested in minimizing the number of defects, the effort required to develop the software, and the cost of the software. The existence of multiple goals and multiobjective optimizers thus profoundly affects data science for software engineering. Chapter 23 explains the importance of goals in model-based reasoning, and Chapter 24 presents an approach that can be used to consider different goals.

1.4 How to read this book

This book covers the following material:

Part I: Data Mining for Managers: The success of an industrial data mining project depends on those technical matters as well as some very important organizational matters. This first section describes those organizational issues.

Part II: Data Mining: A Technical Tutorial: Discusses data mining for software engineering (SE) applications; several data mining methods that form the building blocks for advanced data science approaches for software engineering. For example, in this book, we apply those methods to numerous applications of data mining for SE, including software effort estimation and defect prediction.

Part III: Sharing Data: In this part, we discuss methods for moving data across organizational boundaries. The topics covered here include how to find learning contexts then how to learn across contexts (for cross-company learning); how to handle missing data; privacy; active learning; as well as privacy issues.

Part IV: Sharing Models: In this part, we discuss how to take models learned from one project and adapt and apply them to others. Topics covered here include ensemble learning; temporal learning; and multiobjective optimization.

The chapters of Parts I and II document a flow of ideas while the chapters of Parts III and IV were written to be mostly self-contained. Hence, for the reader who likes skimming, we would suggest reading all of Parts I and II (which are quite short) then dipping into any of the chapters in Parts III and IV, according to your own interests.

To assist in finding parts of the book that most interest you, this book contains several roadmaps:

 See Chapter 2 for a roadmap to Part I: Data Mining for Managers.

 See the start of Chapter 7 for a roadmap to Part II: Data Mining: A Technical Tutorial.

 See Chapter 11, Section 11.2, for a roadmap to Part III: Sharing Data.

 See Chapter 19 for a roadmap to Part IV: Sharing Models.

4.1 Data analysis patterns

As another guide to readers, from Chapter 12 onwards each chapter starts with a short summary table that we call a data analysis pattern:

Name:The main technical method discussed in this chapter.
Also known as:Synonyms, related terms.
Intent:The goal.
Motivation:Background.
Solution:Proposed approach.
Constraints:Issues that complicate the proposed approach.
Implementation:Technical details.
Applicability:Case studies, results.
Related to:Pointers to other chapters with related work.

1.5 But what about …? (what is not in this book)

1.5.1 what about “big data”?

The reader may already be curious about one aspect of this book—there is very little discussion “big data.” That is intentional. While the existence of large CPU farms and vast data repositories enables some novel analyzes, much of the “big data” literature is concerned with systems issues of handling terrabytes of data or thousands of co-operating CPUs. Once those systems issues are addressed, business users are still faced with the same core problems of how to share data and models and insight from one project to another. This book addresses those core problems.

1.5.2 What about related work?

This book showcases the last decade of research by the authors, as they explored the PROMISE data http://openscience.us/repo. Hundreds of other researchers, from the PROMISE community and elsewhere, have also explored that data (see the long list of application areas shown in Section 7.2). For their conclusions, see the excellent papers at

 The Art and Science of Analyzing Software Data, Morgan Kaufmann Publishing, 2015, in press.

 The PROMISE conference, 2005: http://goo.gl/KuocfC. For a list of top-cited papers from PROMISE, see http://goo.gl/ofpG12.

 The Mining Software Repositories conference, 2004: http://goo.gl/FboMVw.

 As well as many other SE conferences.

One aspect of our work that is different than many other researchers is our willingness to “go inside” the data miners. It is common practice to use data miners as “black boxes.” While much good work can be done that way, we have found that the more we use the data miners, the more we want to adjust how they function.

1.5.3 Why all the defect prediction and effort estimation?

For historical reasons, the case studies of this book mostly relate to predicting software defects from static code and estimating development effort. From 2000 to 2004, one of us (Menzies) worked to apply data mining to NASA data. At that time, most of NASA's data related to reports of project effort or defects. In 2005, with Jelber Sayyad, we founded the PROMISE project on reusable experiments in SE. The PROMISE repository was seeded with the NASA data and that kind of data came to dominate that repository.

That said, there are three important reasons to study defect prediction and effort estimation. First, they are important tasks for software engineering:

 Every software project needs a budget and very bad things happen when that budget is inadequate for the task at hand.

 Every software project has bugs, which we hope to reduce.

Second, another reason to explore these two tasks is that there are open science problems. That is, all the materials needed for the reader to repeat, improve, or even refute any part of this book are online:

 For data sets, see the PROMISE repository http://openscience.us/repo.

 For freely available data mining toolkits, download tools such as “R” or WEKA from http://www.r-project.org or http://www.cs.waikato.ac.nz/ml/weka (respectively).

 For tutorials on those tools, see e.g., [444] or some of the excellent online help forums such as http://stat.ethz.ch/mailman/listinfo/r-help or http://list.waikato.ac.nz/mailman/listinfo/wekalist, just to name a few.

Third, effort estimation and defect prediction are excellent laboratory problems; i.e., nontrivial tasks that require mastery of intricate data mining methods. In our experience, we have found that the data mining methods used for these kinds of data apply very well to other kinds of problems.

Hence, if the reader has ambitions to become an industrial or academic data scientist, we suggest that he or she try to learn models that outperform the results shown in this book (or, indeed, the hundreds of other published papers that exploring the PROMISE effort or defect data).

1.6 Who? (about the authors)

The authors of this text have worked real-world data mining with clients for many years. Ekrem Kocaguenli worked with companies in Turkey to build effort estimation models. Dr Kocaguenli now works at Microsoft, Redmond, on deriving operational intelligence with the Bing Ads team. Burak Turhan has an extensive data mining consultancy with Turkish and Finnish companies. Tim Menzies and Fayola Peters have been mining data from NASA projects since 2000. Dr Menzies has also been hired by Microsoft to conduct data mining studies on gaming data. Leandro Minku worked on data mining at Google during a six-month internship in 2009/2010 and has been collaborating with Honda on optimization.

Further, this team of authors has extensive experience in data mining, particularly in the area of software engineering. Tim Menzies is a Professor in Computer Science (WVU) and a former Software Research Chair at NASA where he worked extensively on their data sets. In other industrial work, he developed code in the 1980s and 1990s in the Australian software industry. After that, he returned to academia and has published 200+ refereed articles, many in the area on data mining and SE. According to academic.research.microsoft.com, he is one the top 100 most cited researchers in software engineering, in the last decade (out of 80,000+ authors). His research includes artificial intelligence, data mining and search-based software engineering. He is best known for his work on the PROMISE open source repository of data for reusable software engineering experiments. He received his PhD degree from New South Wales University, Australia. For more information visit http://menzies.us.

Ekrem Kocagüneli received his PhD from the Lane Department of Computer Science and Electrical Engineering, West Virginia University. His research focuses on empirical software engineering, data/model problems associated with software estimation and tackling them with smarter machine learning algorithms. His research provided solutions to industry partners like Turkcell, IBTech (subsidiary of Greece National Bank); he also completed an internship at Microsoft Research Redmond in 2012. His work was published at important software engineering venues such as IEEE TSE, ESE and ASE journals. He now works at Microsoft, Redmond, exploring data mining and operational intelligence metrics on advertisement data.

Leandro L. Minku is a Research Fellow II at the Centre of Excellence for Research in Computational Intelligence and Applications (CERCIA), School of Computer Science, the University of Birmingham (UK). He received his PhD degree in Computer Science from the University of Birmingham (UK) in 2010, and was an intern at Google Zurich for six months in 2009/2010. He was the recipient of the Overseas Research Students Award (ORSAS) from the British government and several scholarships from the Brazilian Council for Scientific and Technological Development (CNPq). Dr Minku's research focuses on software prediction models, search-based software engineering, machine learning in changing environments, and ensembles of learning machines. His work has been published at internationally renowned venues such as ICSE, IEEE TSE, ACM TOSEM, and IEEE TKDE. He was invited to give a keynote talk and to join conference steering committees.

Fayola Peters is a Research Fellow at LERO, the Irish Software Engineering Research Center (Ireland). Along with Mark Grechanik, she is the author of one of the two known algorithms (presented at ICSE12) that can privatize algorithms while still preserving the data mining properties of that data.

Burak Turhan is a Full Professor of Software Engineering at the Department of Information Processing Science at the University of Oulu, Finland. Before taking his current position, Dr Turhan was a Research Associate in the Software Engineering Group, Institute for Information Technology, National Research Council Canada. Prof. Turhan's research and teaching interests in software engineering are focused on empirical studies of software quality and programmer productivity, software analytics through the application of machine learning and data mining methods for defect and cost modeling, and mining software repositories for grounded decision making, as well as agile/lean software development with a special focus on test-driven development. He has published 70+ articles in international journals and conferences, invited for and organized panels and talks, and offered academic and industrial courses at all levels on these topics. He served in various positions for the academic community, e.g., steering committee member, chair, TPC member for 30+ academic conferences; reviewer and editorial board member for 15+ scientific journals; external reviewer and expert for national research councils and IT-related legal cases. He has been involved in 10+ national and international research projects and programs and conducted research in collaboration with leading (multi-)national companies. For more information and details please visit http://turhanb.net.

1.7 Who else? (acknowledgments)

The authors gratefully acknowledge the contribution of the international PROMISE community who have motivated our work with their interest, insights, energy, and synergy.

In particular, the authors would like to thank the founding members of the PROMISE conference's steering committee who have all contributed significantly to the inception and growth of PROMISE: Ayse Bener, Gary Boetticher, Tom Ostrand, Guenther Ruhe, Jelber Sayyad, and Stefan Wagner. Special mention needs to be made of the contribution of Jelber Sayyad who, in 2004, was bold enough to ask, “Why not make a repository of SE data?”

We also thank Tom Zimmermann, Christian Bird, and Nachi Nagappan from Microsoft Research, who let us waste weeks of their life to debug these ideas.

As to special mentions, Tim Menzies wants to especially thank the dozens of graduate students at West Virginia University who helped him develop and debug some of the ideas of this book. Dr Menzies' research was funded in part by NSF, CISE, project #0810879 and #1302169.

Ekrem Kocaguneli would like to thank Ayse Basar Bener, Tim Menzies, and Bojan Cukic for their support and guidance throughout his academic life. Dr Kocaguneli's research was funded in part by NSF, CISE, project #0810879.

Leandro Minku would like to thank all the current and former members of the projects Dynamic Adaptive Automated Software Engineering (DAASE) and Software Engineering By Automated SEarch (SEBASE), especially Prof. Xin Yao and Dr Rami Bahsoon, for the fruitful discussions and support. Dr Minku's research was funded by EPSRC Grant No. EP/J017515/1.

Fayola Peter would like to thank Tim Menzies for his academic guidance from Masters to PhD. Thanks are also deserved for members of the Modeling Intelligence Lab at West Virginia University whose conversations have sparked ideas for work contributed in this book. Dr Peter's research was funded in part by NSF, CISE, project #0810879 and #1302169.

Burak Turhan would like to give special thanks to Junior, Kamel, and the Silver-Viking for their role in the creation of this book. He would also like to acknowledge the Need for Speed (N4S) program funded by Tekes, Finland, for providing partial support to conduct the research activities leading to the results that made their way into this book.

Bibliography

[27] Begel A, Zimmermann T. Analyze this! 145 questions for data scientists in software engineering. In: Proceedings of the 36th international conference on software engineering (ICSE 2014); ACM; June 2014.

[34] Boehm B. Software engineering economics. Englewood Cliffs, NJ: Prentice-Hall; 1981.

[59] Brooks FP. The mythical man-month, Anniversary edition. Reading, MA: Addison-Wesley; 1995.

[64] Buse RPL, Zimmermann T. Information needs for software development analytics. In: Proceedings of the 2012 international conference on software engineering; IEEE Press; 2012:987–996.

[190] Jorgensen M. A review of studies on expert estimation of software development effort. J Syst Softw. 2004;70(1-2):37–60.

[217] Kitchenham BA, Dyba T, Jørgensen M. Evidence-based software engineering. In: ICSE '04: proceedings of the 26th international conference on software engineering; Washington, DC: IEEE Computer Society; 2004:273–281.

[218] Kitchenham BA, Mendes E, Travassos GH. Cross versus within-company cost estimation studies: a systematic review. IEEE Trans Softw Eng. 2007;33(5):316–329.

[280] Mair C, Shepperd M. The consistency of empirical comparisons of regression and analogy-based software project cost prediction. In: International symposium on empirical software engineering; November 2005:10.

[285] McCabe TJ. A complexity measure. IEEE Trans Softw Eng. 1976;2:308–320 December (4).

[291] Menzies T, Butcher A, Cok D, Marcus A, Layman L, Shull F, et al. Local vs. global lessons for defect prediction and effort estimation. IEEE Trans Softw Eng. 2012;1. Available from: http://menzies.us/pdf/12localb.pdf.

[302] Menzies T, Chen Z, Port D, Hihn J. Simple software cost estimation: safe or unsafe? In: Proceedings, PROMISE workshop, ICSE 2005; 2005. Available from: http://menzies.us/pdf/05safewhen.pdf.

[399] Silver N. The signal and the noise: why most predictions fail—but some don't. New York, NY: Penguin; 2012.

[443] Wilkes M. Memoirs of a computer pioneer. Cambridge, MA: MIT Press; 1985.

[444] Witten IH, Frank E. Data mining. 2nd ed. Los Altos, CA: Morgan Kaufmann; 2005.

[463] Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B. Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: van Vliet H, Issarny V, eds. New York, NY: ACM; 2009:91–100 ESEC/SIGSOFT FSE.


2606 “To view the full reference list for the book, click here

1 N components have N! possible interconnections but, with information hiding, only M1 < N components connect to some other M2 < N components in the rest of the system, thus dramatically reducing the number of connections that need to be built, debugged, and maintained.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset