Foreword

Data cleaning is often the most time-consuming part of data analysis. Although it has been recognized as a separate topic for a long time in Official Statistics (where it is called ‘data editing’) and also has been studied in relation to databases, literature aimed at the larger statistical community is limited. This is why, when the publisher invited us to expand our tutorial “An introduction to data cleaning with R”, which we developed for the useR!2013 conference, into a book, we grabbed the opportunity with both hands. On the one hand, we felt that some of the methods that have been developed in the Official Statistics community over the last five decades deserved a wider audience. Perhaps, this book can help with that. On the other hand, we hope that this book will help in bringing some (often pre-existing) techniques into the Official Statistics community, as we move from survey-based data sources to administrative and “big” data sources.

For us, it would also be a nice way to systematize our knowledge and the software we have written on this topic. Looking back, we ended up not only writing this book, but also redeveloping and generalizing much of the data cleaning R packages we had written before. One of the reasons for this is that we discovered nice ways to generalize and expand our software and methods, and another is that we wished to connect to the recently emerged “tidyverse” style of interfaces to the R functionality.

What You Will Find in this Book

This book contains a selection of topics that we found to be useful while developing data cleaning (data editing) systems. The range is very broad, ranging from topics related to computer science, numerical methods, technical standards, statistics and data modeling, and programming.

This book covers topics in “technical data cleaning”, including conversion and interpretation of numerical, text, and date types. The technical standards related to these data types are also covered in some detail. On the data content side of things, topics include data validation (data checking), error localization, various methods for error correction, and missing value imputation.

Wherever possible, the theory discussed in this book is illustrated with an executable R code. We have also included exercises throughout the book which we hope will guide the reader in further understanding both the software and the methods.

The mix of topics reflects both the breadth of the subject and of course the interests and expertise of the authors. The list of missing topics is of course much larger than that what is treated, but perhaps the most important ones are cleaning of time series objects and outlier detection.

For Who Is this Book?

Readers of this book are expected to have basic knowledge of mathematics and statistics and also some programming experience. We assume concepts such as expectation values, variance, and basic calculus and linear algebra as previous knowledge. It is beneficial to have at least some knowledge of R, since this is the language used in this book, but for convenience and reference, a short chapter explaining the basics is included.

Acknowledgments

This book would have not been possible without the work of many others. We would like to thank our colleagues at Statistics Netherlands for fruitful discussions on data validation, imputation, and error localization. Some of the chapters in this book are based on papers and reports written with co-authors. We thank Jeroen Pannekoek, Sander Scholtus, and Jacco Daalmans for their pleasant and fruitful collaboration. We are greatly indebted by the R core team, package developers, and the very supportive R community for their relentless efforts.

Finally, we would like to thank our families for their love and support.

June 2017

Mark and Edwin

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset