Home Page Icon
Home Page
Table of Contents for
Index
Close
Index
by Edwin de Jonge, Mark van der Loo
Statistical Data Cleaning with Applications in R
Cover
Title Page
Copyright
Foreword
What You Will Find in this Book
For Who Is this Book?
Acknowledgments
About the Companion Website
Chapter 1: Data Cleaning
1.1 The Statistical Value Chain
1.2 Notation and Conventions Used in this Book
Chapter 2: A Brief Introduction to R
2.1 R on the Command Line
2.2 Vectors
2.3 Data Frames
2.4 Special Values
2.5 Getting Data into and out of R
2.6 Functions
2.7 Packages Used in this Book
Chapter 3: Technical Representation of Data
3.1 Numeric Data
3.2 Text Data
3.3 Times and Dates
3.4 Notes on Locale Settings
Chapter 4: Data Structure
4.1 Introduction
4.2 Tabular Data
4.3 Matrix Data
4.4 Time Series
4.5 Graph Data
4.6 Web Data
4.7 Other Data
4.8 Tidying Tabular Data
Chapter 5: Cleaning Text Data
5.1 Character Normalization
5.2 Pattern Matching with Regular Expressions
5.3 Common String Processing Tasks in R
5.4 Approximate Text Matching
Chapter 6: Data Validation
6.1 Introduction
6.2 A First Look at the validate Package
6.3 Defining Data Validation
6.4 A Formal Typology of Data Validation Functions
Chapter 7: Localizing Errors in Data Records
7.1 Error Localization
7.2 Error Localization with R
7.3 Error Localization as MIP-Problem
7.4 Numerical Stability Issues
7.5 Practical Issues
7.6 Conclusion
Appendix 7.A: Derivation of Eq. (7.33)
Chapter 8: Rule Set Maintenance and Simplification
8.1 Quality of Validation Rules
8.2 Rules in the Language of Logic
8.3 Rule Set Issues
8.4 Detection and Simplification Procedure
8.5 Conclusion
Chapter 9: Methods Based on Models for Domain Knowledge
9.1 Correction with Data Modifying Rules
9.2 Rule-Based Correction with dcmodify
9.3 Deductive Correction
Chapter 10: Imputation and Adjustment
10.1 Missing Data
10.2 Model-Based Imputation
10.3 Model-Based Imputation in R
10.4 Donor Imputation with R
10.5 Other Methods in the simputation Package
10.6 Imputation Based on the EM Algorithm
10.7 Sampling Variance under Imputation
10.8 Multiple Imputations
10.9 Analytic Approaches to Estimate Variance of Imputation
10.10 Choosing an Imputation Method
10.11 Constraint Value Adjustment
Chapter 11: Example: A Small Data-Cleaning System
11.1 Setup
11.2 Monitoring Changes in Data
11.3 Integration and Automation
References
Index
End User License Agreement
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
References
Next
Next Chapter
End User License Agreement
Index
a
abstraction leakage
adist
alphabet
b
branch and bound
byte order mark
c
CART model imputation
case conversion
character repertoire
chartr
code point
commutation
cosine distance
d
daff protocol
Damerau–Levenshtein distance
data changes
cell changes
diff
logging
patch
tracking
data modifying function
data modifying rule
data point
data set
data transformation
data validation
database
CRUD
daylight saving time
deductive correction
correcting typos
deductive imputation
distance function
domain knowledge
e
elasticnet regression
EM algorithm
multiple imputation
multivariate normal
EM-algorithm
EMB algorithm
error localization
escape character
extended regular expression
f
Fellegi and Holt principle
floating point number
fuzzy matching
g
graphs
grepl
gsub
h
Hamming distance
hot deck imputation
i
iconv
IDE
idempotent
imputation
imputation variance
Inf
integer
.Machine
one's complement
signed
ISO
ISOdate
j
Jaccard distance
Jaro distance
Jaro-Winkler distance
k
Kleene star
l
lasso regression
leap seconds
Levenshtein distance
ligature
linear model
linear regression imputation
log data changes
longest common subsequence distance
m
M
-estimation
MAR
matrix data
MCAR
mean imputation
measurement
missing data
visualisation
mixed integer program
error localization
rule set issues
model residual
model-based imputation
modifying function
multiple imputation
n
NA
NaN
nearest neighbor imputation
NMAR
normal numbers
NULL
numeric stability
numerical tolerance
o
Olson tables
optimal string alignment distance
p
paste
perl
POSIX time
POSIXct
predictive mean matching
predictive model
proxy imputation
pseudo-inverse
q
q
-gram distance
q
-gram profile
r
R
array
character
data frame
formula
function
matrix
vector
R package
Amelia
censusapi
daff
data.table
DBI
dcmodify
deductive
docopt
dplyr
editrules
errorlocate
eurostat
ff
ffbase
glmnet
gsubfn
haven
kernlab
LaF
lattice
lintools
Lubridate
lubridate
lumberjack
magrittr
MASS
memisc
mice
microbenchmark
missForest
qdapRegex
readr
rex
rhadoop
RODBC
rpart
rspa
rspa::match_restrictions
RSQLite
rvest
simputation
sparklyr
sparkR
stringdist
stringi
stringr
textcat
tibble
tidycensus
tidyr
tm
twitteR
utils
validate
VIM
VIM::aggr
wbstats
XML
xml2
yaml
Zelig
random forest imputation
random hot deck imputation
ratio imputation
regex
regular expression
*,?,+
back referencing
character range
greedy
groups
lazy
relational algebra
reliability weights
ridge regression
rule set
quality
simplification
s
sequential hot deck imputation
soundex
statistical value chain
string
string kernel
string similarity
stringdist
strptime
strsplit
sub
substr
successive projection algorithm
t
tabular data
TAI
tidy data
time series
transliteration
u
unicode equivalence
UTC
v
validation function
validation rule
w
wildcard
workflow
data cleaning example
x
XML
y
YAML
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset