Home Page Icon
Home Page
Table of Contents for
Strategies for handling missing values
Close
Strategies for handling missing values
by Ashish Kumar
Mastering pandas - Second Edition
Title Page
Copyright and Credits
Mastering pandas Second Edition
About Packt
Why subscribe?
Contributors
About the author
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Section 1: Overview of Data Analysis and pandas
Introduction to pandas and Data Analysis
Motivation for data analysis
We live in a big data world
The four V's of big data
Volume of big data
Velocity of big data
Variety of big data
Veracity of big data
So much data, so little time for analysis
The move towards real-time analytics
Data analytics pipeline
How Python and pandas fit into the data analytics pipeline
What is pandas?
Where does pandas fit in the pipeline?
Benefits of using pandas
History of pandas
Usage pattern and adoption of pandas
pandas on the technology adoption curve
Popular applications of pandas
Summary
References
Installation of pandas and Supporting Software
Selecting a version of Python to use
Standalone Python installation
Linux
Installing Python from a compressed tarball
Windows
Core Python installation
Installing third-party Python and packages 
macOS/X
Installation using a package manager
Installation of Python and pandas using Anaconda
What is Anaconda?
Why Anaconda?
Installing Anaconda
Windows Installation
macOS Installation
Linux Installation
Cloud installation
Other numeric and analytics-focused Python distributions
Dependency packages for pandas
Review of items installed with Anaconda
JupyterLab
GlueViz
Walk-through of Jupyter Notebook and Spyder
Jupyter Notebook
Spyder
Cross tooling – combining pandas awesomeness with R, Julia, H20.ai, and Azure ML Studio
Pandas with R
pandas with Azure ML Studio
pandas with Julia
pandas with H2O
Command line tricks for pandas
Options and settings for pandas
Summary
Further reading
Section 2: Data Structures and I/O in pandas
Using NumPy and Data Structures with pandas
NumPy ndarrays
NumPy array creation
Array of ones and zeros
Array based on a numerical range
Random and empty arrays
Arrays based on existing arrays
NumPy data types
NumPy indexing and slicing
Array slicing
Array masking
Complex indexing
Copies and views
Operations
Basic operators
Mathematical operators
Statistical operators
Logical operators
Broadcasting
Array shape manipulation
Reshaping
Transposing
Ravel
Adding a new axis
Basic linear algebra operations
Array sorting
Implementing neural networks with NumPy
Practical applications of multidimensional arrays
Selecting only one channel
Selecting the region of interest of an image
Multiple channel selection and suppressing other channels
Data structures in pandas
Series
Series creation
Using an ndarray
Using a Python dictionary
Using a scalar value
Operations on Series
Assignment
Slicing
Other operations
DataFrames
DataFrame creation
Using a dictionary of Series
Using a dictionary of ndarrays/lists
Using a structured array
Using a list of dictionaries
Using a dictionary of tuples for multilevel indexing
Using a Series
Operations on pandas DataFrames
Column selection
Adding a new column
Deleting columns
Alignment of DataFrames
Other mathematical operations
Panels
Using a 3D NumPy array with axis labels
Using a Python dictionary of DataFrame objects
Using the DataFrame.to_panel method
Other operations
Summary
References
I/Os of Different Data Formats with pandas
Data sources and pandas methods
CSV and TXT
Reading CSV and TXT files
Reading a CSV file
Specifying column names for a dataset
Reading from a string of data
Skipping certain rows
Row index
Reading a text file
Subsetting while reading
Reading thousand format numbers as numbers
Indexing and multi-indexing
Reading large files in chunks
Handling delimiter characters in column data
Writing to a CSV
Excel
URL and S3
HTML
Writing to an HTML file
JSON
Writing a JSON to a file
Reading a JSON
Writing JSON to a DataFrame
Subsetting a JSON
Looping over JSON keys
Reading HDF formats
Reading feather files
Reading parquet files
Reading a SQL file
Reading a SAS/Stata file
Reading from Google BigQuery
Reading from a clipboard
Managing sparse data
Writing JSON objects to a file
Serialization/deserialization
Writing to exotic file types
to_pickle()
to_parquet()
to_hdf()
to_sql()
to_feather()
to_html()
to_msgpack()
to_latex()
to_stata()
to_clipboard()
GeoPandas
What is geospatial data?
Installation and dependencies
Working with GeoPandas
GeoDataFrames
Open source APIs – Quandl
read_sql_query
Pandas plotting
Andrews curves
Parallel plot
Radviz plots
Scatter matrix plot
Lag plot
Bootstrap plot
pandas-datareader
Yahoo Finance
World Bank
Summary
Section 3: Mastering Different Data Operations in pandas
Indexing and Selecting in pandas
Basic indexing
Accessing attributes using the dot operator
Range slicing
Labels, integer, and mixed indexing
Label-oriented indexing
Integer-oriented indexing
The .iat and .at operators
Mixed indexing with the .ix operator
Multi-indexing
Swapping and re-ordering levels
Cross-sections
Boolean indexing
The isin and any all methods
Using the where() method
Operations on indexes
Summary
Grouping, Merging, and Reshaping Data in pandas
Grouping data
The groupby operation
Using groupby with a MultiIndex
Using the aggregate method
Applying multiple functions
The transform() method
Filtering
Merging and joining
The concat function
Using append
Appending a single row to a DataFrame
SQL-like merging/joining of DataFrame objects
The join function
Pivots and reshaping data
Stacking and unstacking
The stack() function
The unstack() function
Other methods for reshaping DataFrames
Using the melt function
The pandas.get_dummies() function
pivot table
Transpose in pandas
Squeeze
nsmallest and nlargest
Summary
Special Data Operations in pandas
Writing and applying one-liner custom functions
lambda and apply
Handling missing values
Sources of missing values
Data extraction 
Data collection 
Data missing at random 
Data not missing at random 
Different types of missing values
Miscellaneous analysis of missing values
Strategies for handling missing values
Deletion 
Imputation
Interpolation 
KNN 
A survey of methods on series
The items() method
The keys() method
The pop() method
The apply() method
The map() method
The drop() method
The equals() method
The sample() method
The ravel() function
The value_counts() function
The interpolate() function
The align() function
pandas string methods
upper(), lower(), capitalize(), title(), and swapcase()
contains(), find(), and replace()
strip() and split()
startswith() and endswith()
The is...() functions
Binary operations on DataFrames and series
Binning values
Using mathematical methods on DataFrames
The abs() function
corr() and cov()
cummax(), cumin(), cumsum(), and cumprod()
The describe() function
The diff() function
The rank() function
The quantile() function
The round() function
The pct_change() function
min(), max(), median(), mean(), and mode()
all() and any()
The clip() function
The count() function
Summary
Time Series and Plotting Using Matplotlib
Handling time series data
Reading in time series data
Assigning date indexes and subsetting in time series data
Plotting the time series data
Resampling and rolling of the time series data
Separating timestamp components
DateOffset and TimeDelta objects
Time series-related instance methods
Shifting/lagging
Frequency conversion
Resampling of data
Aliases for time series frequencies
Time series concepts and datatypes
Period and PeriodIndex
PeriodIndex
Conversion between time series datatypes
A summary of time series-related objects
Interconversions between strings and timestamps
Data-processing techniques for time series data
Data transformation
Plotting using matplotlib
Summary
Section 4: Going a Step Beyond with pandas
Making Powerful Reports In Jupyter Using pandas
pandas styling
In-built styling options
User-defined styling options
Navigating Jupyter Notebook
Exploring the menu bar of Jupyter Notebook
Edit mode and command mode
Mouse navigation
Jupyter Notebook Dashboard
Ipywidgets
Interactive visualizations
Writing mathematical equations in Jupyter Notebook
Formatting text in Jupyter Notebook
Headers
Bold and italics
Alignment
Font color
Bulleted lists
Tables
Tables
HTML
Citation
Miscellaneous operations in Jupyter Notebook
Loading an image
Hyperlinks
Writing to a Python file
Running a Python file
Loading a Python file
Internal Links
Sharing Jupyter Notebook reports
Using NbViewer
Using the browser
Using Jupyter Hub
Summary
A Tour of Statistics with pandas and NumPy
Descriptive statistics versus inferential statistics
Measures of central tendency and variability
Measures of central tendency
The mean
The median
The mode
Computing the measures of central tendency of a dataset in Python
Measures of variability, dispersion, or spread
Range
Quartile
Deviation and variance
Hypothesis testing – the null and alternative hypotheses
The null and alternative hypotheses
The alpha and p-values
Type I and Type II errors
Statistical hypothesis tests
Background
The z-test
The t-test
Types of t-tests
A t-test example
chi-square test
ANOVA test
Confidence intervals
An illustrative example
Correlation and linear regression
Correlation
Linear regression
An illustrative example
Summary
A Brief Tour of Bayesian Statistics and Maximum Likelihood Estimates
Introduction to Bayesian statistics
The mathematical framework for Bayesian statistics
Bayes' theory and odds
Applications of Bayesian statistics
Probability distributions
Fitting a distribution
Discrete probability distributions
Discrete uniform distribution
The Bernoulli distribution
The binomial distribution
The Poisson distribution
The geometric distribution
The negative binomial distribution
Continuous probability distributions
The continuous uniform distribution
The exponential distribution
The normal distribution
Bayesian statistics versus frequentist statistics
What is probability?
How the model is defined
Confidence (frequentist) versus credible (Bayesian) intervals
Conducting Bayesian statistical analysis
Monte Carlo estimation of the likelihood function and PyMC
Bayesian analysis example – switchpoint detection
Maximum likelihood estimate
MLE calculation examples
Uniform distribution
Poisson distribution
References
Summary
Data Case Studies Using pandas
End-to-end exploratory data analysis
Data overview
Feature selection
Feature extraction
Data aggregation
Web scraping with Python
Web scraping using pandas
Web scraping using BeautifulSoup
Data validation
Data overview
Structured databases versus unstructured databases
Validating data types
Validating dimensions
Validating individual entries
Using pandas indexing
Using loops
Summary
The pandas Library Architecture
Understanding the pandas file hierarchy
Description of pandas modules and files
pandas/core
pandas/io
pandas/tools
pandas/util
pandas/tests
pandas/compat
pandas/computation
pandas/plotting
pandas/tseries
Improving performance using Python extensions
Summary
pandas Compared with Other Tools
Comparison with R
Data types in R
R lists
R DataFrames
Slicing and selection
Comparing R-matrix and NumPy array
Comparing R lists and pandas series
Specifying a column name in R
Specifying a column name in pandas
R DataFrames versus pandas DataFrames
Multi-column selection in R
Multi-column selection in pandas
Arithmetic operations on columns
Aggregation and GroupBy
Aggregation in R
The pandas GroupBy operator
Comparing matching operators in R and pandas
R %in% operator
Pandas isin() function
Logical subsetting
Logical subsetting in R
Logical subsetting in pandas
Split-apply-combine
Implementation in R
Implementation in pandas
Reshaping using melt
R melt function
The pandas melt function
Categorical data
R example using cut()
The pandas solution
Comparison with SQL
SELECT
SQL
pandas
Where
SQL
pandas
SQL
pandas
SQL
pandas
group by
SQL
pandas
SQL
pandas
SQL
pandas
update
SQL
pandas
delete
SQL
pandas
JOIN
SQL
pandas
SQL
pandas
SQL
pandas
Comparison with SAS
Summary
A Brief Tour of Machine Learning
The role of pandas in machine learning
Installation of scikit-learn
Installing via Anaconda
Installing on Unix (Linux/macOS)
Installing on Windows
Introduction to machine learning
Supervised versus unsupervised learning
Illustration using document classification
Supervised learning
Unsupervised learning
How machine learning systems learn
Application of machine learning – Kaggle Titanic competition
The Titanic: Machine Learning from Disaster problem
The problem of overfitting
Data analysis and preprocessing using pandas
Examining the data
Handling missing values
A naive approach to the Titanic problem
The scikit-learn ML/classifier interface
Supervised learning algorithms
Constructing a model using Patsy for scikit-learn
General boilerplate code explanation
Logistic regression
Support vector machine
Decision trees
Random forest
Unsupervised learning algorithms
Dimensionality reduction
K-means clustering
XGBoost case study
Entropy
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Miscellaneous analysis of missing values
Next
Next Chapter
Deletion 
Strategies for handling missing values
The following are the major strategies for handling missing values.
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset