Data aggregation

As mentioned earlier, the data presented is customer-level data. It would be more feasible and easy to perform analysis on aggregated data, which in this case is a region. To start with, we need to understand how the customers are spread across each region. Hence, we are going to use the groupby function to find the number of customers in each zip code. The snippet and its output are shown in the following code:

data.groupby('zip')['zip'].count().nlargest(10)

The following is the output:

 Aggregating the data based on zip codes

This gives the first 10 zip codes that have the maximum number of customers.

Therefore, we can convert our client-level data into zip-level data using aggregation. After grouping the values, we also have to make sure that we remove the NAs. The following code can be used to perform aggregation on the entire DataFrame: 

data_mod=data.groupby('zip') 
data_clean=pd.DataFrame() 
for name,data_group in data_mod: 
    data_group1=data_group.fillna(method='ffill') 
    data_clean=pd.concat([data_clean,data_group1],axis=0) 
 
data_clean.dropna(axis=0, how='any') 

The following screenshot is the aggregated DataFrame after removing the NAs:

Aggregated DataFrame after removing the NAs

data_clean will become the cleaned version of our sample DataFrame, which will be passed to a model for further analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset