Step 2 – EDA

The first step shows the data:

import matplotlib.pyplot as plt
import seaborn as sns

# showing dataset
df.plot(x='Time', kind='line', subplots=True)
plt.show()

The result will be similar to the following:

Time-series of flight

It is easy to see that when Flaps is anything other than zero, the airplane is taking off or landing, so we should not consider this time frame. The following code drops the two variables from the dataset:

# drop data during landing
df = df.drop(df[df['Flaps']>0].index)
df = df.drop(df[df['Landing_Gear']>0].index)

The dataset is good, so we don't need to clean it. We can now analyze the standard deviation of a few variables. The following code performs this analysis:

# analysis of variance
df_std = df.std()
print(df_std)
We can use the df.describe() function instead of df.std() to see the other statistics.

We can see that Landing_Gear, Thrust_Rev, and Flaps have a value of 0 for standard deviation, so they are not useful at all. We can, therefore, remove these variables:

#removing un-usefull vars
df=df.drop(['Landing_Gear', 'Thrust_Rev' ,'Flaps'], axis=1)

We can now analyze the correlation between the sensors. The following code calculates the Pearson correlation, pvalues. The Pearson correlation is a measure of the linear correlation between two variables:

# correlation
from scipy.stats import pearsonr
def calculate_pvalues(df):
corr ={}
for r in df.columns:
for c in df.columns:
if not(corr.get(r + ' - ' + c )):
p = pearsonr(df[r], df[c])
corr[c + ' - ' + r ] = p
return corr

print('correlation')
d = calculate_pvalues(df).items()
for k,v in d:
print('%s : v: %s p: %s' % (k,v[0],v[1]))

The correlations that we have found between the variables suggest that we need to carry out further investigation. We can use a RandomForestRegressor function to select the most interesting features:

# separate into input and output variables
array = df.values
x = array[:,0:-1]
y = array[:,-1]
# perform feature selection
rfe = RFE(RandomForestRegressor(n_estimators=500, random_state=1), 4)
fit = rfe.fit(x, y)
# report selected features
print('Selected Features:')
names = df.columns.values[0:-1]
for i in range(len(fit.support_)):
if fit.support_[i]:
print(names[i])

The output should be similar to the following:

Time
Altitude
Param1_4
Param3_1
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset