Loading the ED dataset

Next, we import the contents of the fixed-width data file into Python as a pandas DataFrame composed of string columns, using the widths list created in the previous cell. We then name the columns using the col_names list:

df_ed = pd.read_fwf(
    HOME_PATH + 'ED2013',
    widths=width,
    header=None,
    dtype='str'  
)

df_ed.columns = col_names

Let's print a preview of our dataset to confirm it was imported correctly:

print(df_ed.head(n=5))

The output should look similar to the following:

  VMONTH VDAYR ARRTIME WAITTIME   LOV  AGE AGER AGEDAYS RESIDNCE SEX ...   
0     01     3    0647     0033  0058  046    4     -07       01   2 ...    
1     01     3    1841     0109  0150  056    4     -07       01   2 ...    
2     01     3    1333     0084  0198  037    3     -07       01   2 ...    
3     01     3    1401     0159  0276  007    1     -07       01   1 ...    
4     01     4    1947     0114  0248  053    4     -07       01   1 ...    

RX12V3C1 RX12V3C2 RX12V3C3 RX12V3C4 SETTYPE YEAR CSTRATM CPSUM PATWT 0 nan nan nan nan 3 2013 20113201 100020 002945 1 nan nan nan nan 3 2013 20113201 100020 002945 2 nan nan nan nan 3 2013 20113201 100020 002945 3 nan nan nan nan 3 2013 20113201 100020 002945 4 nan nan nan nan 3 2013 20113201 100020 002945

  EDWT  
0  nan  
1  nan  
2  nan  
3  nan  
4  nan  

[5 rows x 579 columns]

Looking at the column values and their meanings in the documentation confirm that the data has been imported correctly. The nan values correspond to blank spaces in the data file.

Finally, as another check, let's count the dimensions of the data file and confirm that there are 24,777 rows and 579 columns:

print(df_ed.shape)

The output should look similar to the following:

(24777, 579)

Now that the data has been imported correctly, let's set up our response variable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset