Data

In this chapter, we are going to use Physical Activity Monitoring Data Set (PAMAP2) published in the Machine Learning Repository by the University of Irvine: https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring

The full dataset contains 52 input features and 3,850,505 events describing 18 different physical activities (for example, walking, cycling, running, watching TV). The data was recorded by a heart rate monitor and three inertial measurement units located on the wrist, chest, and dominant side's ankle. Each event is annotated by an activity label describing the ground truth and also a timestamp. The dataset contains missing values indicated by the value NaN. Furthermore, some columns produced by sensors are marked as invalid ("orientation" - see dataset description):

Figure 1: Properties of dataset as published in the Machine Learning Repository of the University of Irvine.

The dataset represents the perfect example for activity recognition: we would like to train a robust model which would be able to predict a performed activity based on incoming data from physical sensors.

Furthermore, the dataset is spread over multiple files, each file representing measurements of a single subject, which is another real-life aspect of data produced by multiple data sources so we will need to utilize Spark's ability to read from a directory and merge the files to make training/test datasets.

The following lines show a sample of the data. There are a couple of important observations that are worth noting:

  • Individual values are separated by an empty space character
  • The first value in each row represents a timestamp, while the second value holds the activityId
199.38 0 NaN 34.1875 1.54285 7.86975 5.88674 1.57679 7.65264 5.84959 -0.0855996 ... 1 0 0 0 
199.39 11 NaN 34.1875 1.46513 7.94554 5.80834 1.5336 7.81914 5.92477 -0.0907069 ...  1 0 0 0 
199.4 11 NaN 34.1875 1.41585 7.82933 5.5001 1.56628 8.03042 6.01488 -0.0399161 ...  1 0 0 0 

The activityId is represented by a numeric value; hence, we need a translation table to transform an ID to a corresponding activity label which the dataset gives and we show as follows:

1 lying 2 sitting
3 standing 4 walking
5 running 6 cycling
7 Nordic walking 9 watching TV
10 computer work 11 car driving
12 ascending stairs 13 descending stairs
16 vacuum cleaning 17 ironing
18 folding laundry 19 house cleaning
20 playing soccer 24 rope jumping
0 other (transient activities)

 

The example lines represent one "other activity" and then two measurements representing "car driving".

The third column contains heart rate measurements, while the rest of the columns represent data from three different inertia measurements units: columns 4-20 are from the hand sensor, 21-37 contain data from chest sensor and finally the columns 38-54 hold ankle sensor measurements. Each sensor measures 17 different values including temperature, 3-D acceleration, gyroscope and magnetometer data, and orientation. However, the orientation columns are marked as invalid in this dataset.

The input data pack contains two different folders - protocol, and optional measurements which contains data from a few subjects who performed some additional activities. In this chapter, we are going to use only data from optional folder.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset