Understanding the structure of the data

The following diagram depicts the design of the system, in order to help you gain a better understanding of the problem and the structure of the data that's collected:

Fig 4.1: Network traffic and bandwidth usage data for Wi-Fi traffic and storage in Elasticsearch

The data is collected by the system with the following objectives:

  • On the left half of the diagram, there are multiple squares, representing one customer's premises as well as the Wi-Fi routers deployed on that site, along with all of the devices connected to those Wi-Fi routers. The connected devices include laptops, mobile devices, desktop computers, and so on. Each device has a unique MAC address and a user associated with it.
  • The right half of the diagram represents the centralized system, which collects and stores data from multiple customers into a centralized Elasticsearch cluster. Our focus will be on how to design this centralized Elasticsearch cluster and the index to gain meaningful insight.
  • The routers at each customer site collect additional metrics for each connected device, such as data downloaded, data uploaded, and URLs or domain names accessed by the client in a specific time interval. The Wi-Fi routers collect such metrics and send them to the centralized API server periodically, for long-term storage and analysis.
  • When the data is sent by the Wi-Fi routers, it contains fewer fields: mainly the metrics captured by the Wi-Fi routers and the MAC address of the end device for which those metrics are collected. The API server looks up and enriches the records with more information, which is useful for analytics, before storing it in Elasticsearch. The MAC address is looked up to find out the username of the user that the device is assigned to. It also looks up additional dimensions, such as the department of the user. 
What are metrics and dimensions? Metric is a common term used in the analytics world to represent a numerical measure. A common example of a metric is the amount of data downloaded or uploaded in a given time period. The term dimension is usually used to refer to extra/auxiliary information, usually of the string datatype. In this example, we are using a MAC address to look up auxiliary information related to that MAC address, namely the username of the user that the device is assigned to in the system. The name of the department the user belongs to is another example of a dimension.

Finally, the enriched records are stored in Elasticsearch in a flat data structure. One record looks as follows:

"_source": {
"customer": "Google" // Customer to which the WiFi router and device belongs to
"accessPointId": "AP-59484", // Identifier of the WiFi router or Access Point
"time": 1506148631061, // Time of the record in milliseconds since Epoch Jan 1, 1970
"mac": "c6:ec:7d:c6:3d:8d", // MAC address of the client device

"username": "Pedro Harrison", // Name of the user to whom the device is assigned
"department": "Operations", // Department of the user to which the device belongs to

"application": "CNBC", // Application name or domain name for which traffic is reported
"category": "News", // Category of the application

"networkId": "Internal", // SSID of the network
"band": "5 GHz", // Band 5 GHz or 2.4 GHz

"location": "23.102789,72.595381", // latitude & longitude separated by comma

"uploadTotal": 1340, // Bytes uploaded since the last report
"downloadTotal": 2129, // Bytes downloaded since the last report
"usage": 3469, // Total bytes downloaded and uploaded in current period

"uploadCurrent": 22.33, // Upload speed in bytes/sec in current period
"downloadCurrent": 35.48, // Download speed in bytes/sec in current period
"bandwidth": 57.82, // Total speed in bytes/sec (Upload speed + download speed)

"signalStrength": -25, // Signal strength between WiFi router and device
...
}

One record contains various metrics for the given end client device at the given time.

Please note that all the data included in this example is synthetic. Although the names of customers, users, and MAC addresses look realistic, the data was generated using a simulator. The data doesn't belong to any real customers.

Now that we know what our data represents and what each record represents, let's load the data in our local instance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset