Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

No data, no neural net – selecting data

The first thing to do is to select appropriate relevant data that carries most of the system's dynamics that we want the neural network to reproduce. In our case, we need to select data that is relevant for weather forecasting.

Tip

While selecting data, getting an expert opinion about the process and its variables can be really helpful. The expert does help a lot in understanding the relationship between the variables, thus selecting them in an appropriate fashion.

In this chapter, we are going to use the data from the Brazilian Institute of Meteorology (INMET - http://www.inmet.gov.br/ in Portuguese), which is freely available on the Internet and we have the rights to apply it in this book. However, the reader may use any free weather database from the Internet while developing applications. Some examples from the English language sources are listed as follows:

Wunderground (http://wunderground.com/)
Open weather map (http://openweathermap.org/api)
Yahoo weather API (https://developer.yahoo.com/weather/)
U.S. National Climatic Data Center (http://www.ncdc.noaa.gov/)

Knowing the problem – weather variables

Any weather database has almost the same variables:

Temperature (°C)
Humidity (%)
Pressure (mbar)
Wind speed (m/s)
Wind direction (°)
Precipitation (mm)
Sunny hours (h)
Sun energy (W/m²)

This data is usually collected from meteorological stations, satellites, or radars, on an hourly or daily basis.

Tip

Depending on the collection frequency, some variables may be summarized with average, minimum, or maximum values.

The data units may also vary from source to source; that's why the units should always be observed.

Choosing input and output variables

Neural networks work as a nonlinear block that may have a predefined number of inputs and outputs, so we have to select the role that each weather variable will play in this application. In other words, we have to choose which variable(s) the neural network is going to predict and by using which input variables.

Tip

Regarding time series variables, one can derive new variables by applying historical data. This means that given a certain date, one may consider this date's values and the data collected (and/or summarized) from past dates, therefore extending the number of variables.

While defining a problem to use neural networks on, we need to consider one or more predefined target variables: predict temperature, forecast precipitation, measure insolation, and so on. However, in some cases, one may want to model all the variables and to find the causal relationships between them. To identify a causal relationship, there are a number of tools that can be applied:

Cross-correlation
Pearson's coefficient
Statistical analysis
Bayesian networks

For the sake of simplicity, we are not going to explore these tools in this chapter; however, the reader is recommended to go to the references [Dowdy & Wearden, 1983; Pearl, 2000; Fortuna et al., 2007] for obtaining more details about these tools. Instead, since we want to demonstrate the power of neural networks in predicting weather, we will choose the average temperature of a given day, based on the other four variables, on the basis of the current technical literature, which is cited in the preceding reference.

Removing insignificant behaviors – Data filtering

Sometimes, some issues are faced while getting data from some source. The common problems are as follows:

Absence of data in a certain record and variable
Error in measurement (for example, when a value is badly labeled)
Outliers (for example, when the value is very far from the usual range)

To handle each of these issues, one needs to perform filtering on the selected data. The neural network will reproduce exactly the same dynamics as those of the data that it will be trained with, so we have to be careful in feeding it with bad data. Usually, records containing bad data are removed from the dataset, ensuring that only "good" data are fed to the network.

To better understand filtering, let's consider the dataset as a big matrix containing n measurements and m variables.

Removing insignificant behaviors – Data filtering

Where a_j(i) denotes the measurement of variable j at moment i.

So, our task is to find the bad records and delete them. Mathematically, there are a number of ways of identifying a bad record. For error measurement and outlier detection, the following three-sigma rule is very good:

Where x_i denotes the value of the i^th measurement, E[X] represents the average value, σX indicates the standard deviation, and d_i refers to the weighted distance from the average. If the absolute distance of the i^th measurement fails to fit in less than three records, the i^th measurement will be labeled as a bad measurement, and although the other variables from the same instance (row of the matrix) are good, one should discard the entire row of the dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for No data, no neural net – selecting data

Create new playlist

Sign In

Sign Up

No data, no neural net – selecting data

Tip

Knowing the problem – weather variables

Tip

Choosing input and output variables

Tip

Removing insignificant behaviors – Data filtering

Table of Contents for
No data, no neural net – selecting data