No data, no neural net – selecting data

The first thing to do is to select appropriate relevant data that carries most of the system's dynamics that we want the neural network to reproduce. In our case, we need to select data that is relevant for weather forecasting.

Tip

While selecting data, getting an expert opinion about the process and its variables can be really helpful. The expert does help a lot in understanding the relationship between the variables, thus selecting them in an appropriate fashion.

In this chapter, we are going to use the data from the Brazilian Institute of Meteorology (INMET - http://www.inmet.gov.br/ in Portuguese), which is freely available on the Internet and we have the rights to apply it in this book. However, the reader may use any free weather database from the Internet while developing applications. Some examples from the English language sources are listed as follows:

Knowing the problem – weather variables

Any weather database has almost the same variables:

  • Temperature (°C)
  • Humidity (%)
  • Pressure (mbar)
  • Wind speed (m/s)
  • Wind direction (°)
  • Precipitation (mm)
  • Sunny hours (h)
  • Sun energy (W/m2)

This data is usually collected from meteorological stations, satellites, or radars, on an hourly or daily basis.

Tip

Depending on the collection frequency, some variables may be summarized with average, minimum, or maximum values.

The data units may also vary from source to source; that's why the units should always be observed.

Choosing input and output variables

Neural networks work as a nonlinear block that may have a predefined number of inputs and outputs, so we have to select the role that each weather variable will play in this application. In other words, we have to choose which variable(s) the neural network is going to predict and by using which input variables.

Tip

Regarding time series variables, one can derive new variables by applying historical data. This means that given a certain date, one may consider this date's values and the data collected (and/or summarized) from past dates, therefore extending the number of variables.

While defining a problem to use neural networks on, we need to consider one or more predefined target variables: predict temperature, forecast precipitation, measure insolation, and so on. However, in some cases, one may want to model all the variables and to find the causal relationships between them. To identify a causal relationship, there are a number of tools that can be applied:

  • Cross-correlation
  • Pearson's coefficient
  • Statistical analysis
  • Bayesian networks

For the sake of simplicity, we are not going to explore these tools in this chapter; however, the reader is recommended to go to the references [Dowdy & Wearden, 1983; Pearl, 2000; Fortuna et al., 2007] for obtaining more details about these tools. Instead, since we want to demonstrate the power of neural networks in predicting weather, we will choose the average temperature of a given day, based on the other four variables, on the basis of the current technical literature, which is cited in the preceding reference.

Removing insignificant behaviors – Data filtering

Sometimes, some issues are faced while getting data from some source. The common problems are as follows:

  • Absence of data in a certain record and variable
  • Error in measurement (for example, when a value is badly labeled)
  • Outliers (for example, when the value is very far from the usual range)

To handle each of these issues, one needs to perform filtering on the selected data. The neural network will reproduce exactly the same dynamics as those of the data that it will be trained with, so we have to be careful in feeding it with bad data. Usually, records containing bad data are removed from the dataset, ensuring that only "good" data are fed to the network.

To better understand filtering, let's consider the dataset as a big matrix containing n measurements and m variables.

Removing insignificant behaviors – Data filtering

Where aj(i) denotes the measurement of variable j at moment i.

So, our task is to find the bad records and delete them. Mathematically, there are a number of ways of identifying a bad record. For error measurement and outlier detection, the following three-sigma rule is very good:

Removing insignificant behaviors – Data filtering

Where xi denotes the value of the ith measurement, E[X] represents the average value, σX indicates the standard deviation, and di refers to the weighted distance from the average. If the absolute distance of the ith measurement fails to fit in less than three records, the ith measurement will be labeled as a bad measurement, and although the other variables from the same instance (row of the matrix) are good, one should discard the entire row of the dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset