Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9

Online Nonlinear Modeling via Self-Organizing Trees

Nuri Denizcan Vanli^⁎; Suleyman Serdar Kozat^† ^⁎Massachusetts Institute of Technology, Laboratory for Information and Decision Systems, Cambridge, MA, United States
^†Bilkent University, Ankara, Turkey

Abstract

We study online supervised learning and introduce regression and classification algorithms based on self-organizing trees (SOTs), which adaptively partition the feature space into small regions and combine simple local learners defined in these regions. The proposed algorithms sequentially minimize the cumulative loss by learning both the partitioning of the feature space and the parameters of the local learners defined in each region. The output of the algorithm at each time instance is constructed by combining the outputs of a doubly exponential number (in the depth of the SOT) of different predictors defined on this tree with reduced computational and storage complexity. The introduced methods are generic, such that they can incorporate different tree construction methods from the ones presented in this chapter. We present a comprehensive experimental study under stationary and nonstationary environments using benchmark datasets and illustrate remarkable performance improvements with respect to the state-of-the-art methods in the literature.

Keywords

Self-organizing trees; Nonlinear learning; Online learning; Classification and regression trees; Adaptive nonlinear filtering; Nonlinear modeling; Supervised learning

Chapter Points

• We present a nonlinear modeling method for online supervised learning problems.
• Nonlinear modeling is introduced via SOTs, which adaptively partition the feature space to minimize the loss of the algorithm.
• Experimental validation shows huge empirical performance improvements with respect to the state-of-the-art methods.

Acknowledgements

The authors would like to thank Huseyin Ozkan for his contributions in this work.

9.1 Introduction

Nonlinear adaptive learning is extensively investigated in the signal processing [1–4] and machine learning literature [5–7], especially for applications where linear modeling is inadequate and hence does not provide satisfactory results due to the structural constraint on linearity. Although nonlinear approaches can be more powerful than linear methods in modeling, they usually suffer from overfitting and stability and convergence issues [8], which considerably limit their application to signal processing and machine learning problems. These issues are especially exacerbated in adaptive filtering due to the presence of feedback, which is hard to control even for linear models [9]. Furthermore, for applications involving big data, which require the processing of input vectors with considerably large dimensions, nonlinear models are usually avoided due to unmanageable computational complexity increase [10]. To overcome these difficulties, tree-based nonlinear adaptive filters or regressors are introduced as elegant alternatives to linear models since these highly efficient methods retain the breadth of nonlinear models while mitigating the overfitting and convergence issues [11–13].

In its most basic form, a tree defines a hierarchical or nested partitioning of the feature space [12]. As an example, consider the binary tree in Fig. 9.1, which partitions a two-dimensional feature space. On this tree, each node is constructed by a bisection of the feature space (where we use hyperplanes for separation), which results in a complete nested and disjoint partitioning of the feature space. After the partitions are defined, the local learners in each region can be chosen as desired. As an example, to solve a regression problem, one can train a linear regressor in each region, which yields an overall piecewise linear regressor. In this sense, tree-based modeling is a natural nonlinear extension of linear models via a tractable nested structure.

Figure 9.1 Feature space partitioning using a binary tree. The partitioning of a two-dimensional feature space using a complete tree of depth-2 with hyperplanes for separation. The feature space is first bisected by s_t,λ, which is defined by the hyperplane ϕ_t,λ, where the region on the direction of the ϕ_t,λ vector corresponds to the child with label “1”. We then continue to bisect children regions using s_t,0 and s_t,1, defined by ϕ_t,0 and ϕ_t,1, respectively.

Although nonlinear modeling using trees is a powerful and efficient method, there exist several algorithmic parameters and design choices that affect their performance in many applications [11]. Tuning these parameters is a difficult task for applications involving nonstationary data exhibiting saturation effects, threshold phenomena or chaotic behavior [14]. In particular, the performance of tree-based models heavily depends on a careful partitioning of the feature space. Selection of a good partition is essential to balance the bias and variance of the regressor [12]. As an example, even for a uniform binary tree, while increasing the depth of the tree improves the modeling power, such an increase usually results in overfitting [15]. To address this issue, there exist nonlinear modeling algorithms that avoid such a direct commitment to a particular partition but instead construct a weighted average of all possible partitions (or equivalently, piece-wise models) defined on a tree [6,7,16,17]. Note that a full binary tree of depth d defines a doubly exponential number of different partitions of the feature space [18]; for an example, see Fig. 9.2. Each of these partitions can be represented by a certain collection of the nodes of the tree, where each node represents a particular region of the feature space. Any of these partitions can be used to construct a nonlinear model, e.g., by training a linear model in each region, we can obtain a piece-wise linear model. Instead of selecting one of these partitions and fixing it as the nonlinear model, one can run all partitions in parallel and combine their outputs using a mixture-of-experts approach. Such methods are shown to mitigate the bias–variance tradeoff in a deterministic framework [6,7,16,19]. However, these methods are naturally constrained to work on a fixed partitioning structure, i.e., the partitions are fixed and cannot be adapted to data.

Figure 9.2 Example partitioning for a binary classification problem. The left figure shows an example partitioning of a two-dimensional feature space using a depth-2 tree. The active region corresponding to a node is shown colored, where the dashed line represents the separating hyperplane at that node, and the two different colored subregions in a node represent the local classifier trained in that region. The right figure shows all different partitions (and consequently classifiers) defined by the tree on the left.

Although there exist numerous methods to partition the feature space, many of these split criteria are typically chosen a priori and fixed such as the dyadic partitioning [20] and a specific loss (e.g., the Gini index [21]) is minimized separately for each node. For instance, multivariate trees are extended to allow the simultaneous use of functional inner and leaf nodes to draw a decision in [13]. Similarly, the node-specific individual decisions are combined in [22] via the context tree weighting method [23] and a piece-wise linear model for sequential classification is obtained. Since the partitions in these methods are fixed and chosen even before the processing starts, the nonlinear modeling capability of such methods is very limited and significantly deteriorates in cases of high dimensionality [24].

To resolve this issue, we introduce self-organizing trees (SOTs) that jointly learn the optimal feature space partitioning to minimize the loss of the algorithm. In particular, we consider a binary tree where a separator (e.g., a hyperplane) is used to bisect the feature space in a nested manner, and an online linear predictor is assigned to each node. The sequential losses of these node predictors are combined (with their corresponding weights that are sequentially learned) into a global loss that is parameterized via the separator functions and the parameters of the node predictors. We minimize this global loss using online gradient descent, i.e., by updating the complete set of SOT parameters, i.e., the separators, the node predictors and the combination weights, at each time instance. The resulting predictor is a highly dynamical SOT structure that jointly (and in an online and adaptive manner) learns the region classifiers and the optimal feature space partitioning and hence provides an efficient nonlinear modeling with multiple learning machines. In this respect, the proposed method is remarkably robust to drifting source statistics, i.e., nonstationarity. Since our approach is essentially based on a finite combination of linear models, it generalizes well and does not overfit or limitedly overfits (as also shown by an extensive set of experiments).

9.2 Self-Organizing Trees for Regression Problems

In this section, we consider the sequential nonlinear regression problem, where we observe a desired signal ${d_{t}}_{t ⩾ 1}$ , $d_{t} \in R$ , and regression vectors ${x_{t}}_{t ⩾ 1}$ , $x_{t} \in R^{p}$ , such that we sequentially estimate $d_{t}$ by

${\hat{d}}_{t} = f_{t} (x_{t}),$

where $f_{t} (\cdot)$ is the adaptive nonlinear regression function defined by the SOT. At each time t, the regression error of the algorithm is given by

$e_{t} = d_{t} - {\hat{d}}_{t}$

and the objective of the algorithm is to minimize the square error loss $\sum_{t = 1}^{T} e_{t}^{2}$ , where T is the number of observed samples.

9.2.1 Notation

We first introduce a labeling for the tree nodes following [23]. The root node is labeled with an empty binary string λ and assuming that a node has a label n, where n is a binary string, we label its upper and lower children as n1 and n0, respectively. Here we emphasize that a string can only take its letters from the binary alphabet ${0, 1}$ , where 0 refers to the lower child and 1 refers to the upper child of a node. We also introduce another concept, i.e., the definition of the prefix of a string. We say that a string $n^{'} = q_{1}^{'} \dots q_{l^{'}}^{'}$ is a prefix to string $n = q_{1} \dots q_{l}$ if $l^{'} ⩽ l$ and $q_{i}^{'} = q_{i}$ for all $i = 1, \dots, l^{'}$ and the empty string λ is a prefix to all strings. Let $P (n)$ represent all prefixes to the string n, i.e., $P (n) ≜ {n_{0}, \dots, n_{l}}$ , where $l ≜ l (n)$ is the length of the string n, $n_{i}$ is the string with $l (n_{i}) = i$ and $n_{0} = λ$ is the empty string, such that the first i letters of the string n form the string $n_{i}$ for $i = 0, \dots, l$ .

For a given SOT of depth D, we let $N_{D}$ denote all nodes defined on this SOT and $L_{D}$ denote all leaf nodes defined on this SOT. We also let $β_{D}$ denote the number of partitions defined on this SOT. This yields the recursion $β_{j + 1} = β_{j}^{2} + 1$ for all $j ⩾ 1$ , with the base case $β_{0} = 1$ . For a given partition k, we let $M_{k}$ denote the set of all nodes in this partition.

For a node $n \in N_{D}$ (defined on the SOT of depth D), we define $S_{D} (n) ≜ {\overset{´}{n} \in N_{D} | P (\overset{´}{n}) = n}$ as the set of all nodes of the SOT of depth D, whose set of prefixes includes node n.

For a node $n \in N_{D}$ (defined on the SOT of depth D) with length $l (n) ⩾ 1$ , the total number of partitions that contain n can be found by the following recursion:

$γ_{d} (l (n)) ≜ \prod_{j = 1}^{l (n)} β_{d - j} .$

For the case where $l (n) = 0$ (i.e., for $n = λ$ ), one can clearly observe that there exists only one partition containing λ, therefore $γ_{d} (0) = 1$ .

For two nodes $n, \overset{´}{n} \in N_{D}$ (defined on the SOT of depth D), we let $ρ (n, \overset{´}{n})$ denote the number of partitions that contain both n and $\overset{´}{n}$ . Trivially, if $\overset{´}{n} = n$ , then $ρ (n, \overset{´}{n}) = γ_{d} (l (n))$ . If $n \neq \overset{´}{n}$ , then letting $\bar{n}$ denote the longest prefix to both n and $\overset{´}{n}$ , i.e., the longest string in $P (n) \cap P (\overset{´}{n})$ , we obtain

$ρ (n, \overset{´}{n}) ≜ {\begin{matrix} γ_{d} (l (n)), & if n = \overset{´}{n,} \\ \frac{γ_{d} (l (n)) γ_{d - l (\bar{n}) - 1} (l (\overset{´}{n}) - l (\bar{n}) - 1)}{β_{d - l (\bar{n}) - 1}}, & if n \notin P (\overset{´}{n}) \cup S_{D} (\overset{´}{n}), \\ 0, & otherwise. \end{matrix}$

(9.1)

Since $l (\bar{n}) + 1 ⩽ l (n)$ and $l (\bar{n}) + 1 ⩽ l (\overset{´}{n})$ from the definition of the SOT, we naturally have $ρ (n, \overset{´}{n}) = ρ (\overset{´}{n}, n)$ .

9.2.2 Construction of the Algorithm

For each node n on the SOT, we define a node predictor

${\hat{d}}_{t, n} = v_{t, n}^{T} x_{t},$

(9.2)

whose parameter $v_{t, n}$ is updated using the online gradient descent algorithm. We also define a separator function for each node p on the SOT except the leaf nodes (note that leaf nodes do not have any children) using the sigmoid function

$s_{t, n} = \frac{1}{1 + \exp (ϕ_{t, n}^{T} x_{t})},$

(9.3)

where $ϕ_{t, n}$ is the normal to the separation plane. We then define the prediction of any partition according to the hierarchical structure of the SOT as the weighted sum of the prediction of the nodes in that partition, where the weighting is determined by the separator functions of the nodes between the leaf node and the root node. In particular, the prediction of the kth partition at time t is defined as follows:

${\hat{d}}_{t}^{(k)} = \sum_{n \in M_{k}} ({\hat{d}}_{t, n} \prod_{i = 0}^{l (n) - 1} s_{t, n_{i}}^{q_{i}}),$

(9.4)

where $n_{i} \in P (n)$ is the prefix to string n with length $i - 1$ , $q_{i}$ is the ith letter of the string n, i.e., $n_{i + 1} = n_{i} q_{i}$ , and finally $s_{t, n_{i}}^{q_{i}}$ denotes the value of the separator function at node $n_{i}$ such that

$s_{t, n_{i}}^{q_{i}} ≜ {\begin{matrix} s_{t, n_{i}}, & if q_{i} = 0, \\ 1 - s_{t, n_{i}}, & otherwise, \end{matrix}$

(9.5)

with $s_{t, n_{i}}$ defined as in (9.3). We emphasize that we dropped the n dependency of $q_{i}$ and $n_{i}$ to simplify notation. Using these definitions, we can construct the final estimate of our algorithm as

${\hat{d}}_{t} = \sum_{k \in β_{D}} w_{t}^{(k)} {\hat{d}}_{t}^{(k)},$

(9.6)

where $w_{t}^{(k)}$ represents the weight of partition k at time t.

Having found a method to combine the predictions of all partitions to generate the final prediction of the algorithm, we next aim to obtain a low-complexity representation since there are $O ({1.5}^{2^{D}})$ different partitions defined on the SOT and (9.6) requires a storage and computational complexity of $O ({1.5}^{2^{D}})$ . To this end, we denote the product terms in (9.4) as follows:

${\hat{δ}}_{t, n} ≜ {\hat{d}}_{t, n} \prod_{i = 0}^{l (n) - 1} s_{t, n_{i}}^{q_{i}},$

(9.7)

where ${\hat{δ}}_{t, n}$ can be viewed as the estimate of the node n at time t. Then (9.4) can be rewritten as follows:

${\hat{d}}_{t}^{(k)} = \sum_{p \in M_{k}} {\hat{δ}}_{t, n} .$

Since we now have a compact form to represent the tree and the outputs of each partition, we next introduce a method to calculate the combination weights of $O ({1.5}^{2^{D}})$ partitions in a simplified manner. For this, we assign a particular linear weight to each node. We denote the weight of node n at time t as $w_{t, n}$ and then define the weight of the kth partition as the sum of the weights of its nodes, i.e.,

$w_{t}^{(k)} = \sum_{n \in M_{k}} w_{t, n},$

for all $k \in {1, \dots, β_{D}}$ . Since we use online gradient descent to update the weight of each partition, the weight of partition k is recursively updated as

$w_{t + 1}^{(k)} = w_{t}^{(k)} + μ_{t} e_{t} {\hat{d}}_{t}^{(k)} .$

This yields the following recursive update on the node weights:

$w_{t + 1, n} = w_{t, n} + μ_{t} e_{t} {\hat{δ}}_{t, n},$

(9.8)

where ${\hat{δ}}_{t, n}$ is defined as in (9.7). This result implies that instead of managing $O ({1.5}^{2^{D}})$ memory locations and making $O ({1.5}^{2^{D}})$ calculations, only keeping track of the weights of every node is sufficient and the number of nodes in a depth-D model is $| N_{D} | = 2^{D + 1} - 1$ . Therefore, we can reduce the storage and computational complexity from $O ({1.5}^{2^{D}})$ to $O (2^{D})$ by performing the update in (9.8) for all $n \in N_{D}$ .

Using these node predictors and weights, we construct the final estimate of our algorithm as follows:

${\hat{d}}_{t} = \sum_{k = 1}^{β_{d}} {(\sum_{n \in M_{k}} w_{t, n}) (\sum_{n \in M_{k}} {\hat{δ}}_{t, n})} .$

Here, we observe that for arbitrary two nodes $n, \overset{´}{n} \in N_{d}$ , the product $w_{t, n} {\hat{δ}}_{t, \overset{´}{n}}$ appears $ρ (n, \overset{´}{n})$ times in ${\hat{d}}_{t}$ (cf. (9.1)). Hence, the combination weight of the estimate of the node n at time t can be calculated as follows:

$κ_{t, n} = \sum_{\overset{´}{n} \in N_{d}} ρ (n, \overset{´}{n}) w_{t, \overset{´}{n}} .$

(9.9)

Using the combination weight (9.9), we obtain the final estimate of our algorithm as follows:

${\hat{d}}_{t} = \sum_{n \in N_{D}} κ_{t, n} {\hat{δ}}_{t, n} .$

(9.10)

Note that (9.10) is equal to (9.6) with a storage and computational complexity of $O (4^{D})$ instead of $O ({1.5}^{2^{D}})$ .

As we derived all the update rules for the node weights and the parameters of the individual node predictors, what remains is to provide an update scheme for the separator functions. To this end, we use the online gradient descent update

$ϕ_{t + 1, n} = ϕ_{t, n} - \frac{1}{2} η_{t} \nabla e_{t}^{2} (ϕ_{t, n}),$

(9.11)

for all nodes $n \in N_{D} ∖ L_{D}$ , where $η_{t}$ is the learning rate of the algorithm and $\nabla e_{t}^{2} (ϕ_{t, n})$ is the derivative of $e_{t}^{2} (ϕ_{t, n})$ with respect to $ϕ_{t, n}$ . After some algebra, we obtain

$\begin{matrix} ϕ_{t + 1, n} & = ϕ_{t, n} + η_{t} e_{t} \frac{\partial {\hat{d}}_{t}}{\partial s_{t, n}} \frac{\partial s_{t, n}}{\partial ϕ_{t, n}} \\ = ϕ_{t, n} + η_{t} e_{t} {\sum_{\overset{´}{n} \in N_{D}} κ_{t, \overset{´}{n}} \frac{\partial {\hat{δ}}_{t, \overset{´}{n}}}{\partial s_{t, n}}} \frac{\partial s_{t, n}}{\partial ϕ_{t, n}} \\ = ϕ_{t, n} + η_{t} e_{t} {\sum_{q = 0}^{1} \sum_{\overset{´}{n} \in S_{D} (n q)} {(- 1)}^{q} κ_{t, \overset{´}{n}} \frac{{\hat{δ}}_{t, \overset{´}{n}}}{s_{t, n}^{q}}} \frac{\partial s_{t, n}}{\partial ϕ_{t, n}}, \end{matrix}$

(9.12)

where we use the logistic regression classifier as our separator function, i.e., $s_{t, n} = {(1 + \exp (x_{t}^{T} ϕ_{t, n}))}^{- 1}$ . Therefore, we have

$\begin{matrix} \frac{\partial s_{t, n}}{\partial ϕ_{t, n}} & = - {(1 + \exp (x_{t}^{T} ϕ_{t, n}))}^{- 2} \exp (x_{t}^{T} ϕ_{t, n}) x_{t} \\ = - s_{t, n} (1 - s_{t, n}) x_{t} . \end{matrix}$

(9.13)

We emphasize that other separator functions can also be used in a similar way by simply calculating the gradient with respect to the extended direction vector and plugging in (9.12) and (9.13). From (9.13), we observe that $\nabla e_{t}^{2} (ϕ_{t, n})$ includes the product of $s_{t, n}$ and $1 - s_{t, n}$ terms; hence, in order not to slow down the learning rate of our algorithm, we restrict $s^{+} ⩽ s_{t} ⩽ 1 - s^{+}$ for some $0 < s^{+} < 0.5$ . In accordance with this restriction, we define the separator functions as follows:

$s_{t} = s^{+} + \frac{1 - 2 s^{+}}{1 + e^{x_{t}^{T} ϕ_{t}}} .$

(9.14)

According to the update rule in (9.12), the computational complexity of the introduced algorithm results in $O (p 4^{d})$ . This concludes the construction of the algorithm and a pseudocode is given in Algorithm 1.

Algorithm 1 Self-Organizing Tree Regressor (SOTR).

9.2.3 Convergence of the Algorithm

For Algorithm 1, we have the following convergence guarantee, which implies that our regressor (given in Algorithm 1) asymptotically achieves the performance of the best linear combination of the $O ({1.5}^{2^{D}})$ different adaptive models that can be represented using a depth-D tree with a computational complexity $O (p 4^{D})$ . While constructing the algorithm, we refrain from any statistical assumptions on the underlying data, and our algorithm works for any sequence of ${d_{t}}_{t ⩾ 1}$ with an arbitrary length of n. Furthermore, one can use this algorithm to learn the region boundaries and then feed this information to the first algorithm to reduce computational complexity.

Theorem 1

Let ${d_{t}}_{t ⩾ 1}$ and ${x_{t}}_{t ⩾ 1}$ be arbitrary, bounded and real-valued sequences. The predictor ${\hat{d}}_{t}$ given in Algorithm 1 when applied to these sequences yields

$\sum_{t = 1}^{T} {(d_{t} - {\hat{d}}_{t})}^{2} - \min_{w \in R^{β_{d}}} \sum_{t = 1}^{T} {(d_{t} - w^{T} {\hat{d}}_{t})}^{2} ⩽ O (\log (T)),$

(9.15)

for all T, when $e_{t}^{2} (w)$ is strongly convex ∀t, where ${\hat{d}}_{t} = {[{\hat{d}}_{t}^{(1)}, \dots, {\hat{d}}_{t}^{(β_{d})}]}^{T}$ and ${\hat{d}}_{t}^{(k)}$ represents the estimate of $d_{t}$ at time t for the adaptive model $k = 1, \dots, β_{d}$ .

Proof of this theorem can be found in Appendix 9.A.1.

9.3 Self-Organizing Trees for Binary Classification Problems

In this section, we study online binary classification, where we observe feature vectors ${x_{t}}_{t ⩾ 1}$ and determine their labels ${y_{t}}_{t ⩾ 1}$ in an online manner. In particular, the aim is to learn a classification function $f_{t} (x_{t})$ with $x_{t} \in R^{p}$ and $y_{t} \in {- 1, 1}$ such that, when applied in an online manner to any streaming data, the empirical loss of the classifier $f_{t} (\cdot)$ , i.e.,

$L_{T} (f_{t}) ≜ \sum_{t = 1}^{T} 1_{{f_{t} (x_{t}) \neq d_{t}}},$

(9.16)

is asymptotically as small (after averaging over T) as the empirical loss of the best partition classifier defined over the SOT of depth D. To be more precise, we measure the relative performance of $f_{t}$ with respect to the performance of a partition classifier $f_{t}^{(k)}$ , where $k \in {1, \dots, β_{D}}$ , using the following regret:

$R_{T} (f_{t}; f_{t}^{(k)}) ≜ \frac{L_{T} (f_{t}) - L_{T} (f_{t}^{(k)})}{T},$

(9.17)

for any arbitrary length T. Our aim is then to construct an online algorithm with guaranteed upper bounds on this regret for any partition classifier defined over the SOT.

9.3.1 Construction of the Algorithm

Using the notations described in Section 9.2.1, the output of a partition classifier $k \in {1, \dots, β_{D}}$ is constructed as follows. Without loss of generality, suppose that the feature $x_{t}$ has fallen into the region represented by the leaf node $n \in L_{D}$ . Then $x_{t}$ is contained in the nodes $n_{0}, \dots, n_{D}$ , where $n_{d}$ is the i letter prefix of n, i.e., $n_{D} = n$ and $n_{0} = λ$ . For example, if node $n_{d}$ is contained in partition k, then one can simply set $f_{t}^{(k)} (x_{t}) = f_{t, n_{d}} (x_{t})$ . Instead of making a hard selection, we allow an error margin for the classification output $f_{t, n_{d}} (x_{t})$ in order to be able to update the region boundaries later in the proof. To achieve this, for each node contained in partition k, we define a parameter called path probability to measure the contribution of each leaf node to the classification task at time t. This parameter is equal to the multiplication of the separator functions of the nodes from the respective node to the root node, which represents the probability that $x_{t}$ should be classified using the region classifier of node $n_{d}$ . This path probability (similar to the node predictor definition in (9.7)) is defined as

$P_{t, n_{d}} (x_{t}) ≜ \prod_{i = 0}^{d - 1} s_{t, n_{i}}^{q_{i + 1}} (x_{t}),$

(9.18)

where $p_{t, n_{i}}^{q_{i + 1}} (\cdot)$ represents the value of the partitioning function corresponding to node $n_{i}$ towards the $q_{i + 1}$ direction as in (9.5). We consider that the classification output of node $n_{d}$ can be trusted with a probability of $P_{t, n_{d}} (x_{t})$ . This and the other probabilities in our development are independently defined for ease of exposition and gaining intuition, i.e., these probabilities are not related to the unknown data statistics in any way and they definitely cannot be regarded as certain assumptions on the data. Indeed, we do not take any assumptions about the data source.

Intuitively, the path probability is low when the feature vector is close to the region boundaries; hence we may consider to classify that feature vector by another node classifier (e.g., the classifier of the sibling node). Using these path probabilities, we aim to update the region boundaries by learning whether an efficient node classifier is used to classify $x_{t}$ , instead of directly assigning $x_{t}$ to node $n_{d}$ and lose a significant degree of freedom. To this end, we define the final output of each node classifier according to a Bernoulli random variable with outcomes ${- f_{t, n_{d}} (x_{t}), f_{t, n_{d}} (x_{t})}$ , where the probability of the latter outcome is $P_{t, n_{d}} (x_{t})$ . Although the final classification output of node $n_{d}$ is generated according to this Bernoulli random variable, we continue to call $f_{t, n_{d}} (x_{t})$ the final classification output of node $n_{d}$ , with an abuse of notation. Then the classification output of the partition classifier is set to $f_{t}^{(k)} (x_{t}) = f_{t, n_{d}} (x_{t})$ .

Before constructing the SOT classifier, we first introduce certain definitions. Let the instantaneous empirical loss of the proposed classifier $f_{t}$ at time t be denoted by $ℓ_{t} (f_{t}) ≜ 1_{{f_{t} (x_{t}) \neq y_{t}}}$ . Then the expected empirical loss of this classifier over a sequence of length T can be found by

$L_{T} (f_{t}) = E [\sum_{t = 1}^{T} ℓ_{t} (f_{t})],$

(9.19)

with the expectation taken with respect to the randomization parameters of the classifier $f_{t}$ . We also define the effective region of each node $n_{d}$ at time t as follows: $R_{t, n_{d}} ≜ {x : P_{t, n_{d}} (x) ⩾ {(0.5)}^{d}}$ . According to the aforementioned structure of partition classifiers, the node $n_{d}$ classifies an instance $x_{t}$ only if $x_{t} \in R_{t, n_{d}}$ . Therefore, the time accumulated empirical loss of any node n during the data stream is given by

$L_{T, n} ≜ \sum_{t ⩽ T : {x_{t}}_{t ⩾ 1} \in R_{t, n}} ℓ_{t} (f_{t, n}) .$

(9.20)

Similarly, the time accumulated empirical loss of a partition classifier k is $L_{T}^{(k)} ≜ \sum_{n \in M_{k}} L_{T, n}$ .

We then use a mixture-of-experts approach to achieve the performance of the best partition classifier that minimizes the accumulated classification error. To this end, we set the final classification output of our algorithm as $f_{t} (x_{t}) = f_{t}^{(k)}$ with probability $w_{t}^{(k)}$ , where

$w_{t}^{(k)} = \frac{1}{Z_{t - 1}} 2^{- J (k)} \exp (- b L_{t - 1}^{(k)}),$

$b ⩾ 0$ is a constant controlling the learning rate of the algorithm, $J (k) ⩽ 2 | L (k) | - 1$ represents the number of bits required to code the partition k (which satisfies $\sum_{k = 1}^{β_{D}} J (k) = 1$ ) and $Z_{t} = \sum_{k = 1}^{β_{D}} 2^{- J (k)} \exp (- b L_{t}^{(k)})$ is the normalization factor.

Although this randomized method can be used as the SOT classifier, in its current form, it requires a computational complexity $O ({1.5}^{2^{D}} p)$ since the randomization $w_{t}^{(k)}$ is performed over the set ${1, \dots, β_{D}}$ and $β_{D} \approx {1.5}^{2^{D}}$ . However, the set of all possible classification outputs of these partitions has a cardinality as small as $D + 1$ since $x_{t} \in R_{t, n_{D}}$ for the corresponding leaf node $n_{D}$ (in which $x_{t}$ is included) and $f_{t}^{(k)} = f_{t, n_{d}}$ for some $d = 0, \dots, D$ , $\forall k \in {1, \dots, β_{D}}$ . Hence, evaluating all the partition classifiers in k at the instance $x_{t}$ to produce $f_{t} (x_{t})$ is unnecessary. In fact, the computational complexity for producing $f_{t} (x_{t})$ can be reduced from $O ({1.5}^{2^{D}} p)$ to $O (D p)$ by performing the exact same randomization over $f_{t, n_{d}}$ 's using the new set of weights $w_{t, n_{d}}$ , which can be straightforwardly derived as follows:

$w_{t, n_{d}} = \sum_{k = 1}^{β_{D}} w_{t}^{(k)} 1_{f_{t}^{(k)} (x_{t}) = f_{t, n_{d}} (x_{t})} .$

(9.21)

To efficiently calculate (9.21) with complexity $O (D p)$ , we consider the universal coding scheme and let

$M_{t, n} ≜ {\begin{matrix} \exp (- b L_{t, n}), & if n has depth D, \\ \frac{1}{2} [M_{t, n 0} M_{t, n 1} + \exp (- b L_{t, n})], & otherwise \end{matrix}$

(9.22)

for any node n and observe that we have $M_{t, λ} = Z_{t}$ [23]. Therefore, we can use the recursion (9.22) to obtain the denominator of the randomization probabilities $w_{t}^{(k)}$ . To efficiently calculate the numerator of (9.21), we introduce another intermediate parameter as follows. Letting $n_{d}^{'}$ denote the sibling of node $n_{d}$ , we recursively define

$κ_{t, n_{d}} ≜ {\begin{matrix} \frac{1}{2}, & if d = 0 \\ \frac{1}{2} M_{t - 1, n_{d}^{'}} κ_{t, n_{d - 1}}, & if 0 < d < D \\ M_{t - 1, n_{d}^{'}} κ_{t, n_{d - 1}}, & if d = D \end{matrix},$

(9.23)

$\forall d \in {0, \dots, D}$ , where $x_{t} \in R_{t, n_{D}}$ . Using the intermediate parameters in (9.22) and (9.23), it can be shown that we have

$w_{t, n_{d}} = \frac{κ_{t, n_{d}} \exp (- b L_{t, n_{d}})}{M_{t, λ}} .$

(9.24)

Hence, we obtain the final output of the algorithm as $f_{t} (x_{t}) = f_{t, n_{d}} (x_{t})$ with probability $w_{t, n_{d}}$ , where $d \in {0, \dots, D}$ (i.e., with a computational complexity $O (D)$ ).

We then use the final output of the introduced algorithm and update the region boundaries of the tree (i.e., organize the tree) to minimize the final classification error. To this end, we minimize the loss $E [ℓ_{t} (f_{t})] = E [1_{{f_{t} (x_{t}) \neq y_{t}}}] = \frac{1}{4} E [{(y_{t} - f_{t} (x_{t}))}^{2}]$ with respect to the region boundary parameters, i.e., we use the stochastic gradient descent method as follows:

$\begin{matrix} ϕ_{t + 1, n_{d}} & = ϕ_{t, n_{d}} - η \nabla E [ℓ_{t} (f_{t})] \\ = ϕ_{t, n_{d}} - {(- 1)}^{q_{d + 1}} η (y_{t} - f_{t} (x_{t})) s_{t, n_{d}}^{q_{d + 1}^{'}} (x_{t}) [\sum_{i = d + 1}^{D} f_{t, n_{i}} (x_{t})] x_{t}, \end{matrix}$

(9.25)

$\forall d \in {0, \dots, D - 1}$ , where η denotes the learning rate of the algorithm and $q_{d + 1}^{'}$ represents the complementary letter to $q_{d + 1}$ from the binary alphabet ${0, 1}$ . Defining a new intermediate variable

$π_{t, n_{d}} ≜ {\begin{matrix} f_{t, n_{d}} (x_{t}), & if d = D - 1, \\ π_{t, n_{d + 1}} + f_{t, n_{d}} (x_{t}), & if d < D - 1, \end{matrix}$

(9.26)

one can perform the update in (9.25) with a computational complexity $O (p)$ for each node $n_{d}$ , where $d \in {0, \dots, D - 1}$ , resulting in an overall computational complexity of $O (D p)$ as follows:

$ϕ_{t + 1, n_{d}} = ϕ_{t, n_{d}} - {(- 1)}^{m_{d + 1}} η (y_{t} - f_{t} (x_{t})) π_{t, n_{d}} s_{t, n_{d}}^{q_{d + 1}^{'}} (x_{t}) x_{t} .$

(9.27)

This concludes the construction of the algorithm and the pseudocode of the SOT classifier can be found in Algorithm 2.

Algorithm 2 Self-Organizing Tree Classifier (SOTC).

9.3.2 Convergence of the Algorithm

In this section, we illustrate that the performance of Algorithm 2 is asymptotically as good as the best partition classifier such that, as $T \to \infty$ , we have $R_{T} (f_{t}; f_{t}^{(k)}) \to 0$ . Hence, Algorithm 2 asymptotically achieves the performance of the best partition classifier among $O ({1.5}^{2^{D}})$ different classifiers that can be represented using the SOT of depth D with a significantly reduced computational complexity of $O (D p)$ without any statistical assumptions on data.

Theorem 2

Let ${x_{t}}_{t ⩾ 1}$ and ${y_{t}}_{t ⩾ 1}$ be arbitrary and real-valued sequences of feature vectors and their labels, respectively. Then Algorithm 2, when applied to these data sequences, sequentially yields

$\max_{k \in {1, \dots, β_{D}}} E [R_{T} (f_{t}; f_{t}^{(k)})] ⩽ O (\sqrt{\frac{2^{D}}{T}}),$

(9.28)

for all T with a computational complexity $O (D p)$ , where p represents the dimensionality of the feature vectors and the expectation is with respect to the randomization parameters.

Proof of this theorem can be found in Appendix 9.A.2.

9.4 Numerical Results

In this section, we illustrate the performance of SOTs under different scenarios with respect to state-of-the-art methods. The proposed method has a wide variety of application areas, such as channel equalization [26], underwater communications [27], nonlinear modeling in big data [28], speech and texture analysis [29, Chapter 7] and health monitoring [30]. Yet, in this section, we consider nonlinear modeling for fundamental regression and classification problems.

9.4.1 Numerical Results for Regression Problems

Throughout this section, “SOTR” represents the self-organizing tree regressor defined in Algorithm 1, “CTW” represents the context tree weighting algorithm of [16], “OBR” represents the optimal batch regressor, “VF” represents the truncated Volterra filter [1], “LF” represents the simple linear filter, “B-SAF” and “CR-SAF” represent the Beizer and the Catmull–Rom spline adaptive filter of [2], respectively, and “FNF” and “EMFNF” represent the Fourier and even mirror Fourier nonlinear filter of [3], respectively. Finally, “GKR” represents the Gaussian-kernel regressor and it is constructed using n node regressors, say ${\hat{d}}_{t, 1}, \dots, {\hat{d}}_{t, n}$ , and a fixed Gaussian mixture weighting (that is selected according to the underlying sequence in hindsight), giving

${\hat{d}}_{t} = \sum_{i = 1}^{n} f (x_{t}; μ_{i}, Σ_{i}) {\hat{d}}_{t, i},$

where ${\hat{d}}_{t, i} = v_{t, i}^{T} x_{t}$ and

$f (x_{t}; μ_{i}, Σ_{i}) ≜ \frac{1}{2 π \sqrt{| Σ_{i} |}} e^{- \frac{1}{2} {(x_{t} - μ_{i})}^{T} Σ_{i}^{- 1} (x_{t} - μ_{i})},$

for all $i = 1, \dots, n$ .

For a fair performance comparison, in the corresponding experiments in Subsection 9.4.1.2, the desired data and the regressor vectors are normalized between $[- 1, 1]$ since the satisfactory performance of several algorithms requires the knowledge on the upper bounds (such as the B-SAF and the CR-SAF) and some require these upper bounds to be between $[- 1, 1]$ (such as the FNF and the EMFNF). Moreover, in the corresponding experiments in Subsection 9.4.1.1, the desired data and the regressor vectors are normalized between $[- 1, 1]$ for the VF, the FNF and the EMFNF algorithms due to the aforementioned reason. The regression errors of these algorithms are then scaled back to their original values for a fair comparison.

Considering the illustrated examples in the respective papers [2,3,16], the orders of the FNF and the EMFNF are set to 3 for the experiments in Subsection 9.4.1.1 and 2 for the experiments in Subsection 9.4.1.2. The order of the VF is set to 2 for all experiments. Similarly, the depth of the trees for the SOTR and CTW algorithms is set to 2 for all experiments. For these tree-based algorithms, the feature space is initially partitioned by the direction vectors $ϕ_{t, n} = {[ϕ_{t, n}^{(1)}, \dots, ϕ_{t, n}^{(p)}]}^{T}$ for all nodes $n \in N_{D} ∖ L_{D}$ , where $ϕ_{t, n}^{(i)} = - 1$ if $i \equiv l (n) (\mod D)$ , e.g., when $D = p = 2$ , we have the four quadrants as the four leaf nodes of the tree. Finally, we use cubic B-SAF and CR-SAF algorithms, whose number of knots are set to 21 for all experiments. We emphasize that both these parameters and the learning rates of these algorithms are selected to give equal rates of performance and convergence.

9.4.1.1 Mismatched Partitions

In this subsection, we consider the case where the desired data is generated by a piece-wise linear model that mismatches with the initial partitioning of the tree-based algorithms. Specifically, the desired signal is generated by the following piece-wise linear model:

$d_{t} = {\begin{matrix} w^{T} x_{t} + π_{t}, & if ϕ_{0}^{T} x_{t} ⩾ 0.5 and ϕ_{1}^{T} x_{t} ⩾ 1, \\ - w^{T} x_{t} + π_{t}, & if ϕ_{0}^{T} x_{t} ⩾ 0.5 and ϕ_{1}^{T} x_{t} < 1, \\ - w^{T} x_{t} + π_{t}, & if ϕ_{0}^{T} x_{t} < 0.5 and ϕ_{2}^{T} x_{t} ⩾ - 1, \\ w^{T} x_{t} + π_{t}, & if ϕ_{0}^{T} x_{t} < 0.5 and ϕ_{2}^{T} x_{t} < - 1, \end{matrix}$

(9.29)

where $w = {[1, 1]}^{T}$ , $ϕ_{0} = {[4, - 1]}^{T}$ , $ϕ_{1} = {[1, 1]}^{T}$ , $ϕ_{2} = {[1, 2]}^{T}$ , $x_{t} = {[x_{1, t}, x_{2, t}]}^{T}$ , $π_{t}$ is a sample function from a zero mean white Gaussian process with variance 0.1 and $x_{1, t}$ and $x_{2, t}$ are sample functions of a jointly Gaussian process of mean ${[0, 0]}^{T}$ and variance $I_{2}$ . The learning rates are set to 0.005 for SOTR and CTW, 0.1 for FNF, 0.025 for B-SAF and CR-SAF, 0.05 for EMFNF and VF. Moreover, in order to match the underlying partition, the mass points of GKR are set to $μ_{1} = {[1.4565, 1.0203]}^{T}$ , $μ_{2} = {[0.6203, - 0.4565]}^{T}$ , $μ_{3} = {[- 0.5013, 0.5903]}^{T}$ and $μ_{4} = {[- 1.0903, - 1.0013]}^{T}$ with the same covariance matrix as in the previous example.

Fig. 9.3 shows the normalized time accumulated regression error of the proposed algorithms. We emphasize that the SOTR algorithm achieves a better error performance compared to its competitors. Comparing the performances of the SOTR and CTW algorithms, we observe that the CTW algorithm fails to accurately predict the desired data, whereas the SOTR algorithm learns the underlying partitioning of the data, which significantly improves the performance of the SOTR. This illustrates the importance of the initial partitioning of the regressor space for tree-based algorithms to yield a satisfactory performance.

Figure 9.3 Regression error performances for the second-order piece-wise linear model in (9.29).

In particular, the CTW algorithm converges to the best batch regressor having the predetermined leaf nodes (i.e., the best regressor having the four quadrants of two-dimensional space as its leaf nodes). However that regressor is suboptimal since the underlying data is generated using another constellation; hence its time accumulated regression error is always lower bounded by $O (1)$ compared to the global optimal regressor. The SOTR algorithm, on the other hand, adapts its region boundaries and captures the underlying unevenly rotated and shifted regressor space partitioning perfectly. Fig. 9.4 shows how our algorithm updates its separator functions and illustrates the nonlinear modeling power of SOTs.

Figure 9.4 Changes in the boundaries of the leaf nodes of the SOT of depth 2 generated by the SOTR algorithm at time instances t = 0,1000,2000,5000,20000,50000. The separator functions adaptively learn the boundaries of the piece-wise linear model in (9.29).

9.4.1.2 Chaotic Signals

In this subsection, we illustrate the performance of the SOTR algorithm when estimating chaotic data generated by the Henon map and the Lorenz attractor [31].

First, we consider a zero mean sequence generated by the Henon map, a chaotic process given by

$d_{t} = 1 - ζ d_{t - 1}^{2} + η d_{t - 2},$

(9.30)

known to exhibit chaotic behavior for the values of $ζ = 1.4$ and $η = 0.3$ . The desired data at time t is denoted as $d_{t}$ whereas the extended regressor vector is $x_{t} = {[d_{t - 1}, d_{t - 2}, 1]}^{T}$ , i.e., we consider a prediction framework. The learning rate is set to 0.025 for B-SAF and CR-SAF, whereas it is set to 0.05 for the rest.

Fig. 9.5 (left plot) shows the normalized regression error performance of the proposed algorithms. One can observe that the algorithms whose basis functions do not include the necessary quadratic terms and the algorithms that rely on a fixed regressor space partitioning yield unsatisfactory performance. On the other hand, VF can capture the salient characteristics of this chaotic process since its order is set to 2. Similarly, FNF can also learn the desired data since its basis functions can well approximate the chaotic process. The SOTR algorithm, however, uses a piece-wise linear modeling and still achieves asymptotically the same performance as the VF algorithm, while outperforming the FNF algorithm.

Figure 9.5 Regression error performances of the proposed algorithms for the signal generated by the Henon map in (9.30) (left figure) and for the Lorenz attractor in (9.31) with parameters dt = 0.01, ρ = 28, σ = 10 and β = 8/3 (right figure).

Second, we consider the chaotic signal set generated using the Lorenz attractor [31] that is defined by the following three discrete-time equations:

$\begin{matrix} x_{t} & = x_{t - 1} + (σ (y - x)) d t, \\ y_{t} & = y_{t - 1} + (x_{t - 1} (ρ - z_{t - 1}) - y_{t - 1}) d t, \\ z_{t} & = z_{t - 1} + (x_{t - 1} y_{t - 1} - β z_{t - 1}) d t, \end{matrix}$

(9.31)

where we set $d t = 0.01$ , $ρ = 28$ , $σ = 10$ and $β = 8 / 3$ to generate the well-known chaotic solution of the Lorenz attractor. In the experiment, $x_{t}$ is selected as the desired data and the two-dimensional region represented by $y_{t}, z_{t}$ is set as the regressor space, that is, we try to estimate $x_{t}$ with respect to $y_{t}$ and $z_{t}$ . The learning rates are set to 0.01 for all algorithms.

Fig. 9.5 (right plot) illustrates the nonlinear modeling power of the SOTR algorithm even when estimating a highly nonlinear chaotic signal set. As can be observed from Fig. 9.5, the SOTR algorithm significantly outperforms its competitors and achieves a superior error performance since it tunes its region boundaries to the optimal partitioning of the regressor space, whereas the performances of the other algorithms directly rely on the initial selection of the basis functions and/or tree structures and partitioning.

9.4.2 Numerical Results for Classification Problems

9.4.2.1 Stationary Data

In this section, we consider stationary classification problems and compare the SOTC algorithm with the following methods: Perceptron, “PER” [25]; Online AdaBoost, “OZAB” [32]; Online GradientBoost, “OGB” [33]; Online SmoothBoost, “OSB” [34]; and Online Tree–Based Nonadaptive Competitive Classification, “TNC” [22]. The parameters for all of these compared methods are set as in their original proposals. For “OGB” [33], which uses K weak learners per M selectors, essentially resulting in MK weak learners in total, we use $K = 1$ , as in [34], for a fair comparison along with the logit loss that has been shown to consistently outperform other choices in [33]. The TNC algorithm is nonadaptive, i.e., not self-organizing, in terms of the space partitioning, which we use in our comparisons to illustrate the gain due to the proposed self-organizing structure. We use the Perceptron algorithm as the weak learners and node classifiers in all algorithms. We set the learning rate of the SOTC algorithm to $η = 0.05$ in all of our stationary as well as nonstationary data experiments. We use $N = 100$ weak learners for the boosting methods, whereas we use a depth-4 tree in SOTC and TNC algorithms, which corresponds to $31 = 2^{5} - 1$ local node classifiers. The SOTC algorithm has linear complexity in the depth of the tree, whereas the compared methods have linear complexity in the number of weak learners.

As can be observed in Table 9.1, the SOTC algorithm consistently outperforms the compared methods. In particular, the compared methods essentially fail to classify Banana and BMC datasets, which indicates that these methods are not able to extend to complex nonlinear classification problems. On the contrary, the SOTC algorithm successfully models these complex nonlinear relations with piece-wise linear curves and provides a superior performance. In general, the SOTC algorithm has significantly better transient characteristics and the TNC algorithm occasionally performs poorly (such as on BMC and Banana data sets) depending on the mismatch between the initial partitions defined on the tree and the underlying optimal separation of the data. This illustrates the importance of learning the region boundaries in piece-wise linear models.

Table 9.1

Average classification errors (in percentage) of algorithms on benchmark datasets

Data Set	PER	OZAB	OGB	OSB	TNC	SOTC
Heart	24.66	23.96	23.28	23.63	21.75	20.09
Breast cancer	5.77	5.44	5.71	5.23	4.84	4.65
Australian	20.82	20.26	19.70	20.01	15.92	14.86
Diabetes	32.25	32.43	33.49	31.33	26.89	25.75
German	32.45	31.86	32.72	31.86	28.13	26.74
BMC	47.09	45.72	46.92	46.37	25.37	17.03
Splice	33.42	32.59	32.79	32.81	18.88	18.56
Banana	48.91	47.96	48.00	48.84	27.98	17.60

9.4.2.2 Nonstationary Data: Concept Change/Drift

In this section, we apply the SOTC algorithm to nonstationary data, where there might be continuous or sudden/abrupt changes in the source statistics, i.e., concept change. Since the SOTC algorithm processes data in a sequential manner, we choose the Dynamically Weighted Majority (DWM) algorithm (DWM) [35] with Perceptron (DWM-P) or naive Bayes (DWM-N) experts for the comparison, since the DWM algorithm is also an online algorithm. Although the batch algorithms do not truly fit into our framework, we still devise an online version of the tree-based local space partitioning algorithm [24] (which also learns the space partitioning and the classifier using the coordinate ascent approach) using a sliding-window approach and abbreviate it as the WLSP algorithm. For the DWM method, which allows the addition and removal of experts during the stream, we set the initial number of experts to 1, where the maximum number of experts is bounded by 100. For the WLSP method, we use a window size of 100. The parameters for these compared methods are set as in their original proposals.

We run these methods on the BMC dataset (1200 instances, Fig. 9.6), where a sudden/abrupt concept change is obtained by rotating the feature vectors (clock-wise around the origin) $180^{\circ}$ after the 600th instance. This is effectively equivalent to flipping the label of the feature vectors; hence the resulting dataset is denoted as BMC-F. For a continuous concept drift, we rotate each feature vector $180^{\circ} / 1200 = {0.15}^{\circ}$ starting from the beginning; the resulting dataset is denoted as BMC-C. In Fig. 9.6, we present the classification errors for the compared methods averaged over 1000 trials. At each 10th instance, we test the algorithms with 1200 instances drawn from the active set of statistics (active concept).

Figure 9.6 Performances of the algorithms in case of the abrupt and continuous concept changes in the BMC dataset. On the left figure, at the 600th instance, there is a 180^∘ clock-wise rotation around the origin that is effectively a label flip. On the right figure, at each instance, there is a 180^∘/1200 clock-wise rotation around the origin.

Since the BMC dataset is non-Gaussian with strongly nonlinear class separations, the DWM method does not perform well on the BMC-F data. For instance, DWM-P operates with an error rate fluctuating around 0.48–0.49 (random guess). This results since the performance of the DWM method is directly dependent on the expert success and we observe that both base learners (Perceptron or the naive Bayes) fail due to the high separation complexity in the BMC-F data. On the other hand, the WLSP method quickly converges to steady state, however it is also asymptotically outperformed by the SOTC algorithm in both experiments. Increasing the window size is clearly expected to boost the performance of WLSP, however at the expense of an increased computational complexity. It is already significantly slower than the SOTC method even when the window size is 100 (for a more detailed comparison, see Table 9.2). The performance of the WLSP method is significantly worse on the BMC-C data set compared to the BMC-F data set, since in the former scenario, WLSP is trained with batch data of a continuous mixture of concepts in the sliding windows. Under this continuous concept drift, the SOTC method always (not only asymptotically as in the case of the BMC-F data set) performs better than the WLSP method. Hence, the sliding-window approach is sensitive to the continuous drift. Our discussion about the DWM method on the concept change data (BMC-F) remains valid in the case of the concept drift (BMC-C) as well. In these experiments, the power of self-organizing trees is obvious as the SOTC algorithm almost always outperforms the TNC algorithm. We also observe from Table 9.2 that the SOTC algorithm is computationally very efficient and the cost of region updates (compared to the TNC algorithm) does not increase the computational complexity of the algorithm significantly.

Table 9.2

Running times (in seconds) of the compared methods when processing the BMC data set on a daily-use machine (Intel(R) Core(TM) i5-3317U CPU @ 1.70 GHz with 4 GB memory)

PER	OZAB	OGB	OSB	TNC	DWM-P	DWM-N	WLSP	SOTC
0.06	12.90	3.57	3.91	0.43	2.06	6.91	68.40	0.62

Appendix 9.A

9.A.1 Proof of Theorem 1

For the SOT of depth D, suppose ${\hat{d}}_{t}^{(k)}$ , $k = 1, \dots, β_{d}$ , are obtained as described in Section 9.2.2. To achieve the upper bound in (9.15), we use the online gradient descent method and update the combination weights as

$w_{t + 1} = w_{t} - \frac{1}{2} η_{t} \nabla e_{t}^{2} (w_{t}) = w_{t} + η_{t} e_{t} {\hat{d}}_{t},$

(9.32)

where $η_{t}$ is the learning rate of the online gradient descent algorithm. We derive an upper bound on the sequential learning regret $R_{n}$ , which is defined as

$R_{T} ≜ \sum_{t = 1}^{T} e_{t}^{2} (w_{t}) - \sum_{t = 1}^{T} e_{t}^{2} (w_{n}^{⁎}),$

where $w_{T}^{⁎}$ is the optimal weight vector over T, i.e.,

$w_{T}^{⁎} ≜ \underset{w \in R^{β_{d}}}{\arg \min} \sum_{t = 1}^{T} e_{t}^{2} (w) .$

Following [36], using Taylor series approximation, for some point $z_{t}$ on the line segment connecting $w_{t}$ to $w_{T}^{⁎}$ , we have

$e_{t}^{2} (w_{T}^{⁎}) = e_{t}^{2} (w_{t}) + {(\nabla e_{t}^{2} (w_{t}))}^{T} (w_{T}^{⁎} - w_{t}) + \frac{1}{2} {(w_{T}^{⁎} - w_{t})}^{T} \nabla^{2} e_{t}^{2} (z_{t}) (w_{T}^{⁎} - w_{t}) .$

(9.33)

According to the update rule in (9.32), at each iteration the update on weights are performed as $w_{t + 1} = w_{t} - \frac{η_{t}}{2} \nabla e_{t}^{2} (w_{t})$ . Hence, we have

$\begin{matrix} {| | w_{t + 1} - w_{T}^{⁎} | |}^{2} & = {| | w_{t} - \frac{η_{t}}{2} \nabla e_{t}^{2} (w_{t}) - w_{T}^{⁎} | |}^{2} \\ = {| | w_{t} - w_{T}^{⁎} | |}^{2} - η_{t} {(\nabla e_{t}^{2} (w_{t}))}^{T} (w_{t} - w_{T}^{⁎}) + \frac{η_{t}^{2}}{4} {| | \nabla e_{t}^{2} (w_{t}) | |}^{2} . \end{matrix}$

This yields

${(\nabla e_{t}^{2} (w_{t}))}^{T} (w_{t} - w_{T}^{⁎}) = \frac{{| | w_{t} - w_{T}^{⁎} | |}^{2} - {| | w_{t + 1} - w_{T}^{⁎} | |}^{2}}{η_{t}} + η_{t} \frac{{| | \nabla e_{t}^{2} (w_{t}) | |}^{2}}{4} .$

Under the mild assumptions that ${| | \nabla e_{t}^{2} (w_{t}) | |}^{2} ⩽ A^{2}$ for some $A > 0$ and $e_{t}^{2} (w_{T}^{⁎})$ is λ-strong convex for some $λ > 0$ [36], we achieve the following upper bound:

$e_{t}^{2} (w_{t}) - e_{t}^{2} (w_{T}^{⁎}) ⩽ \frac{{| | w_{t} - w_{T}^{⁎} | |}^{2} - {| | w_{t + 1} - w_{T}^{⁎} | |}^{2}}{η_{t}} - \frac{λ}{2} {| | w_{t} - w_{T}^{⁎} | |}^{2} + η_{t} \frac{A^{2}}{4} .$

(9.34)

By selecting $η_{t} = \frac{2}{λ t}$ and summing up the regret terms in (9.34), we get

$\begin{matrix} R_{n} & = \sum_{t = 1}^{n} {e_{t}^{2} (w_{t}) - e_{t}^{2} (w_{T}^{⁎})} \\ ⩽ \sum_{t = 1}^{n} {| | w_{t} - w_{T}^{⁎} | |}^{2} (\frac{1}{η_{t}} - \frac{1}{η_{t - 1}} - \frac{λ}{2}) + \frac{A^{2}}{4} \sum_{t = 1}^{n} η_{t} \\ = \frac{A^{2}}{4} \sum_{t = 1}^{n} \frac{2}{λ t} ⩽ \frac{A^{2}}{2 λ} (1 + \log (n)) . \end{matrix}$

9.A.2 Proof of Theorem 2

Since $Z_{t}$ is a summation of terms that are all positive, we have $Z_{t} ⩾ 2^{- J (k)} \exp (- b L_{t}^{(k)})$ and after taking the logarithm of both sides and rearranging the terms, we get

$- \frac{1}{b} \log Z_{T} ⩽ L_{T}^{(k)} + \frac{J (k) \log 2}{b}$

(9.35)

for all $k \in {1, \dots, β_{D}}$ at the (last) iteration at time T. We then make the following observation:

$\begin{matrix} Z_{T} = \prod_{t = 1}^{T} \frac{Z_{t}}{Z_{t - 1}} & = \prod_{t = 1}^{T} \sum_{k = 1}^{β_{D}} \frac{2^{- J (k)} \exp (- b L_{t - 1}^{(k)})}{Z_{t - 1}} \exp (- b ℓ_{t} (f_{t}^{(k)})) \\ ⩽ \exp (- b L_{T} (f_{t}) + \frac{T b^{2}}{8}), \end{matrix}$

(9.36)

where the second line follows from the definition of $Z_{t}$ and the last line follows from the Hoeffding inequality by treating the $w_{t}^{(k)} = 2^{- J (k)} \exp (- b L_{t - 1}^{(k)}) / Z_{t - 1}$ terms as the randomization probabilities. Note that $L_{T} (f_{t})$ represents the expected loss of the final algorithm; cf. (9.19). Combining (9.35) and (9.36), we obtain

$\frac{L_{T} (f_{t})}{T} ⩽ \frac{L_{T}^{(k)}}{T} + \frac{J (k) \log 2}{T b} + \frac{b}{8}$

and choosing $b = \sqrt{2^{D} / T}$ , we find the desired upper bound in (9.28) since $J (k) ⩽ 2^{D + 1} - 1$ , for all $k \in {1, \dots, β_{D}}$ .

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: Online Nonlinear Modeling via Self-Organizing Trees

Create new playlist

Sign In

Sign Up