5. Transmission Rate Adaptation in Multimedia WLAN: A Dynamic Games Approach

This chapter considers the scheduling, rate adaptation, and buffer management in a multiuser wireless local area network (WLAN) where each user transmits scalable video payload. Based on opportunistic scheduling, users access the available medium (channel) in a decentralized manner. The rate adaptation problem of the WLAN multimedia networks is then formulated as a general-sum switching control dynamic Markovian game by modeling the video states and block-fading channel^* qualities of each user as a finite-state Markovian chain. A value iteration algorithm is proposed to compute the Nash equilibrium policy of such a game, and the convergence of the algorithm is also proved. We also give assumptions on the system so that the Nash equilibrium transmission policy of each user is a randomization of two pure policies with each policy nondecreasing on the buffer state occupancy. Based on this structural result, we use policy-gradient algorithm to compute the Nash equilibrium policy.

5.1 Introduction

Advances in video coding technology and standardization [1,2] along with the rapid developments and improvements of network infrastructures, storage capacity, and computing power are enabling an increasing number of video applications. Application areas today range from multimedia messaging, video telephony, and video conference over mobile TV, wireless and wired Internet video streaming, standard and high-definition (HD) TV broadcasting to DVD, blue-ray disc, and HD DVD optical storage media. For these applications, a variety of video transmission and storage systems may be employed.

Modern video transmission and storage systems using the Internet and mobile networks are typically based on real-time transport protocol (RTP)/Internet protocol (IP) for real-time services (conversational and streaming) and on computer file formats like mp4 or 3 gp. Most RTP/IP access networks are typically characterized by a wide range of connection qualities and receiving devices. The connection quality would be adjusted if the network requires time-varying data throughput. The variety of devices with different capabilities ranging from cell phones with small screens and restricted processing power to high-end personal computers with HD displays results from the continuous evolution of these endpoints.

Scalable video coding (SVC) is a highly attractive solution to the problems posted by the characteristics of modern video transmission system. The time-varying nature of wireless channels and video content motivate the need for scalable media, which can adapt the video bit rate without significantly sacrificing the decoded video quality. The SVC project provides spatial, temporal, and signal-to-noise ratio (SNR) scalability, which ensures a graceful degradation in video quality when faced with channel fluctuations [3].

Efficient SVC provides a number of benefits in terms of applications [4,5]. In the scenario of a video transmission service with heterogeneous clients, multiple bitstreams of the same source content differing in coded picture size, frame size, and bit rate should be provided simultaneously. With the application of a properly configured SVC scheme, the source content has to be encoded only once, for the highest required resolution and bit rate, resulting in a scalable bitstream from which representations with lower resolution and/or quality can be obtained by discarding selected data.

Real-time video streaming applications belong to a group of fast spreading services that are quickly reaching the mobile arena. Such applications have increased the need to supply large volumes of coded video data to mobile and heterogeneous users [6]. However, capacity restrictions, heterogeneous device and network capabilities, and error-prone transmissions are just some of the problems resulting from the characteristics of modern video communication systems, to which SVC offers an attractive solution [7]. In this chapter, we present a rate adaptation scheme for real-time encoding and streaming of multiple scalable video streams in a WLAN system. The aim is to adapt the video layers of each user to the fluctuations in channel quality and the transmission control polices, such that the transmission buffer of each user does not saturate and the transmission delay constraint is not violated.

Rate allocation for multimedia transmission in wireless networks has been studied extensively. Previous works that are of interest include [8,9,10]. In [8], joint radio link buffer management and scheduling strategies are compared for wireless video streaming over high-speed downlink packet access (HSDPA) networks. We use the results of [8] to choose an appropriate buffer management strategy for our proposed switching control stochastic dynamic game solution. In [9] and [10], video rate and distortion models are used to improve the transmitted video quality for low-latency multimedia traffic. Rate and distortion modeling has become a keystone in model-based video rate and transmission control algorithms. The models are used to predict the coded video packet size and the distortion of the corresponding decoded picture prior to the actual encoding process in order to be used in a constrained optimization framework. However, these models cannot be used for characterization of scalable video data.

5.1.1 Motivation for Stochastic Game Formulation

The concept of a stochastic game, first introduced by Lloyd Shapley in early 1950s, is a dynamic game played by one or more players. The elements of a stochastic game include system-state set, action sets, transition probabilities, and utility functions. It is an extension of the single player Markov decision processes (MDPs) to include the multiple players whose actions all impact the resulting payoffs and next state. A switching control game [11,12,13] is a special type of stochastic dynamic game where the transition probability in any given state depends on only one player. It is known that the Nash equilibrium for such a game can be computed by solving a sequence of MDPs.

In this chapter, we consider an Institute of Electrical and Electronics Engineers (IEEE) 802.11 [14] WLAN, which deploys the carrier sense multiple access with collision avoidance (CSMA/CA). We propose a modified CSMA/CA channel access mechanism (Section 5.3.3), which takes into account the dynamic behaviors of the video variation, as well as the channel quality and transmission delay of each user. Due to the fact that there is no central controller for resource allocation in a WLAN system, there is strong motivation to study the case where individual users seek to selfishly maximize their transmission rate while taking into account the access rule. This is akin to individuals minimizing their tax (throughput subject to latency constraint in our case) without violating the tax law (the modified CSMA/CA channel access rule in our case). The interactions among system users can be naturally formulated using game theory. Furthermore, by exploiting the correlated time-varying sample paths of the channels and buffers, the transmission rate adaptation problem can be formulated as a Markovian dynamic game. Section 5.2 describes the details of the WLAN system under study as well as the dynamic behavior of video sources and the scalable rate-distortion models used to represent this behavior.

The main results of this chapter are summarized as follows.

First, we consider a multiuser WLAN system where each user is equipped with a scalable video quality encoder delivering video bitstream that conforms with SVC [7], which is the scalable extension of H.264/AVC. We address the scheduling, rate control, and buffer management of such system, and formulate the problem as a constrained switching control stochastic dynamic game combined with rate-distortion modeling of the source. The video states and the block-fading channel qualities of each user are formulated as a finite-state Markovian chain, and each user aims to optimize its own utility under a transmission latency constraint when it is scheduled for transmission. We then formulate the rate adaptation in the WLAN multimedia system as a constrained switching control dynamic Markovian game [12,15].

Second, we propose a value iteration algorithm that obtains the Nash equilibrium transmission policy of the constrained general-sum switching control game in Section 5.4. The Nash equilibrium policy of such constrained game is a randomization of two deterministic policies. The value iteration algorithm involves a sequence of dynamic programming problems with Lagrange multipliers. In each iteration of the value iteration algorithm, a Lagrange multiplier is used to combine the latency cost with the transmission reward for each user; the Lagrange multiplier is then updated based on the delay constraint. The algorithm converges to the Nash equilibrium policy of the constrained general-sum switching control game.

We then proceed to the main structural result of this chapter. It is shown that under reasonable assumptions, the Nash equilibrium policy is monotone-nondecreasing on the buffer state occupancy. This structural result on the transmission policy enables us to search for the Nash equilibrium policy in the policy space via a policy-gradient algorithm. The structural result is especially useful in dealing with multiuser systems where each user has a large buffer size. Finally, numerical results of the proposed switching control Markovian dynamic game policy in transmitting scalable video are presented.

5.2 System Description and the Video Rate-Distortion Model

This section introduces the time-slotted WLAN system model (Figure 5.1), which consists of multiple users’ attempts to transmit their data over available channels (medium) [14]. The time synchronization among users can be done by downlink pilots obtaining relevant time information. Each user is equipped with a buffer, a decentralized scheduler, and a rate adaptor to transmit a scalable video payload, which consists of a base layer and multiple enhancement layers. We use game theory to formulate the decentralized behaviors of all the users accessing the common spectrum resource. It will be shown in Section 5.4 that the state information of other users is needed in order to compute the optimal transmission policy of the current user. This can be realized by allowing all the users to broadcast their state information (buffer size, channel) during the guard interval at the beginning of each time slot.

5.2.1 System Description

We consider a K user WLAN system where only one user can access the channel at each time slot according to a modified CSMA/CA mechanism (Section 5.3). The rate-control problem of each user can be formulated as a constrained dynamic Markovian game by modeling the correlated block-fading channel as a Markov chain. The “users” that are referred to in this chapter are video sources who are uploading video streams to the network, and mobile end users are on the other side of the WLAN system downloading the video streams. Under the predefined decentralized access rule (Section 5.3.3), the problem presented is a special type of game, namely, a switching control Markovian dynamic game.

FIGURE 5.1
(See color insert.) A WLAN system where each user is equipped with a size B buffer, a decentralized scheduler, and a rate adaptor for transmission control. The users transmit a scalable video payload in which enhancement layers provide quality refinements over the base layer bitstream.

We assume that K users (indexed by k = 1,…, K) are equipped with scalable video encoders delivering video bitstreams that conform with SVC [7]. In SVC, quality scalability is achieved using medium-grained or coarse-grained scalability (MGS/CGS), where scalable enhancement packets deliver quality refinements to a preceding layer representation by requantizing the residual signal using a smaller quantization step size and encoding only the quantization refinements in the enhancement layer packets [7]. Moreover, MGS/CGS enhancement layer packets are coded using the key picture concept. For each picture a flag is transmitted, which signals whether the base quality reconstruction or the enhancement layer reconstruction of the reference pictures is employed for motion-compensated prediction. In order to limit the memory requirements, a second syntax element signals where the base quality representation of a picture is additionally reconstructed and stored in drop priority based (DPB). In order to limit the decoding overhead for such key pictures, SVC specifies that motion parameters must not change between the base and enhancement layer representation of key picture, and thus also for key pictures, the decoding can be done with a single motion-compensation loop. This approach achieves a trade-off between the coding efficiency of enhancement layers and the drift at the decoder [7].

Assume that each user K encodes its video stream at fixed base and enhancement layer quantization parameter (QP) values q_l,k, where l = 0 corresponds to the base layer, and l ≥ 1 corresponds to every subsequent enhancement layer. This results in a fluctuation of the video bit rate and distortion as the scene complexity and level of motion vary. The video variation can be captured using the mean absolute difference (MAD) of the prediction residual, since it quantifies the mismatch between the original uncoded picture and the intra/inter prediction. Therefore, we resort to video rate and distortion models to predict the video distortion of the corresponding decoded picture prior to the actual encoding process. In [16], new rate and distortion models are proposed that capture the variation of MGS/CGS scalable coded video content as a function of two encoding parameters, namely, the QP and the MAD of the residual signal. We will summarize the results in the following subsection.

5.2.2 Scalable Rate-Distortion Modeling

In this section, we present new rate and distortion estimation models for individual frame representation of MGS/CGS scalable coded video content. Two real-time encoding parameters are commonly used in rate-distortion estimation models, namely, the QP and the MAD of the residual signal.

Dropping the user subscript k, let ql,ql′∈{0,1,…51} $q_{l}, q_{l^{'}} \in {0, 1, \dots 51}$ [1] be the QP values of two distinct video layers l and l′, respectively. We have found a simple relationship that estimates the residual MAD ${\tilde{m}}_{l}$ of a specific frame given the residual MAD of the same frame at an initial $q_{l^{'}}$ as shown in the following:

${\tilde{m}}_{l} = {\tilde{m}}_{l^{'}} 2^{ς (q_{l} - q_{l^{'}})},$

(5.1)

where ς is a model parameter typically valued around 0.07 for most sequences [1,16]. In MGS/CGS scalability, enhancement layer packets contain refinements on the quantization of residual texture information [7]. Therefore, we use the expression in (5.1) to estimate the perceived MAD of each of the MGS/CGS enhancement layers.

5.2.2.1 Rate Model

Here we present the rate model for base and CGS enhancement layer SVC coded video frames. The MGS/CGS [10] base and enhancement layer bit rate in SVC can be expressed as follows:

$r (l, {\tilde{m}}_{l^{'}}) = c_{l} {({\tilde{m}}_{l})}^{u} 2^{- q_{l} / 6} = c_{l} {({\tilde{m}}_{l^{'}} 2^{ς (q_{l} - q_{l^{'}})})}^{u} 2^{- q_{l} / 6},$

(5.2)

where

c_l is a model parameter

u is a power factor that depends on the frame type, such that u = 1 for intercoded frames and u ≈ 5/6 for intra-coded frames [1,16]

5.2.2.2 Distortion Model

The distortion model estimates the decoded picture luminance peak signal-to-noise ratio (Y-PSNR). Let q_l be the QP value of layer l, and let ${\tilde{m}}_{l}$ be the prediction MAD estimate. The expression of the video PSNR achieved by decoding all layers up to layer l is as follows:

$\begin{array}{l} δ (l, {\tilde{m}}_{l^{'}}) = ν_{1} \log_{10} ({({\tilde{m}}_{l})}^{u} + 1) \cdot q_{l} + ν_{2} \\ = ν_{1} \log_{10} ({({\tilde{m}}_{l^{'}} 2^{a (q_{l} - q_{l^{'}})})}^{u} + 1) \cdot q_{l} + ν_{2}, \end{array}$

(5.3)

where

u is the same power factor described in (5.2)

ν₁ and ν₂ are sequence-dependent model parameters typically valued at 0.52 and 47, respectively [16]

The parameters ν₁ and ν₂ can be refined for each sequence during encoding. Figure 5.2 illustrates the performance of the proposed distortion model compared to actual encoding of two reference video sequences: Foreman and Football Figures 5.2a and b measure the Y-PSNR estimation results, while Figures 5.2c and d measure the rate estimation results. The video streams used in the simulation have one base layer and two CGS enhancement layers. The accuracy of this rate-distortion model (5.3) can be easily seen in Figure 5.2.

Note that the parameters ς, ν₁, ν₂, and c_l are calculated only once and remain constant for the entire video sequence.

5.3 Uplink Rate Adaptation Problem Formulation

We assume every user has an uplink buffer that holds the incoming coded video frames. When a user is scheduled for transmission, the buffer output rate is chosen depending on the channel quality and buffer occupancy. Therefore, the rate adaptation problem is reformulated as a buffer control problem for every user. In this section, we give a Markovian dynamic game description of this problem.

We assume that the time slot coincides with the video frame duration (30–40 ms [7]). Each user has a fixed packet arrival rate, and the incoming packets are SVC bitstreams with both base and enhancement layers. We denote the incoming number of video frames of user k as $f_{i n, k} (k = 1, 2, \dots, K)$ .

Let $b_{k}^{n}$ represent the buffer occupancy state of user k at time n and $b_{k}^{n} \in {0, 1, \dots B}$ , where B is the maximum number of coded video frames that can be stored in the buffer. The composition of buffer states of all the K users is $b^{n} = {b_{1}^{n}, \dots, b_{K}^{n}}$ and $b^{n} \in B$ , where $B$ is used to denote the space of system buffer states. Furthermore, we use $b_{- k}^{n}$ to denote the buffer state composition of all the users excluding user k.

FIGURE 5.2
Illustration of the modeled PSNR estimates (a and b), and the modeled rate estimates (c and d) for the base layer and two CGS enhancement layers of the sequences Foreman and Football.

The channel quality of user k at time n is denoted as $h_{k}^{n}$ . The channel is characterized using circularly symmetric complex Gaussian random variables, which depend only on the previous time slot. By quantizing the channel quality metric, we can denote the resulting discrete channel state space by $h_{k}^{n} \in {1, 2, \dots}$ . Let $h^{n} = {h_{1}^{n}, \dots, h_{K}^{n}}$ be the composition of channel states of all the K users. Assuming that $h^{n} \in H, n = 1, 2, \dots, N$ is block-fading, with block length equal to each time period, the channel states constitute a Markov process with transition probability from time n to (n + 1) given by $ℙ (h^{(n + 1)} | h^{n})$ .

Let $m_{k}^{n}$ be the MAD of user k at time n. In [9], the MAD between two consecutive video frames is modeled as a stationary first-order Gauss Markov process. We quantize the range of $m_{k}^{n}$ to achieve a video variability state space denoted by $m_{k}^{n} \in M = {0, 1, 2 \dots, M}$ . The video states constitute a Markov process with transition probabilities given by $ℙ (m_{k}^{(n + 1)} | m_{k}^{n})$ . The system video state at time n is a composition of the video states of all the users and $m^{n} = {m_{1}^{n}, m_{2}^{n}, \dots, m_{K}^{n}}$ .

The system state at time n is composed of the channel state, video state, and buffer state, which is denoted as $s^{n} = {h^{n}, m^{n}, b^{n}}$ . Let S denote the finite system state space, which comprises channel state space $H$ , video state space $M$ , and user buffer state space $B$ . That is, $S = H \times M \times B$ . Here × denotes a Cartesian product. S_k is used to indicate the states where user k is scheduled for transmission. $S_{1}, S_{2}, \dots, S_{K}$ are disjoint subsets of S with the property of $S = S_{1} \cup S_{2} \cup \dots \cup S_{K}$ .

5.3.1 Actions, Transmission Reward, and Holding Cost

The action $a_{k}^{n}$ by user k at time slot n is defined to be the number of coded video frames taken from the buffer and transmitted. The rate-control algorithm considered in this chapter adjusts the output frame rate $a_{k}^{n}$ according to the channel quality and buffer occupancy. Figure 5.3 illustrates the buffer control mechanism adopted in this chapter.

5.3.1.1 Actions

Let $a_{k}^{n}$ denote the action of the kth user at the nth time slot. We express $a_{k}^{n}$ as the number of coded video frames included in the outgoing traffic payload. If $s^{n} \in S_{k}$ , then $a_{k}^{n} \in A = {a_{\min}, \dots, a_{\max}}$ , otherwise, $a_{k}^{n} = 0. a_{\min}$ and a_max denote the minimum and maximum video frames that user k can output at time n, respectively.

We assume that the number of video layers $l_{a_{k}^{n}}$ at the output window at time slot n is dictated by a channel-dependent rule. Therefore, we express the output number of layers of the video frames at time slot n as follows:

FIGURE 5.3
(See color insert.) Example of the buffer control mechanism assumed in this chapter where $f_{i n, k} = 1 (k = 1, 2, \dots, K)$ . At every time slot, a new coded video frame enters the buffer. The buffer output depends on the scheduling algorithm involved, the buffer occupancy, and the channel quality. If a user is scheduled for transmission, then the action taken will extract a specific number l of video frame layers from up to N frames stored in the buffer.

$l_{a_{k}^{n}} = g (h_{k}^{n}),$

(5.4)

where $g (.)$ is the channel-dependent rule, and it is an increasing function on the channel state. The remaining video layers $l > l_{a_{k}^{n}}$ are dropped. This assumption is to reduce the action space, and it allows us to derive the structural results as illustrated later in Section 5.4.2. Consequently, the action space constitutes the number of video frames to be transmitted and is given by $A = {1, 2, \dots, W}$ .

Let L_k be the number of video layers admitted into the buffer of user k; each input packet of user k has L_k layers. The video layer index is denoted as $l \in L = {0, 1, \dots L_{k} - 1}$ , where l = 0 corresponds to the base layer, $l \geq 1$ to every subsequent enhancement layer.

The buffer output packet $a_{k}^{n}$ of user k at time n is determined by the modified CSMA/CA decentralized channel access algorithm. If user k is scheduled for transmission at time slot $n, a_{k}^{n} > 0$ , otherwise, $a_{k}^{n} = 0$ . However, the system selectively transmits some packet layers according to the channel quality. For example, the system transmits all of the L_k layers of each packet if the channel quality $h_{k}^{n}$ is good, while it only transmits the base layers of each packet if the channel quality $h_{k}^{n}$ is poor.

5.3.1.2 Transmission Reward

Let $c_{k} (s^{n}, a_{1}^{n}, \dots, a_{K}^{n})$ denote the transmission reward of user k at time n. Specifically, $c_{k} (s^{n}, a_{1}^{n}, \dots, a_{K}^{n})$ is chosen to be the expected video PSNR resulting from transmitting $l_{a_{k}^{n}}$ video layers from $a_{k}^{n}$ video frames stored in the buffer of user k. Let $δ (l, m_{k}^{n})$ be the video PSNR achieved by decoding all packets up to layer l of user k at time n achieved with video state $m_{k}^{n}$ , where $m_{k}^{n}$ is the base layer MAD. Let $p_{l} (h_{k}^{n})$ be the packet error rate (PER) encountered by user k at the lth layer during the transmission. The transmission reward can be expressed as [16]

$c_{k} (s^{n}, a_{1}^{n}, \dots, a_{K}^{n}) = a_{k}^{n} \cdot ((1 - p_{0} (h_{k}^{n})) . δ (0, m_{k}^{n}) + \sum_{l = 1}^{l_{a_{k}^{n}} - 1} \prod_{j = 0}^{l} (1 - p_{j} (h_{k}^{n})) \cdot [δ (l, m_{k}^{n}) - δ (l - 1, m_{k}^{n})]) .$

(5.5)

The bit errors are assumed to be independent and identically distributed (i.i.d.), and $δ (l, m_{k}^{n})$ is given in (5.3). Note here that the video distortion measure $δ (l, m_{k}^{n})$ can either be considered as the picture mean-squared error (MSE) or the PSNR. In the case of PSNR, the objective will be to maximize $c_{k} (s^{n}, a_{k}^{n})$ , otherwise it is minimized. In this chapter, we consider the PSNR as the video quality metric.

It can be seen from (5.5) that at time slot n if the kth user is scheduled for transmission, the performance of user k depends solely on its own channel state $h_{k}^{n}$ and not on the actions. Thus, for $s^{n} \in S_{k}$ , the rewards of all the users in the system are

$c_{k} (s^{n}, a_{1}^{n}, \dots, a_{K}^{n}) = c_{k} (s^{n}, a_{k}^{n}) \geq 0$

(5.6)

$c_{i} (s^{n}, a_{1}^{n}, \dots, a_{K}^{n}) = 0, (i \neq k) .$

(5.7)

For notational convenience, in the following sections, we will drop the subscript k by defining

$c (s^{n}, a_{k}^{n}) : = c_{k} (s^{n}, a_{k}^{n}) .$

5.3.1.3 Holding Cost

Each user has an instantaneous quality of service (QoS) constraint denoted by $d_{k} (s^{n}, a_{1}^{n}, \dots, a_{K}^{n})$ , where $k = 1, \dots, K$ . If the QoS is chosen to be the delay (latency), then $d_{k} (s^{n}, a_{1}^{n}, \dots, a_{K}^{n})$ is a function of the buffer state $b_{k}^{n}$ of the current user k. The instantaneous holding costs will be subsequently included in an infinite horizon latency constraint.

We express the holding cost of user k with $s^{n} \in S_{k}$ as follows:

$d_{κ} (s^{n}, a_{κ}^{n}) = \frac{1}{κ} \cdot {({[b_{κ}^{n} - a_{κ}^{n} + f_{i n, κ}]}^{+})}^{τ}, τ \geq 1,$

(5.8)

where κ is the average output frame rate, which is determined by the system. τ is a constant factor, which is specified to be $τ \geq 1$ ; this is due to the fact that the data overflow probability becomes higher with more data in the buffer. The latency cost at time slot n is evaluated by the buffer state after taking action $a_{k}^{n}$ .

Since the transmission latency is independent of the actions of all the remaining users, it can be simplified as

$d_{k} (s^{n}, a_{1}^{n}, \dots, a_{K}^{n}) = d_{k} (s^{n}, a_{k}^{n}) \geq 0, s^{n} \in S_{k} .$

(5.9)

For the remaining K – 1 users who are not scheduled for transmission, their holding costs can be expressed as

$d_{i, i \neq k} (s^{n}, a_{k}^{n}) = \frac{1}{κ} \cdot \min {(b_{i}^{n} + f_{i n, i}, B)}^{τ} .$

(5.10)

5.3.2 Markovian Game Formulation

The intuition behind the Markovian game formulation for the rate adaptation problem in a WLAN system is as follows: A WLAN system does not have a central authority for the resource allocation, and the system access rule is typically a decentralized opportunistic scheme. Due to the selfish nature of the system users, each user aims to maximize its own payoff. The interaction among users is characterized as the competition for the common system resources, which can best be formulated using game theory. By modeling the transmission channel as correlated Markov process, the rate adaptation problem can further be formulated as a Markovian game.

At time instant n, assume user k is scheduled for transmission according to the system access rule specified in Section 5.3.3. We define $c (s^{n}, a_{k}^{n}) \geq 0$ to be the instantaneous reward of user k when the system is in state sⁿ. We use $Φ_{1}, Φ_{2}, \dots, Φ_{K}$ to represent the set of all the deterministic policies of each user, respectively. The infinite horizon expected total discounted reward of any user i, given its transmission policy $π_{i} (π_{i} \in Φ_{i})$ , can be written as

$C_{i} (π_{i}) = ?_{s^{n}} [\sum_{n = 1}^{\infty} β^{n - 1} \cdot c_{i} (s^{n}, a_{k}^{n}) | π_{i}]$

(5.11)

where $0 \leq β<1$ is the discount factor, the state $s^{n} \in S_{k}$ , and the expectation is taken over action $a_{k}^{n}$ as well as system state sⁿ evolution for $n = 1, 2, \dots$ . The holding cost of user i at that time is $d_{i} (s^{n}, a_{k}^{n})$ and the infinite horizon expected total discounted latency constraint can be written as

$D_{i} (π_{i}) = ?_{s^{n}} [\sum_{n = 1}^{\infty} β^{n - 1} \cdot d_{i} (s^{n}, a_{k}^{n}) | π_{i}] \leq {\tilde{D}}_{i},$

(5.12)

with ${\tilde{D}}_{i}$ being a system parameter depending on the system requirement on user i.

Optimization problem: Given the policies of the other users, the optimal transmission policy of user i, $π_{i}^{*}$ , is chosen so as to maximize the overall expected discounted reward subject to its constraint; this can be mathematically written as

$π_{i}^{*} = {π_{i} : \max_{π_{i} \in Φ_{i}} C_{i} (π_{i}) s . t . D_{i} (π_{i}) \leq {\tilde{D}}_{i}, i = 1, \dots, K} .$

(5.13)

Note that the choice of the optimal rate transmission policy of ith user is a function of the policies of other users. Each user aims to optimize its own discounted reward (5.11) under the latency constraint (5.12).

5.3.3 System Access Rule

This chapter adopts a WLAN system model (IEEE 802.11 [14]) with a modified CSMA/CA mechanism, which will be implemented by a decentralized channel access rule. The decentralized channel access algorithm can be constructed as follows: At the beginning of a time slot, each user k attempts to access the channel after a certain time delay $t_{k}^{n}$ . The time delay of user k can be specified via an opportunistic scheduling algorithm [17], such as

$t_{k}^{n} = \frac{γ_{k}}{b_{k}^{n} h_{k}^{n}} .$

(5.14)

Here, $γ_{k}$ is a user-specified QoS parameter and $γ_{k} \in {γ_{p}, γ_{s}}$ . As soon as a user successfully accesses the channel, the remaining users detect the channel occupancy and stop their attempts to access. Let $k * n$ denote the index of the first user that successfully accesses the channel. If there are multiple users with the same minimum waiting time, $k * n$ is randomly chosen from these users with equal probability.

5.3.4 Transition Probabilities and Switching Control Game Formulation

With the aforementioned setup, the system feature satisfies the assumptions of a special type of dynamic game called a switching control game [11,12,13], where the transition probabilities depend on only one player in each state. It is known that the Nash equilibrium for such a game can be computed by solving a sequence of MDPs [12]. The decentralized transmission control problem in a Markovian block-fading channel WLAN system can now be formulated as a switching control game. In such a game [12], the transition probabilities depend only on the action of the kth user when the state $s \in S_{k}$ ; this enables us to solve such type of game by a finite sequence of MDPs.

According to the property of the switching control game, when the kth user is scheduled for transmission, the transition probability between the current composite state $s = {h, m, b}$ and the next state $s^{'} = {h^{'}, m^{'}, b^{'}}$ depends only on the action of the kth user a_k. The transition probability function of our problem can now be mathematically expressed by the following equation:

$\begin{array}{l} ℙ (s^{'} | s, a_{1}, a_{2}, \dots, a_{K}) \\ = ℙ (s^{'} | s, a_{K}) \\ = \prod_{i = 1}^{K} ℙ ({h^{'}}_{i} | h_{i}) \cdot \prod_{i = 1}^{K} ℙ ({m^{'}}_{i} | m_{i}) \cdot \prod_{i = 1, i \neq k}^{K} ℙ ({b^{'}}_{i} | b_{i}) \cdot ℙ ({b^{'}}_{k} | b_{k}, a_{k}) \end{array}$

(5.15)

As each user is equipped with a size B buffer, the buffer occupancy of user k evolves according to Lindley’s equation [18,19]:

${b^{'}}_{k} = \min ({[b_{k} - a_{k} + f_{i n, k}]}^{+}, B) .$

(5.16)

The evolution of the buffer state of user $i = 1, 2, \dots, K$ when $i \neq k$ follows the following rule:

$b_{i}^{'} = \min (b_{i} + f_{i n, i}, B) .$

For user k, the buffer state transition probability depends on the input and output date rates. This can be expressed as follows:

$ℙ ({b^{'}}_{k} | b_{k}, a_{k}) = I_{{{b^{'}}_{k} = \min ({[b_{k} - a_{k} + f_{i n, k}]}^{+}, B)}^{'}}$

(5.17)

where $I_{{\cdot}}$ is the indicator function.

For those users that are not scheduled for transmission, the buffer state transition probability depends only on the distribution of incoming traffic, which can be written as

$ℙ ({b^{'}}_{i} | b_{i}) = I_{{{b^{'}}_{i} = \min (b_{i} + f_{i n, i}, B)}} .$

(5.18)

Equations 5.11, 5.12, and 5.15 define the constrained switching control Markovian game that we consider. Our goal is to solve such games, that is, we seek to compute a Nash equilibrium^* policy $π_{i}^{*}, i = 1, \dots, K$ (which is not necessarily unique). However, if both the reward (5.11) and constraint (5.12) are zero-sum among all the users, then all the Nash equilibria have a unique value vector and are globally optimal [12].

The Markovian switching control game can be solved by constructing a sequence of MDP as described in Algorithm 5.1. We refer to [12, Chapter 3.2] and [11] for the proof. So, the constrained switching controlled Markovian game ((5.11) and (5.12)) can be solved by iteratively updating the transmission policy $π_{i}^{* n}$ of user $i, i = 1, \dots, K$ with the policy of remaining users fixed. Here, n denotes the iteration index as mentioned in Algorithm 5.1. At each step, ith user aims to maximize the overall expected discounted reward, subject to its constraint as specified in (5.13).

5.4 Nash Equilibrium Solutions to the Markovian Dynamic Game

This section first presents a value iteration algorithm to compute the Nash equilibrium solution to the formulated general-sum dynamic Markovian switching control game and proves the convergence of such algorithm. Then, we introduce four assumptions on the transmission reward, holding cost, and state transition probability functions, which lead us to the structural result on the Nash equilibrium policy. This structural result enables us to search for the Nash equilibrium policy in the policy space using a policy-gradient algorithm, and it has reduced computational cost.

5.4.1 Value Iteration Algorithm

In [12], a value iteration algorithm was designed to calculate the Nash equilibrium for an unconstrained general-sum dynamic Markovian switching control game. Therefore, we convert the problem in (5.13) to an unconstrained one using Lagrangian relaxation and use the value iteration algorithm specified in Algorithm 5.1 to find a Nash equilibrium solution.

Algorithm 5.1: Value iteration algorithm

Step 1:

Set m = 0; Initialize l

Initialize ${V_{1}^{0}, V_{2}^{0}, \dots, V_{K}^{0}}, {λ_{1}^{0}, λ_{2}^{0}, \dots, λ_{K}^{0}}$

Step 2: Inner loop: Set n = 0

Step 3: Inner loop: Update transmission policies

for k = 1: K do

for each $s \in S_{k}$ ,

$π_{k}^{n} (s) = \arg \min_{π_{k}^{n} (s) \in A} {- c (s, a_{k}) + λ_{k}^{m} \cdot d_{k} (s, a_{k}) + β \sum_{s^{'} = 1}^{| S |} ℙ (s^{'} | s, a_{k}) v_{k}^{n} (s^{'})};$

$v_{k}^{n + 1} (s) = - c (s, π_{k}^{n} (s)) + λ_{k}^{m} \cdot d_{k} (s, π_{k}^{n} (s)) + β \sum_{s^{'} = 1}^{| S |} ℙ (s^{'} | s, π_{k}^{n} (s)) v_{k}^{n} (s^{'});$

$v_{i = 1 : K, i \neq k}^{n + 1} (s) = λ_{i}^{m} \cdot d_{i} (s, π_{k}^{n} (s)) + β \sum_{s^{'} = 1}^{| S |} ℙ (s^{'} | s, π_{k}^{n} (s)) v_{i}^{n} (s^{'});$

end for

Step 4: If $‖ V_{k}^{n + 1} - V_{k}^{n} ‖ \leq ε, k = 1, \dots, K$ , set n = n + 1, and return to Step 3; otherwise, go to Step 5

Step 5: Update Lagrange multipliers

for k = 1: K do

$λ_{k}^{m + 1} = λ_{k}^{m} + \frac{1}{l} [D_{k} (π_{1}^{n}, π_{2}^{n}, \dots, π_{K}^{n}) - {\tilde{D}}_{k}]$

end for

Step 6: The algorithm stops when $λ_{k}^{m}, k = 1, 2, \dots, K$ , converge; otherwise, set m = m + 1 and return to Step 2

The algorithm can be summarized as follows. We use $V_{k = 1, 2, \dots, K}^{n}$ and $λ_{k = 1, 2, \dots, K}^{m}$ to represent the value vector and Lagrange multiplier, respectively, of a user $k (k = 1, \dots, K)$ at the nth and mth time slot. The algorithm mainly consists of two parts: the outer loop and the inner loop. The outer loop updates the Lagrange multipliers of each user and the inner loop optimizes the transmission policy of each user under fixed Lagrange multipliers. The outer loop index and inner loop index are m and n, respectively Note from Algorithm 5.1 that the interaction between users is through the update of value vectors since $v_{i = 1 : K, i \neq k}^{n + 1} (s)$ is a function of $π_{k}^{n} (s)$ .

In Step 1, we set the outer loop index m to be 0 and initialize the step size l, the value vectors $V_{k = 1, 2, \dots, K}^{0}$ , and Lagrange multipliers $λ_{k = 1, 2, \dots, K}^{0}$ . Step 3 is the inner loop, where at each step we solve kth user-controlled game and obtain the new optimal strategy for that user with the strategy of the remaining players fixed. Note here that the objective of the system is to maximize the transmission reward and minimize the holding cost as shown in (5.13). Since $- c (s, a_{k})$ is used for each step, the optimal transmission policy $π_{k}^{n} (s)$ is obtained by doing the minimization. Variable e in Step 4 is a small number chosen to ensure the convergence of $V_{k}^{n}$ . Step 5 updates the Lagrange multipliers based on the discounted delay value of each user given the transmission policies ${π_{1}^{n}, π_{1}^{n}, \dots, π_{K}^{n}} \cdot \frac{1}{l}$ is the step size that satisfies the conditions for convergence of the policy-gradient algorithm. This sequence of Lagrange multipliers ${λ_{1}^{m}, \dots, λ_{K}^{m}}$ with $m = 0, 1, 2, \dots$ converges in probability to ${λ_{1}^{*}, \dots, λ_{K}^{*}}$ , which satisfies the constrained problem defined in (5.13) [20,21]. The algorithm terminates when $λ_{k = 1, 2, \dots, K}^{m}$ converge; otherwise, go to Step 2.

Since this is a constrained optimization problem, the optimal transmission policy is a randomization of two deterministic polices [18,19]. Use $λ_{k = 1, 2, \dots, K}^{*}$ to represent the Lagrange multipliers obtained with Algorithm 5.1 The randomization policy of each user can be written as

$π_{k}^{*} (s) = q_{k} π_{k}^{*} (s, λ_{k, 1}) + (1 - q_{k}) π_{k}^{*} (s, λ_{k, 2}),$

(5.19)

where $0 \leq q_{k} \leq 1$ is the randomization factor, and $π_{k}^{*} (s, λ_{k, 1})$ and $π_{k}^{*} (s, λ_{k, 2})$ are the unconstrained optimal policies with Lagrange multipliers $λ_{k, 1}$ and $λ_{k, 2}$ . Specifically, $λ_{k, 1} = λ_{k}^{*} - Δ$ and $λ_{k, 2} = λ_{k}^{*} + Δ$ for a perturbation parameter Δ. The randomization factor of the kth user q_k is calculated by

$q_{k} = \frac{{\tilde{D}}_{k} - D_{k} (λ_{1, 2}, \dots, λ_{K, 2})}{D_{k} (λ_{1, 1}, \dots, λ_{K, 1}) - D_{k} (λ_{1, 2}, \dots, λ_{K, 2})} .$

(5.20)

The convergence proof of the inner loop of Algorithm 5.1 is shown in Appendix 5.A.1, and this value iteration algorithm obtains a Nash equilibrium solution to the constrained switching control Markovian game with general-sum reward and general-sum constraint [12].

Remark: The primary purpose of the value iteration algorithm is to prove the structural results on the Nash equilibrium policy. The value iteration algorithm (Algorithm 5.1) is not easy to implement in a practical system, because at each iteration of the algorithm, a user k is required to know the channel states of all the other users and the system state transition probability matrix.

5.4.2 Structural Result on Randomized Threshold Policy

In this section, we characterize the structure of the Nash equilibrium achieved by Algorithm 5.1. First, we list four assumptions. Based on these four assumptions, Theorem 5.1 is introduced:

• A1: The set of policies that satisfy constraint (5.12) is nonempty, to ensure the delay constraint of the system is valid.

A1 is the feasible assumption on the system constraint, and it is assumed to be satisfied.

• A2: Transmission reward $c (s, a_{k})$ is a supermodular^* function of b_k, a_k for any channel quality h_k and MAD m_k of the current user. $c (s, a_{k})$ is also an integer-concave function of $b_{k} - a_{k}$ for any h_k and m_k.

It can be seen from Section 5.3.1 that the transmission reward $c (s, a_{k})$ is linear in a_k (5.5) and independent of the buffer state b_k. Thus, the assumption A2 holds.

• A3: Holding cost $d_{k} (s, a_{k})$ is a submodular function of b_k, a_k for any channel quality h_k and MAD m_k of the current user. $d_{k} (s, a_{k})$ is also integer-convex in $b_{k} - a_{k}$ for any h_k and m_k.

The holding cost of user k with $s \in S_{k}$ is as follows:

$d_{k} (s, a_{k}) = κ \cdot {(b_{k} - a_{k} + f_{i n, k})}^{τ} .$

(5.21)

The submodularity property of the holding cost function $d_{k} (s, a_{k})$ can be easily verified. It is also straightforward to show $d_{k} (s, a_{k})$ is integer-convex in $b_{k} - a_{k}$ .

• A4: $ℙ ({b^{'}}_{k} | b_{k}, a_{k})$ is of second order, stochastically increasing on $(b_{k} - a_{k})$ for any b_k and a_k.

The $ℙ ({b^{'}}_{k} | b_{k}, a_{k})$ expression given by (5.17) shows it is first order, stochastically increasing in $(b_{k} - a_{k})$ . First-order stochastic dominance implies second-order stochastic dominance [18,19]; thus assumption A4 holds.

The following theorem is one of our main results. It shows that the Nash equilibrium of a constrained dynamic switching control game is a randomization of two pure monotone policies.

Theorem 5.1 For each user k and a given channel state, the Nash equilibrium policy $π_{k}^{*} (s)$ is a randomization of two pure policies $π_{k}^{1} (s)$ and $π_{k}^{2} (s)$ . Each of these two pure policies is a nondecreasing function with respect to buffer occupancy state b_k.

The detailed proof of Theorem 5.1 is given in Appendix5.A.2.

5.4.3 Learning Nash Equilibrium Policy via Policy-Gradient Algorithm

The value iteration in Algorithm 5.1 is applicable to a general type of constrained switching control game. In the case that the system parameters satisfy assumptions A1–A4 as stated earlier, we can exploit the structural result (Theorem 5.1) to search for the Nash equilibrium policy in the policy space. This results in a significant reduction on the complexity of computing the Nash equilibrium policies.

The main idea of searching the Nash equilibrium policy in the policy space is as follows. If there are three actions available with action set $a_{k} = {1, 2, 3}$ , the Nash equilibrium policy $π_{k}^{*} (s)$ is a randomized mixture:

$π_{k}^{*} (s) = {\begin{array}{l} 1 & 0 \leq b_{k} < b_{1} (s) \\ p_{1} & b_{1} (s) \leq b_{k} < b_{2} (s) \\ 2 & b_{2} (s) \leq b_{k} < b_{3} (s) \\ p_{2} & b_{3} (s) \leq b_{k} < b_{4} (s) \\ 3 & b_{4} (s) \leq b_{k} \end{array} .$

(5.22)

Here, ${p_{1}, p_{2}} \in [0, 1]$ is the randomization factor; $b_{1} (s), b_{2} (s), b_{3} (s)$ , and b₄(s) are the buffer thresholds. The search for each Nash equilibrium policy problem is now converted to the estimation of these six parameters. Note here that the policy search applies to a system with any number of actions; here we consider a three-action system as an example.

The simultaneous perturbation stochastic approximation (SPSA) method [20] is adopted to estimate the parameters. This method is especially efficient in high-dimensional problems in terms of providing a good solution for a relatively small number of measurements for the objective function. The essential feature of SPSA is the underlying gradient approximation, which requires only two objective function measurements per iteration regardless of the dimension of the optimization problem. These two measurements are made by simultaneously varying, in a properly random fashion, all the variables in the problem. The detailed algorithm is described in Algorithm 5.2.

Algorithm 5.2: Policy-gradient algorithm

1: Initialization: $θ^{(0)}, λ^{0}; n = 0; ρ=4$

2: Initialize constant perturbation step size β and gradient step size α

3: Main Iteration

4: for k = 1: K do

5: $d i m_{k} = 6 \times | h_{k} | \times | h_{i} | \times | b_{i} |$

6: Generate $Δ^{n} = {[Δ_{1}^{n}, Δ_{2}^{n}, \dots, Δ_{d i m_{k}}^{n}]}^{T}$ are Bernoulli random variables with $p = \frac{1}{2}$ .

7: $θ_{k +}^{n} = θ_{k}^{n} + β \times Δ^{n}$

8: $θ_{k -}^{n} = θ_{k}^{n} - β \times Δ^{n}$

9: $Δ C_{k}^{n} = \frac{c_{k} (s^{n}, θ_{k +}^{n}) - c_{k} (s^{n}, θ_{k -}^{n})}{2 β} {[{(Δ_{1}^{n})}^{- 1}, {(Δ_{2}^{n})}^{- 1}, \dots, {(Δ_{m_{k}}^{n})}^{- 1}]}^{T}$

10: $Δ D_{k}^{n} = \frac{d_{k} (s^{n}, θ_{k +}^{n}) - d_{k} (s^{n}, θ_{k -}^{n})}{2 β} {[{(Δ_{1}^{n})}^{- 1}, {(Δ_{2}^{n})}^{- 1}, \dots, {(Δ_{m_{k}}^{n})}^{- 1}]}^{T}$

11: $θ_{k}^{(n + 1)} = θ_{k}^{n} - α \times (Δ C_{k}^{n} + Δ D_{k}^{n} \cdot \max [0, λ_{k}^{n} + ρ \cdot (D (s^{n}, θ_{k}^{n}) - {\tilde{D}}_{k})])$

12: $λ_{k}^{(n + 1)} = \max [(1 - \frac{α}{ρ} \cdot λ_{k}^{n}), λ_{k}^{n} + α \cdot (D (s^{n}, θ_{k}^{n}) - {\tilde{D}}_{k})]$

13: end for

14: The parameters of other users remain unchanged

15: n = n + 1

16: The iteration terminates when the values of the parameters $θ^{n}$ converge; else return back to Step 3.

The first part of the algorithm initializes system variables. $θ^{n}$ represents the union of all the parameters we search for at the nth time slot, and $θ^{n} = {θ_{1}^{n}, θ_{2}^{n}, \dots, θ_{K}^{n}}$ . $θ_{k}^{n}$ indicates parameter vector of the kth user. The Lagrange multipliers of all the K users are defined to be $λ^{n} = {λ_{1}^{n}, λ_{2}^{n}, \dots, λ_{K}^{n}}$ . β and α denote the constant perturbation step size and constant gradient step size, respectively. When a $α \to 0, \frac{α}{β^{2}} \to 0$ , and $n \to \infty$ , the policy-gradient algorithm converges weakly [22, Theorem 8.3.1]. In the main part of the algorithm, SPSA algorithm is applied to iteratively update system parameters When the kth user is scheduled to transmit at time slot n, parameters $θ_{k}^{n}$ and the Lagrange multiplier $λ_{k}^{n}$ can be updated after introducing a random perturbation vector $Δ^{n}$ . Meanwhile, the parameters of the other users remain unchanged. The algorithm terminates when $θ^{n}$ converge.

Remark: At each iteration of the policy-gradient algorithm, user k is only required to know its own transmission reward and holding cost. The transmission reward and holding cost of user k are functions of its own channel state and buffer state (Section 5.4.2). Thus, the policy-gradient algorithm is distributed and implementable in a practical system.

5.5 Numerical Examples

In this section, we illustrate the performance improvement of the proposed dynamic switching control game policy over myopic policies. We also present numerical examples of the Nash equilibrium transmission policy. For our comparisons, we used the JSVM-9-8 reference software [23] to encode two reference video sequences in Common Intermediate Format (CIF) resolution: Foreman and Football. Each sequence is composed of 100 frames encoded in SVC, with an H.264/AVC compatible base layer and two CGS enhancement layers. The coded video streams are composed of I and P frames only, with only the first frame encoded as an I frame, a frame rate of 30 fps, and minimum QoS guarantee corresponding to the average base-layer QP value of q₀ = 38. The first and second CGS enhancement layers have average QP values of 32 and 26, respectively. The video MAD m_k is quantized into three states, $m_{k} \in M = {0, 1, 2}$ , such that

$m_{k} = {\begin{array}{l} 0, & if M A D_{k} < μ_{k} - \frac{σ_{k}}{2}, \\ 1, & if μ_{k} - \frac{σ_{k}}{2} \leq M A D_{k} \geq μ_{k} + \frac{σ_{k}}{2} \\ 2, & if M A D_{k} > μ_{k} + \frac{σ_{k}}{2} \end{array},$

where μ_k and σ_k are the mean and standard deviation of the MAD of user k.

Tables 5.1 and 5.2 list the bit rate in bits per second and average distortion in terms of Y-PSNR of each video layer of the two incoming video sequences.

In all the cases considered here, the channel quality measurements h_k of user k are quantized into two different states, $h_{k} \in {1, 2}$ . In the models used, each user has a size 10 buffer with an incoming rate equivalent to increasing the buffer size by 1 during a transmission time slot $(f_{i n, 1} = f_{i n, 2} = \dots = f_{i n, K} = 1)$ . One buffer unit is equivalent to one video frame. The transmission time slot duration is set equal to 10 ms, thus allowing a maximum transmission delay of 100 ms. This system configuration ensures that the transmission rewards, holding costs, and buffer transition probability matrices satisfy assumptions A2 and A3 specified in Section 5.4.2. The channel is assumed to be Markovian, and the transition probability matrices are generated randomly.

TABLE 5.1

Average Incoming Rate and Distortion (Y-PSNR) Characteristics of Video User Foreman

Sequence Video Layer	Foreman
Sequence Video Layer	Bit Rate ${\bar{r}}_{l, 1}$ (kbps)	Y-PSNR (dB)
Base l = 0	40.9	30.75
First CGS l = 1	103.36	34.59
Second CGS l = 2	242.64	38.87

TABLE 5.2

Average Incoming Rate and Distortion (Y-PSNR) Characteristics of Video User Football

Sequence Video Layer	Football
Sequence Video Layer	Bit Rate ${\bar{r}}_{l, 2}$ (kbps)	Y-PSNR (dB)
Base l = 0	55.82	28.17
First CGS l = 1	139.03	32.47
Second CGS l = 2	252.97	37.29

In the system models used, each user has three different action choices when it is scheduled for transmission. The three different actions are a_k = 1, 2, 3, and correspond to the user buffer output of one, two, and three frames, respectively. The action a_k = 0 corresponds to the No-Transmit state. Therefore, when the Nash equilibrium policy is 0, the current user is not scheduled for transmission, and a_k = 0. However, the outgoing traffic is still one frame, f_in,k = 1.

5.5.1 General-Sum Constrained Game: Randomized Monotone Transmission Policy

Figure 5.4 considers the same two-user system of the previous example with a 25 ms transmission delay constraint. As it is a constrained switching controlled Markovian game, the Nash equilibrium policy is a randomization of two pure policies. The figure shows that the optimal transmission policies are no longer deterministic but are a randomization of two pure policies, and that each Nash equilibrium policy is monotone-increasing on the buffer state. The results of Figure 5.4 are obtained by applying the value iteration algorithm. The first subfigure of Figure 5.4 is the optimal policy of user 1 when h₂ = 1 and b₂ = 1, while the second subfigure of Figure 5.4 shows the optimal policy of user 2 when h₁ = 1 and b₁ = 1.

FIGURE 5.4
The Nash equilibrium transmission policy obtained via value iteration algorithm for users 1 and 2. The first subfigure is obtained when h₂ = 1 and b₂ = 1, and the second subfigure is obtained when h₁ = 1 and b₁ = 1, respectively. The transmission delay constraint is specified to be 25 ms. Each Nash equilibrium transmission policy is a randomization of two pure monotone policies.

5.5.2 Result by Policy-Gradient Algorithm

The policy-gradient algorithm (Algorithm 5.2) we propose uses the structural result on the optimal transmit policy, and each policy can be determined by three parameters, namely, lower threshold b_l(s), upper threshold b_h(s), and randomization factor p, as described in (5.22) (assume there are two actions). The simulation results of the policy of user 1 with h₂ = 1, b₂ = 1 are shown in Figure 5.5. In the system setup, each user has a buffer size of 10 and three actions, and the system transmission delay constraint is 25 ms. By comparing Figure 5.5 to the first subfigure of Figure 5.4, we can see that the results obtained by the value iteration algorithm and policy-gradient algorithm are very close.

FIGURE 5.5
The Nash equilibrium transmission control policy obtained via policy-gradient algorithm for user 1 and 2. The result is obtained when (a) h₂ = 1 and b₂ = 1 and (b) h₁ = 1 and b₁ = 1. The transmission delay constraint is specified to be 25 ms. The transmission policy is monotone-nondecreasing on its own buffer state.

5.5.3 Transmission Buffer Management

We adopt a priority-based packet drop similar to the DPB buffer management policy proposed in [8]. With this policy, low-priority packets that have resided longest in the transmission buffer are dropped when the buffering delay approaches the delay constraint. In [8], the DPB policy was found to be second best to the drop dependency based (DDB), which requires video dependency information that cannot be collected in real time. The DPB policy is attractive since it requires minimal priority information about the video payload, which is readily available in the video network abstraction layer unit (NALU) header of a MGS/CGS-coded video stream. We implement this policy by shifting the user action a_k to a lower state that satisfies the buffer delay constraint.

In order to demonstrate the effectiveness of the proposed dynamic switching control game algorithm, we compare its performance to that of a myopic policy that selects at each time slot the user action that maximizes the video PSNR while satisfying the transmission buffer delay constraint. The time slot duration equals the video frame capturing duration, which is around 30 ms. The myopic policy does not consider the effect of current actions on future states. We assume that the delay constraint corresponds to the packet play-out deadline; therefore, all buffered packets that exceed the delay constraint are assumed to be lost and are discarded from the user buffer. Figures 5.6 and 5.7 show the results of simulating the transmission of two users Foreman and Football for both the proposed switching control game policy and the myopic policy. The figures show that under the same channel conditions, the proposed policy has better Y-PSNR performance and transmission buffer utilization for both users.

FIGURE 5.6
(See color insert.) Result of the transmission of the Football sequence comparing the performance in terms of video PSNR and buffer utilization between the proposed switching control game policy (blue lines) and the myopic policy (red lines) with 80 ms delay constraint. The result shows that the proposed switching control game policy performs better than the myopic policy.

FIGURE 5.7
(See color insert.) Result of the transmission of the Foreman sequence comparing the performance in terms of video PSNR and buffer utilization between the proposed switching control game policy (blue lines) and the myopic policy (red lines) with 80 ms delay constraint. The result shows that the proposed switching control game policy performs better than the myopic policy.

5.6 Conclusions

This chapter considers a time-slotted multiuser WLAN system where each user delivers SVC video bitstream. The modified CSMA/CA mechanism of such system is implemented through an opportunistic scheduling system access rule. By modeling video states and block-fading channels as a finite-state Markov chain, the rate adaptation problem among users can be formulated as a constrained general-sum switching control Markovian game. The Nash equilibrium transmission policy of such a game can be computed through a value iteration algorithm. Given four assumptions on the transmission reward, holding cost, and transition probability functions (Section 5.4.2), the Nash equilibrium policy is a randomized mixture of two pure policies with each policy monotone-nondecreasing on the buffer state occupancy. This structural result enables us to search for the Nash equilibrium policy in the policy space via a policy-gradient algorithm.

Glossary

CGS	Coarse-grained scalability
CIF	Common Intermediate Format
CSMA/CA	Carrier sense multiple access with collision avoidance
DPB	Drop priority based
HD	High definition
HSDPA	High speed downlink packet access
IEEE	Institute of Electrical and Electronics Engineers
IP	Internet protocol
MAD	Mean absolute difference
MDP	Markov decision process
MGS	Medium-grained scalability
MSE	Mean-square-error
NALU	Network Abstraction Layer Unit
PER	Packet error rate
PSNR	Peak signal-to-noise ratio
QoS	Quality of service
QP	Quantization parameter
RTP	Real-time transport protocol
SNR	Signal-to-noise ratio
SPSA	Simultaneous perturbation stochastic approximation
SVC	Scalable video coding
WLAN	Wireless local area network
Y-PSNR	Luminance peak signal-to-noise ratio

Symbols

k	User index
K	Total number of users
l	Layer index
n	Time index
r	Bit rate
u	Power factor
B	Maximum number of coded video frames stored in the buffer
W	Number of actions
$a_{k}^{n}$	Action of user k at time n
$b_{k}^{n}$	Buffer occupancy state of user k at time n
c_k	Transmission reward of user k
d_k	Holding cost of user k
$h_{k}^{n}$	Channel state of user k at time n
$m_{k}^{n}$	MAD at the lth layer
$t_{k}^{n}$	Channel access time delay of user k at time n
p_l	PER at the lth layer
q_l,k	Quantization parameter at the lth layer of the kth user
m_l	MAD at the lth layer
f_in,k	Incoming number of video frames of user k
$v_{k}^{n} (s)$	Value of user k at time n at state s
L_k	Number of video layers admitted into the buffer of user k
C_k	Discounted transmission reward of user k
D_k	Discounted holding cost of user k
${\tilde{D}}_{k}$	Discounted holding cost constraint
$V_{k}^{n}$	Value
bⁿ	Composition of buffer states of all the K users at time n
hⁿ	Composition of channel states of all the K users at time n
mⁿ	Composition of MAD states of all the K users at time n
sⁿ	System state at time n
$V_{k}^{n}$	Value vector of user k at time n
δ	Video PSNR
β	Discount factor
$γ_{k}$	QoS parameter of user k
$θ^{n}$	Union of all the parameters at nth time slot
π_k	Transmission policy of user k
$π_{k}^{*}$	Optimal transmission policy of user k
$λ_{k}^{m}$	Lagrange multiplier of user k at time m
$Φ_{k}$	Set of all the deterministic policies of user k
$Δ^{n}$	Bernoulli random variables
$P$	Transition matrix

Appendix 5.A.1 Proof of Convergence of Algorithm 5.1

1. By following the definition of $π_{k}^{n} (s, s \in S_{k})$ from Algorithm 5.1, we can deduce that

$V_{k}^{(n + 1)} \leq V_{k}^{n} .$

(5.23)

2. When $π_{k}^{n} (s, s \in S_{k}) = π_{k}^{(n + Δ)} (s, s \in S_{k})$ , we have $V_{k}^{n} = V_{k}^{(n + Δ)}$ . In view of (5.23), if $V_{k}^{n} \neq V_{k}^{(n + Δ)}$ , then $π_{k}^{n} (s, s \in S_{k}) \neq π_{k}^{(n + Δ)} (s, s \in S_{k})$ for any $Δ = 1, 2, 3, \dots$ .

3. The payoff function of the matrix game with $s \in S_{1}$ is of the form $[- c (s, a_{k}) + λ_{k}^{m} \cdot d_{k} (s, a_{k})] + β \sum_{s^{'} = 1}^{| S |} ℙ (s^{'} | s, a_{k}) v_{k}^{n} (s^{'})$ . Please note that the first term of this payoff function $[- c (s, a_{k}) + λ_{k}^{m} \cdot d_{k} (s, a_{k})]$ is independent of $V_{k}^{n}$ . By using the result from [12], Lemma 6.3.2, $π_{k}^{n} (s, s \in S_{k})$ equals to some extreme optimal action in a submatrix game with payoff function $[- c (s, a_{k}) + λ_{k}^{m} \cdot d_{k} (s, a_{k})]$ . As there are only finite number of extreme optimal action candidates for $π_{k}^{n} (s, s \in S_{k})$ , there exist n and $Δ \geq 1$ , such that $π_{k}^{n} (s, s \in S_{k}) = π_{k}^{(n + Δ)} (s, s \in S_{k})$ , which in turn implies $V_{k}^{n} = V_{k}^{(n + Δ)}$ .

4. If $V_{k}^{n} = V_{k}^{(n - 1)}$ ; $π_{k}^{n} (s, s \in S_{k})$ is the optimal action policy for user $(k \in {1, \dots, K})$ of the game with a fixed Lagrange multiplier $λ_{k}^{m}$ .

Based on these observations 1–4, we conclude that Algorithm 5.1 will converge in a finite number of iterations with fixed Lagrange multipliers $λ_{k = 1, 2, \dots, K}^{m}$ .

Appendix 5.A.2 Proof of Theorem 5.1

We choose delay constraint parameters ${\tilde{D}}_{k, k = 1, 2, \dots, K}$ in the optimization problem (5.13) carefully to ensure the existence of an optimal policy. The optimal policy can be obtained when the objective function (5.11) is maximized and the equality of the delay constraint (5.12) is held.

According to Algorithm 5.1, $\forall s \in S$ , the optimal policy and value matrix of kth user are updated by the following steps:

$π_{k}^{n} (s) = \underset{π_{k}^{n} (s)}{\arg \min} {- c (s, a_{k}) + λ_{k}^{m} \cdot d_{k} (s, a_{k}) + β \sum_{s^{'} = 1}^{| S |} ℙ (s^{'} | s, a_{k}) v_{k}^{n} (s^{'})};$

(5.24)

$v_{k}^{(n + 1)} (s) = - c (s, π_{k}^{n} (s)) + λ_{k}^{m} \cdot d_{k} (s, π_{k}^{n} (s) + β \sum_{s^{'} = 1}^{| S |} ℙ (s^{'} | s, π_{k}^{n} (s)) v_{k}^{n} (s^{'}) .$

(5.25)

A sufficient condition for the optimal transmission action policy $π_{k}^{n} (s)$ to be monotone-nondecreasing in the buffer state b_k is that the right-hand side of (5.24) is a submodular function of (b_k, a_k). According to assumptions A2 and A3, $[- c (s, a_{k}) + λ_{k}^{m} \cdot d_{k} (s, a_{k})]$ is a submodular function of (b_k, a_k); thus, we only need to demonstrate that is also submodular in the pair (b_k, a_k).

Recall the overall state space S is the union of all the K subspaces, $S = S_{1} \cup S_{2} \dots \cup S_{K}$ . Thus, we can write $\sum_{s^{'} = 1}^{| S |} ℙ (s^{'} | s, a_{k}) v_{k}^{n} (s^{'})$ as a summation of K terms according to the partition of the state space, where the ith term $(i = 1, 2, \dots, K)$ is denoted by $Q_{i} (s, a_{k})$ . By using the property of the state transition probability (5.15) and Assumption A4, we have the following result:

$\begin{array}{l} Q_{i} (s, a_{k}) = \sum_{s^{'}, s^{'} \in S_{i}} ℙ (s^{'} | s, a_{k}) v_{k}^{n} (s^{'}) \\ = \prod_{l = 1}^{K} ℙ ({h^{'}}_{l} | h_{l}) \prod_{l = 1}^{K} ℙ ({m^{'}}_{l} | m_{l}) \prod_{l = 1, l \neq k}^{K} ℙ ({b^{'}}_{l} | b_{l}) ℙ ({b^{'}}_{k} | b_{k} + a_{k}) v_{k}^{n} (s^{'}) . \end{array}$

(5.26)

In order to show the submodularity of $\sum_{s^{'} = 1}^{| S |} ℙ (s^{'} | s, a_{k}) v_{k}^{n} (s^{'})$ , we only need to prove the submodularity property of $Q_{i} (s, a_{k})$ for $i = 1, 2, \dots, K$ .

The proof of Theorem 5.1 needs the result from Lemma 5.1, the proof of which will be given after the proof of the Theorem in Appendix 5.A.3.

Lemma 5.1 Under the conditions of Theorem 1, $Q_{i} (s, a_{k})$ is of the form $Q_{i} (s, a_{k}) = {\bar{Q}}_{i} (b_{k} - a_{k}, {h, m, b_{- k}})$ . Function ${\bar{Q}}_{i} (x, y)$ is integer-convex function of x for any given $y = {h, m, b_{- k}}$ .

According to Lemma 5.1, ${\bar{Q}}_{i} (x, y)$ is integer-convex in x, which can be written as

$\begin{array}{l} {\bar{Q}}_{i} (x^{'} + Δ, y) - {\bar{Q}}_{i} (x^{'}, y) \\ \leq {\bar{Q}}_{i} (x + Δ, y) - {\bar{Q}}_{i} (x, y) \end{array}$

(5.27)

for $x^{'} \geq x$ and $Δ \geq 0$ . Substituting in Equation 5.27 with $x^{'} = b_{k} - a_{k}, x = b_{k} - {a^{'}}_{k}$ , and $Δ = {b^{'}}_{k} - b_{k}$ for ${b^{'}}_{k} \geq b_{k}$ and ${a^{'}}_{k} \geq a_{k}$ , we obtain

$\begin{array}{l} Q_{i} ({{b^{'}}_{k}, a_{k}}, y) - Q_{i} ({b_{k}, a_{k}}, y) \\ \leq Q_{i} ({{b^{'}}_{k}, {a^{'}}_{k}}, y) - Q_{i} ({b_{k}, {a^{'}}_{k}}, y), \end{array}$

(5.28)

which is the definition of submodularity of $Q_{i} (s, a_{k})$ in b_k and a_k. Furthermore, the linear combination of submodular functions still remains the submodular property. Thus, $\sum_{s^{'} = 1}^{| S |} ℙ (s^{'} | s, a_{k}) v_{k}^{n} (s^{'})$ is submodular in the pair (b_k, a_k).

Based on these results, we can prove the submodularity of the right-hand side of Equation 5.24 This proves that the optimal transition action policy $π_{k}^{*} {(s)}^{n}$ is monotone-nondecreasing on the buffer state b_k under fixed positive Lagrange constant.

According to [24], the optimal randomized policy with a general constraint is a mixed policy comprising of two pure policies that can be computed under two different Lagrange multipliers. As discussed earlier, we have already shown that each of these two pure policies is nondecreasing on the buffer state occupancy. Thus, the mixed policy also owns the nondecreasing property on the buffer state. This concludes the proof.

Appendix 5.A.3 Proof of Lemma 5.1

Let us first assume that $v_{k}^{n} (s^{'})$ is an integer-convex function of $b^{'}$ . Then, according to Proposition 2 of [25] and A4 in Section 5.4.2 along with (5.26), it can be concluded that $Q_{i} (s, a_{k})$ is an integer-convex function of b_k – a_k.

Thus, we only need to prove that $v_{k}^{n} (s^{'})$ is an integer-convex function of ${b^{'}}_{k}$ , which can be proved via backward induction. The value vector can be updated by the value iteration algorithm as follows:

$v_{k}^{(n)} (s) = - c (s, a_{k}) + λ_{k}^{m} \cdot d_{k} (s, a_{k}) + β \sum_{s^{'} = 1}^{| S |} ℙ (s^{'} | s, a_{k}) v_{k}^{n - 1} (s^{'}) .$

(5.29)

If $v_{k}^{n - 1} (s^{'})$ is integer-convex in ${b^{'}}_{k}$ , as well as the A2 and A3 from Section 5.4.2, it can be proved that $v_{k}^{n} (s)$ is integer-convex in b_k by using the key Lemma 61 from [26]. The convergence of the value iteration algorithm is not affected by the initial values. Choosing $v_{k}^{0} (s^{'})$ to be an integer-convex function of ${b^{'}}_{k}, v_{k}^{n} (s)$ is integer-convex in b_k followed by induction. This concludes the proof. □

References

1. ITU-T Rec. H.264 and ISO/IEC 14496-10 (MPEG-4 AVC), ITU-T and ISO/IEC JTC 1. Advanced Video Coding for Generic Audiovisual Services, June 2005.

2. ISO/IEC 14492-2 (MPEG-4 vISUAL), ISO/IEC JTC 1. Coding of Audio-Visual Objects—Part 2: Visual, May 2004.

3. J. Reichel, H. Schwarz, and M. Wien. Joint scalable video model JSVM-9. Technical Report N 8751, ISO/IEC JTC 1/SC 29/WG 11, Marrakech, Morocco, January 2007.

4. T. Schierl and T. Wiegand. Mobile video transmission using scalable video coding. IEEE Transactions on Circuits and Systems for Video Technology, 17(9):1204–1217, September 2007.

5. M. Wien, R. Cazoulat, A. Graffunder, A. Hutter, and P. Amon. Real-time system for adaptive video streaming based on SVC. IEEE Transactions on Circuits and Systems for Video Technology, 17(9):1227–1237, September 2007.

6. M. Luby, T. Gasiba, T. Stockhammer, and M. Watson. Reliable multimedia download delivery in cellular broadcast networks. IEEE Transactions on Broadcasting, 53(1):235–246, March 2007.

7. H. Schwarz, D. Marpe, and T. Wiegand. Overview of the scalable video coding extension of the H.264/AVC standard. IEEE Transactions on Circuits and Systems for Video Technology, 17(9):1103–1120, 2007.

8. G. Liebl, H. Jenkac, T. Stockhammer, and C. Buchner. Radio link buffer management and scheduling for wireless video streaming. Springer Telecommunication Systems, 30(1–3):255–277, 2005.

9. X. Zhu and B. Girod. Analysis of multi-user congestion control for video streaming over wireless networks. In Proceedings of IEEE International Conference on Multimedia and Expo (ICME), Toronto, Ontario, Canada, July 2006.

10. D. K. Kwon, M. Y. Shen, and C. C. J. Kuo. Rate control for H.264 video with enhanced rate and distortion models. IEEE Transactions on Circuits and Systems for Video Technology, 17(5):517–529, 2007.

11. S. R. Mohan and T. E. S. Raghavan. An algorithm for discounted switching control stochastic games. OR Spektrum, 55(10):5069–5083, October 2007.

12. J. A. Filar and K. Vrieze. Competitive Markov Decision Processes. Springer-Verlag, New York, 1997.

13. O. J. Vrieze, S. H. Tijs, T. E. S. Raghavan, and J. A. Filar. A finite algorithm for the switching control stochastic game. OR Spektrum, 5(1):15–24, March 1983.

14. IEEE Std 802.11-2007. IEEE standard for local and metropolitan area networks, Part 11: Wireless LAN medium access control (MAC) and physical layer (PHY) specifications, 2007.

15. E. Altman. Constrained Markov Decision Processes. Chapman & Hall, London, U.K., 1999.

16. H. Mansour, P. Nasiopoulos, and V. Krishnamurthy. Real-time joint rate and protection allocation for multi-user scalable video streaming. In Proceedings of IEEE Personal, Indoor, and Mobile Radio Communications (PIMRC), Cannes, France, pp. 1–5, September 2008.

17. A. Farrokh and V. Krishnamurthy. Opportunistic scheduling for streaming users in HSDPA multimedia systems. IEEE Transactions on Multimedia Systems, 8(4):844–855, August 2006.

18. D. V. Djonin and V. Krishnamurthy. MIMO transmission control in fading channels–A constrained Markov decision process formulation with monotone randomized policies. IEEE Transactions on Signal Processing, 55(10):5069–5083, October 2007.

19. D. V. Djonin and V. Krishnamurthy. Q-Learning algorithms for constrained Markov decision processes with randomized monotone policies: Application to MIMO transmission control. IEEE Transactions on Signal Processing, 55(5):2170–2181, May 2007.

20. J. C. Spall. Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control. Wiley-Interscience, Hoboken, NJ, 2003.

21. V. Krishnamurthy and G. G. Yin. Recursive algorithms for estimation of hidden Markov models and autoregressive models with Markov regime. IEEE Transactions on Information Theory, 48(2):458–476, February 2002.

22. H. J. Kushner and G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer-Verlag, New York, 2003.

23. ISO/IEC JTC 1/SC 29/WG 11 N8964. JSVM-10 software, 2007.

24. F. J. Beutler and K. W. Ross. Optimal policies for controlled Markov chains with a constraint. Journal of Mathematical Analysis and Applications, 112:236–252, November 1985.

25. J. E. Smith and K. F. McCardle. Structural properties of stochastic dynamic programs. Operations Research, 50(5):796–809, September–October 2002.

26. E. Alman, B. Gaujal, and A. Hordijk. Discrete-Event Control of Stochastic Networks: Multimodularity and Regularity. Springer, Berlin, Germany, 2003.

* This chapter is based on the following publication.: J. W. Huang, H. Mansour, and V. Krishnamurthy, A dynamical games approach to transmission rate adaptation in multimedia WLAN, IEEE Transactions on Signal Processing, 58(7):3635–3646, July 2010.

* Block-fading channel assumes that the channel states remain constant within a time slot and that they vary from one time slot to another.

* A Nash equilibrium [12] is a set of policies, one for each player, such that no player has an incentive to unilaterally change their actions. Players are in equilibrium if a change in policies by any one of them would lead that player to earn less than that if it remained with its current policy.

* If a function $f : A \times B \times C \to R$ is supermodular in (a, b) for any fixed $c \in C$ , then for all $a^{'} \geq a$ and $b^{'} \geq b, f (a^{'}, b^{'}; c) - (a, b^{'}; c) \geq f (a^{'}, b; c) - f (a, b, c)$ holds. f is submodular if −f is supermodular.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5. Transmission Rate Adaptation in Multimedia WLAN: A Dynamic Games Approach

Create new playlist

Sign In

Sign Up

Table of Contents for
5. Transmission Rate Adaptation in Multimedia WLAN: A Dynamic Games Approach