8. Implementing a Fast DDOF Solver (3/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

8.2ModifyingtheBasicCRSolver 129

float a_2 = -abc_3.x / abc_4.y;

float g_2 = -abc_3.z / abc_2.y;

float a_1 = -abc_2.x / abc_3.y;

float g_1 = -abc_2.z / abc_1.y;

float a0 = -abc0.x / abc_1.y;

float g0 = -abc0.z / abc1.y;

float a1 = -abc2.x / abc1.y;

float g1 = -abc2.z / abc3.y;

float3 l1_abc_pp = float3(a_2 * abc_4.x,

abc_3.y + a_2 * abc_4.z + g_2 * abc_2.x, g_2 * abc_2.z);

float3 l1_x_pp = float3(x_3 + a_2 * x_4 + g_2 * x_2 );

float3 l1_abc_p = float3(a_1 * abc_3.x,

abc_2.y + a_1 * abc_3.z + g_1 * abc_1.x, g_1 * abc_1.z);

float3 l1_x_p = float3(x_2 + a_1 * x_3 + g_1 * x_1);

float3 l1_abc_c = float3(a0 * abc_1.x,

abc0.y + a0 * abc_1.z + g0 * abc1.x, g0 * abc1.z);

float3 l1_x_c = float3(x0 + a0 * x_1 + g0 * x1);

float3 l1_abc_n = float3(a1 * abc1.x,

abc2.y + a1 * abc1.z + g1 * abc3.x, g1 * abc3.z);

float3 l1_x_n = float3(x2 + a1 * x1 + g1 * x3);

// Phase 2: Now solve for thethe intermediate-level

// data we need to compute to go up to full resolution.

int3 i3l2_LoadPosC = int3(input.Pos.x * 0.25, input.Pos.y, 0);

float3 l2_y0 = txYn.Load(i3l2_LoadPosC).xyz;

float3 l2_y1 = txYn.Load(i3l2_LoadPosC, int2(1, 0)).xyz;

float3 l2_y_1 = txYn.Load(i3l2_LoadPosC, int2(-1, 0)).xyz;

float3 l2_y_2 = txYn.Load(i3l2_LoadPosC, int2(-2, 0)).xyz;

float3 l1_y_c = l2_y0;

float3 l1_y_p = (l1_x_p - l1_abc_p.x * l2_y_1

- l1_abc_p.z * l2_y0) / l1_abc_p.y;

float3 l1_y_pp = l2_y_1;

float3 l1_y_n = (l1_x_n - l1_abc_n.x * l2_y0

- l1_abc_n.z * l2_y1) / l1_abc_n.y;

130 8.ImplementingaFastDDOFSolver

// Phase 3: Now use the intermediate solutions to solve

// for the full result.

float3 fRes3 = l2_y0;

float3 fRes2 = (x_1 - abc_1.x * l1_y_p

- abc_1.z * l1_y_c ) / abc_1.y; // y_1

float3 fRes1 = l1_y_p; // y_2

float3 fRes0 = (x_3 - abc_3.x * l1_y_pp

- abc_3.z * l1_y_p ) / abc_3.y; // y_3

float3 f3Res[4] = {fRes0, fRes1, fRes2, fRes3};

return (float4(f3Res[uint(input.Pos.x) & 3], 0.0));

}

Listing 8.2. Final stage of the solver.

4. Stop at two or three unknowns instead of reducing it all down to just one un-

known. Given that the number of hardware threads in a modern GPU is in the

thousands, this actually makes sense because it keeps a lot more threads of a

modern GPU busy compared to going down to just one unknown. Cramer’s

rule is used to solve the resulting



equation systems.

5. Optionally pack the evolving

and the

, and

into just one four-

channel

uint32 texture to further save memory and to gain speed since the

number of texture operations is cut down by a factor of two. This packing us-

es Shader Model 5 instructions (see Listing 8.3) and relies on the assumption

that the

values can be represented as 16-bit floating-point values. It further

assumes that one doesn’t need the full mantissa of the 32-bit floating-point

values for storing

, and

, and it steals the six lowest mantissa bits of

each one to store a 16-bit

channel.

// Pack six floats into a uint4 variable. This steals six mantissa bits

// from the three. 32-bit FP values that hold abc to store x.

uint4 pack(float3 abc, float3 x)

{

uint z = f32tof16(x.z);

return (uint4(((asuint(abc.x) & 0xFFFFFFC0) | (z & 0x3F)),

((asuint(abc.y) & 0xFFFFFFC0) | ((z >> 6) & 0x3F)),

((asuint(abc.z) & 0xFFFFFFC0) | ((z >> 12) & 0x3F)),

(f32tof16(x.x) + (f32tof16(x.y) << 16))));

}

8.3Results 131

struct ABC_X

{

float3 abc;

float3 x;

};

ABC_X unpack(uint4 d)

{

ABC_X res;

res.abc = asfloat(d.xyz & 0xFFFFFFC0);

res.x.xy = float2(f16tof32(d.w), f16tof32(d.w >> 16));

res.x.z = f16tof32(((d.x & 0x3F) + ((d.y & 0x3F) << 6) +

((d.z & 0x3F) << 12)));

return (res);

}

Listing 8.3. Packing/unpacking all solver variables into/from one rgab32_uint value.

8.3Results

Table 8.1 shows how various implementations of the DDOF solver perform at

various resolutions and how much memory each solver consumes. These perfor-

mance numbers (run on a system with an AMD HD 5870 GPU with 1 GB of vid-

eo memory) show that the improved solver presented in this gem outperforms the

traditional solvers in terms of running time and also in terms of memory re-

quirements.

In the settings used in these tests, the packing shown in Listing 8.3 does not

show any obvious differences (see Figure 8.3). Small differences are revealed in

Figure 8.4, which shows the amplified absolute difference between the images in

Figure 8.3. If these differences stay small enough, then packing should be used in

DirectX 11 rendering paths in games that implement this gem.

132

(a)

sho

Res

1280



1280



1280



1280



1600



1600



1600



1600



1920



1920



1920



1920



2560

re 8.3. A co

n in Listing

lution S

1024



1024



1024



1024



1200



1200



1200



1200



1200



1200



1200



1200



1600

parison betw

.3 was used.

lver

andard solve

andard solver

-to-one red

andard solve

-to-one red

andard solve

andard solver

-to-one red

andard solve

andard solver

-to-one red

able 8.1. Co

en images for

Packing

ction

ction + packi

packing

ction

ction + packi

packing

ction

ction + packi

packing

ction

ction + packi

parison of so

(b)

which (a) pac

8.Imple

Runni

Time (

2.46

1.97

1.92

g 1.87

3.66

2.93

2.87

g 2.75

4.31

3.43

3.36

g 3.23

7.48

5.97

5.80

g 5.59

ver efficienc

ing was not

entingaFa

Mem

(~M

sed and (b) th

tDDOFSolv

)

packing

r

References

inve

Referen

[Kas

[Shi

[Rig

[Zh

re 8.4. Absol

ted.

es

s et al. 2006]

Field Using

Studios, 20

pub_id=898

hkovtsov and

Metro 2033

http://devel

er et al. 2004

Depth of F

Wordware,

real-timede

g et al. 2010

]

Solvers on

Principles a

idav.ucdavi

te difference

Michael Kass

Simulated Di

6. Available

Rege 2010]

The Last Re

per.download

] Guennadi

eld Simulatio

2004. Avail

thoffieldsimu

]

Yao Zhang,

he GPU.” P

d Practice o

.edu/func/retu

etween the i

Aaron Lefoh

ffusion on a

at http://ww

Oles Shishko

uge.” Game

nvidia.com/pr

iguer,

atalya

.” ShaderX2,

ble at http://

ation.pdf.

Jonathan Coh

oceedings of

Parallel Pro

n_pdf?pub_i

ages in Figur

, and John O

PU.” Techni

.idav.ucdavis.

tsov and As

evelopers Co

esentations/20

Tatarchuk, a

edited by

ati.amd.com/

n, and John

the 15th AC

ramming. 20

=978.

8.3, multipli

wens. “Intera

al report. Pi

edu/publicati

u Rege. “D

ference 2010

10/gdc/metro.

d John Isidor

olfgang Enge

eveloper/shad

. Owens, “Fa

SIGPLAN

0. Available

d by 255 and

tive Depth of

ar Animation

ns/print_pub?

11 Effects in

. Available at

df.

. “Real-Time

l. Plano, TX:

rx/shaderx2_

t Tridiagonal

mposium on

t http://www.

133

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 8. Implementing a Fast DDOF Solver (3/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
8. Implementing a Fast DDOF Solver (3/4)