8.2ModifyingtheBasicCRSolver 129
float a_2 = -abc_3.x / abc_4.y;
float g_2 = -abc_3.z / abc_2.y;
float a_1 = -abc_2.x / abc_3.y;
float g_1 = -abc_2.z / abc_1.y;
float a0 = -abc0.x / abc_1.y;
float g0 = -abc0.z / abc1.y;
float a1 = -abc2.x / abc1.y;
float g1 = -abc2.z / abc3.y;
float3 l1_abc_pp = float3(a_2 * abc_4.x,
abc_3.y + a_2 * abc_4.z + g_2 * abc_2.x, g_2 * abc_2.z);
float3 l1_x_pp = float3(x_3 + a_2 * x_4 + g_2 * x_2 );
float3 l1_abc_p = float3(a_1 * abc_3.x,
abc_2.y + a_1 * abc_3.z + g_1 * abc_1.x, g_1 * abc_1.z);
float3 l1_x_p = float3(x_2 + a_1 * x_3 + g_1 * x_1);
float3 l1_abc_c = float3(a0 * abc_1.x,
abc0.y + a0 * abc_1.z + g0 * abc1.x, g0 * abc1.z);
float3 l1_x_c = float3(x0 + a0 * x_1 + g0 * x1);
float3 l1_abc_n = float3(a1 * abc1.x,
abc2.y + a1 * abc1.z + g1 * abc3.x, g1 * abc3.z);
float3 l1_x_n = float3(x2 + a1 * x1 + g1 * x3);
// Phase 2: Now solve for thethe intermediate-level
// data we need to compute to go up to full resolution.
int3 i3l2_LoadPosC = int3(input.Pos.x * 0.25, input.Pos.y, 0);
float3 l2_y0 = txYn.Load(i3l2_LoadPosC).xyz;
float3 l2_y1 = txYn.Load(i3l2_LoadPosC, int2(1, 0)).xyz;
float3 l2_y_1 = txYn.Load(i3l2_LoadPosC, int2(-1, 0)).xyz;
float3 l2_y_2 = txYn.Load(i3l2_LoadPosC, int2(-2, 0)).xyz;
float3 l1_y_c = l2_y0;
float3 l1_y_p = (l1_x_p - l1_abc_p.x * l2_y_1
- l1_abc_p.z * l2_y0) / l1_abc_p.y;
float3 l1_y_pp = l2_y_1;
float3 l1_y_n = (l1_x_n - l1_abc_n.x * l2_y0
- l1_abc_n.z * l2_y1) / l1_abc_n.y;
130 8.ImplementingaFastDDOFSolver
// Phase 3: Now use the intermediate solutions to solve
// for the full result.
float3 fRes3 = l2_y0;
float3 fRes2 = (x_1 - abc_1.x * l1_y_p
- abc_1.z * l1_y_c ) / abc_1.y; // y_1
float3 fRes1 = l1_y_p; // y_2
float3 fRes0 = (x_3 - abc_3.x * l1_y_pp
- abc_3.z * l1_y_p ) / abc_3.y; // y_3
float3 f3Res[4] = {fRes0, fRes1, fRes2, fRes3};
return (float4(f3Res[uint(input.Pos.x) & 3], 0.0));
}
Listing 8.2. Final stage of the solver.
4. Stop at two or three unknowns instead of reducing it all down to just one un-
known. Given that the number of hardware threads in a modern GPU is in the
thousands, this actually makes sense because it keeps a lot more threads of a
modern GPU busy compared to going down to just one unknown. Cramer’s
rule is used to solve the resulting
22
or
3
3
equation systems.
5. Optionally pack the evolving
i
y
and the
i
a
,
i
b
, and
i
c
into just one four-
channel
uint32 texture to further save memory and to gain speed since the
number of texture operations is cut down by a factor of two. This packing us-
es Shader Model 5 instructions (see Listing 8.3) and relies on the assumption
that the
i
x
values can be represented as 16-bit floating-point values. It further
assumes that one doesn’t need the full mantissa of the 32-bit floating-point
values for storing
i
a
,
i
b
, and
i
c
, and it steals the six lowest mantissa bits of
each one to store a 16-bit
i
x
channel.
// Pack six floats into a uint4 variable. This steals six mantissa bits
// from the three. 32-bit FP values that hold abc to store x.
uint4 pack(float3 abc, float3 x)
{
uint z = f32tof16(x.z);
return (uint4(((asuint(abc.x) & 0xFFFFFFC0) | (z & 0x3F)),
((asuint(abc.y) & 0xFFFFFFC0) | ((z >> 6) & 0x3F)),
((asuint(abc.z) & 0xFFFFFFC0) | ((z >> 12) & 0x3F)),
(f32tof16(x.x) + (f32tof16(x.y) << 16))));
}
8.3Results 131
struct ABC_X
{
float3 abc;
float3 x;
};
ABC_X unpack(uint4 d)
{
ABC_X res;
res.abc = asfloat(d.xyz & 0xFFFFFFC0);
res.x.xy = float2(f16tof32(d.w), f16tof32(d.w >> 16));
res.x.z = f16tof32(((d.x & 0x3F) + ((d.y & 0x3F) << 6) +
((d.z & 0x3F) << 12)));
return (res);
}
Listing 8.3. Packing/unpacking all solver variables into/from one rgab32_uint value.
8.3Results
Table 8.1 shows how various implementations of the DDOF solver perform at
various resolutions and how much memory each solver consumes. These perfor-
mance numbers (run on a system with an AMD HD 5870 GPU with 1 GB of vid-
eo memory) show that the improved solver presented in this gem outperforms the
traditional solvers in terms of running time and also in terms of memory re-
quirements.
In the settings used in these tests, the packing shown in Listing 8.3 does not
show any obvious differences (see Figure 8.3). Small differences are revealed in
Figure 8.4, which shows the amplified absolute difference between the images in
Figure 8.3. If these differences stay small enough, then packing should be used in
DirectX 11 rendering paths in games that implement this gem.
132
(a)
Fi
gu
sho
w
Res
o
1280
1280
1280
1280
1600
1600
1600
1600
1920
1920
1920
1920
2560
2560
2560
2560
u
re 8.3. A co
m
w
n in Listing
8
o
lution S
o
1024
S
t
1024
S
t
1024
F
o
1024
F
o
1200
S
t
1200
S
t
1200
F
o
1200
F
o
1200
S
t
1200
S
t
1200
F
o
1200
F
o
1600
S
t
1600
S
t
1600
F
o
1600
F
o
T
m
parison betw
e
8
.3 was used.
o
lver
t
andard solve
r
t
andard solver
+
o
u
r
-to-one red
u
o
u
r
-to-one red
u
t
andard solve
r
t
andard solve
r
+
o
u
r
-to-one red
u
o
u
r
-to-one red
u
t
andard solve
r
t
andard solver
+
o
u
r
-to-one red
u
o
u
r
-to-one red
u
t
andard solve
r
t
andard solver
+
o
u
r
-to-one red
u
o
u
r
-to-one red
u
T
able 8.1. Co
m
e
en images for
+
Packing
u
ction
u
ction + packi
n
+
packing
u
ction
u
ction + packi
n
+
packing
u
ction
u
ction + packi
n
+
packing
u
ction
u
ction + packi
n
m
parison of so
l
(b)
which (a) pac
k
8.Imple
m
Runni
n
Time (
m
2.46
1.97
1.92
n
g 1.87
3.66
2.93
2.87
n
g 2.75
4.31
3.43
3.36
n
g 3.23
7.48
5.97
5.80
n
g 5.59
l
ver efficienc
y
k
ing was not
u
m
entingaFa
s
ng
m
s)
Mem
o
(~M
B
90
70
50
40
13
2
10
2
73
58
15
8
12
2
88
70
28
1
21
9
15
6
12
5
y
.
u
sed and (b) th
e
s
tDDOFSolv
e
o
r
y
B
)
2
2
8
2
1
9
6
5
e
packing
e
r
References
Fi
gu
inve
r
Referen
c
[Kas
[Shi
s
[Rig
u
[Zh
a
u
re 8.4. Absol
u
r
ted.
c
es
s et al. 2006]
Field Using
Studios, 20
0
pub_id=898
.
s
hkovtsov and
Metro 2033
:
http://devel
o
u
er et al. 2004
Depth of F
i
Wordware,
real-timede
p
n
g et al. 2010
]
Solvers on
t
Principles a
n
idav.ucdavi
s
u
te difference
b
Michael Kass
,
Simulated Di
0
6. Available
.
Rege 2010]
:
The Last Re
f
o
per.download
.
] Guennadi
R
i
eld Simulatio
n
2004. Avail
a
p
thoffieldsimu
l
]
Yao Zhang,
t
he GPU.” P
r
n
d Practice o
f
s
.edu/func/retu
r
b
etween the i
m
,
Aaron Lefoh
n
ffusion on a
G
at http://ww
w
Oles Shishko
v
f
uge.” Game
D
.
nvidia.com/pr
e
iguer,
N
atalya
n
.” ShaderX2,
a
ble at http://
l
ation.pdf.
Jonathan Coh
e
r
oceedings of
f
Parallel Pro
g
r
n_pdf?pub_i
d
m
ages in Figur
e
n
, and John O
G
PU.” Techni
c
w
.idav.ucdavis.
v
tsov and As
h
D
evelopers Co
n
esentations/20
a
Tatarchuk, a
n
,
edited by
W
/
ati.amd.com/
d
e
n, and John
D
the 15th AC
M
g
ramming. 20
1
d
=978.
e
8.3, multipli
e
O
wens. “Intera
c
c
al report. Pi
x
edu/publicati
o
h
u Rege. “D
X
n
ference 2010
10/gdc/metro.
p
n
d John Isidor
o
W
olfgang Enge
d
eveloper/shad
e
D
. Owens, “Fa
s
M
SIGPLAN
Sy
1
0. Available
a
e
d by 255 and
c
tive Depth of
x
ar Animation
o
ns/print_pub?
X
11 Effects in
. Available at
p
df.
o
. “Real-Time
l. Plano, TX:
e
rx/shaderx2_
s
t Tridiagonal
Sy
mposium on
a
t http://www.
133
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset