Wednesday, April 25, 2018

Comparing ispc_texcomp alpha performance vs. my encoder

Most papers and encoders focus on opaque performance with BC7, but alpha textures are very important too. BC7's alpha support is somewhat weaker than opaque, especially with alpha signals that are uncorrelated vs. RGB. It only has a single 2 subset mode that can handle alpha, with limited color precision (555.1 with 2-bit indices).

This test exercises each codec in a different way than my previous opaque-only tests. Modes 0-3 are useless with transparent blocks.

I've finished the alpha path in my new non-RDO BC7 encoder. Results on a 4k test texture containing random 4x4 blocks (both in RGB and A) picked from thousands of textures in my corpus:

My encoder - higher quality settings:
Time: 42.4 secs
RGB: 46.058 RGB Avg. PSNR
A: 44.521 PSNR

My encoder - lower quality settings:
Time: 21.9 secs
RGB: 45.986
A: 44.539

My encoder - faster settings:
Time: 12.8 secs
RGB: 45.850
A: 43.949

ispc_texcomp slow:
Time: 155.4 secs
RGB: 45.826
A: 44.216

ispc_texcomp basic: 
Time: 45.6 secs
RGB: 45.820
A: 44.188

ispc_texcomp fast:
Time: 23.3 secs
RGB: 45.647
A: 44.307

This was with linear colorspace metrics, and benchmarking was on a single thread. The RGB/A stats are PSNR (higher is better). (In case you're wondering, SSIM and PSNR are highly correlated with block compression, so I usually use PSNR until I start doing whole-texture stuff with RDO.)

I'm now ~2x faster at higher quality vs. the fastest CPU BC7 encoder I'm aware of, and there are several easy optimizations left. And this is before enabling perceptual metrics.

BC7 mode utilization comparison of three encoders

I've been doing some benchmarking today to see where I stand with raw (non-RDO) BC7 encoding. Depending on the profile, I'm up to 2.26x faster at higher average quality using linear colorspace metrics vs. ispc_texcomp.

BC7 mode utilization, ispc_texcomp's basic profile, with my encoder set to a roughly similar profile:


Here's ispc_texcomp's slow profile, my encoder at a higher quality profile, and DirectXTex with BC_FLAGS_USE_3SUBSETS (with the pbit bug fixed so this is a fair comparison).


After modifying DirectXTex to try mode 6 first it get slightly faster and uses mode 6 much more often:


This was a multithreaded test. The timings are the overall amount of CPU time utilized only for encoding across all threads.

My encoder favors mode 6 on grayscale inputs (one large texture is grayscale), and it's always the first mode that's checked for opaque blocks so on simple blocks mode 6 gets favored. Mode 6 has very good endpoint precision (7777.1) and large 4-bit indices, so even on complex blocks it's fairly good.

GPU texture compression error metrics

While working on an encoder I conduct a lot of experiments (probably thousands over time) to improve it. To check for regressions or silly bugs, you must use some sort of error metrics otherwise it'll be impossible to make forward progress just by examining the output with your eye.

Here's what I commonly use:

Total time: 0.124466, encode-only CPU time: 3.288052
Compressed size: 360041, 7.325053 bpp

BC7 Mode histogram:
6 22 2 0 3809 2754 17936 0

RGB Total   Error: Max:  53, Mean: 5.220, MSE: 21.670, RMSE: 4.655, PSNR: 34.772, CRCA: 0x76C0BC5D CRCB: 0x9A5C47D1

RGB Average Error: Max:  53, Mean: 1.740, MSE: 7.223, RMSE: 2.688, PSNR: 39.543, SSIM: 0.985854, CRCA: 0x76C0BC5D CRCB: 0x9A5C47D1

Luma        Error: Max:  35, Mean: 1.123, MSE: 3.048, RMSE: 1.746, PSNR: 43.290, SSIM: 0.993128, CRCA: 0x351EFAEF CRCB: 0xA29288A7

Red         Error: Max:  52, Mean: 1.766, MSE: 7.544, RMSE: 2.747, PSNR: 39.355, SSIM: 0.987541, CRCA: 0xC673D5EA CRCB: 0x8623AA59

Green       Error: Max:  40, Mean: 1.553, MSE: 5.462, RMSE: 2.337, PSNR: 40.758, SSIM: 0.989642, CRCA: 0x2BF4294F CRCB: 0x418E5EE1

Blue        Error: Max:  53, Mean: 1.900, MSE: 8.664, RMSE: 2.944, PSNR: 38.753, SSIM: 0.980377, CRCA: 0x76C0BC5D CRCB: 0x9A5C47D1

Alpha       Error: Max:  31, Mean: 1.039, MSE: 3.308, RMSE: 1.819, PSNR: 42.935, SSIM: 0.976243, CRCA: 0x524049DA CRCB: 0x7492A642

I report some timings (two timings because it may be a threaded encode), the compressed size of the data (using LZMA) to get a feel for the encoder's output entropy, the mode histogram on those modes that support this concept, and then a bunch of metrics. 

I report a bunch of variations of PSNR (RGB Total and RGB Average) because there really are no rock solid standards for this and every author or tool seems to do this in a different way. I sometimes also use RGBA PSNR to compare my results against other tools.

Sometimes, an encoder bug will subtly break just a few blocks. That's where the Max errors are really handy.

Luma error is what I pay attention to during perceptual encoding tests, and RGB Average for non-perceptual tests. SSIM and PSNR are highly correlated in my experience for block compression work, which I'll show in a future blog post.

I also use PSNR per second or SSIM per second metrics (or some variation).

The CRC's are there to detect when the output (or input) data is exactly the same across runs.

I have a tool which can compare two encoders across a large corpus of test textures, which is really handy for finding regressions.

I also test with large textures containing random blocks gathered from thousands of test textures. For example (blogger.com has resampled it, here's the original 4K PNG):


These corpus textures can be generated using crunch in the corpus_gen mode.

I've found that testing like this is a reliable way of gauging the overall strength of an encoder vs. another. Graphics devs will instantly say "but wait why aren't you looking at minimizing block artifacts by looking at adjacent blocks!" That turns block compression into a global optimization problem (like PVRTC), and it's already hard enough as it is to do in a practical amount of CPU time. Also with BC7, the error is already pretty low (45-60 dB), and BC1-style block artifacts in a properly written BC7 encoder at high quality settings are uncommon.

Monday, April 23, 2018

LZHAM decompressor vectorization

The is just a sketch of an idea. It might not pan out, there are too many details to know until I try it. LZHAM is an old codec now, but maybe I could stretch it a bit further before moving on to a new design.

It should be possible to decode multiple segments of each LZHAM block simultaneously with vector instructions. Porting the decoder to C, then vectorizing the decoder loop with SPMD (using iscp) shouldn't be that hard. The compressed stream will need some hints so it knows where each lane needs to start decoding from.

So for 8 parallel decodes, at the beginning of the block you memcpy() the arithmetic and Huffman statistics into 8 copies, run the SPMD decoder on say 1-2K bytes per lane, then sync the statistics after 8 1-2K blocks are processed. Then repeat the process.

Lane 0 will be able to process match copies from any previously decompressed bytes, but lane 1 can't access lane 0's decoded bytes, and lane 2 can't access lane 1/0's, etc. That'll definitely lower ratio. However, it should be possible for the compressor to simulate lane 0's decompressor and predict which bytes are available in lane 1 at each point in the compressed data stream.

The various if() statements in the decoder's inner loop won't be particularly coherent, which would impact lane efficiency. But even a 2-3x improvement in decoding speed would be pretty nice.

DirectXTex BC7 3 subset wierdness

I brought Microsoft's DirectXTex project (latest code) into my test project, to see how it fairs vs. ispc_texcomp and my encoder. Unfortunately, it appears broken. Across 31 test images (kodim and others):

DirectXTex BC7:

No flags (0): 9972.6 secs 44.41 dB

BC_FLAGS_USE_3SUBSETS: 13534.6 secs 44.25 dB

This is wrong. Quality should go up with 3 subset modes enabled, not down. I'm tempted to go figure out what's wrong in there myself, but it's a lot of code.

By comparison, my ispc encoder gets 477.1 secs 46.72 dB (using high quality settings). ispc_texcomp is in the same ballpark. With 3 subset modes disabled, I get 45.96 dB (as expected - the 3 subset modes are useful!).

I verified that the flag is doing something. Here's the BC7 mode histogram for kodim01 with the flags parameter to DirectX::D3DXEncodeBC7() set to 0:

0 17968 0 1752 0 0 4856 0

With 3 subsets enabled:

1435 16647 26 1675 0 0 4793 0

The source looks nice and readable, and as a library it was dead-simple to get it building and linked against my stuff. But it doesn't appear to be a production-ready encoder, it's more like a sample.

I'm calling it from multiple threads using OpenMP (it's too slow to benchmark otherwise). It makes my 20 core Xeon workstation crawl for a while, it's that slow.

Also, there is no need to disable 3 subset modes in a properly written encoder. Some timings with my encoder: 487 secs (3 subsets enabled) vs. 476 secs (3 subsets disabled). In another test at lower quality: 215.9 secs (3 subsets enabled) vs. 199.5 secs (disabled). This was on a 20 core Xeon workstation/40 threads/31 images (kodim+others). 

The extra cost of a 3 subset mode isn't a big deal (3 endpoint optimizations) once you've estimated which partition(s) to examine more deeply. Partition estimation is fast and simple with SIMD, and a nice property of 3 subset modes is that the # of pixels fed to the endpoint optimizer per subset is rather low (enabling 3-subset specific optimizations). If your pbit handling is correct these modes are quite valuable.


Sunday, April 22, 2018

Proper pbit computation in the BC7 texture format

The BC7 GPU texture format supports the clever concept of endpoint "pbits", where the LSB's of RGB(A) endpoints are forced to the same value so only 1 bit (vs. 3 or 4) needs to be coded. BC7's use of pbits saves precious bits which can be used for other things which decrease error more. Some modes support a unique pbit per endpoint, and some only have 1 pbit for each endpoint pair.

I'm covering this because the majority of available BC7 encoders mess this important detail up. (I now kinda doubt any available BC7 encoder handles this correctly in all modes.) The overall difference across a 31 texture corpus (including the kodim images) is .26 dB RGB PSNR, which is quite a lot considering the CPU/GPU cost of doing this correctly vs. incorrectly is the same. (The improvement is even greater if you try all pbit combinations with proper rounding: .4 dB.)

ispc_texcomp handles this correctly for sure in most if not all modes, while the DirectXTex CPU, Volition GPU, and NVidia Texture Tool encoders don't as far as I can tell (they use pbit voting without proper rounding - the worst). The difference to doing this correctly in some modes is pretty significant - by at least ~.6 dB in mode 0!

Not doing this properly means your encoder will run slower because it will have to work harder (scanning more of the search space) to keep PSNR up vs. the competition. The amount of compute involved in lifting a BC7 encoder "up" by .26 dB across an entire test corpus is actually quite large, because there's a very steep quality vs. compute "wall" in BC7.

Here are some of the ways p-bits can be computed. The RGB PSNR's were for a single encoded image (kodim18), purposely limited to mode 0 (with 4 bit components+unique per-endpoint pbits) to emphasize the importance of doing this correctly:
  • 40.217 dB: pbit search+compensated rounding: Compute properly rounded component endpoints compensating for the chosen pbit, try all pbit combinations. This is an encoder option in my new BC7 encoder. Encoding error evaluation cost: 2X or 4X (expensive!)
  • 39.663 dB: Round to middle of component bin+pbit search: Compute rounded endpoints (with a scale factor not including the pbit), then evaluate the full error of all 2 or 4 pbit possibilities. This is what I initially started doing, because it's trivial to code. In mode 0, you scale by 2^4, round, then iterate through all the pbits and test the error of each combination. Error evaluation cost: 2X or 4X
  • 39.431 dB: Compensated rounding (ispc_texcomp method): Proper quantization interval rounding, factoring in the shift introduced when the pbit is 1 vs. 0. The key idea: If an endpoint scaled and rounded to full-precision (with a scale factoring in the pbit) has an LSB which differs from the pbit actually encoded, you must properly round the output component value to compensate for the different LSB or you will introduce more error than necessary. So basically, if the LSB you want is different from what's encoded, you need to correctly round the encoded component index to compensate for this difference. You must also factor in the way the hardware scales up the encoded component+pbit to 8-bits. Error evaluation cost: 1X
  • 39.151 dB: Voting/parity (DirectXTex/Volition): Count how many endpoint components in the scaled colors (with a scale factor including the pbit) sharing each pbit have set LSB's. If half or more do then set the encoded pbit to 1, otherwise 0. pbit voting doesn't round the encoded components so it introduces a lot of error. Error evaluation cost: 1X
  • 38.712 dB: Always 0 or 0,1
  • 37.878: Always 0 or 0,0
I tested a few different ways of breaking ties when computing pbits by voting and just reported the best one. At least on this image biasing the high endpoint towards 1 helps a little:

Shared  Unique
> >    39.053
> >=   39.151
>= >   38.996
>= >=  38.432
>=  > >=   39.151

This stuff is surprisingly tricky to get right, so here's a mode 0 example to illustrate what's going on. Each component can be coded to 16 possible values with one pbit selecting between two different ramps. So factoring in the pbit we have 32 possible representable 8-bit values. Here are the resulting 8-bit values (scaled using the shift+or method BC7 uses - not pure arithmetic scaling by 255/31 which is slightly different):

pbit 0:
0
16
33
49
66
82
99
115
132
148
165
181
198
214
231
247

pbit 1:
8
24
41
57
74
90
107
123
140
156
173
189
206
222
239
255

Let's say the encoder wants to encode an endpoint value of 9/255 (using 8-bit precision) in mode 0 (4-bit components+pbit). The pbit voting encoders will compute a quantized/scaled component value of 1/31 (from a range of [0,31] - not [0,15] because we're factoring in the pbit). The LSB is 1 and the encoded component index (the top 4 MSB's) is 0, and if more than half of the other component LSB's are also 1 we're ok. In the good case we're coding a value of 8/255, which is closer to 9/255 than 24/255.

If instead a pbit of 0 is chosen, we're now encoding a value of 0/255 (because the encoded component index of 0 wasn't compensated), when we should have chosen the closer value of 16/255 (i.e. a component index of 1). Choosing the wrong LSB and not compensating the index has resulted in increased error.

There's an additional bit of complexity to all this: The hardware scales the mode 0 index+pbit up to 8-bits by shifting the index+pbit left by 3 bits for mode 0, then it takes the upper 3 MSB's of this and logically or's them into the lower 3 bits to fill in. This isn't quite the same as scaling by 255/31. So proper pbit determination code needs to factor this in, too. Here are the ramps computed using arithmetic scaling+rounding (notice they slightly differ from the previous ramps computed using shifting+or'ing):

0
16
33
49
66
82
99
115
132
148
165
181
197
214
230
247

8
25
41
58
74
90
107
123
140
156
173
189
206
222
239
255

I worked out the formulas involved on a piece of paper: 

How to compute [0,1] values from mode 0 bins+pbits (using arithmetic scaling, NOT hardware scaling):
pbit 0: value=bin*2/31
pbit 1: value=(bin*2+1)/31

How to compute mode 0 bins from [0,1] values with proper compensation/rounding (rearranging the equations+rounding) for each pbit index:
pbit 0: bin=floor(value*31/2+.5)
pbit 1: bin=floor((value*31-1)/2+.5)

Here's the clever code in ispc_texcomp that handles this correctly for modes with unique p-bits (modes 0,3,6,7). I bolded the bin calculations, which are slightly optimized forms of the previous set of equations. 

I believe there's actually a bug in here for mode 7 - I don't see it scaling the component values up to 8-bit bytes for this mode. It has a special case in there to handle mode 0, and modes 3/6 don't need scaling because they have 7-bit components, but mode 7 has 5-bit components. I didn't check the rest of the code to see if it actually handles mode 7 elsewhere, but it's possible ispc_texcomp's handling of mode 7 is actually broken due to this bug. Mode 7 isn't valuable when encoding opaque textures, but is pretty valuable for alpha textures because it's the only alpha mode that supports partitions.

///////////////////////////
// endpoint quantization

inline int unpack_to_byte(int v, uniform const int bits)
{
    assert(bits >= 4);
    int vv = v << (8 - bits);
    return vv + shift_right(vv, bits);
}

void ep_quant0367(int qep[], float ep[], uniform int mode, uniform int channels)
{
    uniform int bits = 7;
    if (mode == 0) bits = 4;
    if (mode == 7) bits = 5;

    uniform int levels = 1 << bits;
    uniform int levels2 = levels * 2 - 1;

    for (uniform int i = 0; i < 2; i++)
    {
        int qep_b[8];

        for (uniform int b = 0; b < 2; b++)
            for (uniform int p = 0; p < 4; p++)
            {
                int v = (int)((ep[i * 4 + p] / 255f*levels2 - b) / 2 + 0.5) * 2 + b;
                qep_b[b * 4 + p] = clamp(v, b, levels2 - 1 + b);
            }

        float ep_b[8];
        for (uniform int j = 0; j < 8; j++)
            ep_b[j] = qep_b[j];

        if (mode == 0)
            for (uniform int j = 0; j < 8; j++)
                ep_b[j] = unpack_to_byte(qep_b[j], 5);

        float err0 = 0f;
        float err1 = 0f;
        for (uniform int p = 0; p < channels; p++)
        {
            err0 += sq(ep[i * 4 + p] - ep_b[0 + p]);
            err1 += sq(ep[i * 4 + p] - ep_b[4 + p]);
        }

        for (uniform int p = 0; p < 4; p++)
            qep[i * 4 + p] = (err0 < err1) ? qep_b[0 + p] : qep_b[4 + p];
    }
}

Saturday, April 21, 2018

RDO BC7 encoder planning

I'm building my first RDO BC7/BC6H encoders for our product (Basis), and I'm trying to decide which BC7 modes to implement first. I've decided on modes 1+6 for opaque textures. For alpha, I've boiled down the options to modes 1+5+6, 1+4+6, or maybe 1+4+6+7.

For alpha I know I'll need either 4 and/or 5 because they are the only modes with uncorrelated alpha support. Mode 1 isn't optional because it supports 2 subsets, greatly reducing ugly block artifacts. The 3 subset modes aren't used much in practice and aren't necessary. Mode 6 is optional but boosts quality on simple blocks due to its awesome 7777.1 bit endpoint precision and beefy 4-bit selectors. The bare minimum modes for a decent opaque/alpha encoder seems to be 1+4 or 1+5, but adding 6 and 7 will improve quality especially with alpha textures.

Others have suggested that I do a RDO BC7 encoder using only a single subset mode (6 seems perfect), but the result would have BC1-style block artifacts. I've already built a 2 subset RDO encoder for ETC1 in Basis and it isn't that much harder to support 2 subsets vs. 1.

RDO GPU texture encoders try to optimize the encoded output data so that, after the output is LZ compressed (and the output pretty much always is!), you get as much quality per compressed bit as you can. The output bit patterns matter a lot. If you can reuse a bit pattern that's occurred in a nearby block, without impacting quality too much, an LZ codec can exploit this to issue a match vs. literals (saving bits and improving quality per compressed output bit). There's more to it than that, because you also want the ability to control the output quality vs. compressed size tradeoff in an artist-friendly way.

My RDO BC1 encoder in Basis uses Lagrangian optimization, and this is simplified because in BC1 the bit patterns nicely line up to byte boundaries which modern LZ codecs like. The endpoints and selectors aren't munged together, and a large chunk of the output entropy is in the selector data.

Anyhow, here's some of the data I used to make these decisions about BC7. I have a lot more data than this which I may post later.

image corpus was kodim+7 others
For comparison ispc_basic (using 7 modes) gets 46.54 dB
Encoder was set to extremely high quality (better than ispc_slow)

Single mode:

mode enctime   dB               
1    83        45.8 
3    91.2      43.4 
6    12.1      42.9 
2    12.9      42.7 
0    19.6      42.34
4    38.56     41.48
5    14.4      38.45 

Key mode combinations:
                 
1+6  86.37     46.27
1+5  11.161    45.79
1+6  21.9      45.7 
1+4  137.6     45.38

This data definitely isn't ideal, because I used my ispc encoder which was heavily tuned for performance. So modes 1 and 3 tried all 64 partitions in this test, while modes 0 and 2 only tried the best predicted partition. But it matches up with earlier data I generated from a more brute force encoder.

Note that the mode 4 and 5 encoders used all rotation/index selector options, which skews the output a bit. I doubt I'll be supporting these mode 4/5 options. Two 1+6 entries are for the encoder set to very high vs. much lower quality levels.

Some earlier encoding error data for just kodim18, using an earlier C++ version of my encoder:

Best of any mode error: 12988.968550

Best of modes 0 and 1: 13186.750320
Best of modes 1 and 6: 13440.390471
Best of modes 1 and 2: 13521.855790
Best of modes 1 and 4: 13565.064541
Best of modes 1 and 3: 13589.143902
Best of modes 1 and 5: 13592.709517
Best of modes 1 and 7: 13604.182592
Best of modes 1 and 1: 13605.033774
Best of modes 0 and 6: 14099.723969
Best of modes 0 and 3: 15046.616630
Best of modes 2 and 6: 15383.155463
Best of modes 0 and 4: 15563.136445
Best of modes 0 and 5: 15665.245609
Best of modes 0 and 2: 15892.424359
Best of modes 3 and 6: 15977.876955

Best of modes 0 and 1 and 6: 13037.149688
Best of modes 0 and 1 and 4: 13160.175075
Best of modes 0 and 1 and 2: 13168.570462
Best of modes 0 and 1 and 3: 13171.807469
Best of modes 0 and 1 and 5: 13176.588633
Best of modes 0 and 1 and 7: 13186.504616
Best of modes 0 and 0 and 1: 13186.750320
Best of modes 0 and 1 and 1: 13186.750320
Best of modes 1 and 2 and 6: 13360.493404
Best of modes 1 and 4 and 6: 13400.774007
Best of modes 1 and 5 and 6: 13429.100640
Best of modes 1 and 3 and 6: 13433.822985
Best of modes 1 and 6 and 7: 13439.954762
Best of modes 1 and 1 and 6: 13440.390471
Best of modes 1 and 6 and 6: 13440.390471
Best of modes 1 and 2 and 4: 13489.904966
Best of modes 1 and 2 and 3: 13508.004738
Best of modes 1 and 2 and 5: 13511.406144
Best of modes 1 and 2 and 2: 13521.855790
Best of modes 1 and 1 and 2: 13521.855790

The 1+6 combination is pretty powerful. Only a quarter dB separates it from ispc_basic on this corpus. Also, mode 1's bit packing is nicely aligned to bytes (more or less) - perfect for RDO encoding!

byte bits
0    2 6  mode+partition
1    6 2  endpoints
2    4 4  endpoints
3    2 6  endpoints
4    6 2  endpoints
5    4 4  endpoints
6    2 6  endpoints
7    6 2  endpoints
8    4 4  endpoints
9    2 6  endpoints
10   2 6  pbits+selectors
11   8    selectors
12   8    selectors
13   8    selectors 
14   8    selectors
15   8    selectors

Note that most of mode 1's output is endpoint data, not selector data, so partition/pbit biasing, endpoint quantization and bit pattern optimization will be critical.