Saturday, December 31, 2016

End of year realization

I'm basically an artist who "paints" with code, and good debuggers are one of my brushes. Except, each "painting" involves a variable amount of mental and/or physical resources. The Steam Linux project was the hardest one I've ever done so far. Getting games on Linux and pushing the GL driver teams to go in the right directions was extremely difficult.

Thankfully, each project is a learning experience. Each time I get better at understanding myself.

Friday, December 16, 2016

Visual Studio 2015's Busted Find Dialog

In Visual Studio 2015, Microsoft decided to wreck the Find dialog so it's perma-docked into the upper right-hand corner of the document. The dialog is too small, and the key icons (to enable case sensitive searching or searching for whole words) are too small and hard to use:


By comparison, here's 2010's:


The new find dialog in 2015 is an example of bad UI, and I'm not the only Windows C++ developer I know who seriously dislikes it.

Saturday, December 10, 2016

"The Ballad of the Green Beret"

I heard this playing at the local Pagliacci's recently, and I realized this is one of the tunes my father used to play all the time. He was in Vietnam in I think '68 or '69, totally lost alone in the jungle, and was saved by a branch of the Special Forces called the Green Beret's.


Monday, November 28, 2016

Why Age3 used low poly skinned meshes

Age3 used CPU skinning of relatively low poly models (even in "high" model mode). To help improve this technical design misstep made by the Age3 team (before I joined the team near the end of production) I rewrote the skinning code to be multithreaded. Unfortunately, by the time I came on board the artists had already created a ton of low poly skinned meshes.

I also built the skinning DLL with Intel's compiler, so I was able to easily rewrite all the skinning code using SSE1/2 ops using compiler intrinsics. Back in those days MSVC's support for vector intrinsics was weaker than Intel's compiler. (I'm also the developer to blame for Age3's SSE requirement, which bit some owners of very early AMD processors who otherwise could have played the title at low frame rates.)

Anyhow, I mention this because if you play Age3 today, like on a 4k monitor, the game's terrain and other effects hold up pretty well. Except the skinned character models look terribly low poly by comparison. On Halo Wars I used GPU skinning, instanced rendering, and I heavily jobified the animation system.

Another little note about the Halo Wars engine

There's still a lot of misunderstanding out there about where the Halo Wars engine technology came from. Starting in very early 2005 the HW team wrote a new engine pretty much from scratch. The Age3 code was only single threaded, didn't use SIMD, and consumed huge amounts of RAM. (Age3 used over 32MB just for UTF16 strings - not good for a console game!) The "Bang!" engine ran at ~7Hz and took around three to five minutes to load on the early Xbox 360 devkits.

Colt McAnlis (now Google), Billy Khan (now at Id Software) and I wrote the entire Xbox 360-only renderer almost from scratch. We started out with Age3's particle renderer and my "wrench" demo deferred shading engine for SM 2.0 hardware. Ensemble Studios basically gave us a blank check to do whatever we wanted on Xbox 360. (What good times!)

Age3's particle engine (written partially or mostly by Graham Devine, now at Magic Leap) was so good that the artists refused to allow us to rewrite it. Billy and I threaded it by converting it into jobs, and we SIMD'ified all the key loops using Altivec ops. We also offloaded as many computations as we could into vertex/pixel shaders, to cut down on the very high CPU cost of the original code.

The Halo Wars particle engine would have ran circles around Age3's (once ported back to x86).

Please don't get me wrong, Age3 was a beautiful and fun game, and I loved working on it. The team was super easy and pleasant to work with. Just remember that Halo Wars was created by a very different team with different goals. We had some pretty awesome goals for the next Halo Wars, but the studio was shut down.

Sunday, October 23, 2016

RDO ETC1 compression examples

I've compressed the kodak test images using the prototype RDO ETC1 compressor I've been working on recently at various settings. You can download a .7z archive containing the RDO compressed .KTX files and unpacked PNG's here. The .KTX files can be loaded using the Mali Texture Compression Tool (v4.3.0).

Here are the unpacked images for 512 endpoints and 1024 selectors (1.65 average bits/texel vs. 2.85 average bits/texel for non-RDO ETC1):

























Non-RDO:
best_luma_psnr: Avg: 40.009226, Std Dev: 2.154732, Min: 35.193684, Max: 42.750275, Mean: 41.113007
best_luma_ssim: Avg: 0.983419, Std Dev: 0.002109, Min: 0.980131, Max: 0.989254, Mean: 0.983190
best_bits_per_texel: Avg: 2.851078, Std Dev: 0.341672, Min: 2.184774, Max: 3.482361, Mean: 2.822876
best_orig_size: Avg: 196676.000000, Std Dev: 0.000000, Min: 196676.000000, Max: 196676.000000, Mean: 196676.000000
best_compressed_size: Avg: 140136.166667, Std Dev: 16793.846305, Min: 107386.000000, Max: 171165.000000, Mean: 138750.000000

RDO (#endpoints_#selectors):

512_256:
rdo_luma_psnr: Avg: 31.638530, Std Dev: 2.891301, Min: 25.210732, Max: 35.657692, Mean: 33.023266
rdo_luma_ssim: Avg: 0.903939, Std Dev: 0.022998, Min: 0.839709, Max: 0.941335, Mean: 0.902615
rdo_bits_per_texel: Avg: 1.478541, Std Dev: 0.211604, Min: 1.075765, Max: 1.888489, Mean: 1.453206

512_512:
rdo_luma_psnr: Avg: 32.549770, Std Dev: 2.950959, Min: 25.927277, Max: 36.671211, Mean: 34.135223
rdo_luma_ssim: Avg: 0.916562, Std Dev: 0.020127, Min: 0.860491, Max: 0.950293, Mean: 0.915359
rdo_bits_per_texel: Avg: 1.555512, Std Dev: 0.211616, Min: 1.142314, Max: 1.969767, Mean: 1.533732

512_1024:
rdo_luma_psnr: Avg: 33.600601, Std Dev: 2.981399, Min: 26.842752, Max: 37.809361, Mean: 35.187038
rdo_luma_ssim: Avg: 0.928182, Std Dev: 0.017318, Min: 0.879742, Max: 0.957868, Mean: 0.926356
rdo_bits_per_texel: Avg: 1.648000, Std Dev: 0.208101, Min: 1.249207, Max: 2.055928, Mean: 1.623047

512_2048:
rdo_luma_psnr: Avg: 34.828563, Std Dev: 2.959008, Min: 27.984495, Max: 38.820568, Mean: 36.302998
rdo_luma_ssim: Avg: 0.939762, Std Dev: 0.014454, Min: 0.898490, Max: 0.964750, Mean: 0.938300
rdo_bits_per_texel: Avg: 1.765885, Std Dev: 0.208030, Min: 1.368184, Max: 2.174438, Mean: 1.735372

512_4096:
rdo_luma_psnr: Avg: 36.244860, Std Dev: 2.824295, Min: 29.513725, Max: 39.823002, Mean: 37.670746
rdo_luma_ssim: Avg: 0.951658, Std Dev: 0.011454, Min: 0.918562, Max: 0.971457, Mean: 0.950959
rdo_bits_per_texel: Avg: 1.924732, Std Dev: 0.210003, Min: 1.535360, Max: 2.343709, Mean: 1.893290

1024_4096:
rdo_luma_psnr: Avg: 36.375379, Std Dev: 2.881440, Min: 29.531380, Max: 40.141788, Mean: 37.697235
rdo_luma_ssim: Avg: 0.952464, Std Dev: 0.011512, Min: 0.918884, Max: 0.972384, Mean: 0.951676
rdo_bits_per_texel: Avg: 1.992114, Std Dev: 0.220525, Min: 1.569580, Max: 2.418762, Mean: 1.949666

Effect of ETC1 selector quantization on Luma SSIM/PSNR

This is like the previous post, except this time only the selectors are quantized while the endpoints are left alone. kodak test images, perceptual colorspace metrics:





Stats for non-RDO ETC1 compression:

best_luma_psnr: Avg: 40.009226, Std Dev: 2.154732, Min: 35.193684, Max: 42.750275, Mean: 41.113007
best_luma_ssim: Avg: 0.983419, Std Dev: 0.002109, Min: 0.980131, Max: 0.989254, Mean: 0.983190
best_bits_per_texel: Avg: 2.851078, Std Dev: 0.341672, Min: 2.184774, Max: 3.482361, Mean: 2.822876

RDO selectors 8192:

rdo_luma_psnr: Avg: 38.225255, Std Dev: 2.628415, Min: 31.853958, Max: 41.955276, Mean: 39.500504
rdo_luma_ssim: Avg: 0.966271, Std Dev: 0.007768, Min: 0.944449, Max: 0.981821, Mean: 0.966354
rdo_bits_per_texel: Avg: 2.366380, Std Dev: 0.231610, Min: 1.902201, Max: 2.793721, Mean: 2.337708

RDO selectors 4096:

rdo_luma_psnr: Avg: 36.581700, Std Dev: 2.874786, Min: 29.814810, Max: 40.718441, Mean: 37.796730
rdo_luma_ssim: Avg: 0.953993, Std Dev: 0.010954, Min: 0.922887, Max: 0.973516, Mean: 0.953305
rdo_bits_per_texel: Avg: 2.132147, Std Dev: 0.220503, Min: 1.668640, Max: 2.535848, Mean: 2.094666

RDO selectors: 2048:

rdo_luma_psnr: Avg: 35.129581, Std Dev: 2.967410, Min: 28.291447, Max: 39.650620, Mean: 36.413860
rdo_luma_ssim: Avg: 0.942579, Std Dev: 0.013760, Min: 0.903846, Max: 0.967203, Mean: 0.941114
rdo_bits_per_texel: Avg: 1.969779, Std Dev: 0.216071, Min: 1.506246, Max: 2.368530, Mean: 1.930033

RDO selectors 1024:

rdo_luma_psnr: Avg: 33.915408, Std Dev: 2.963184, Min: 27.143675, Max: 38.416290, Mean: 35.294361
rdo_luma_ssim: Avg: 0.931751, Std Dev: 0.016440, Min: 0.886028, Max: 0.960691, Mean: 0.929749
rdo_bits_per_texel: Avg: 1.848387, Std Dev: 0.216314, Min: 1.378805, Max: 2.245748, Mean: 1.809530

RDO selectors 512:

rdo_luma_psnr: Avg: 32.898390, Std Dev: 2.920482, Min: 26.292456, Max: 37.282799, Mean: 34.293579
rdo_luma_ssim: Avg: 0.920788, Std Dev: 0.019035, Min: 0.868281, Max: 0.953666, Mean: 0.918912
rdo_bits_per_texel: Avg: 1.753840, Std Dev: 0.215968, Min: 1.278585, Max: 2.150350, Mean: 1.717773

RDO selectors 256:

rdo_luma_psnr: Avg: 32.036631, Std Dev: 2.866251, Min: 25.595591, Max: 36.275482, Mean: 33.285240
rdo_luma_ssim: Avg: 0.909641, Std Dev: 0.021761, Min: 0.849128, Max: 0.946493, Mean: 0.907937
rdo_bits_per_texel: Avg: 1.673566, Std Dev: 0.215763, Min: 1.187663, Max: 2.065999, Mean: 1.631165

RDO selectors 128:

rdo_luma_psnr: Avg: 31.255766, Std Dev: 2.800476, Min: 24.977221, Max: 35.173336, Mean: 32.437733
rdo_luma_ssim: Avg: 0.896458, Std Dev: 0.024306, Min: 0.827130, Max: 0.934879, Mean: 0.895064
rdo_bits_per_texel: Avg: 1.600956, Std Dev: 0.215559, Min: 1.127218, Max: 1.991862, Mean: 1.550741

Saturday, October 22, 2016

Effect of ETC1 endpoint quantization on Luma SSIM/PSNR

In this test on the 24 kodak images I quantized the ETC1 block colors/intensity tables (or what I've been calling "endpoints", from DXT1/BC1 terminology) to 128 clusters, but the selectors were not quantized at all. 128 clusters for endpoints is at the edge of usability for many photos.

This test also adaptively limits blocks to only a single endpoint (verses a unique endpoint for each subblock), if doing so doesn't lower the block's PSNR by more than 1.25 dB.

Anyhow, these two graphs show that this process is quite  effective. Even at only 128 clusters, the overall SSIM is only reduced by around .01, while the bitrate is reduced by around .4 - .5 bits/texel.

The results look surprisingly good. I've made great progress on quality per bit over the previous few weeks, and I'll be posting images and .KTX files in a day or so.



Two more graphs, with 3 different endpoint quantization settings:


Overall stats:

ETC1 (no quantization):
best_luma_psnr: Avg: 40.009226, Std Dev: 2.154732, Min: 35.193684, Max: 42.750275, Mean: 41.113007
best_luma_ssim: Avg: 0.983419, Std Dev: 0.002109, Min: 0.980131, Max: 0.989254, Mean: 0.983190
best_bits_per_texel: Avg: 2.851078, Std Dev: 0.341672, Min: 2.184774, Max: 3.482361, Mean: 2.822876

128 endpoints:
rdo_luma_psnr: Avg: 38.042171, Std Dev: 1.874003, Min: 34.209053, Max: 41.065495, Mean: 38.749592
rdo_luma_ssim: Avg: 0.974083, Std Dev: 0.004284, Min: 0.960817, Max: 0.983318, Mean: 0.974376
rdo_bits_per_texel: Avg: 2.351300, Std Dev: 0.318168, Min: 1.788859, Max: 2.967855, Mean: 2.344340

512 endpoints:
rdo_luma_psnr: Avg: 39.239567, Std Dev: 2.001313, Min: 34.834538, Max: 41.839687, Mean: 40.379951
rdo_luma_ssim: Avg: 0.979648, Std Dev: 0.002847, Min: 0.973445, Max: 0.987098, Mean: 0.979329
rdo_bits_per_texel: Avg: 2.617640, Std Dev: 0.345818, Min: 2.031942, Max: 3.296285, Mean: 2.604553

1024 endpoints:
rdo_luma_psnr: Avg: 39.490915, Std Dev: 2.033055, Min: 34.942341, Max: 42.026814, Mean: 40.666183
rdo_luma_ssim: Avg: 0.980563, Std Dev: 0.002673, Min: 0.976034, Max: 0.987617, Mean: 0.980514
rdo_bits_per_texel: Avg: 2.693218, Std Dev: 0.356560, Min: 2.069397, Max: 3.390055, Mean: 2.668416

The next 2 graphs show RDO ETC1 compression on the kodak test images with endpoint quantization effectively disabled (for all practical purposes). Note that adaptive subblock utilization is still enabled here, so it's possible for a block's subblocks to be forced to use the same block colors/intensity tables (endpoints) if the quality loss is < 1.25 dB.

Tests like this are important, because it shows that the RDO compressor is able to utilize all the features available in ETC1: flip/non-flipped, differential/absolute block color encoding, subblocks, etc.



Overall stats:

rdo_luma_psnr: Avg: 39.766113, Std Dev: 2.066657, Min: 35.116722, Max: 42.367085, Mean: 40.845627
rdo_luma_ssim: Avg: 0.981710, Std Dev: 0.002428, Min: 0.978301, Max: 0.988114, Mean: 0.981266
rdo_bits_per_texel: Avg: 2.754947, Std Dev: 0.365874, Min: 2.098104, Max: 3.464823, Mean: 2.714681
rdo_orig_size: Avg: 196676.000000, Std Dev: 0.000000, Min: 196676.000000, Max: 196676.000000, Mean: 196676.000000
rdo_compressed_size: Avg: 135411.166667, Std Dev: 17983.452669, Min: 103126.000000, Max: 170303.000000, Mean: 133432.000000

best_luma_psnr: Avg: 40.009226, Std Dev: 2.154732, Min: 35.193684, Max: 42.750275, Mean: 41.113007
best_luma_ssim: Avg: 0.983419, Std Dev: 0.002109, Min: 0.980131, Max: 0.989254, Mean: 0.983190
best_bits_per_texel: Avg: 2.851078, Std Dev: 0.341672, Min: 2.184774, Max: 3.482361, Mean: 2.822876
best_orig_size: Avg: 196676.000000, Std Dev: 0.000000, Min: 196676.000000, Max: 196676.000000, Mean: 196676.000000
best_compressed_size: Avg: 140136.166667, Std Dev: 16793.846305, Min: 107386.000000, Max: 171165.000000, Mean: 138750.000000

The next graphs are just like the previous ones, except the adaptive subblock feature is disabled. They show that RDO ETC1 with no quantization is virtually identical to basic (highest quality, block by block) ETC1 compression.




Overall stats:

rdo_luma_psnr: Avg: 39.991337, Std Dev: 2.109917, Min: 35.276287, Max: 42.721352, Mean: 41.098907
rdo_luma_ssim: Avg: 0.982858, Std Dev: 0.002269, Min: 0.979608, Max: 0.988770, Mean: 0.982394
rdo_bits_per_texel: Avg: 2.853771, Std Dev: 0.348101, Min: 2.188131, Max: 3.518412, Mean: 2.828857
rdo_orig_size: Avg: 196676.000000, Std Dev: 0.000000, Min: 196676.000000, Max: 196676.000000, Mean: 196676.000000
rdo_compressed_size: Avg: 140268.541667, Std Dev: 17109.836167, Min: 107551.000000, Max: 172937.000000, Mean: 139044.000000

best_luma_psnr: Avg: 40.009226, Std Dev: 2.154732, Min: 35.193684, Max: 42.750275, Mean: 41.113007
best_luma_ssim: Avg: 0.983419, Std Dev: 0.002109, Min: 0.980131, Max: 0.989254, Mean: 0.983190
best_bits_per_texel: Avg: 2.851078, Std Dev: 0.341672, Min: 2.184774, Max: 3.482361, Mean: 2.822876
best_orig_size: Avg: 196676.000000, Std Dev: 0.000000, Min: 196676.000000, Max: 196676.000000, Mean: 196676.000000

best_compressed_size: Avg: 140136.166667, Std Dev: 16793.846305, Min: 107386.000000, Max: 171165.000000, Mean: 138750.000000

Thursday, October 20, 2016

Rate distortion performance of Basis ETC1 RDO+LZMA on the Kodak test set

At 3 quality levels, using REC709 perceptual colorspace metrics. This compares plain ETC1 (with no lossless compression), basislib highest quality ETC1+LZMA, and basislib RDO+LZMA.

"S" = selectors, "E" = endpoints.

crunch-style adaptive endpoint quantization at the block/subblock level is supported, but not at the macroblock (2x2 block) level yet. Also, the KTX writer backend is greedy, meaning it doesn't try to choose the best combination of selectors+endpoints that result in the least amount of compressed bits output by LZMA (or LZHAM). The lack of both features hurts compression. I have several other improvements to both quality and bitrate coming, but this is a good milestone.



With a few more quality levels:






Wednesday, October 19, 2016

status of basis ETC1 support

I've transitioned my 2D-only prototype to a full-blown class now, instead of it living in my experimental framework as a huge function.

Next up are things like macroblock support, more endpoint/selector codebook refinements, an investigation into alternative selector compression schemes, and an experiment to exploit endpoint/selector codebook entry correlation. After this, I'm rewriting the code so it works on texture arrays, cubemaps, etc.

This rewritten new code will be the "front end" of the full ETC1 compressor. The back end (that does the coding) comes after the front end is in good shape. Unlike crunch, basis will use the same basic front end for both .RDO mode and .CRN (or .basis) mode.

This compressor is also compatible with the ETC1 "subset" format I mentioned here, which means it could be trivially transcoded to DXT1 with the aid of a precomputed lookup table.

Monday, October 17, 2016

In-app charting/graphing using pplot

pplot is a nice little graphing library:

http://pplot.sourceforge.net/

pplot is device and platform independent, which I really liked. I hooked it up to my generic image class, which supports things like vector text rendering, antialiased line drawing, etc.



Sunday, October 16, 2016

GST: GPU-decodable Supercompressed Textures

This is amazingly well done:

http://gamma.cs.unc.edu/GST/

Code:
https://github.com/GammaUNC/GST

Paper:
http://gamma.cs.unc.edu/GST/gst.pdf

Check this awesome timeline out:


I no longer feel so alone out here. I've been working on "Supercompressed Texture" technology for about a decade now, before I knew it would be named "Supercompressed Textures". The first title I was involved in that used GPU transcoding of compressed textures was for the PS2 version of World Series Baseball 2k3 (2003). It was designed by Blue Shift's then-CTO, John Brooks. This technology was then licensed to Electronic Arts for use in their PS3 titles.

And, my first Xbox 360 title (Halo Wars) relied on a real-time supercompressed texture decompression system I wrote in '06-'07, so the title would fit into memory at all. (crunch was actually my 2nd attempt at this approach, not my first.) So this tech has been around for years, being used behind closed doors in a low key way. It's like the academic world is just now catching on. In the professional game development world, this is advanced but still "old school" technology now.

My main bit of feedback about this paper, so far: The description of how the selector compression actually works is kinda muddled. (What's the "prefix sum" all about?) Also, it looks like crunch was used at the maximum quality level (255), not a tuned level or a number of levels. Crunch quality level 255 just isn't used in practice, to my knowledge. The codebooks at that level are huge and the image quality is unnecessarily high. Also, can I speed up crunch's CPU transcoder by 2-3x? Oh yes!

Another thing I noticed: Because GST doesn't support lossy endpoint quantization (like crunch does), I think its rate distortion performance is more limited than crunch's. crunch should be able to target lower bitrates than GST, is my guess. GST's main way of controlling the quality vs. rate tradeoff is its lossy dictionary-based selector compression method, while crunch can smoothly vary the quality of both the endpoints and selectors.

Next up: Universal Supercompressed Textures with either CPU or GPU decoding. (Isn't it obvious? We need to abstract away all of these crazy formats behind good technologies and shared tools.)


Saturday, October 15, 2016

2D Haar Wavelet Transform on GPU texture selector indices

I've been very busy refining my new ETC1 compressor, so I haven't been posting much recently. Today I decided to do something different, so I've been playing around with the 2D Haar 4x4 and 8x8 transforms (or here) on ETC1 selector bits. I first did this years ago while writing crunch on DXT1/BC1, but I unfortunately didn't publish or use the results.

To use the Haar transform on selector indices, I prepare the input samples by adding .5 to each selector index (which range from [0,3] in ETC1), do the transform, uniform quantize, then do the inverse transform and truncate the resulting values back to the [0,3] selector range. (You must shift the input samples by .5 or it won't work.)

The quantization stage scales the the floating point coefficient by 4 (to get 2 bits to the right of the decimal point, which in experiments is just enough for 4x4) and converts to integer. This integer is then divided by a quantization value, then it's converted to float and divided by 4

For this uniform quantization matrix:
  1   1   1   2   2   3   3   4
  1   1   2   2   3   3   4   4
  1   2   2   3   3   4   4   5
  2   2   3   3   4   4   5   5
  2   3   3   4   4   5   5   6
  3   3   4   4   5   5   6   6
  3   4   4   5   5   6   6   7
  4   4   5   5   6   6   7   7

I get this ETC1 image after 8x8 Haar transform+quantization+inverse transform:


The original ETC1 compressed texture (before Haar filtering):


Selector visualization:


1x difference image (the delta between the original and filtered ETC1 images):


There is error in high frequencies, which is exactly what is to be expected given the above quantization matrix.

Here's a more aggressive quantization matrix:

  2   4   6   8  10  12  14  16
  4   6   8  10  12  14  16  18
  6   8  10  12  14  16  18  20
  8  10  12  14  16  18  20  22
 10  12  14  16  18  20  22  24
 12  14  16  18  20  22  24  26
 14  16  18  20  22  24  26  28
 16  18  20  22  24  26  28  30

ETC1 image:


Selector visualization:


An even more aggressive quantization matrix:

  3   6   9  12  15  18  21  24
  6   9  12  15  18  21  24  27
  9  12  15  18  21  24  27  30
 12  15  18  21  24  27  30  33
 15  18  21  24  27  30  33  36
 18  21  24  27  30  33  36  39
 21  24  27  30  33  36  39  42
 24  27  30  33  36  39  42  45


Selector visualization:


I have some ideas on how the 4x4 Haar transform could be very useful in Basis, but they are just ideas right now. I find it amazing that the selectors can be transformed and manipulated in the frequency domain like this.