Sunday, January 3, 2021

CVTT Texture Compressor Technical Breakdown

Since a technical breakdown of how Betsy does texture compression was posted, I wanted to lay out how the compressors in Convection Texture Tools (CVTT) work, as well as provide some context of what CVTT's objectives are in the first place to explain some of the technical decisions.

First off, while I am very happy with how CVTT has turned out, and while it's definitely a production-quality texture compressor, providing the best compressor possible for a production environment has not been its primary goal. Its primary goal is to experiment with compression techniques to improve the state of the art, particularly finding inexpensive ways to hit high quality targets.

A common theme that wound up manifesting in most of CVTT's design is that encoding decisions are either guided by informed decisions, i.e. models that relate to the problem being solved, or are exhaustive.  Very little of it is done by random or random-like searching. Much of what CVTT exists to experiment with is figuring out techniques which amount to making those informed decisions.

CVTT's ParallelMath module, and choice of C++

While there's some concidence with CVTT having a similar philosophy to Intel's ISPC compressor, the reason for CVTT's SPMD-style design was actually motivated by it being built a port of the skeleton of DirectXTex's HLSL BC7 compressor.

I chose to use C++ instead of ISPC for three main reasons:
  • It was easier to develop it in Visual Studio.
  • It was easier to do operations that didn't parallelize well.  This turned out to matter with the ETC compressor in particular.
  • I don't trust in ISPC's longevity, in particular I think it will be obsolete as soon as someone makes something that can target both CPU and GPU, like either a new language that can cross-compile, or SPIR-V-on-CPU.

Anyway, CVTT's ParallelMath module is kind of the foundation that everything else is built on.  Much of its design is motivated by SIMD instruction set quirks, and a desire to maintain compatibility with older instruction sets like SSE2 without sacrificing too much.

Part of that compatibility effort is that most of CVTT's ops use a UInt15 type.  The reason for UInt15 is to handle architectures (like SSE2!) that don't support unsigned compares, min, or max, which means performing those operations on a 16-bit number requires flipping the high bit on both operands.  For any number where we know the high bit is zero for both operands, that flip is unnecessary - and a huge number of operations in CVTT fit in 15 bits.

The compare flag types are basically vector booleans, where either all bits are 1 or all bits are 0 for a given lane - There's one type for 16-bit ints, and one for 32-bit floats, and they have to be converted since they're different widths.  Those are combined with several utility functions, some of which, like SelectOrZero and NotConditionalSet, can elide a few operations.

The RoundForScope type is a nifty dual-use piece of code.  SSE rounding modes are determined by the CSR register, not per-op, so RoundForScope when targeting SSE will set the CSR, and then reset it in its destructor.  For other architectures, including the scalar target, the TYPE of the RoundForScope passed in is what determines the operation, so the same code works whether the rounding is per-op or per-scope.

While the ParallelMath architecture has been very resistant to bugs for the most part, where it has run into bugs, they've mostly been due to improper use of AnySet or AllSet - Cases where parallel code can behave improperly because lanes where the condition should exclude it are still executing, and need to be manually filtered out using conditionals.

BC1-7 common themes

All of the desktop formats that CVTT supports are based on interpolation.  S3TC RGB (a.k.a. DXT1) for instance defines two colors (called endpoints), then defines all pixels as being either one of those two colors, or a color that is part-way between those two colors, for each 4x4 block.  Most of the encoding effort is spent on determining what the two colors should be.
 
You can read about a lot of this on Simon Brown's post outlining the compression techniques used by Squish, one of the pioneering S3TC compressors, which in turn is the basis for the algorithm used by CVTT's BC1 compressor.

Principal component analysis

Principal component analysis determines, based on a set of points, what the main axis is that the colors are aligned along.  This gives us a very good guess of what the initial colors should be, simply using the colors that are the furthest along that axis, but it isn't necessarily ideal.

Endpoint refinement

In BC1 for instance, each color is assigned to one of four possible values along the color line.  CVTT solves for that by just finding the color with the shortest distance to each pixel's color.  If the color assignments are known, then it's possible to determine what the color values are that will minimize the sum of the square distance of that mapping.  One round of refinement usually yields slightly better results and is pretty cheap to check.  Two rounds will sometimes yield a slightly better result.

Extrapolation

One problem with using the farthest extents of the principal axis as the color is that the color precision is reduced (quantized) by the format.  In BC1-5, the color is reduced to a 16-bit color with 5 bits of red, 6 bits of green, and 5 bits of alpha.  It's frequently possible to achieve a more accurate match by using colors outside of the range so that the interpolated colors are closer to the actual image colors - This sacrifices some of the color range.

CVTT internally refers to these as "tweak factors" or similar, since what they functionally do is make adjustments to the color mapping to try finding a better result.

The number of extrapolation possibilities increases quadratically with the number of indexes.  CVTT will only ever try four possibilities: No insets, one inset on one end (which is two possibilities, one for each end), and one inset on both ends.

BC1 (DXT1)

CVTT's BC1 encoder uses the cluster fit technique developed by Simon Brown for Squish.  It uses the principal axis to determine an ordering of each of the 16 pixels along the color line, and then rather than computing the endpoints from the start and end points, it computes them by trying each possible count of pixels assigned to each endpoint that maintains the original order and still totals 16.  That's a fairly large set of possibilities with a lot of useless entries, but BC1 is fairly tight on bits, so it does take a lot of searching to maximize quality out of it.

BC2 (DXT3)

BC2 uses BC1 for RGB and 4bpp alpha.  There's not much to say here, since it just involves reducing the alpha precision.

BC3 (DXT5)

This one is actually a bit interesting.  DXT5 uses indexed alpha, where it defines two 8-bit alpha endpoints and a 3-bit interpolator per pixel, but it also has a mode where 2 of the interpolators are reserved 0 and 255 and only 6 are endpoint-to-endpoint values.  Most encoders will just use the min/max alpha.  CVTT will also try extrapolated endpoints, and will try for the second mode by assuming that any pixels within 1/10th of the endpoint range of 0 or 255 would be assigned to the reserved endpoints.  The reason for the 1/10th range is that the rounding range of the 6-value endpoints is 1/10th of the range, and it assumes that for any case where the endpoints would include values in that range, it would just use the 8-index mode and there'd be 6 indexes between them anyway.

BC4 and BC5

These two modes are functionally the same as BC3's alpha encoding, with the exception that the signed modes are offset by 128.  CVTT handles signed modes by pre-offsetting them and undoing the offset.

BC7

BC7 has 8 modes of operation and is the most complicated format to encode, but it's actually not terribly more complicated than BC1.  All of the modes do one of two things: They encode 1 to 3 pairs of endpoints that are assigned to specific groupings of pixels for all color channels, referred to as partitions, or or they encode one set of endpoints for the entire block, except for one endpoint, which is encoded separately.
 
Here are the possible partitions:

Credit: Jon Rocatis from this post.

Another feature of BC7 are parity bits, where the low bit of each endpoint is specified by a single bit.  Parity bits (P-bit) exist as a way of getting a bit more endpoint precision when there aren't as many available bits as there are endpoint channels without causing the channels to have a different number of bits, something that caused problems with gray discoloration in BC1-3.
 
CVTT will by default just try every partition, and every P-bit combination.

Based on some follow-up work that I'm still experimenting with, a good quality trade-off would be to only check certain subsets.  Among the BC7 subsets, the vast majority of selected subsets fall into a only about 16 of the possible ones, and omitting those causes very little quality loss.  I'll publish more about that when my next experiment is further along.

Weight-by-alpha issues

One weakness that CVTT's encoder has vs. Monte Carlo-style encoders is that principal component analysis does not work well for modes in BC7 where the alpha and some of the color channels are interpolated using the same indexes.  This is never a problem with BC2 or BC3, which can avoid that problem by calculating alpha first and then pre-weighting the RGB channels.

I haven't committed a solution to that yet, and while CVTT gets pretty good quality anyway, it's one area where it underperforms other compressors on BC7 by a noticeable amount.

Shape re-use

The groupings of pixels in BC7 are called "shapes."

One optimization that CVTT does is partially reuse calculations for identical shapes.  That is, if you look at the 3 subset grouping above, you can notice that many of the pixel groups are the same as some pixel groups in the 2 subset grouping.

To take advantage of that fact, CVTT performs principal component analysis on all unique shapes before performing further steps.  This is a bit of a tradeoff though: It's only an optimization if those shapes are actually used, so it's not ideal for if CVTT were to reduce the number of subsets that it checks.

Weight reconstruction

One important aspect of BC7 is that, unlike BC1-3, it specifies the precision that interpolation is to be done at, as well as the weight values for each index.  However, doing a table lookup for each value in a parallelized index values is a bit slow.  CVTT avoids this by reconstructing the weights arithmetically:

MUInt15 weight = ParallelMath::LosslessCast<MUInt15>::Cast(ParallelMath::RightShift(ParallelMath::CompactMultiply(g_weightReciprocals[m_range], index) + 256, 9));

Coincidentally, doing this just barely fits into 16 bits of precision accurately.

BC6H

BC6H is very similar to BC7, except it's 16-bit floating point.   The floating point part is achieved by encoding the endpoints as a high-precision base and low-precision difference from the base.  Some of the modes that it supports are partitioned similar to BC7, and it also has an extremely complicated storage format where the endpoint bits are located somewhat arbitrarily.
 
There's a reason that BC6H is the one mode that's flagged as "experimental."  Unlike all other modes, BC6H is floating point, but has a very unique quirk: When BC6H interpolates between endpoints, it's done as if the endpoint values are integers, even though they will be bit-cast into floating point values.

Doing that severely complicates making a BC6H encoder, because part of the floating point values are the exponent, meaning that the values are roughly logarithmic.  Unless they're the same, they don't even correlate proportionally with each other, so color values may shift erratically, and principal component analysis doesn't really work.

CVTT tries to do its usual tricks in spite of this, and it sort of works, but it's an area where CVTT's general approach is ill-suited.

ETC1

ETC1 is based on cluster fit, via what's basically a mathematical reformulation of it.

Basically, ETC1 is based on the idea that the human visual system sees color detail less than intensity detail, so it encodes each 4x4 block as a pair of either 4x2 or 2x4 blocks which each encode a color, an offset table ID, and a per-pixel index into the offset table.  The offsets are added to ALL color channels, making them grayscale offsets, essentially.
 

Unique cumulative offsets

What's distinct about ETC compares to the desktop formats, as far as using cluster fit is concerned, is two things: First, the primary axis is always known.  Second, the offset tables are symmetrical, where 2 of the entries are the negation of the other two.
 
The optimal color for a block, not accounting for clamping, will be the average color of the block, offset by 1/16th of the offset assigned to each pixel.  Since half of the offsets negate each other, every pair of pixels assigned to opposing offsets cancel out, causing no change.  This drastically reduces the search space, since many of the combinations will produce identical colors.  Another thing that reduces the search space is that many of the colors will be duplicates after the precision reduction from quantization.  Yet another thing is that in the first mode, the offsets are +2 and +4, which have a common factor, causing many of the possible offsets to overlap, cancelling out even more combinations.

So, CVTT's ETC1 compressor simply evaluates each possible offset from the average color that results in a unique color post-quantization, and picks the best one.  Differential mode works by selecting the best VALID combination of colors, first by checking if the best pair of colors is valid, and failing that, checking all evaluated color combinations.
 

ETC2

ETC2 has 3 additional selectable modes on top of the ETC1 modes.  One, called T mode, contains 4 colors: Color0, Color1, Color1+offset, and Color2+offset.  Another, called H mode, contains Color0+offset, Color0-offset, Color1+offset, and Color1-offset.  The final mode, called planar mode, contains what is essentially a base color and a per-axis offset gradient.

T and H mode

T and H mode both exist to better handle blocks where, within the 2x4 or 4x2 blocks, the colors do not align well along the grayscale axis.  CVTT's T/H mode encoding basically works with that assumption by trying to find where it thinks the poorly-aligned color axes might be.  First, it generates some chrominance coordinates, which are basically 2D coordinates corresponding to the pixel colors projected on to the grayscale plane.  Then, it performs principal component analysis to find the primary chrominance axis.  Then, it splits the block based on which side of the half-way point each pixel is to form two groupings that are referred to internally as "sectors."

From the sectors, it performs a similar process of inspecting each possible offset count from the average to determine the best fit - But it will also record if any colors NOT assigned to the sector can still use one of the results that it computed, which are used later to determine the actual optimal pairing of the results that it computed.

One case that this may not handle optimally is when the pixels in a block ARE fairly well-aligned along the grayscale axis, but the ability of T/H colors to be relatively arbitrary would be an advantage.
 

ETC2 with punch-through, "virtual T mode"

ETC2 supports punchthrough transparency by mapping one of the T or H indexes to transparent.  Both of these are resolved in the same way as T mode.  When encoding punch-through the color values for T mode are Color0, Color1+offset, transparent, Color1-offset, and in H mode, they are Color0+offset, Color0-offset, transparent, and Color1.

Essentially, both have a single color, and another color +/- an offset, there are only 2 differences: First, the isolated color H mode is still offset, so the offset has to be undone.  If that quantizes to a more accurate value, then H mode is better.  Second, the H mode color may not be valid - H mode encodes the table index low bit based on the order of the colors, but unlike when encoding opaque, reordering the colors will affect which color has the isolated value and which one has the pair of values. 

H mode as T mode encoding

One special case to handle with testing H mode is the possibility that the optimal color is the same.  This should be avoidable by evaluating T mode first, but the code handles it properly just to be safe.  Because H mode encodes the table low bit based on a comparison of the endpoints, it may not be possible to select the correct table if the endpoints are the same.  In that case, CVTT uses a fallback where it encodes the block as T mode instead, mapping everything to the color with the pair of offsets.

Planar mode

Planar mode involves finding an optimal combination of 3 values that determine the color of each channel value as O+(H*x)+(V*Y)

How planar mode actually works is by just finding the least-squares fit for each of those three values at once.
 
Where error=(reconstructedValue-actualValue)², we want to solve for d(error)/dO=0, d(error)/dH=0, and d(error)/dV=0

All three of these cases resolve to quadratic formulas, so the entire thing is just converted to a system of linear equations and solved.  The proof and steps are in the code.

ETC2 alpha and EAC

Both of these "grayscale" modes are both more complicated because they have 3-bit indexes, multiple lookup tables, and an amplitude multiplier.

CVTT tries a limited set of possibilities based on alpha insets.  It tries 10 alpha ranges, which correspond to all ranges where the index inset of each endpoint is +/- 1 the number of the other endpoint.  So, for example, given 8 alpha offsets numbered 0-7, it will try these pairs:
  • 0,7
  • 0,6
  • 1,7
  • 1,6
  • 1,5
  • 2,6
  • 2,5
  • 2,4
  • 3,5
  • 3,4
Once the range is selected, 2 multipliers are checked: The highest value that can be multiplied without exceeding the actual alpha range, and the smallest number that can be multiplied while exceeding it.

The best result of these possibilities is selected.

Possible improvements and areas of interest

BC6H is by far the most improvable aspect.  Traditional PCA doesn't work well because of the logarithmic interpolation.  Sum-of-square-difference in floating point pseudo-logarithmic space performs much worse than in gamma space and is prone to sparkly artifacts.

ETC1 cumulative offset deduplication assumes that each pixel is equally important, which doesn't hold when using weight-by-alpha.

ETC2 T/H mode encoding could try all 15 possible sector assignments (based on the 16-pixel ordering along the chroma axis) instead of one.  I did try finding the grouping that minimized the total square distance to the group averages instead of using the centroid as the split point, but that actually had no effect... they might be mathematically equivalent?  Not sure.

A lot of these concepts don't translate well to ASTC.  CVTT's approaches largely assume that it's practical to traverse the entire search space, but ASTC is highly configurable, so its search space has many axes, essentially.  The fact that partitioning is done AFTER grid interpolation in particular is also a big headache that would require its own novel solutions.

Reduction of the search space is one of CVTT's biggest sore spots.  It performs excellently at high quality targets, but is relatively slow at lower quality targets.  I justified this because typically developers want to maximize quality when import is a one-time operation done offline, and CVTT is fast enough for the most part, but it probably wouldn't be suitable for real-time operation.

No comments:

Post a Comment