.
Another feature of BC7 are parity bits, where the low bit of each endpoint is specified by a single bit. Parity bits (P-bit) exist as a way of getting a bit more endpoint precision when there aren't as many available bits as there are endpoint channels without causing the channels to have a different number of bits, something that caused problems with gray discoloration in BC1-3.
CVTT will by default just try every partition, and every P-bit combination.
Based on some follow-up work that I'm still experimenting with, a good quality trade-off would be to only check certain subsets. Among the BC7 subsets, the vast majority of selected subsets fall into a only about 16 of the possible ones, and omitting those causes very little quality loss. I'll publish more about that when my next experiment is further along.
Weight-by-alpha issues
One weakness that CVTT's encoder has vs. Monte Carlo-style encoders is that principal component analysis does not work well for modes in BC7 where the alpha and some of the color channels are interpolated using the same indexes. This is never a problem with BC2 or BC3, which can avoid that problem by calculating alpha first and then pre-weighting the RGB channels.
I haven't committed a solution to that yet, and while CVTT gets pretty good quality anyway, it's one area where it underperforms other compressors on BC7 by a noticeable amount.
Shape re-use
The groupings of pixels in BC7 are called "shapes."
One optimization that CVTT does is partially reuse calculations for identical shapes. That is, if you look at the 3 subset grouping above, you can notice that many of the pixel groups are the same as some pixel groups in the 2 subset grouping.
To take advantage of that fact, CVTT performs principal component analysis on all unique shapes before performing further steps. This is a bit of a tradeoff though: It's only an optimization if those shapes are actually used, so it's not ideal for if CVTT were to reduce the number of subsets that it checks.
Weight reconstruction
One important aspect of BC7 is that, unlike BC1-3, it specifies the precision that interpolation is to be done at, as well as the weight values for each index. However, doing a table lookup for each value in a parallelized index values is a bit slow. CVTT avoids this by reconstructing the weights arithmetically:
MUInt15 weight = ParallelMath::LosslessCast<MUInt15>::Cast(ParallelMath::RightShift(ParallelMath::CompactMultiply(g_weightReciprocals[m_range], index) + 256, 9));
Coincidentally, doing this just barely fits into 16 bits of precision accurately.
BC6H
BC6H is very similar to BC7, except it's 16-bit floating point. The floating point part is achieved by encoding the endpoints as a high-precision base and low-precision difference from the base. Some of the modes that it supports are partitioned similar to BC7, and it also has an extremely complicated storage format where the endpoint bits are located somewhat arbitrarily.
There's a reason that BC6H is the one mode that's flagged as "experimental." Unlike all other modes, BC6H is floating point, but has a very unique quirk: When BC6H interpolates between endpoints, it's done as if the endpoint values are integers, even though they will be bit-cast into floating point values.
Doing that severely complicates making a BC6H encoder, because part of the floating point values are the exponent, meaning that the values are roughly logarithmic. Unless they're the same, they don't even correlate proportionally with each other, so color values may shift erratically, and principal component analysis doesn't really work.
CVTT tries to do its usual tricks in spite of this, and it sort of works, but it's an area where CVTT's general approach is ill-suited.
ETC1
ETC1 is based on cluster fit, via what's basically a mathematical reformulation of it.
Basically, ETC1 is based on the idea that the human visual system sees color detail less than intensity detail, so it encodes each 4x4 block as a pair of either 4x2 or 2x4 blocks which each encode a color, an offset table ID, and a per-pixel index into the offset table. The offsets are added to ALL color channels, making them grayscale offsets, essentially.
Unique cumulative offsets
What's distinct about ETC compares to the desktop formats, as far as using cluster fit is concerned, is two things: First, the primary axis is always known. Second, the offset tables are symmetrical, where 2 of the entries are the negation of the other two.
The optimal color for a block, not accounting for clamping, will be the average color of the block, offset by 1/16th of the offset assigned to each pixel. Since half of the offsets negate each other, every pair of pixels assigned to opposing offsets cancel out, causing no change. This drastically reduces the search space, since many of the combinations will produce identical colors. Another thing that reduces the search space is that many of the colors will be duplicates after the precision reduction from quantization. Yet another thing is that in the first mode, the offsets are +2 and +4, which have a common factor, causing many of the possible offsets to overlap, cancelling out even more combinations.
So, CVTT's ETC1 compressor simply evaluates each possible offset from the average color that results in a unique color post-quantization, and picks the best one. Differential mode works by selecting the best VALID combination of colors, first by checking if the best pair of colors is valid, and failing that, checking all evaluated color combinations.
ETC2
ETC2 has 3 additional selectable modes on top of the ETC1 modes. One, called T mode, contains 4 colors: Color0, Color1, Color1+offset, and Color2+offset. Another, called H mode, contains Color0+offset, Color0-offset, Color1+offset, and Color1-offset. The final mode, called planar mode, contains what is essentially a base color and a per-axis offset gradient.
T and H mode
T and H mode both exist to better handle blocks where, within the 2x4 or 4x2 blocks, the colors do not align well along the grayscale axis. CVTT's T/H mode encoding basically works with that assumption by trying to find where it thinks the poorly-aligned color axes might be. First, it generates some chrominance coordinates, which are basically 2D coordinates corresponding to the pixel colors projected on to the grayscale plane. Then, it performs principal component analysis to find the primary chrominance axis. Then, it splits the block based on which side of the half-way point each pixel is to form two groupings that are referred to internally as "sectors."
From the sectors, it performs a similar process of inspecting each possible offset count from the average to determine the best fit - But it will also record if any colors NOT assigned to the sector can still use one of the results that it computed, which are used later to determine the actual optimal pairing of the results that it computed.
One case that this may not handle optimally is when the pixels in a block ARE fairly well-aligned along the grayscale axis, but the ability of T/H colors to be relatively arbitrary would be an advantage.
ETC2 with punch-through, "virtual T mode"
ETC2 supports punchthrough transparency by mapping one of the T or H indexes to transparent. Both of these are resolved in the same way as T mode. When encoding punch-through the color values for T mode are Color0, Color1+offset, transparent, Color1-offset, and in H mode, they are Color0+offset, Color0-offset, transparent, and Color1.
Essentially, both have a single color, and another color +/- an offset, there are only 2 differences: First, the isolated color H mode is still offset, so the offset has to be undone. If that quantizes to a more accurate value, then H mode is better. Second, the H mode color may not be valid - H mode encodes the table index low bit based on the order of the colors, but unlike when encoding opaque, reordering the colors will affect which color has the isolated value and which one has the pair of values.
H mode as T mode encoding
One special case to handle with testing H mode is the possibility that the optimal color is the same. This should be avoidable by evaluating T mode first, but the code handles it properly just to be safe. Because H mode encodes the table low bit based on a comparison of the endpoints, it may not be possible to select the correct table if the endpoints are the same. In that case, CVTT uses a fallback where it encodes the block as T mode instead, mapping everything to the color with the pair of offsets.
Planar mode
Planar mode involves finding an optimal combination of 3 values that determine the color of each channel value as O+(H*x)+(V*Y)
How planar mode actually works is by just finding the least-squares fit for each of those three values at once.
Where error=(reconstructedValue-actualValue)², we want to solve for d(error)/dO=0, d(error)/dH=0, and d(error)/dV=0
All three of these cases resolve to quadratic formulas, so the entire thing is just converted to a system of linear equations and solved. The proof and steps are in the code.
ETC2 alpha and EAC
Both
of these "grayscale" modes are both more complicated because they have
3-bit indexes, multiple lookup tables, and an amplitude multiplier.
CVTT
tries a limited set of possibilities based on alpha insets. It tries
10 alpha ranges, which correspond to all ranges where the index inset of
each endpoint is +/- 1 the number of the other endpoint. So, for
example, given 8 alpha offsets numbered 0-7, it will try these pairs:
- 0,7
- 0,6
- 1,7
- 1,6
- 1,5
- 2,6
- 2,5
- 2,4
- 3,5
- 3,4
Once
the range is selected, 2 multipliers are checked: The highest value
that can be multiplied without exceeding the actual alpha range, and the
smallest number that can be multiplied while exceeding it.
The best result of these possibilities is selected.
Possible improvements and areas of interest
BC6H is by far the most improvable aspect. Traditional PCA doesn't work well because of the logarithmic interpolation. Sum-of-square-difference in floating point pseudo-logarithmic space performs much worse than in gamma space and is prone to sparkly artifacts.
ETC1 cumulative offset deduplication assumes that each pixel is equally important, which doesn't hold when using weight-by-alpha.
ETC2 T/H mode encoding could try all 15 possible sector assignments (based on the 16-pixel ordering along the chroma axis) instead of one. I did try finding the grouping that minimized the total square distance to the group averages instead of using the centroid as the split point, but that actually had no effect... they might be mathematically equivalent? Not sure.
A lot of these concepts don't translate well to ASTC. CVTT's approaches largely assume that it's practical to traverse the entire search space, but ASTC is highly configurable, so its search space has many axes, essentially. The fact that partitioning is done AFTER grid interpolation in particular is also a big headache that would require its own novel solutions.
Reduction of the search space is one of CVTT's biggest sore spots. It performs excellently at high quality targets, but is relatively slow at lower quality targets. I justified this because typically developers want to maximize quality when import is a one-time operation done offline, and CVTT is fast enough for the most part, but it probably wouldn't be suitable for real-time operation.