Since a technical breakdown of how Betsy does texture compression was posted, I wanted to lay out how the compressors in Convection Texture Tools (CVTT) work, as well as provide some context of what CVTT's objectives are in the first place to explain some of the technical decisions.
First off, while I am very happy with how CVTT has turned out, and while it's definitely a production-quality texture compressor, providing the best compressor possible for a production environment has not been its primary goal. Its primary goal is to experiment with compression techniques to improve the state of the art, particularly finding inexpensive ways to hit high quality targets.
A common theme that wound up manifesting in most of CVTT's design is that encoding decisions are either guided by informed decisions, i.e. models that relate to the problem being solved, or are exhaustive. Very little of it is done by random or random-like searching. Much of what CVTT exists to experiment with is figuring out techniques which amount to making those informed decisions.
CVTT's ParallelMath module, and choice of C++
- It was easier to develop it in Visual Studio.
- It was easier to do operations that didn't parallelize well. This turned out to matter with the ETC compressor in particular.
- I don't trust in ISPC's longevity, in particular I think it will be obsolete as soon as someone makes something that can target both CPU and GPU, like either a new language that can cross-compile, or SPIR-V-on-CPU.
Anyway, CVTT's ParallelMath module is kind of the foundation that everything else is built on. Much of its design is motivated by SIMD instruction set quirks, and a desire to maintain compatibility with older instruction sets like SSE2 without sacrificing too much.
Part of that compatibility effort is that most of CVTT's ops use a UInt15 type. The reason for UInt15 is to handle architectures (like SSE2!) that don't support unsigned compares, min, or max, which means performing those operations on a 16-bit number requires flipping the high bit on both operands. For any number where we know the high bit is zero for both operands, that flip is unnecessary - and a huge number of operations in CVTT fit in 15 bits.
The compare flag types are basically vector booleans, where either all bits are 1 or all bits are 0 for a given lane - There's one type for 16-bit ints, and one for 32-bit floats, and they have to be converted since they're different widths. Those are combined with several utility functions, some of which, like SelectOrZero and NotConditionalSet, can elide a few operations.
The RoundForScope type is a nifty dual-use piece of code. SSE rounding modes are determined by the CSR register, not per-op, so RoundForScope when targeting SSE will set the CSR, and then reset it in its destructor. For other architectures, including the scalar target, the TYPE of the RoundForScope passed in is what determines the operation, so the same code works whether the rounding is per-op or per-scope.
While the ParallelMath architecture has been very resistant to bugs for the most part, where it has run into bugs, they've mostly been due to improper use of AnySet or AllSet - Cases where parallel code can behave improperly because lanes where the condition should exclude it are still executing, and need to be manually filtered out using conditionals.
BC1-7 common themes
Principal component analysis
BC4 and BC5
Unique cumulative offsets
T and H mode
ETC2 with punch-through, "virtual T mode"
H mode as T mode encoding
ETC2 alpha and EAC