When adding ETC support to Convection Texture Tools, I decided to try adapting the cluster fit algorithm used for desktop formats to ETC.
Cluster fit works by sorting the pixels into an order based on a color axis, and then repeatedly evaluating each possible combination of counts of the number of pixels assigned to each index. It does so by taking the pixels and applying a least-squares fit to produce the endpoint line.
For ETC, this is is simplified in a few ways: The axis is always 1,1,1, so the step of picking a good axis is unnecessary. There is only one base color and the offsets are determined by the table index, so the clustering step would only solve the base color.
Assuming that you know what the offsets for each pixel are, the least squares fit amounts to simply subtracting the offset from each of the input pixels and averaging the result.
For a 4x2 block, there are 165 possible cluster configurations, but it turns out that some of those are redundant, given certain assumptions. The base color is derived from the formula ((color1-offset1)+(color2-offset2)+...)/8, but since the adds are commutative, that's identical to ((color1+color2+...)-(offset1+offset2+...))/8
The first half of that is the total of the colors, which is constant. The second is the total of the offsets.
Fortunately, not all of the possible combinations produce unique offsets. Some of them cancel out, since adding 1 to or subtracting 1 from the count of the offsets that are negatives of each other produces no change. In an example case, the count tuples (5,0,1,2) and (3,2,3,0) are the same, since 5*-L + 0*-S + 1*S + 2*L = 3*-L + 2*-S + 3*S + 0*L.
For most of the tables, this results in only 81 possible offset combinations. For the first table, the large value is divisible by the small value, causing even more cancellations, and only 57 possible offset combinations.
Finally, most of the base colors produced by the offset combinations are not unique after quantization: Differential mode only has 5-bit color resolution, and differential mode only has 4-bit resolution, so after quantization, many of the results get mapped to the same color. Deduplicating them is also inexpensive: If the offsets are checked in ascending order, then once the candidate color progresses past the threshold where the result could map to a specific quantized color, it will never cross back below that threshold, so deduplication only needs to inspect the last appended quantized color.
Together, these reduce the candidate set of base colors to a fairly small number, creating a very optimal search space at low cost.
There are a few circumstances where these assumptions don't hold:
One is when the clamping behavior comes into effect, particularly when a pixel channel's value is near 0 or 255. In that case, this algorithm can't account for the fact that changing the value of the base color would have no effect on some of the offset colors.
One is when the pixels are not of equal importance, such as when using weight-by-alpha, which makes the offset additions non-commutative, but that only invalidates the cancellation part of the algorithm. The color total can be pre-weighted, and the rest of the algorithm would have to rely on working more like cluster fit: Sort the colors along the 1,1,1 line and determine the weights for the pixels in that order, generate all 165 cluster combinations, and compute the weight totals for each one. Sort them into ascending order, and then the rest of the algorithm should work.
One is when dealing with differential mode constraints, since not all base color pairs are legal. There are some cases where a base color pair that is just barely illegal could be made legal by nudging the colors closer together, but in practice, this is rare: Usually, there is already a very similar individual mode color pair, or another differential mode pair that is only slightly worse.
In CVTT, I deal with differential mode by evaluating all of the possibilities and picking the best legal pair. There's a shortcut case when the best base color for both blocks produces a legal differential mode pair, but this is admittedly a bit less than optimal: It picks the first evaluation in the case of a tie when searching for the best, but since blocks are evaluated starting with the largest combined negative offset, it's a bit more likely to pick colors far away from the base than colors close to the base, even though colors closer to the average tend to produce smaller offsets and are more likely to be legal, so this could be improved by making the tie-breaking function prefer smaller offsets.
In practice though, the differential mode search is not where most of the computation time is spent: Evaluating the actual base colors is.
As with the rest of CVTT's codecs, brute force is still key: The codec is designed to use 8-wide SSE2 16-bit math ops wherever possible to processing 8 blocks at once, but this creates a number of challenges since sorting and list creation are not amenable to vectorization. I solve this by careful insertion of scalar ops, and the entire differential mode part is scalar as well. Fortunately, as stated, the parts that have to be scalar are not major contributors to the encoding time.
You can grab the stand-alone CVTT encoding kernels here: https://github.com/elasota/ConvectionKernels