Thursday, October 10, 2019

Porting Glider - Part 1

Recently found out that Classic Mac game Glider PRO's source code was released, so I'm starting a project called GlidePort to bring it to Windows, ideally in as faithful of a reproduction as possible and using the original data files.  Some additions like gamepad support may come at a later time if this stays on track.

While this is a chance to restore of the few iconic Mac-specific games of the era to, it's also a chance to explore a lot of the era technology, so I'll be doing some dev diaries about the process.

Porting Glider has a number of technical challenges: It's very much coded for the Mac platform, which has a lot of peculiarities compared to POSIX and Windows.  The preferred language for Mac OS was originally Pascal, so the C standard library is often mostly or entirely unused, and the Macintosh Toolbox (the operating system API)  has differences like preferring length-prefixed strings instead of C-style null terminated strings.

Data is in big endian format, as it was originally made for Motorola 68k and PowerPC CPUs.  Data files are split into two "forks," one as a flat data stream and the other as a resource database that the toolbox provides parsing facilities for.  In Mac development, parsing individual data elements was generally the preferred style vs. reading in whole structures, which leads to data formats often having variable-length strings and no padding for character buffer space or alignment.

Rendering is done using QuickDraw, the system-provided multimedia infrastructure.  Most images use the system-native PICT format, a vector format that is basically a list of QuickDraw commands.

At minimum, this'll require parsing a lot of Mac native resource formats, some Mac interchange formats (i.e. BinHex 4), reimplementation of a subset of QuickDraw and QuickTime, substitution of copyrighted fonts, and switch-out of numerous Mac-specific compiler extensions like dword literals and Pascal string escapes.

The plan for now is to implement the original UI in Qt, but I might rebuild the UI instead if that turns out to be impractical.

Wednesday, September 4, 2019

Efficient ETC compression using unique cumulative offsets

When adding ETC support to Convection Texture Tools, I decided to try adapting the cluster fit algorithm used for desktop formats to ETC.

Cluster fit works by sorting the pixels into an order based on a color axis, and then repeatedly evaluating each possible combination of counts of the number of pixels assigned to each index.  It does so by taking the pixels and applying a least-squares fit to produce the endpoint line.

For ETC, this is is simplified in a few ways: The axis is always 1,1,1, so the step of picking a good axis is unnecessary.  There is only one base color and the offsets are determined by the table index, so the clustering step would only solve the base color.

Assuming that you know what the offsets for each pixel are, the least squares fit amounts to simply subtracting the offset from each of the input pixels and averaging the result.

For a 4x2 block, there are 165 possible cluster configurations, but it turns out that some of those are redundant, given certain assumptions.  The base color is derived from the formula ((color1-offset1)+(color2-offset2)+...)/8, but since the adds are commutative, that's identical to ((color1+color2+...)-(offset1+offset2+...))/8

The first half of that is the total of the colors, which is constant.  The second is the total of the offsets.

Fortunately, not all of the possible combinations produce unique offsets.  Some of them cancel out, since adding 1 to or subtracting 1 from the count of the offsets that are negatives of each other produces no change.  In an example case, the count tuples (5,0,1,2) and (3,2,3,0) are the same, since 5*-L + 0*-S + 1*S + 2*L = 3*-L + 2*-S + 3*S + 0*L.

For most of the tables, this results in only 81 possible offset combinations.  For the first table, the large value is divisible by the small value, causing even more cancellations, and only 57 possible offset combinations.

Finally, most of the base colors produced by the offset combinations are not unique after quantization: Differential mode only has 5-bit color resolution, and differential mode only has 4-bit resolution, so after quantization, many of the results get mapped to the same color.  Deduplicating them is also inexpensive: If the offsets are checked in ascending order, then once the candidate color progresses past the threshold where the result could map to a specific quantized color, it will never cross back below that threshold, so deduplication only needs to inspect the last appended quantized color.

Together, these reduce the candidate set of base colors to a fairly small number, creating a very optimal search space at low cost.

There are a few circumstances where these assumptions don't hold:

One is when the clamping behavior comes into effect, particularly when a pixel channel's value is near 0 or 255.  In that case, this algorithm can't account for the fact that changing the value of the base color would have no effect on some of the offset colors.

One is when the pixels are not of equal importance, such as when using weight-by-alpha, which makes the offset additions non-commutative, but that only invalidates the cancellation part of the algorithm.  The color total can be pre-weighted, and the rest of the algorithm would have to rely on working more like cluster fit: Sort the colors along the 1,1,1 line and determine the weights for the pixels in that order, generate all 165 cluster combinations, and compute the weight totals for each one.  Sort them into ascending order, and then the rest of the algorithm should work.

One is when dealing with differential mode constraints, since not all base color pairs are legal.  There are some cases where a base color pair that is just barely illegal could be made legal by nudging the colors closer together, but in practice, this is rare: Usually, there is already a very similar individual mode color pair, or another differential mode pair that is only slightly worse.

In CVTT, I deal with differential mode by evaluating all of the possibilities and picking the best legal pair.  There's a shortcut case when the best base color for both blocks produces a legal differential mode pair, but this is admittedly a bit less than optimal: It picks the first evaluation in the case of a tie when searching for the best, but since blocks are evaluated starting with the largest combined negative offset, it's a bit more likely to pick colors far away from the base than colors close to the base, even though colors closer to the average tend to produce smaller offsets and are more likely to be legal, so this could be improved by making the tie-breaking function prefer smaller offsets.

In practice though, the differential mode search is not where most of the computation time is spent: Evaluating the actual base colors is.

As with the rest of CVTT's codecs, brute force is still key: The codec is designed to use 8-wide SSE2 16-bit math ops wherever possible to processing 8 blocks at once, but this creates a number of challenges since sorting and list creation are not amenable to vectorization.  I solve this by careful insertion of scalar ops, and the entire differential mode part is scalar as well.  Fortunately, as stated, the parts that have to be scalar are not major contributors to the encoding time.


You can grab the stand-alone CVTT encoding kernels here: https://github.com/elasota/ConvectionKernels

Friday, March 30, 2018

BC7 endpoint search using endpoint extrapolation

Convection Texture Tools is now roughly equal quality-wise with NVTT at compressing BC7 textures despite being about 140 times faster, making it one of the fastest and highest-quality BC7 compressors.

How this was accomplished turned out to be simpler than expected.  Recall that Squish became the gold standard of S3TC compressors by implementing a "cluster fit" algorithm that ordered all of the input colors on a line and tried every possible grouping of them to least-squares fit them.

Unfortunately, using this technique isn't practical in BC7 because the number of orderings has rather extreme scaling characteristics.  While 2-bit indices have a few hundred possible orderings, 4-bit indices have millions, most BC7 mode indices are 3 bits, and some have 4.

With that option gone, most BC7 compressors until now have tried to solve endpoints using various types of endpoint perturbation, which tends to require a lot of iterations.

Convection just uses 2 rounds of K-means clustering and a much simpler technique based on a guess about why Squish's cluster fit algorithm is actually useful: It can create endpoint mappings that don't use some of the terminal ends of the endpoint line, causing the endpoint to be extrapolated out, possibly to a point that loses less accuracy to quantization.

Convection just tries cutting off 1 index at each end, then 1 index at both ends.  That turned out to be enough to place it near the top of the quality benchmarks.

Now I just need to add color weighting and alpha weighting and it'll be time to move on to other formats.

Friday, December 2, 2011

Spherical harmonics self-shadowing

Valve's self-shadowing radiosity normal maps concept can be used with spherical harmonics in approximately the same way: Integrate a sphere based on how much light will affect a sample if incoming from numerous sample direction, accounting for collision with other samples due to elevation.

You can store this as three DXT1 textures, though you can improve quality by packing channels with similar spatial coherence. Coefficients 0, 2, and 6 in particular tend to pack well, since they're all dominated primarily by directions aimed perpendicular to the texture.

I use the following packing:
Texture 1: Coefs 0, 2, 6
Texture 2: Coefs 1, 4, 5
Texture 3: Coefs 3, 7, 8

You can reference an early post on this blog for code on how to rotate a SH vector by a matrix, in turn allowing you to get it into texture space. Once you've done that, simply multiply each SH coefficient from the self-shadowing map by the SH coefficients created from your light source (also covered on the previous post) and add together.

Tuesday, October 18, 2011

Introducing RDX

Has it really been a year since the last update?

Well, things have been chugging along with less discovery and more actual work. However, development on TDP is largely on hold due to the likely impending release of the Doom 3 source code, which has numerous architectural improvements like rigid-body physics and much better customization of entity networking.


In the meantime, however, a component of TDP has been spun off into its own project: The RDX extension language. Initially planned as a resource manager, it has evolved into a full-fledged programmability API. The main goal was to have a runtime with very straightforward integration, to the point that you can easily use it for managing your C++ resources, but also to be much higher performance than dynamically-typed interpreted languages, especially when dealing with complex data types such as float vectors.

Features are still being implemented, but the compiler seems to be stable and load-time conversion to native x86 code is functional. Expect a real release in a month or two.

The project now has a home on Google Code.

Thursday, October 7, 2010

YCoCg DXT5 - Stripped down and simplified

You'll recall some improvements I proposed to the YCoCg DXT5 algorithm a while back.

There's another realization of it I made recently: As a YUV-style color space, the Co and Cg channels are constrained to a range that's directly proportional to the Y channel. The addition of the scalar blue channel was mainly introduced to deal with resolution issues that caused banding artifacts on colored objects changing value, but the entire issue there can be sidestepped by simply using the Y channel as a multiplier for the Co and Cg channels, causing them to only respect tone and saturation while the Y channel becomes fully responsible for intensity.

This is not a quality improvement, in fact it nearly doubles PSNR in testing. However, it does result in considerable simplification of the algorithm, both on the encode and decode sides, and the perceptual loss compared to the old algorithm is very minimal.

This also simplifies the algorithm considerably:


int iY = px[0] + 2*px[1] + px[2]; // 0..1020
int iCo, iCg;

if (iY == 0)
{
iCo = 0;
iCg = 0;
}
else
{
iCo = (px[0] + px[1]) * 255 / iY;
iCg = (px[1] * 2) * 255 / iY;
}

px[0] = (unsigned char)iCo;
px[1] = (unsigned char)iCg;
px[2] = 0;
px[3] = (unsigned char)((iY + 2) / 4);


... And to decode:


float3 DecodeYCoCgRel(float4 inColor)
{
return (float3(4.0, 0.0, -4.0) * inColor.r
+ float3(-2.0, 2.0, -2.0) * inColor.g
+ float3(0.0, 0.0, 4.0)) * inColor.a;
}



While this does the job with much less perceptual loss than DXT1, and eliminates banding artifacts almost entirely, it is not quite as precise as the old algorithm, so using that is recommended if you need the quality.

Friday, June 4, 2010

... and they're still compressable

As a corollary to the last entry, an orthogonal tangent basis is commonly compressed by storing the normal and one of the texture axis vectors, along with a "handedness" multiplier which is either -1 or 1. The second texture axis is regenerated by taking the cross product of the normal and the stored axis, and multiplying it by the handedness.

The method I proposed was faulted for breaking this scheme, but there's no break at all. Since the two texture axes are on the triangle plane, and the normal is perpendicular, you can use the same compression scheme by simply storing the two texture axis vectors, and regenerating the normal by taking the cross product of them, multiplying it by a handedness multiplier, and normalizing it.

This does not address mirroring concerns if you use my "snap-to-normal" recommendation, though you could detect those cases in a vertex shader by using special handedness values.