Thursday, October 7, 2010

YCoCg DXT5 - Stripped down and simplified

You'll recall some improvements I proposed to the YCoCg DXT5 algorithm a while back.

There's another realization of it I made recently: As a YUV-style color space, the Co and Cg channels are constrained to a range that's directly proportional to the Y channel. The addition of the scalar blue channel was mainly introduced to deal with resolution issues that caused banding artifacts on colored objects changing value, but the entire issue there can be sidestepped by simply using the Y channel as a multiplier for the Co and Cg channels, causing them to only respect tone and saturation while the Y channel becomes fully responsible for intensity.

This is not a quality improvement, in fact it nearly doubles PSNR in testing. However, it does result in considerable simplification of the algorithm, both on the encode and decode sides, and the perceptual loss compared to the old algorithm is very minimal.

This also simplifies the algorithm considerably:


int iY = px[0] + 2*px[1] + px[2]; // 0..1020
int iCo, iCg;

if (iY == 0)
{
iCo = 0;
iCg = 0;
}
else
{
iCo = (px[0] + px[1]) * 255 / iY;
iCg = (px[1] * 2) * 255 / iY;
}

px[0] = (unsigned char)iCo;
px[1] = (unsigned char)iCg;
px[2] = 0;
px[3] = (unsigned char)((iY + 2) / 4);


... And to decode:


float3 DecodeYCoCgRel(float4 inColor)
{
return (float3(4.0, 0.0, -4.0) * inColor.r
+ float3(-2.0, 2.0, -2.0) * inColor.g
+ float3(0.0, 0.0, 4.0)) * inColor.a;
}



While this does the job with much less perceptual loss than DXT1, and eliminates banding artifacts almost entirely, it is not quite as precise as the old algorithm, so using that is recommended if you need the quality.

Friday, June 4, 2010

... and they're still compressable

As a corollary to the last entry, an orthogonal tangent basis is commonly compressed by storing the normal and one of the texture axis vectors, along with a "handedness" multiplier which is either -1 or 1. The second texture axis is regenerated by taking the cross product of the normal and the stored axis, and multiplying it by the handedness.

The method I proposed was faulted for breaking this scheme, but there's no break at all. Since the two texture axes are on the triangle plane, and the normal is perpendicular, you can use the same compression scheme by simply storing the two texture axis vectors, and regenerating the normal by taking the cross product of them, multiplying it by a handedness multiplier, and normalizing it.

This does not address mirroring concerns if you use my "snap-to-normal" recommendation, though you could detect those cases in a vertex shader by using special handedness values.

Thursday, April 22, 2010

Tangent-space basis vectors: Don't assume your texture projection is orthogonal

How do you generate the tangent vectors, which represent which way the texture axes on a textured triangle, are facing?

Hitting up Google tends to produce articles like this one, or maybe even that exact one. I've seen others linked too, the basic formulae tend to be the same. Have you looked at what you're pasting into your code though? Have you noticed that you're using the T coordinates to calculate the S vector, and vice versa? Well, you can look at the underlying math, and you'll find that it's because that's what happens when you assume the normal, S vector, and T vectors form an orthonormal matrix and attempt to invert it, in a sense you're not really using the S and T vectors but rather vectors perpendicular to them.

But that's fine, right? I mean, this is an orthogonal matrix, and they are perpendicular to each other, right? Well, does your texture project on to the triangle with the texture axes at right angles to each other, like a grid?


... Not always? Well, you might have a problem then!

So, what's the real answer?

Well, what do we know? First, translating the vertex positions will not affect the axial directions. Second, scrolling the texture will not affect the axial directions.

So, for triangle (A,B,C), with coordinates (x,y,z,t), we can create a new triangle (LA,LB,LC) and the directions will be the same:

We also know that both axis directions are on the same plane as the points, so to resolve that, we can to convert this into a local coordinate system and force one axis to zero.



Now we need triangle (Origin, PLB, PLC) in this local coordinate space. We know PLB[y] is zero since LB was used as the X axis.


Now we can solve this. Remember that PLB[y] is zero, so...


Do this for both axes and you have your correct texture axis vectors, regardless of the texture projection. You can then multiply the results by your tangent-space normalmap, normalize the result, and have a proper world-space surface normal.

As always, the source code spoilers:

terVec3 lb = ti->points[1] - ti->points[0];
terVec3 lc = ti->points[2] - ti->points[0];
terVec2 lbt = ti->texCoords[1] - ti->texCoords[0];
terVec2 lct = ti->texCoords[2] - ti->texCoords[0];

// Generate local space for the triangle plane
terVec3 localX = lb.Normalize2();
terVec3 localZ = lb.Cross(lc).Normalize2();
terVec3 localY = localX.Cross(localZ).Normalize2();

// Determine X/Y vectors in local space
float plbx = lb.DotProduct(localX);
terVec2 plc = terVec2(lc.DotProduct(localX), lc.DotProduct(localY));

terVec2 tsvS, tsvT;

tsvS[0] = lbt[0] / plbx;
tsvS[1] = (lct[0] - tsvS[0]*plc[0]) / plc[1];
tsvT[0] = lbt[1] / plbx;
tsvT[1] = (lct[1] - tsvT[0]*plc[0]) / plc[1];

ti->svec = (localX*tsvS[0] + localY*tsvS[1]).Normalize2();
ti->tvec = (localX*tsvT[0] + localY*tsvT[1]).Normalize2();


There's an additional special case to be aware of: Mirroring.

Mirroring across an edge can cause wild changes in a vector's direction, possibly even degenerating it. There isn't a clear-cut solution to these, but you can work around the problem by snapping the vector to the normal, effectively cancelling it out on the mirroring edge.

Personally, I check the angle between the two vectors, and if they're more than 90 degrees apart, I cancel them, otherwise I merge them.

Thursday, February 11, 2010

Volumetric fog spoilers

Okay, so you want to make volumetric fog. Volumetric fog has descended from its days largely as a gimmick to being situationally useful, and there are still some difficulties: It's really difficult to model changes in the light inside the fog. There are techniques you can use for volumetric shadows within the fog, like rendering the depths of the front and back sides of non-overlapping volumes into a pair of accumulation textures, and using the difference between the two to determine the amount of distance penetrated.

Let's focus on a simpler implementation though: Planar, infinite, and with a linear transitional region. A transitional region is nice because it means the fog appears to gradually taper off instead of being conspicuously contained entirely below a flat plane.

In practice, there is one primary factor that needs to be determined: The amount of fog penetrated by the line from the viewpoint to the surface. In determining that, the transitional layer and the surface layer actually need to be calculated separately:

Transition layer

For the transition layer, what you want to do is multiply the distance traveled through the transition layer by the average density of the fog. Fortunately, due to some quirks of the math involved, there's a very easy way to get this: The midpoint of the entry and exit points of the transitional region will be located at a point where the fog density is equal to the average density passed through. The entry and exit points can be done by taking the viewpoint and target distances and clamping them to the entry and exit planes.

Full-density layer


The full-density layer is a bit more complex, since it behaves differently whether the camera is inside or outside of the fog. For a camera inside the fog, the fogged portion is represented by the distance from the camera to the fog plane. For a camera outside of the fog, the fogged portion is represented by the distance from the object to the fog plane. If you want to do it in one pass, both of these modes can be represented by dividing one linearly interpolated value by the linearly interpolated distance of the camera-to-point distance relative to the fog plane.


Since the camera being inside or outside the fog is completely determinable in advance, you can easily make permutations based on it and skip a branch in the shader. With a deferred renderer, you can use depth information and the fog plane to determine all of the distances. With a forward renderer, most of the distance factors interpolate linearly, allowing you to do some clamps and divides entirely in the shader.

Regardless of which you use, once you have the complete distance traveled, the most physically accurate determination of the amount still visible as:

min(1, e-(length(cameraToVert) * coverage * density))

You don't have to use e as the base though: Using 2 is a bit faster, and you can rescale the density coefficient to achieve any behavior you could have attained with using e.


As usual, the shader code spoilers:


// EncodeFog : Encodes a 4-component vector containing fraction components used
// to calculate fog factor
float4 VEncodeFog(float3 cameraPos, float3 vertPos, float4 fogPlane, float fogTransitionDepth)
{
float cameraDist, pointDist;

cameraDist = dot(cameraPos, fogPlane.xyz);
pointDist = dot(vertPos, fogPlane.xyz);

return float4(cameraDist, fogPlane.w, fogPlane.w - fogTransitionDepth, pointDist);
}

// PDecodeFog : Returns the fraction of the original scene to display given
// an encoded fog fraction and the camera-to-vertex vector
// rcpFogTransitionDepth = 1/fogTransitionDepth
float PDecodeFog(float4 fogFactors, float3 cameraToVert, float fogDensityScalar, float rcpFogTransitionDepth)
{
// x = cameraDist, y = shallowFogPlaneDist, z = deepFogPlaneDist (< shallow), w = pointDist
float3 diffs = fogFactors.wzz - fogFactors.xxw;

float cameraToPointDist = diffs.x;
float cameraToFogDist = diffs.y;
float nPointToFogDist = diffs.z;

float rAbsCameraToPointDist = 1.0 / abs(cameraToPointDist);

// Calculate the average density of the transition zone fog
// Since density is linear, this will be the same as the density at the midpoint of the ray,
// clamped to the boundaries of the transition zone
float clampedCameraTransitionPoint = max(fogFactors.z, min(fogFactors.y, fogFactors.x));
float clampedPointTransitionPoint = max(fogFactors.z, min(fogFactors.y, fogFactors.w));
float transitionPointAverage = (clampedPointTransitionPoint + clampedCameraTransitionPoint) * 0.5;

float transitionAverageDensity = (fogFactors.y - transitionPointAverage) * rcpFogTransitionDepth;

// Determine a coverage factor based on the density and the fraction of the ray that passed through the transition zone
float transitionCoverage = transitionAverageDensity *
abs(clampedCameraTransitionPoint - clampedPointTransitionPoint) * rAbsCameraToPointDist;

// Calculate coverage for the full-density portion of the volume as the fraction of the ray intersecting
// the bottom part of the transition zone
# ifdef CAMERA_IN_FOG
float fullCoverage = cameraToFogDist * rAbsCameraToPointDist;
if(nPointToFogDist >= 0.0)
fullCoverage = 1.0;
# else
float fullCoverage = max(0.0, nPointToFogDist * rAbsCameraToPointDist);
# endif

float totalCoverage = fullCoverage + transitionCoverage;

// Use inverse exponential scaling with distance
// fogDensityScalar is pre-negated
return min(1.0, exp2(length(cameraToVert) * totalCoverage * fogDensityScalar));
}

Wednesday, February 10, 2010

Self-shadowing textures using radial horizon maps


I don't think I need to extoll the virtues of self-shadowing textures, so instead, I'll simply present a method for doing them:

Any given point on a heightmap can represent all of the shadowing from the surrounding heightmap as a waveform h = f(t), where t is a radial angle across the base plane, and h is the sine of the angle from the base plane to the highest-angled heightmap sample in that direction. h, in other words, is the horizon level for a light coming from that direction.

Anyway, the method for precomputing this should be pretty obvious: Fire traces from each sample and finding the shallowest angle that will clear all heightmap samples. Once you have it, it's simply a matter of encoding it, which can be easily done using Fourier series. Each Fourier band requires 2 coefficients, except the first which requires one since the sine band is zero. I use 5 coefficients (stored as an RGBA8 and A8), but 3 works acceptably. More coefficients requires more storage, but produces sharper shadows.

Fourier series are really straightforward, but here are the usual Cartesian coordinate spoilers:

Assuming x = sin(t) and y = cos(t)

[sin(0)] = 0
[cos(0)] = 1
[sin(n)] = x
[cos(n)] = y
[sin(2n)] = 2*x*y
[cos(2n)] = y*y - x*x

The constant band squared integrates to 1 over a circle, but all other bands integrate to 0.5. That means you need to double whatever you get from it to reproduce the original waveform.

There's one other problem, which is that the horizon will move rapidly at shallow angles, right where precision breaks down. You need more precision at low values, and my recommended way of doing that is storing the sign-preserving square root of the original value, and multiplying the stored value by the absolute value of itself in the shader.

For added effect, rather than simply doing a comparison of the light angle to the horizon level, you can take the difference of the two, multiply it by some constant, and saturate. This will produce a softer shadow edge.

Shader code to decode this, taking a texture coordinate and a tangent-space lighting direction to the light souce:

float SelfShadowing(float3 tspaceLightDir, float2 tc)
{
float3 nLightDir = normalize(tspaceLightDir);
float4 lqCoefs = tex2D(pHorizonLQMap, tc) * 4.0 - 2.0; // Premultiply by 2
float cCoef = tex2D(pHorizonCCUMap, tc).x * 2.0 - 1.0;

lqCoefs *= abs(lqCoefs); // Undo dynamic range compression
cCoef *= abs(cCoef);

float2 tspaceRadial = normalize(float3(tspaceLightDir.xy, 0.0)).xy;

float3 qmultipliers = tspaceRadial.xyx * tspaceRadial.yyx * float3(2.0, 1.0, -1.0);

float horizonLevel = cCoef + dot(lqCoefs.rg, tspaceRadial)
+ dot(lqCoefs.baa, qmultipliers);

return saturate(0.5 + 20.0*(tspaceLightDir.z - horizonLevel));
}


An additional optimization would be to exploit the fact that the range of the coefficients other than the constant band is not -1..1, but rather, -2/pi..2/pi. Expanding this to -1..1 gives you a good deal of additional precision.

Monday, January 18, 2010

Spherical harmonics spoilers

Spherical harmonics seems to have some impenetrable level of difficulty, especially among the indie scene which has little to go off of other than a few presentations and whitepapers, some of which even contain incorrect information (i.e. one of the formulas in the Sony paper on the topic is incorrect), and most of which are still using ZYZ rotations because it's so hard to find how to do a matrix rotation.

Hao Chen and Xinguo Liu did a presentation at SIGGRAPH '08 and the slides from it contain a good deal of useful stuff, nevermind one of the ONLY easy-to-find rotate-by-matrix functions. It also treats the Z axis a bit awkwardly, so I patched the rotation code up a bit, and a pre-integrated cosine convolution filter so you can easily get SH coefs for directional light.

There was also gratuitous use of sqrt(3) multipliers, which can be completely eliminated by simply premultiplying or predividing coef #6 by it, which incidentally causes all of the constants and multipliers to resolve to rational numbers.

As always, you can include multiple lights by simply adding the SH coefs for them together. If you want specular, you can approximate a directional light by using the linear component to determine the direction, and constant component to determine the color. You can do this per-channel, or use the average values to determine the direction and do it once.

Here are the spoilers:

#define SH_AMBIENT_FACTOR   (0.25f)
#define SH_LINEAR_FACTOR (0.5f)
#define SH_QUADRATIC_FACTOR (0.3125f)

void LambertDiffuseToSHCoefs(const terVec3 &dir, float out[9])
{
// Constant
out[0] = 1.0f * SH_AMBIENT_FACTOR;

// Linear
out[1] = dir[1] * SH_LINEAR_FACTOR;
out[2] = dir[2] * SH_LINEAR_FACTOR;
out[3] = dir[0] * SH_LINEAR_FACTOR;

// Quadratics
out[4] = ( dir[0]*dir[1] ) * 3.0f*SH_QUADRATIC_FACTOR;
out[5] = ( dir[1]*dir[2] ) * 3.0f*SH_QUADRATIC_FACTOR;
out[6] = ( 1.5f*( dir[2]*dir[2] ) - 0.5f ) * SH_QUADRATIC_FACTOR;
out[7] = ( dir[0]*dir[2] ) * 3.0f*SH_QUADRATIC_FACTOR;
out[8] = 0.5f*( dir[0]*dir[0] - dir[1]*dir[1] ) * 3.0f*SH_QUADRATIC_FACTOR;
}


void RotateCoefsByMatrix(float outCoefs[9], const float pIn[9], const terMat3x3 &rMat)
{
// DC
outCoefs[0] = pIn[0];

// Linear
outCoefs[1] = rMat[1][0]*pIn[3] + rMat[1][1]*pIn[1] + rMat[1][2]*pIn[2];
outCoefs[2] = rMat[2][0]*pIn[3] + rMat[2][1]*pIn[1] + rMat[2][2]*pIn[2];
outCoefs[3] = rMat[0][0]*pIn[3] + rMat[0][1]*pIn[1] + rMat[0][2]*pIn[2];

// Quadratics
outCoefs[4] = (
( rMat[0][0]*rMat[1][1] + rMat[0][1]*rMat[1][0] ) * ( pIn[4] )
+ ( rMat[0][1]*rMat[1][2] + rMat[0][2]*rMat[1][1] ) * ( pIn[5] )
+ ( rMat[0][2]*rMat[1][0] + rMat[0][0]*rMat[1][2] ) * ( pIn[7] )
+ ( rMat[0][0]*rMat[1][0] ) * ( pIn[8] )
+ ( rMat[0][1]*rMat[1][1] ) * ( -pIn[8] )
+ ( rMat[0][2]*rMat[1][2] ) * ( 3.0f*pIn[6] )
);

outCoefs[5] = (
( rMat[1][0]*rMat[2][1] + rMat[1][1]*rMat[2][0] ) * ( pIn[4] )
+ ( rMat[1][1]*rMat[2][2] + rMat[1][2]*rMat[2][1] ) * ( pIn[5] )
+ ( rMat[1][2]*rMat[2][0] + rMat[1][0]*rMat[2][2] ) * ( pIn[7] )
+ ( rMat[1][0]*rMat[2][0] ) * ( pIn[8] )
+ ( rMat[1][1]*rMat[2][1] ) * ( -pIn[8] )
+ ( rMat[1][2]*rMat[2][2] ) * ( 3.0f*pIn[6] )
);

outCoefs[6] = (
( rMat[2][1]*rMat[2][0] ) * ( pIn[4] )
+ ( rMat[2][2]*rMat[2][1] ) * ( pIn[5] )
+ ( rMat[2][0]*rMat[2][2] ) * ( pIn[7] )
+ 0.5f*( rMat[2][0]*rMat[2][0] ) * ( pIn[8])
+ 0.5f*( rMat[2][1]*rMat[2][1] ) * ( -pIn[8])
+ 1.5f*( rMat[2][2]*rMat[2][2] ) * ( pIn[6] )
- 0.5f * ( pIn[6] )
);

outCoefs[7] = (
( rMat[0][0]*rMat[2][1] + rMat[0][1]*rMat[2][0] ) * ( pIn[4] )
+ ( rMat[0][1]*rMat[2][2] + rMat[0][2]*rMat[2][1] ) * ( pIn[5] )
+ ( rMat[0][2]*rMat[2][0] + rMat[0][0]*rMat[2][2] ) * ( pIn[7] )
+ ( rMat[0][0]*rMat[2][0] ) * ( pIn[8] )
+ ( rMat[0][1]*rMat[2][1] ) * ( -pIn[8] )
+ ( rMat[0][2]*rMat[2][2] ) * ( 3.0f*pIn[6] )
);

outCoefs[8] = (
( rMat[0][1]*rMat[0][0] - rMat[1][1]*rMat[1][0] ) * ( pIn[4] )
+ ( rMat[0][2]*rMat[0][1] - rMat[1][2]*rMat[1][1] ) * ( pIn[5] )
+ ( rMat[0][0]*rMat[0][2] - rMat[1][0]*rMat[1][2] ) * ( pIn[7] )
+0.5f*( rMat[0][0]*rMat[0][0] - rMat[1][0]*rMat[1][0] ) * ( pIn[8] )
+0.5f*( rMat[0][1]*rMat[0][1] - rMat[1][1]*rMat[1][1] ) * ( -pIn[8] )
+0.5f*( rMat[0][2]*rMat[0][2] - rMat[1][2]*rMat[1][2] ) * ( 3.0f*pIn[6] )
);
}


... and to sample it in the shader ...


float3 SampleSHQuadratic(float3 dir, float3 shVector[9])
{
float3 ds1 = dir.xyz*dir.xyz;
float3 ds2 = dir*dir.yzx; // xy, zy, xz

float3 v = shVector[0];

v += dir.y * shVector[1];
v += dir.z * shVector[2];
v += dir.x * shVector[3];

v += ds2.x * shVector[4];
v += ds2.y * shVector[5];
v += (ds1.z * 1.5 - 0.5) * shVector[6];
v += ds2.z * shVector[7];
v += (ds1.x - ds1.y) * 0.5 * shVector[8];

return v;
}


For Monte Carlo integration, take sampling points, feed direction "dir" to the following function to get multipliers for each coefficient, then multiply by the intensity in that direction. Divide the total by the number of sampling points:


void SHForDirection(const terVec3 &dir, float out[9])
{
// Constant
out[0] = 1.0f;

// Linear
out[1] = dir[1] * 3.0f;
out[2] = dir[2] * 3.0f;
out[3] = dir[0] * 3.0f;

// Quadratics
out[4] = ( dir[0]*dir[1] ) * 15.0f;
out[5] = ( dir[1]*dir[2] ) * 15.0f;
out[6] = ( 1.5f*( dir[2]*dir[2] ) - 0.5f ) * 5.0f;
out[7] = ( dir[0]*dir[2] ) * 15.0f;
out[8] = 0.5f*( dir[0]*dir[0] - dir[1]*dir[1] ) * 15.0f;
}


... and finally, for a uniformly-distributed random point on a sphere ...


terVec3 RandomDirection(int (*randomFunc)(), int randMax)
{
float u = (((float)randomFunc()) / (float)(randMax - 1))*2.0f - 1.0f;
float n = sqrtf(1.0f - u*u);

float theta = 2.0f * M_PI * (((float)randomFunc()) / (float)(randMax));

return terVec3(n * cos(theta), n * sin(theta), u);
}

2.0 gamma textures and full-range scalars in YCoCg DXT5

A few years back there was a publication on real-time YCoCg DXT5 texture compression. There are two improvements on the technique I feel I should present:

There's a pretty clear problem right off the bat: It's not particularly friendly to linear textures. If you simply attempt to convert sRGB values into linear space and store the result in YCoCg, you will experience severe banding owing largely to the loss of precision at lower values. Gamma space provides a lot of precision at lower intensity values where the human visual system is more sensitive.

sRGB texture modes exist as a method to cheaply convert from gamma space to linear, and are pretty fast since GPUs can just use a look-up table to get the linear values, but YCoCg can't be treated as an sRGB texture and doing sRGB decodes in the shader is fairly slow since it involves a divide, power raise, and conditional.

This can be resolved first by simply converting from a 2.2-ish sRGB gamma ramp to a 2.0 gamma ramp, which preserves most of the original gamut: 255 input values map to 240 output values, low intensity values maintain most of their precision, and they can be linearized by simply squaring the result in the shader.


Another concern, which isn't really one if you're aiming for speed and doing things real-time, but is if you're considering using such a technique for offline processing, is the limited scale factor. DXT5 provides enough resolution for 32 possible scale factor values, so there isn't any reason to limit it to 1, 2, or 4 if you don't have to. Using the full range gives you more color resolution to work with.


Here's some sample code:


unsigned char Linearize(unsigned char inByte)
{
float srgbVal = ((float)inByte) / 255.0f;
float linearVal;

if(srgbVal < 0.04045)
linearVal = srgbVal / 12.92f;
else
linearVal = pow( (srgbVal + 0.055f) / 1.055f, 2.4f);

return (unsigned char)(floor(sqrt(linearVal)* 255.0 + 0.5));
}

void ConvertBlockToYCoCg(const unsigned char inPixels[16*3], unsigned char outPixels[16*4])
{
unsigned char linearizedPixels[16*3]; // Convert to linear values

for(int i=0;i<16*3;i++)
linearizedPixels[i] = Linearize(inPixels[i]);

// Calculate Co and Cg extents
int extents = 0;
int n = 0;
int iY, iCo, iCg;
int blockCo[16];
int blockCg[16];
const unsigned char *px = linearizedPixels;
for(int i=0;i<16;i++)
{
iCo = (px[0]<<1) - (px[2]<<1);
iCg = (px[1]<<1) - px[0] - px[2];
if(-iCo > extents) extents = -iCo;
if( iCo > extents) extents = iCo;
if(-iCg > extents) extents = -iCg;
if( iCg > extents) extents = iCg;

blockCo[n] = iCo;
blockCg[n++] = iCg;

px += 3;
}

// Co = -510..510
// Cg = -510..510
float scaleFactor = 1.0f;
if(extents > 127)
scaleFactor = (float)extents * 4.0f / 510.0f;

// Convert to quantized scalefactor
unsigned char scaleFactorQuantized = (unsigned char)(ceil((scaleFactor - 1.0f) * 31.0f / 3.0f));

// Unquantize
scaleFactor = 1.0f + (float)(scaleFactorQuantized / 31.0f) * 3.0f;

unsigned char bVal = (unsigned char)((scaleFactorQuantized << 3) | (scaleFactorQuantized >> 2));

unsigned char *outPx = outPixels;

n = 0;
px = linearizedPixels;
for(i=0;i<16;i++)
{
// Calculate components
iY = ( px[0] + (px[1]<<1) + px[2] + 2 ) / 4;
iCo = ((blockCo[n] / scaleFactor) + 128);
iCg = ((blockCg[n] / scaleFactor) + 128);

if(iCo < 0) iCo = 0; else if(iCo > 255) iCo = 255;
if(iCg < 0) iCg = 0; else if(iCg > 255) iCg = 255;
if(iY < 0) iY = 0; else if(iY > 255) iY = 255;

px += 3;

outPx[0] = (unsigned char)iCo;
outPx[1] = (unsigned char)iCg;
outPx[2] = bVal;
outPx[3] = (unsigned char)iY;

outPx += 4;
}
}




.... And to decode it in the shader ...



float3 DecodeYCoCg(float4 inColor)
{
float3 base = inColor.arg + float3(0, -0.5, -0.5);
float scale = (inColor.b*0.75 + 0.25);
float4 multipliers = float4(1.0, 0.0, scale, -scale);
float3 result;

result.r = dot(base, multipliers.xzw);
result.g = dot(base, multipliers.xyz);
result.b = dot(base, multipliers.xww);

// Convert from 2.0 gamma to linear
return result*result;
}