« More iPhone | Main | The longest 36-hours of my life. »
Moonlight colorspace conversion
I took on the task recently of implementing the YUV to RGB colorspace converter for Moonlight. There is a lot of code out there that does this already, in fact we were using libswscale to do it previously, but this was less than ideal for us from a performance and licensing pespective. Let me start with a bit of background on YUV to RGB conversions (4:2:0 specifically) and then I'll describe our C/MMX/SSE2 implementation.
Traditional digital media is tranmitted in a colorspace known as YUV. YUV splits luminence and chrominance to display an image rather than putting it in RGB. This is mainly because of how the human eye detects color. We can share chrominance values that are near by other luminance values. The 4:2:0 colorspace shares the same U/V value for 4 Y values in the following manner:
Lets assume a 16 pixel wide image, 4 pixels high. Our planes would look as follows:
Y00 Y01 Y02 Y03 Y04 Y05 Y06 Y07 Y08 Y09 Y10 Y11 Y12 Y13 Y14 Y15 Y16 Y17 Y18 Y19 Y20 Y21 Y22 Y23 Y24 Y25 Y26 Y27 Y28 Y29 Y30 Y31 Y32 Y33 Y34 Y35 Y36 Y37 Y38 Y39 Y40 Y41 Y42 Y43 Y44 Y45 Y46 Y47 Y48 Y49 Y50 Y51 Y52 Y53 Y54 Y55 Y56 Y57 Y58 Y59 Y60 Y61 Y62 Y63 U00 U01 U02 U03 U04 U05 U06 U07 U08 U09 U10 U11 U12 U13 U14 U15 V00 V01 V02 V03 V04 V05 V06 V07 V08 V09 V10 V11 V12 V13 V14 V15
The Y planes are split into 2x2 blocks to determine the corresponding U/V plane
Y00 Y01 Y02 Y03 ... Y16 Y17 Y18 Y19 ... U00 U01 ... V00 V01 ...
YUV to RGB can be mathematically represented as:
R = 1.164 * (Y - 16) + 1.596 * (V - 128) G = 1.164 * (Y - 16) - 0.813 * (V - 128) - 0.391 * (U - 128) B = 1.164 * (Y - 16) + 2.018 * (U - 128)
A corresponding integer math based approxmation can be represented as:
R = CLAMP((298 * (Y - 16) + 409 * (V - 128) + 128) >> 8, 0, 255); G = CLAMP((298 * (Y - 16) - 100 * (U - 128) - 208 * (V - 128) + 128) >> 8, 0, 255); B = CLAMP((298 * (Y - 16) + 516 * (U - 128) + 128) >> 8, 0, 255);
So a simple C based implementation of this algorithm for YUV would be:
static inline void YUV444ToBGRA(uint8_t Y, uint8_t U, uint8_t V, uint8_t *dst)
{
dst[2] = CLAMP((298 * (Y - 16) + 409 * (V - 128) + 128) >> 8, 0, 255);
dst[1] = CLAMP((298 * (Y - 16) - 100 * (U - 128) - 208 * (V - 128) + 128) >> 8, 0, 255);
dst[0] = CLAMP((298 * (Y - 16) + 516 * (U - 128) + 128) >> 8, 0, 255);
dst[3] = 0xFF;
}
void Convert (int width, int height, uint8_t *y_plane, uint8_t *u_plane, uint8_t *v_plane, uint8_t *dest) {
int i, j;
uint8_t *y_row1 = y_plane;
uint8_t *y_row2 = y_plane+width;
uint8_t *dest_row1 = dest;
uint8_t *dest_row2 = dest+(width*4);
for (int i = 0; i < height; i++, y_row1+=width, y_row2+=width, dest_row1+=(width*4), dest_row2+=(width*4)) {
// Unroll the loop
for (int j = 0; j < width >> 1; j++, dest_row1+=8, dest_row2+=8, y_row1+=2, y_row2+=2, u_plane+=1, v_plane+=1) {
// Process Y1 U0 V0
YUV444ToBGRA (*y_row1, *u_plane, *v_plane, dest_row1);
// Process Y2 U0 V0
YUV444ToBGRA (y_row1[1], *u_plane, *v_plane, (dest_row1+4));
// Process Y16 U0 V0
YUV444ToBGRA (*y_row2, *u_plane, *v_plane, dest_row2);
// Process Y17 U0 V0
YUV444ToBGRA (y_row2[1], *u_plane, *v_plane, (dest_row2+4));
}
}
}
This works perfectly well, however its slow. On modern architectures we have SIMD processors which allow us to do simple math like this to multiple vales at the same time. So lets dissect our MMX implementation of YUV2RGB.
Taking our above mathematical representation:
R = 1.164 * (Y - 16) + 1.596 * (V - 128) G = 1.164 * (Y - 16) - 0.813 * (V - 128) - 0.391 * (U - 128) B = 1.164 * (Y - 16) + 2.018 * (U - 128)
This math isn't ideal for a SIMD, so we're going to promote and demote precision (64 is a simple shift operation):
R = (1.164*64)*(Y-16)/64 + (1.596*64)*(V-128)/64 G = (1.164*64)*(Y-16)/64 - (0.813*64)*(V-128)/64 - (0.391*64)*(U-128)/64 B = (1.164*64)*(Y-16)/64 + (2.018*64)*(U-128)/64
Expanding out our contants:
R_V = 1.596*64 =~ 102 G_V = 0.813*64 =~ 52 G_U = 0.391*64 =~ 25 B_U = 2.018*64 =~ 129 Y_C = 1.164*64 =~ 74
We store all these constants in a constant aligned memory buffer declared at compile time so we can do aligned loads from it in SSE mode.
This leaves us with 3 unknowns for each pixel. We start our MMX/SSE2 loop by calculating R_V, G_V-G_U, and B_U for 8 (16 for sse2) U/V values. We actually only use 4 (8 for SSE2) of these in each iteration, but we can precalculate the next 4 in the SIMD stream and this prevents an unaligned load/store in SSE2. At this point we're left with simple math to calculate the RGB:
R = 74*(Y-16)/64 + R_V G = 74*(Y-16)/64 - (G_V-G_U) B = 74*(Y-16)/64 + B_U
After performing this simple math (please refer to SVN if you want to see the nitty gritty on how we process 8/16 pels at a time, but basically we unpack the low and hi portions and operate on them seperately. This gives us the benefit of always being able to do aligned load/stores for sse), we do a bit of unpacking to order things into cairo's expected BGRA32 and dump it into the output buffer.
For the curious we do a simple alignment check to determine if we have to calculate the U/V planes in the loop iteration or not, it looks roughly like (again for mmx):
mov u_plane, %eax and $7, %eax test %eax, %eax je do_calc movq backup_r_v, %mm1 movq backup_g_u_g_v, %mm2 movq backup_b_u, %mm3 jmp done_calc
If you're more curious our implementation is in svn in yuv-converter.cpp. Feel free to ask me any questions you might have.
PS: If someone wants to contribut a LUT based implementation for our C fallback under MIT/X11 that would be appreciated.

