C optimization

omahaid · 22-05-2006 8:22pm #1

How's it going all!

I'm working on a problem at work where I need to read and write all the contents of RAM in order to generate ECC bits. The code I'm using is simple enough but seems fairly slow (up around 20 seconds for 512MB, which is a bit too slow). Whether that's the quickest I can get it, i don't know.... Here's my code (it's C btw).

void EccInit(unsigned long mem_size) {

unsigned long i;

for(i = 0; i < mem_size; i+=16) {

memmove((unsigned long *)i, (unsigned long *)i, 16);

}

}

It's copying 128 bytes at a time as that's the size of the cache line. Any thoughts?

robfitz · 22-05-2006 9:42pm

Hardware architecture, operating sytem and compiler information would be helpful.

I'm guessing it's some sort of embedded system because it's not normally possible to touch all the memory like that.

Do the contents of the memory matter after the code has run? Looping in pages might be better. Make sure the pointers are aligned. It might be better to read from one cache line and write to another instead of reading and writing to the same cache line. It might be faster to decrement and test for zero in your outer loop.

Something like this hand coded in assembler might be faster.

  move $mem_size to regA
  jump to start
loop:
  move deref(regA) to regB
  move regB to deref(regA)
start:
  sub 4 from regA
  jump if regA is not zero to loop

omahaid · 22-05-2006 9:53pm

PPC, U-Boot bootup code (they're keeping me away from the OS part of it :-) and GCC. Yeah the contents have to be the contents the system booted up with (whether they're right or not is a different matter). Tried it with a decrementer too, but didn't seem to make a huge difference. As to the suggestion about looping in pages, I'll admit to needing to look that up a bit.....

robfitz · 23-05-2006 12:05am

64 or 32 bit PPC? Do you have to read and write each byte? Just reading should be a lot faster.

From my quick reading cachelines are on the G4 are 32 bytes and G5 are 128 bytes.

How fast is this code?

void EccInit(unsigned long mem_size) {
    volatile unsigned long *p = (unsigned long *)mem_size;  // Must be aligned to a cacheline
    while ((mem_size -= (sizeof(unsigned long) * 8))) {
        p[0] = p[0];
        p[1] = p[1];
        p[2] = p[2];
        p[3] = p[3];
        p[4] = p[4];
        p[5] = p[5];
        p[6] = p[6];
        p[7] = p[7];
    }
}

omahaid wrote:

It's copying 128 bytes at a time as that's the size of the cache line. Any thoughts?

No it doesn't. It's only copying the data in 16 byte chunks.

omahaid · 23-05-2006 7:50am

It's 64 bit. I need to read each byte to detect an ecc error, and write it to generate the ecc bits. I realise I could only write those bytes whose ecc is invalid, but I'm not sure if the act of checking for an error each time is worth it (I'll test it). That code you put up seems pretty good though (cheers!), I'll give it a stab today. As for reading 128 bytes, yes, I spot my error there, should be 128 instead of 16 in memmove (and the loop should be incrementing by 128).

Talliesin · 23-05-2006 12:06pm

robfitz's idea is sound, though generally memmove and similar has the potential to be implemented with processor-specific code that can beat Duff's device.

omahaid · 23-05-2006 12:24pm

Yeah, tried it out there, still dragging it's heels. I'll have to do some more research into it, a 500Mhz processor with 667Mhz RAM shouldn't take that long to do this. Hopefully some more research will enlighten me, cheers anyway!

robfitz · 23-05-2006 1:58pm

Talliesin wrote:

though generally memmove and similar has the potential to be implemented with processor-specific code that can beat Duff's device.

That is very true but this is a bootloader, and the implementation in assembler doesn't seem to be especially optimized.

Here's another version which uses local variables to store the variable temporarily before writing them out which should help prevent stalls.

void EccInit(unsigned long mem_size) {
    volatile unsigned long *p = (unsigned long *)mem_size;  // Must be aligned to a cacheline
    unsigned long t0, t1, t2, t3;
    while ((mem_size -= (sizeof(unsigned long) * 8))) {
        p[0] = t0;
        p[1] = t1;
        p[2] = t2;
        p[3] = t3;
        t0 = p[0];
        t1 = p[1];
        t2 = p[2];
        t3 = p[3];
        p[4] = t0;
        p[5] = t1;
        p[6] = t2;
        p[7] = t3;
        t0 = p[4];
        t1 = p[5];
        t2 = p[6];
        t3 = p[7];
    }
}

The code could try unrolling the loop more and using more tempory variables.

omahaid · 23-05-2006 3:53pm

Some one had the data cache set to write through. Setting this to write back instead has brought the loop time down to about five seconds. I'm working on reading a cache line of data into an array and then writing this back to memory. Hopefully that should knock a few more clock cycles off.

rsynnott · 25-05-2006 7:00pm

Does it have Altivec/AMX? If so, you could look there. I seem to remember hearing tell of it being used for quick memory things.

Are you sure it's not just a slow memory controller?

C optimization

Comments