Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

C optimization

Options
  • 22-05-2006 8:22pm
    #1
    Closed Accounts Posts: 2,497 ✭✭✭


    How's it going all!

    I'm working on a problem at work where I need to read and write all the contents of RAM in order to generate ECC bits. The code I'm using is simple enough but seems fairly slow (up around 20 seconds for 512MB, which is a bit too slow). Whether that's the quickest I can get it, i don't know.... Here's my code (it's C btw).

    void EccInit(unsigned long mem_size) {
    unsigned long i;
    for(i = 0; i < mem_size; i+=16) {
    memmove((unsigned long *)i, (unsigned long *)i, 16);
    }
    }

    It's copying 128 bytes at a time as that's the size of the cache line. Any thoughts?


Comments

  • Registered Users Posts: 441 ✭✭robfitz


    Hardware architecture, operating sytem and compiler information would be helpful.

    I'm guessing it's some sort of embedded system because it's not normally possible to touch all the memory like that.

    Do the contents of the memory matter after the code has run? Looping in pages might be better. Make sure the pointers are aligned. It might be better to read from one cache line and write to another instead of reading and writing to the same cache line. It might be faster to decrement and test for zero in your outer loop.

    Something like this hand coded in assembler might be faster.
      move $mem_size to regA
      jump to start
    loop:
      move deref(regA) to regB
      move regB to deref(regA)
    start:
      sub 4 from regA
      jump if regA is not zero to loop
    


  • Closed Accounts Posts: 2,497 ✭✭✭omahaid


    PPC, U-Boot bootup code (they're keeping me away from the OS part of it :-) and GCC. Yeah the contents have to be the contents the system booted up with (whether they're right or not is a different matter). Tried it with a decrementer too, but didn't seem to make a huge difference. As to the suggestion about looping in pages, I'll admit to needing to look that up a bit.....


  • Registered Users Posts: 441 ✭✭robfitz


    64 or 32 bit PPC? Do you have to read and write each byte? Just reading should be a lot faster.

    From my quick reading cachelines are on the G4 are 32 bytes and G5 are 128 bytes.

    How fast is this code?
    void EccInit(unsigned long mem_size) {
        volatile unsigned long *p = (unsigned long *)mem_size;  // Must be aligned to a cacheline
        while ((mem_size -= (sizeof(unsigned long) * 8))) {
            p[0] = p[0];
            p[1] = p[1];
            p[2] = p[2];
            p[3] = p[3];
            p[4] = p[4];
            p[5] = p[5];
            p[6] = p[6];
            p[7] = p[7];
        }
    }
    

    omahaid wrote:
    It's copying 128 bytes at a time as that's the size of the cache line. Any thoughts?

    No it doesn't. It's only copying the data in 16 byte chunks.


  • Closed Accounts Posts: 2,497 ✭✭✭omahaid


    It's 64 bit. I need to read each byte to detect an ecc error, and write it to generate the ecc bits. I realise I could only write those bytes whose ecc is invalid, but I'm not sure if the act of checking for an error each time is worth it (I'll test it). That code you put up seems pretty good though (cheers!), I'll give it a stab today. As for reading 128 bytes, yes, I spot my error there, should be 128 instead of 16 in memmove (and the loop should be incrementing by 128).


  • Closed Accounts Posts: 9,314 ✭✭✭Talliesin


    robfitz's idea is sound, though generally memmove and similar has the potential to be implemented with processor-specific code that can beat Duff's device.


  • Advertisement
  • Closed Accounts Posts: 2,497 ✭✭✭omahaid


    Yeah, tried it out there, still dragging it's heels. I'll have to do some more research into it, a 500Mhz processor with 667Mhz RAM shouldn't take that long to do this. Hopefully some more research will enlighten me, cheers anyway!


  • Registered Users Posts: 441 ✭✭robfitz


    Talliesin wrote:
    though generally memmove and similar has the potential to be implemented with processor-specific code that can beat Duff's device.

    That is very true but this is a bootloader, and the implementation in assembler doesn't seem to be especially optimized.

    Here's another version which uses local variables to store the variable temporarily before writing them out which should help prevent stalls.
    void EccInit(unsigned long mem_size) {
        volatile unsigned long *p = (unsigned long *)mem_size;  // Must be aligned to a cacheline
        unsigned long t0, t1, t2, t3;
        while ((mem_size -= (sizeof(unsigned long) * 8))) {
            p[0] = t0;
            p[1] = t1;
            p[2] = t2;
            p[3] = t3;
            t0 = p[0];
            t1 = p[1];
            t2 = p[2];
            t3 = p[3];
            p[4] = t0;
            p[5] = t1;
            p[6] = t2;
            p[7] = t3;
            t0 = p[4];
            t1 = p[5];
            t2 = p[6];
            t3 = p[7];
        }
    }
    

    The code could try unrolling the loop more and using more tempory variables.


  • Closed Accounts Posts: 2,497 ✭✭✭omahaid


    Some one had the data cache set to write through. Setting this to write back instead has brought the loop time down to about five seconds. I'm working on reading a cache line of data into an array and then writing this back to memory. Hopefully that should knock a few more clock cycles off.


  • Registered Users Posts: 4,003 ✭✭✭rsynnott


    Does it have Altivec/AMX? If so, you could look there. I seem to remember hearing tell of it being used for quick memory things.

    Are you sure it's not just a slow memory controller?


Advertisement