Multi-threaded process with intrinsics - good or bad?

Martyr · 19-08-2009 04:53PM #1

I'm finishing off a multi-threaded program for windows which before used event handles to signal when a thread had stopped or should stop but felt that it was just over complicating what i needed it to do.

wanted to write the thread code using inline asm that used LOCK prefix but again this seems a bit OTT..

Then there are intrinsics or inline macro asm which can be used but my main question is, can anyone think of any reason why using intrinsics for synchronizing threads is a "bad idea" ??

And why would using pthreads or boost libraries be any better, apart from the portability issues..

i'm not synchronising threads over multiple processes or computers, so is using the intrinsics fine?

Anyway, the thread code is something like: (with main operations stripped out)

[PHP]static long stop = FALSE;

// loop for number specified by lpParameter or terminate if stop is TRUE
//
DWORD ThreadProc(LPVOID lpParameter)
{
DWORD dwResult;

for (;;) {

// perform main thread operations here

dwResult = GetTickCount(); // does nothing useful, just here to generate asm

if (dwResult == 0xDEADBEEF) {
_InterlockedIncrement(&stop); // set to TRUE / signal other threads to finish
break; // break out of loop and exit
} else if (_InterlockedCompareExchange(&stop,TRUE,TRUE) == TRUE) // else check if another thread signalled
break; // and stop if TRUE
}
return(0);
}[/PHP]

and the asm generated

[PHP] ALIGN 2
PUBLIC ?ThreadProc@@YAKPAX@Z
?ThreadProc@@YAKPAX@Z PROC NEAR
; parameter 1: 8 + esp
$B2$1: ; Preds $B2$0
push esi ;34.1
mov DWORD PTR [esp], edi ;34.1
; LOE ebx ebp esi
$B2$2: ; Preds $B2$5 $B2$1
call DWORD PTR __imp__GetTickCount@0 ;41.20
; LOE eax ebx ebp esi
$B2$3: ; Preds $B2$2
cmp eax, -559038737 ;43.25
je $B2$8 ; Prob 1% ;43.25
; LOE ebx ebp esi
$B2$4: ; Preds $B2$3
mov edi, OFFSET FLAT: ?stop$0@@4JA ;46.20
mov ecx, 1 ;46.20
mov eax, 1 ;46.20
lock cmpxchg DWORD PTR [edi], ecx ;46.20
; LOE eax ebx ebp esi
$B2$5: ; Preds $B2$4
cmp eax, 1 ;46.68
jne $B2$2 ; Prob 99% ;46.68
; LOE ebx ebp esi
$B2$6: ; Preds $B2$5
mov edi, DWORD PTR [esp] ;
; LOE ebx ebp esi edi
$B2$7: ; Preds $B2$8 $B2$6
xor eax, eax ;49.11
pop ecx ;49.11
ret ;49.11
; LOE
$B2$8: ; Preds $B2$3 ; Infreq
mov edi, DWORD PTR [esp] ;
mov edx, OFFSET FLAT: ?stop$0@@4JA ;44.13
mov eax, 1 ;44.13
lock xadd DWORD PTR [edx], eax ;44.13
jmp $B2$7 ; Prob 100% ;44.13
ALIGN 2
; LOE ebx ebp esi edi[/PHP]

fasty · 19-08-2009 07:47PM

The Interlocked______ intrinsics are all atomic so I think it'll be okay to do that as opposed to Critical Sections and Locks. There's no need for third party thread libraries in this instance as you don't seem worried about porting problems.

I use them for thread pooling and have never really felt I should be using someone else's threading libraries.

satchmo · 20-08-2009 10:24PM

No, using intrinsics is a perfectly normal way to specify atomic operations like those. The only alternative I can see is to write the asm yourself, but this is needlessly low-level and non-portable, and besides the compiler can probably optimize the intrinsic call much better than you could optimize your asm.

Martyr · 20-08-2009 11:14PM

thanks guys.

@satchmo, how would you implement a similar atomic procedure on Cell SPU?

satchmo · 21-08-2009 11:43PM

There are a few different ways you could do it. The SPU's MFCs have a dedicated path for working with atomics in main memory (they don't get queued up with all the other DMAs), so you could get & set atomics just like you're doing above. Or you could use the SPU's mailboxes to signal each other, either via the PPU or by writing directly from one SPU to the other via the memory mapped mailbox registers (not sure if this is possible with IBM's Cell SDK, but it is with Sony's).

Hell, you could even lock a cacheline of memory and then wait for the interrupt handler to tell you when any writes are made to an address in that cacheline. The guys from Insomniac do this to trigger SPU jobs from the PPU (or even GPU) without saturating the bus with a tight busy loop that checks a variable in memory. I don't know how well this would scale though, the system can only have a certain number of cacheline reservations, and it's a bit overkill for what you're doing!

One thing you should be careful with when working with atomics to do this sort of multithreaded stuff on a PPC architecture like the Cell is that memory writes can be issued in one order but written in another. So if you write to a memory location, then perform an atomic operation in order to kick off another thread to process the data you've just written, the order that those writes are performed in aren't guaranteed. Which means that you could end up kicking off the other thread before its data has actually been written to memory, which is a nightmare to debug. You need to insert a memory barrier between the two writes in order to guarantee the write order, as the compiler isn't smart enough to figure out that those two memory writes are related. This isn't an issue on x86 though due to its stricter memory consistency model, but I figured it's worth mentioning.

Multi-threaded process with intrinsics - good or bad?

Comments