Multi-Threading with DLL files

Martyr · 10-04-2008 06:37PM #1

I need to run a thread for each available processor found on windows.

But each thread needs to access "global data" without the use of spinlocks/mutexes

Problem is, calling LoadLibrary() multiple-times returns the same base address for each thread.

Workaround currently: replicate DLL file and load seperately, thus getting a different base address and invidual memory space.

Everything runs fine, but i was hoping there was an alternative method that could be used.

any one any ideas?

Cantab. · 10-04-2008 08:09PM

Average Joe wrote: »

I need to run a thread for each available processor found on windows.

But each thread needs to access "global data" without the use of spinlocks/mutexes

Problem is, calling LoadLibrary() multiple-times returns the same base address for each thread.

Workaround currently: replicate DLL file and load seperately, thus getting a different base address and invidual memory space.

Everything runs fine, but i was hoping there was an alternative method that could be used.

any one any ideas?

I'm not a multi-processor expert, but how do you propose to have separate threads use the same memory without synchronising the shared memory space using standard RT techniques (i.e. mutexing, etc.)? Doesn't seem possible to me...

Martyr · 10-04-2008 08:23PM

i didn't phrase the question very well.

The main problem is that x86 processors don't have enough free registers..
In thread, it uses all 8 general purpose registers (yes, its in assembly), including EBP and ESP which are usually reserved for local variables or parameters to the function.

as a result, i stored what needs saved in global data variables.

but when multi-threading, each thread is sharing the same data, which obviously won't work.

to get around this, i put the thread into a DLL file and duplicate it for each processor core available.

the reason for this is better explained by MSDN when using Loadlibrary

To summarize, the system performs the following steps at load time:

Examines the image and determines its preferred base address and required size.
Finds the address space required and maps the image, copy-on-write, from the file.
Applies internal fixups if the image is not at its preferred base address.
Fixes up all dynamic link imports by placing the correct address for each imported function into the appropriate entry of the Import Address Table. This table stores 32-bit addresses contiguously; to store up to 1024 imported functions requires it to dirty only one page of memory.

on a dual-core for example, i rename the file.dll as

file_0.dll
file_1.dll

load each individually, which then has its own memory space, get the procedure address, create the thread.

this allows each thread to have its own private global variables..(sort-of)
also, i don't need to use spinlocks or mutexes, helping improve speed of each thread.

but the current method of duplicating files to make this work isn't greatest of solutions and i was hoping there was more elegant way to achieve this.

Cantab. · 10-04-2008 11:52PM

Average Joe wrote: »

i didn't phrase the question very well.

The main problem is that x86 processors don't have enough free registers..
In thread, it uses all 8 general purpose registers (yes, its in assembly), including EBP and ESP which are usually reserved for local variables or parameters to the function.

as a result, i stored what needs saved in global data variables.

but when multi-threading, each thread is sharing the same data, which obviously won't work.

to get around this, i put the thread into a DLL file and duplicate it for each processor core available.

the reason for this is better explained by MSDN when using Loadlibrary

To summarize, the system performs the following steps at load time:

Examines the image and determines its preferred base address and required size.

Finds the address space required and maps the image, copy-on-write, from the file.

Applies internal fixups if the image is not at its preferred base address.

Fixes up all dynamic link imports by placing the correct address for each imported function into the appropriate entry of the Import Address Table. This table stores 32-bit addresses contiguously; to store up to 1024 imported functions requires it to dirty only one page of memory.

on a dual-core for example, i rename the file.dll as

file_0.dll
file_1.dll

load each individually, which then has its own memory space, get the procedure address, create the thread.

this allows each thread to have its own private global variables..(sort-of)
also, i don't need to use spinlocks or mutexes, helping improve speed of each thread.

but the current method of duplicating files to make this work isn't greatest of solutions and i was hoping there was more elegant way to achieve this.

So you're programming an Intel multi-core at register level? Good for you!

I'd like to know what kind of an application this is for -- why is it so crucial to hand-optimise the performance? Couldn't you just program as normal and tack on an extra processor or two? How much performance gain do you think your hand-written code will achieve above compiled code?

Couldn't you use an Intel compiler and let it auto-optimise your high-level thread automatically? It's very smart you know.

Could it be that your app may be better suited to a more parallel architecture such as GPU/FPGA?

What ARE you implementing mate?

dazberry · 11-04-2008 09:54AM

Average Joe wrote: »

on a dual-core for example, i rename the file.dll as

file_0.dll
file_1.dll

load each individually, which then has its own memory space, get the procedure address, create the thread.

Have you looked at Thread Local Storage? Alternately since you're doing this in asm, could you not make more use of the stack as the stack should be unique for each thread?

D.

pH · 11-04-2008 11:09AM

dazberry wrote: »

Have you looked at Thread Local Storage? Alternately since you're doing this in asm, could you not make more use of the stack as the stack should be unique for each thread?

D.

Absolutely - TLS is the correct way to do this.

Martyr · 11-04-2008 01:31PM

Cantab wrote:

What ARE you implementing mate?

multi-threaded programs, comparing difference in speed of x86 core2 with PS3 cell b.e, which is powerpc based.

dazberry wrote:

Have you looked at Thread Local Storage?

yes, but it would slow down the thread too much unfortunately.
what i'd hoped for was some function in windows which allowed loading 1 DLL file, multiple times, but each time at a different base address.

dazberry wrote:

Alternately since you're doing this in asm, could you not make more use of the stack as the stack should be unique for each thread?

in some situations, i've found it faster to use global variables rather than the stack..atleast for this process.

local variables stored above +128 or below -128 of the stack generates more than 3 opcodes, which usually takes longer for the processor to decode.

The thread is broken up into separate routines.
For this reason, using local variables, requires other registers to load effective address and/or PUSH/POP instructions which are avoided because they don't pair.

an attempt is made to ensure there is only 1 write to a register every 2 instructions, breaking up dependencies - it would be better to do 1 write every 3 or 4 instructions, but again, there aren't enough registers.

this is why esp and ebp are used, whereas compilers wouldn't normally touch these at all.

carveone · 11-04-2008 04:23PM

Average Joe wrote: »

yes, but it would slow down the thread too much unfortunately.
what i'd hoped for was some function in windows which allowed loading 1 DLL file, multiple times, but each time at a different base address.

Then does it become like fork()/exec() rather than creating a new thread? It's not sharing the same code space that's for sure...

I believe there isn't a function that does what you ask. You'd have to write your own. Which would suck rather a lot more than what you're doing now!
One level of indirection solves your problem but you'd start using LEA. I mean, if the stack is slow for you, perhaps you don't want to be adding computed offsets to alloced memory...

Amusingly enough, under DOS you could change DS

Only joking!

Yeah, none of this helps much, sorry...

Conor.

Martyr · 11-04-2008 04:54PM

carveone wrote:

Amusingly enough, under DOS you could change DS Only joking!

good point tbh, ds is default for data but you can over-ride this using es,fs,gs or ss..32-bit mode Windows still recognises segment prefixes

satchmo · 14-04-2008 03:45PM

Average Joe wrote: »

multi-threaded programs, comparing difference in speed of x86 core2 with PS3 cell b.e, which is powerpc based

I'd be careful how you compare the two, they're inherently completely different processors. Besides the difference in cache latencies etc, the Cell's PPU uses in-order execution so you can't just execute the same instructions in the same order on both platforms and expect the performance to be comparable.

Interesting thread (the programming board needs more like this), let us know how you get on.

ressem · 14-04-2008 06:32PM

Well, he's not using the extra registers available in x64 mode, so it must be a fairly specific benchmark that he is looking to create.

Actually aren't there about 40 physical general purpose registers available on Core Intel processors, which are swapped between using a register alias table?

As for the original question, while the rebaseimage() can be used to create an image in memory, perhaps you can find an altered version of LoadLibrary to make use of it, something like
http://www.joachim-bauch.de/tutorials/load_dll_memory.html

Martyr · 17-04-2008 02:36PM

satchmo wrote:

I'd be careful how you compare the two, they're inherently completely different processors. Besides the difference in cache latencies etc, the Cell's PPU uses in-order execution so you can't just execute the same instructions in the same order on both platforms and expect the performance to be comparable.

POWER/POWERPC is a completely new architecture to me - a little more difficult to learn than x86, would you say?

i'll probably spend time writing code in C, then analysing the assembly generated by GCC to begin with.

The in-order execution point - algorithms will be running in parallel.
Would you say less dependencies generates faster code?

The PPU/SPE's both have 32 128-bit vector registers and 32 general purpose registers (not to mention 32 floating point/other special purpose registers) correct?

I read in different places that the SPE has "128 registers", assuming the writer meant 32 x (4 x 32-bit) / VMX registers - just wanted some clarification.

for speed, where is best place to store/read data?
also, what is the maximum amount of memory i can address in one SPE?

this info is probably all buried in the manuals somewhere, but i know you've experience in this area already - hope you don't mind answering.

satchmo wrote:

let us know how you get on.

that could be some time, but will do.

ressem wrote:

Well, he's not using the extra registers available in x64 mode, so it must be a fairly specific benchmark that he is looking to create

there is both x86/x64 code.. i've just not installed 64-bit windows yet.
linux fedora core 8 is running on the ps3.

ressem wrote:

As for the original question, while the rebaseimage() can be used to create an image in memory, perhaps you can find an altered version of LoadLibrary to make use of it, something like

since the code is all assembly and there are only 1 or 2 calls to api during the thread, it might be good idea to allocate memory using VirtualAlloc() specifying PAGE_EXECUTE_READWRITE - copy the code/data there before calling CreateThread() on the address of code.

though it would mean having to calculate all the data offsets manually..so i suppose in-memory execution (rebaseimage() or something similar) would be best solution so far.

Multi-Threading with DLL files

Comments