Supercomputer Question

SouperComputer · 28-11-2003 5:30pm #1

yes, I know my nick and this post is ironic

was looking at this: http://www.sw.nec.co.jp/hpc/sx-e/sx6i/index.html

8TFLOPS, just out of interest, roughly how many TFLOPS would your 3Ghz P4 run at.

I realise they are different architecture, but I think it would be an interesting comparision

L1011 · 28-11-2003 5:55pm

A 3.0Ghz P4 should be 3 Gigaflops, as it has one float unit. an Atlhon 3000+ might push it to 5 Gigaflops, as it has more units. But its intrisically slower

SouperComputer · 28-11-2003 7:36pm

just to confirm:

NEC: 8 TERAFLOPS
x86: 6 Gigaflops

hmm, big difference alright

Capt'n Midnight · 28-11-2003 8:55pm

Originally posted by MYOB
A 3.0Ghz P4 should be 3 Gigaflops, as it has one float unit. an Atlhon 3000+ might push it to 5 Gigaflops, as it has more units. But its intrisically slower

For non-techies pipelining

It takes more than one clock cycle to carry out many instructions (eg: addition takes four operations) provided you use separate hardware for each part of each instruction you can have four additions in the pipeline at a time - eg if it takes four clock cycles for the first addition then the next addition would be finished on the next clock cycle...

Note: x86 processors can process more than one instruction at a time if they use different parts of the chip..

BTW: one cluster knoppix cd, and a Lan where all the pc have been set to boot up off the network card

http://home.cwru.edu/beowulf/ - they are half way to 1.8GHz

po0k · 28-11-2003 10:40pm

http://www.top500.org/lists/2003/11/top5.php

po0k · 28-11-2003 10:46pm

Originally posted by SouperComputer:
yes, I know my nick and this post is ironic

was looking at this: http://www.sw.nec.co.jp/hpc/sx-e/sx6i/index.html

8TFLOPS, just out of interest, roughly how many TFLOPS would your 3Ghz P4 run at.

I realise they are different architecture, but I think it would be an interesting comparision

"The world-fastest class one chip vector processor (8GFLOPS) is loaded inside a micro-supercomputer."
From you're own link to a workstatiion....

Also:

Hardware

The ES is based on:

5,120 (640 8-way nodes) 500 MHz NEC CPUs
8 GFLOPS per CPU (41 TFLOPS total)
2 GB (4 512 MB FPLRAM modules) per CPU (10 TB total)
shared memory inside the node
640 × 640 crossbar switch between the nodes
16 GB/s inter-node bandwidth
20 kVA power consumption per node

http://www.top500.org/lists/2003/11/1/

dazberry · 29-11-2003 12:56am

Originally posted by Capt'n Midnight
For non-techies pipelining
[snip]
Note: x86 processors can process more than one instruction at a time if they use different parts of the chip..

/me routes out the pocket protector...

As Capt'n Midnight said... but just to make it more complicated...

There's a ton(ne) of rules here, that you need to optimize your code for, such as pairable instructions, instructions that effect prefetch and branch prediction, data misalignment across code boundries etc.

A brief example is that of AGI (or Address Generation Interlock). Basically if the current instruction changes a register that is used as a basis for an address calculation on the next instruction, this will stall the pipeline, or multiple pipelines, since the P1 was superscaler and had 2 pipelines.

It's all very vague now, but you could run say an FADD or FMUL in one clock-cycle on the P1 FPU, on the understanding that you didn't reference the result until at-least 3 clock cycles later.

So I guess in this day and age its all up to the compiler writers. Doesn't excluse bad algorithms, but even at best, what might be optimised on one generation of CPU, won't see much benefit on the next. So when you see N megaflops FPU speed, you can bet that's well optimized and is a best case scenario.

/puts pocket protector back in Worx Assembly Language Masterclass book and goes to bed dreaming of simpler days

D.

CyberGhost · 29-11-2003 1:38am

i always read about this floating points and FPUs but i don't know what a hell are they? where can i find info that explains this stuff?

Capt'n Midnight · 29-11-2003 1:59am

The FPU is like a scientific calculator - it does all the triganometery eg: for games

Early Intel chips were like ordinary calcluators
add subtract multiply and divide eg: 8088/8086(XT) ,286(AT) , 386 , 486SX all needed a second chip to do the number crunching 8087 / 287/387 / 486DX (487 being a cynical marketing ploy)

You can treat a 486 as being a 386 AND 387 in the same chip (with a few tweaks that make it about 1.5 times as fast)

Pentiums and later are a bit odd they are a bit like two 386's and a 387 in the same package ...

Remember very clever people have been figuring out tricks in computer hardware for the last 60 years so it can be a bit difficult to comprehend why it looks the way it does with out considering the evolution.

dazberry · 29-11-2003 3:15am

Originally posted by Capt'n Midnight
The FPU is like a scientific calculator - it does all the triganometery eg: for games

TBH the games stuff were where the real cleverness came in. Basically for example SIN tables were precalculated in fix point math and used as lookups for matrix calculations. Didn't have the precision of FP but didn't need it.

The other issue about the pre Pentium FPUs where that they weren't pipelined, so they were particularily slow. In addition, because any x86 prior to the Pentium may or may not have an FPU, there was a penalty to call an FPU instruction (ESC) which could in effect take 14+ clock cycles just to prepare when accessing memory, and that was before performing the FPU function. I can only guess cos I never did (or looked to) doing any x87 stuff, but me thinks the CPU was idle when the FPU was working, it was sort of co-operative processing rather than multi-processing.

If the FPU didn't exist, an FPU (missing

) exception was raised and hopefully the calculation was ran in standard CPU code, albeit slower again. I do recall seeing programs supplied with 2 exes, and for that matter I do remember compiler switches with emulate FPU code options.

486DX (487 being a cynical marketing ploy)

LOL - I'd forgotten about that. If I remember correctly the 486 SX and DX both had a 487 built in, but the SX version had the FPU turned off and it couldn't be turned back on. I suppose that was the first time that the segmentation economics of it all came into play.

D.

CyberGhost · 29-11-2003 3:17am

yea, it's so hard to understand the deep stuff of the computers

Thanks for the explanations!

SouperComputer · 30-11-2003 3:12pm

cool, maybe later ill actually have a chance to read this thread later!

Capt'n Midnight · 30-11-2003 7:28pm

Originally posted by dazberry
The other issue about the pre Pentium FPUs where that they weren't pipelined, so they were particularily slow. D. [/B]

Ah yeah all that fuss about pentium pipelining - Ha -badly ordered instructions could delay the pipleine so much that a 486 was faster for some code ! - ie. the pentium was only faster when the instuction order was optimised and then they were maybe 40% faster - but even 486's were about 25% faster than before with the optimised code

I had to put up with a lot of abuse from people who couldn't get around the fact that old code running on a pentium P60 was not going to be much faster than on a 486/66 (and one muppet who reckoned his 386 was faster - so I asked him what speed it was - 66 he said - He knew that the motherboard had been changed and still called it a 386 !!!!)

Bottom line code optimised for the porcessor will also increase the speed especially for floating point ops FLOPS since a single native instruction will always (almost) be faster than several instructions on an older processor.

And then there was the pentium pro - guaranteed to be SLOWER running older apps than older processors (something that seemed to be impossible to explain to sales droids on comission) - but it was faster if you used the new instructions - ie FLOPS up but looked down if you used an old benchmarking utility.

PS. there was one silicongraphics chip that had an instruction that said somthing like multiply A x B and add C and it could process 4 of theses instructions in parallel - so every MHz you got four results (ok it was integer multiplacitaion not floating point)

Capt'n Midnight · 04-12-2003 1:17am

Architecting the Future http://www.aceshardware.com/read.jsp?id=55000245
http://www.aceshardware.com/read.jsp?id=60000273
designs that have fully embraced the multithreaded approach like the Niagara processor (4 threads per core, 8 cores per die, 32 threads per chip).

ie. this chip will be able to execute 32 simple instructions at the same time !

Supercomputer Question

Comments