Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Supercomputer Question

  • 28-11-2003 5:30pm
    #1
    Registered Users, Registered Users 2 Posts: 6,949 ✭✭✭


    yes, I know my nick and this post is ironic

    was looking at this: http://www.sw.nec.co.jp/hpc/sx-e/sx6i/index.html

    8TFLOPS, just out of interest, roughly how many TFLOPS would your 3Ghz P4 run at.

    I realise they are different architecture, but I think it would be an interesting comparision


Comments

  • Registered Users, Registered Users 2 Posts: 69,310 ✭✭✭✭L1011


    A 3.0Ghz P4 should be 3 Gigaflops, as it has one float unit. an Atlhon 3000+ might push it to 5 Gigaflops, as it has more units. But its intrisically slower


  • Registered Users, Registered Users 2 Posts: 6,949 ✭✭✭SouperComputer


    just to confirm:

    NEC: 8 TERAFLOPS
    x86: 6 Gigaflops

    hmm, big difference alright


  • Moderators, Recreation & Hobbies Moderators, Science, Health & Environment Moderators, Technology & Internet Moderators Posts: 92,264 Mod ✭✭✭✭Capt'n Midnight


    Originally posted by MYOB
    A 3.0Ghz P4 should be 3 Gigaflops, as it has one float unit. an Atlhon 3000+ might push it to 5 Gigaflops, as it has more units. But its intrisically slower

    For non-techies pipelining

    It takes more than one clock cycle to carry out many instructions (eg: addition takes four operations) provided you use separate hardware for each part of each instruction you can have four additions in the pipeline at a time - eg if it takes four clock cycles for the first addition then the next addition would be finished on the next clock cycle...

    Note: x86 processors can process more than one instruction at a time if they use different parts of the chip..

    BTW: one cluster knoppix cd, and a Lan where all the pc have been set to boot up off the network card :)

    http://home.cwru.edu/beowulf/ - they are half way to 1.8GHz :)


  • Registered Users, Registered Users 2 Posts: 15,815 ✭✭✭✭po0k




  • Registered Users, Registered Users 2 Posts: 15,815 ✭✭✭✭po0k


    Originally posted by SouperComputer:
    yes, I know my nick and this post is ironic

    was looking at this: http://www.sw.nec.co.jp/hpc/sx-e/sx6i/index.html

    8TFLOPS, just out of interest, roughly how many TFLOPS would your 3Ghz P4 run at.

    I realise they are different architecture, but I think it would be an interesting comparision

    "The world-fastest class one chip vector processor (8GFLOPS) is loaded inside a micro-supercomputer."
    From you're own link to a workstatiion....

    Also:


    Hardware


    The ES is based on:

    5,120 (640 8-way nodes) 500 MHz NEC CPUs
    8 GFLOPS per CPU (41 TFLOPS total)
    2 GB (4 512 MB FPLRAM modules) per CPU (10 TB total)
    shared memory inside the node
    640 × 640 crossbar switch between the nodes
    16 GB/s inter-node bandwidth
    20 kVA power consumption per node

    http://www.top500.org/lists/2003/11/1/


  • Advertisement
  • Registered Users, Registered Users 2 Posts: 2,150 ✭✭✭dazberry


    Originally posted by Capt'n Midnight
    For non-techies pipelining
    [snip]
    Note: x86 processors can process more than one instruction at a time if they use different parts of the chip..

    /me routes out the pocket protector...

    As Capt'n Midnight said... but just to make it more complicated...

    There's a ton(ne) of rules here, that you need to optimize your code for, such as pairable instructions, instructions that effect prefetch and branch prediction, data misalignment across code boundries etc.

    A brief example is that of AGI (or Address Generation Interlock). Basically if the current instruction changes a register that is used as a basis for an address calculation on the next instruction, this will stall the pipeline, or multiple pipelines, since the P1 was superscaler and had 2 pipelines.

    It's all very vague now, but you could run say an FADD or FMUL in one clock-cycle on the P1 FPU, on the understanding that you didn't reference the result until at-least 3 clock cycles later.

    So I guess in this day and age its all up to the compiler writers. Doesn't excluse bad algorithms, but even at best, what might be optimised on one generation of CPU, won't see much benefit on the next. So when you see N megaflops FPU speed, you can bet that's well optimized and is a best case scenario.

    /puts pocket protector back in Worx Assembly Language Masterclass book and goes to bed dreaming of simpler days :D

    D.


  • Registered Users, Registered Users 2 Posts: 5,554 ✭✭✭CyberGhost


    i always read about this floating points and FPUs but i don't know what a hell are they? where can i find info that explains this stuff?


  • Moderators, Recreation & Hobbies Moderators, Science, Health & Environment Moderators, Technology & Internet Moderators Posts: 92,264 Mod ✭✭✭✭Capt'n Midnight


    The FPU is like a scientific calculator - it does all the triganometery eg: for games

    Early Intel chips were like ordinary calcluators
    add subtract multiply and divide eg: 8088/8086(XT) ,286(AT) , 386 , 486SX all needed a second chip to do the number crunching 8087 / 287/387 / 486DX (487 being a cynical marketing ploy)

    You can treat a 486 as being a 386 AND 387 in the same chip (with a few tweaks that make it about 1.5 times as fast)

    Pentiums and later are a bit odd they are a bit like two 386's and a 387 in the same package ...


    Remember very clever people have been figuring out tricks in computer hardware for the last 60 years so it can be a bit difficult to comprehend why it looks the way it does with out considering the evolution.


  • Registered Users, Registered Users 2 Posts: 2,150 ✭✭✭dazberry


    Originally posted by Capt'n Midnight
    The FPU is like a scientific calculator - it does all the triganometery eg: for games

    TBH the games stuff were where the real cleverness came in. Basically for example SIN tables were precalculated in fix point math and used as lookups for matrix calculations. Didn't have the precision of FP but didn't need it.

    The other issue about the pre Pentium FPUs where that they weren't pipelined, so they were particularily slow. In addition, because any x86 prior to the Pentium may or may not have an FPU, there was a penalty to call an FPU instruction (ESC) which could in effect take 14+ clock cycles just to prepare when accessing memory, and that was before performing the FPU function. I can only guess cos I never did (or looked to) doing any x87 stuff, but me thinks the CPU was idle when the FPU was working, it was sort of co-operative processing rather than multi-processing.

    If the FPU didn't exist, an FPU (missing :D) exception was raised and hopefully the calculation was ran in standard CPU code, albeit slower again. I do recall seeing programs supplied with 2 exes, and for that matter I do remember compiler switches with emulate FPU code options.
    486DX (487 being a cynical marketing ploy)

    LOL - I'd forgotten about that. If I remember correctly the 486 SX and DX both had a 487 built in, but the SX version had the FPU turned off and it couldn't be turned back on. I suppose that was the first time that the segmentation economics of it all came into play.

    D.


  • Registered Users, Registered Users 2 Posts: 5,554 ✭✭✭CyberGhost


    yea, it's so hard to understand the deep stuff of the computers

    Thanks for the explanations!


  • Advertisement
  • Registered Users, Registered Users 2 Posts: 6,949 ✭✭✭SouperComputer


    cool, maybe later ill actually have a chance to read this thread later!


  • Moderators, Recreation & Hobbies Moderators, Science, Health & Environment Moderators, Technology & Internet Moderators Posts: 92,264 Mod ✭✭✭✭Capt'n Midnight


    Originally posted by dazberry
    The other issue about the pre Pentium FPUs where that they weren't pipelined, so they were particularily slow. D. [/B]

    Ah yeah all that fuss about pentium pipelining - Ha -badly ordered instructions could delay the pipleine so much that a 486 was faster for some code ! - ie. the pentium was only faster when the instuction order was optimised and then they were maybe 40% faster - but even 486's were about 25% faster than before with the optimised code :)
    I had to put up with a lot of abuse from people who couldn't get around the fact that old code running on a pentium P60 was not going to be much faster than on a 486/66 (and one muppet who reckoned his 386 was faster - so I asked him what speed it was - 66 he said - He knew that the motherboard had been changed and still called it a 386 !!!!)

    Bottom line code optimised for the porcessor will also increase the speed especially for floating point ops FLOPS since a single native instruction will always (almost) be faster than several instructions on an older processor.

    And then there was the pentium pro - guaranteed to be SLOWER running older apps than older processors (something that seemed to be impossible to explain to sales droids on comission) - but it was faster if you used the new instructions - ie FLOPS up but looked down if you used an old benchmarking utility.

    PS. there was one silicongraphics chip that had an instruction that said somthing like multiply A x B and add C and it could process 4 of theses instructions in parallel - so every MHz you got four results (ok it was integer multiplacitaion not floating point)


  • Moderators, Recreation & Hobbies Moderators, Science, Health & Environment Moderators, Technology & Internet Moderators Posts: 92,264 Mod ✭✭✭✭Capt'n Midnight


    Architecting the Future http://www.aceshardware.com/read.jsp?id=55000245
    http://www.aceshardware.com/read.jsp?id=60000273
    designs that have fully embraced the multithreaded approach like the Niagara processor (4 threads per core, 8 cores per die, 32 threads per chip).

    ie. this chip will be able to execute 32 simple instructions at the same time !


Advertisement