Newsgroups: comp.parallel

Newsgroups: comp.parallel
From: [email protected] (Allan Gottlieb)
Subject: Info on some new parallel machines
Nntp-Posting-Host: allan.ultra.nyu.edu
Organization: New York University, Ultracomputer project
Date: 18 Dec 92 12:31:15
A week or two ago, in response to a request for information on ksr,
I posted the ksr section of a paper I presented at PACTA'92 in
Barcelona in sept. I received a bunch of requests for a posting of
the entire paper, which I "did". Unfortunately, it seems to have
disappeared somewhere between here and Clemson so I am trying again.
I doubt if anyone will get this twice but if so, please let me know
and accept my appologies.

Allan Gottlieb

\" Format via
\" troff -me filename
\" New Century Schoolbook fonts
\" Delete next three lines if you don't have the font
fp 1 NR \" normal
fp 2 NI \" italic
fp 3 NB \" bold
sz 11
nr pp 11
nr ps 1v .\" They want double space before paragraph
nr sp 12
nr fp 10
pl 26c
m1 1c
m2 0
m3 0
m4 0
ll 14c
tp
(l C
sz +2
b "Architectures for Parallel Supercomputing
sz -2
sp .5c
Allan Gottlieb
sp 1.5c
Ultracomputer Research Laboratory
New York University
715 Broadway, Tenth Floor
New York NY 10003 USA
)l
sp 1c
sh 1 Introduction
lp
In this talk, I will describe the architectures of new commercial
offerings from Kendall Square Research, Thinking Machines
Incorporated, Intel Corporation, and the MasPar Computer Corporation.
These products span much of the currently active design space for
parallel supercomputers, including shared-memory and message-passing,
MIMD and SIMD, and processor sizes from a square millimeter to
hundreds of square centimeters. However, there is at least one
commercially important class omitted: the parallel vector
supercomputers, whose death at the hands of the highly parallel
invaders has been greatly exaggerated (shades of Mark Twain). Another
premature death notice may have been given to FORTRAN since all these
machines speak (or rather understand) this language\*-but that is
another talk.
sh 1 "New Commercial Offerings"
lp
I will describe the architectures of four new commercial offerings:
The shared-memory MIMD KSR1 from Kendall Square Research; two
message-passing MIMD computers, the Connection Machine CM-5 from
Thinking Machines Corporation and the Paragon XP/S from Intel
Corporation; and the SIMD MP-1 from the MasPar Computer Corporation.
Much of this section is adapted from material prepared for the
forthcoming second edition of
i "Highly Parallel Computing" ,
a book I co-author with George Almasi from IBM's T.J. Watson Research
Center.
sh 2 "The Kendall Square Research KSR1"
lp
The KSR1 is a shared-memory MIMD computer with private, consistent
caches, that is, each processor has its own cache and the system
hardware guarantees that the multiple caches are kept in agreement.
In this regard the design is similar to the MIT Alewife [ACDJ91] and the
Stanford Dash [LLSJ92]. There are, however, three significant differences
between the KSR1 and the two University designs. First, the Kendall
Square machine is a large-scale, commercial effort: the current design
supports 1088 processors and can be extended to tens of thousands.
Second, the KSR1 features an ALLCACHE memory, which we explain below.
Finally, the KSR1, like the Illinois Cedar [GKLS84], is a hierarchical
design: a small machine is a ring or
q "Selection Engine"
of up to 32 processors (called an SE:0); to achieve
1088 processors, an SE:1 ring of 34 SE:0 rings is assembled. Larger
machines would use yet higher level rings. More information on the
KSR1 can be found in [Roth92].
sh 3 Hardware
lp
A 32-processor configuration (i.e. a full SE:0 ring) with 1 gigabyte
of memory and 10 gigabytes of disk requires 6 kilowatts of power and 2
square meters of floor space. This configuration has a peak
computational performance of 1.28 GFLOPS and a peak I/O bandwidth of
420 megabytes/sec. In a March 1992 posting to the comp.parallel
electronic newsgroup, Tom Dunigan reported that a 32-processor KSR1 at
the Oak Ridge National Laboratory attained 513 MFLOPS on the
1000\(mu1000 LINPACK benchmark. A full SE:1 ring with 1088 processors
equipped with 34.8 gigabytes of memory and 1 terabyte of disk would
require 150 kilowatts and 74 square meters. Such a system would have
a peak floating point performance of 43.5 GFLOPS and a peak I/O
bandwidth of 15.3 gigabytes/sec.
pp
Each KSR1 processor is a superscalar 64-bit unit able to issue up to
two instructions every 50ns., giving a peak performance rating of 40
MIPS. (KSR is more conservative and rates the processor as 20 MIPS
since only one of the two instructions issued can be computational but
I feel that both instructions should be counted. If there is any
virtue in peak MIPS ratings, and I am not sure there is, it is that
the ratings are calculated the same way for all architectures.) Since
a single floating point instruction can perform a multiply and an add,
the peak floating point performance is 40 MFLOPS. At present, a KSR1
system contains from eight to 1088 processors (giving a system-wide
peak of 43,520 MIPS and 43,520 MFLOPS) all sharing a common virtual
address space of one million megabytes.
pp
The processor is implemented as a four chip set consisting of a
control unit and three co-processors, with all chips fabricated in 1.2
micron CMOS. Up to two instructions are issued on each clock cycle.
The floating point co-processor supports IEEE single and double
precision and includes linked triads similar to the multiply and add
instructions found in the Intel Paragon. The integer/logical
co-processor contains its own set of thirty-two 64-bit registers and
performs the the usual arithmetic and logical operations. The final
co-processor provides a 32-MB/sec I/O channel at each processor. Each
processor board also contains a 256KB data cache and a 256KB
instruction cache. These caches are conventional in organization
though large in size, and should not be confused with the ALLCACHE
(main) memory discussed below.
sh 3 "ALLCACHE Memory and the Ring of Rings"
lp
Normally, caches are viewed as small temporary storage vehicles for
data, whose permanent copy resides in central memory. The KSR1 is
more complicated in this respect. It does have, at each processor,
standard instruction and data caches, as mentioned above. However,
these are just the first-level caches.
i Instead
of having main memory to back up these first-level caches, the KSR1
has second-level caches, which are then backed up by
i disks .
That is,
there is no central memory; all machine resident data and instructions
are contained in one or more caches, which is why KSR uses the term
ALLCACHE memory. The data (as opposed to control) portion of the
second-level caches are implemented using the same DRAM technology
normally found in central memory. Thus, although they function as
caches, these structures have the capacity and performance of main memory.
pp
When a (local, second-level) cache miss occurs on processor A,
the address is sent around the SE:0 ring. If the requested address
resides in B, another one of the processor/local-cache pairs on the same
SE:0 ring, B
forwards the cache line (a 128-byte unit, called a subpage by KSR) to A
again using the (unidirectional) SE:0 ring. Depending on the access
performed, B may keep a copy of the subpage (thus sharing it with A) or
may cause all existing copies to be invalidated (thus giving A
exclusive access to the subpage). When the response arrives at A, it
is stored in the local cache, possibly evicting previously stored
data. (If this is the only copy of the old data, special actions are
taken not to evict it.) Measurements at Oak Ridge indicate a 6.7 microsecond
latency for their (32-processor) SE:0 ring.
pp
If the requested address resides in processor/local-cache C, which is
located on
i another
SE:0 ring, the situation is more interesting. Each SE:0 includes an
ARD (ALLCACHE routing and directory cell), containing a large
directory with an entry for every subpage stored on the entire
SE:0.\**
(f
\**Actually an entry for every page giving the state of every subpage.
)f
If the ARD determines that the subpage is not contained in the current
ring, the request is sent
q up
the hierarchy to the (unidirectional) SE:1 ring,
which is composed solely of ARDs, each essentially a copy of the ARD
q below
it. When the request reaches the SE:1 ARD above the SE:0 ring
containing C, the request is sent down and traverses the ring to C, where
it is satisfied. The response from C continues on the SE:0 ring to
the ARD, goes back up, then around the SE:1 ring, down to the SE:0
ring containing A, and finally around this ring to A.
pp
Another difference between the KSR1 caches and the more conventional
variety is size. These are BIG caches, 32MB per processor. Recall
that they replace the conventional main memory and hence are
implemented using dense DRAM technology.
pp
The SE:0 bandwidth is 1 GB/sec. and the SE:1 bandwidth can be
configured to be 1, 2, or 4 GB/sec., with larger values more
appropriate for systems with many SE:0s (cf. the fat-trees used in the
CM5). Readers interested in a performance comparison between ALLCACHE
and more conventional memory organizations should read [SJG92].
Another architecture using the ALLCACHE design is the Data Diffusion
Machine from the Swedish Institute of Computer Science [HHW90].
sh 4 Software
lp
The KSR operating system is an extension of the OSF/1 version of Unix.
As is often the case with shared-memory systems, the KSR operating
system runs on the KSR1 itself and not on an additional
q host
system. The later approach is normally used on message passing
systems like the CM-5, in which case only a subset of the OS functions
run directly on the main system. Using the terminology of [AG89] the
KSR operating system is symmetric; whereas the CM-5 uses a
master-slave approach. Processor allocation is performed dynamically
by the KSR operating system, i.e. the number of processors assigned to
a specific job varies with time.
pp
A fairly rich software environment is supplied including the X window
system with the Motif user interface; FORTRAN, C, and COBOL; the
ORACLE relational database management system; and AT&T's Tuxedo for
transaction processing.
pp
A FORTRAN programmer may request automatic parallelization of his/her
program or may specify the parallelism explicitly; a C programmer has
only the latter option.
sh 2 "The TMC Connection Machine CM-5"
lp
Thinking Machines Corporation has become well known for their SIMD
connection machines CM-1 and CM-2. Somewhat
surprisingly their next offering CM-5 has moved into the MIMD world
(although, as we shall see, there is still hardware support for a
synchronous style of programming). Readers seeking additional
information should consult [TMC91].
sh 3 Architecture
lp
At the very coarsest level of detail, the CM-5 is simply a
message-passing MIMD machine, another descendent of the Caltech cosmic
cube [Seit85]. But such a description leaves out a great deal. The
interconnection topology is a fat tree, there is support for SIMD, a
combining control network is provided, vector units are available, and
the machine is powerful. We discuss each of these in turn.
pp
A fat tree is a binary tree in which links higher in the tree have
greater bandwidth (e.g. one can keep the clock constant and use wider
busses near the root). Unlike hypercube machines such as CM-1 and
CM-2, a node in the CM-5 has a constant number of nearest neighbors
independent of the size of the machine. In addition, the bandwidth
available per processor for random communication patterns remains
constant as the machine size increases; whereas this bandwidth
decreases for meshes (or non-fat trees). Local communication is
favored by the CM-5 but by only a factor of 4 over random
communication (20MB/sec vs. 5MB/sec), which is much less than in other
machines such as CM-2. Also attached to this fat tree are I/O
interfaces. The device side of these interfaces can support 20MB/sec;
higher speed devices are accommodated by ganging together multiple
interfaces. (If the destination node for the I/O is far from the
interface, the sustainable bandwidth is also limited by the fat
tree to 5MB/sec.)
pp
The fat tree just discussed is actually one of three networks on the
CM-5. In addition to this
q "data network" ,
there is a diagnostic network used for fault detection and a control
network that we turn to next.
One function of the control network is to provide rapid
synchronization of the processors, which is accomplished by by a
global OR operation that completes shortly after the last
participating processor sets its value. This
q "cheap barrier"
permits the main advantage of SIMD (permanent synchrony implying no
race conditions) without requiring that the processors always execute
the same instruction.
pp
A second function of the control network is to provide a form of
hardware combining, specifically to support reduction and parallel
prefix calculations. A parallel prefix computation for a given binary
operator \(*f (say addition) begins with each processor specifying a
value and ends with each processor obtaining the sum of the values
provided by itself and all lower-numbered processors. These parallel
prefix computations may be viewed as the synchronous, and hence
deterministic, analogue of the fetch-and-phi operation found in the
NYU Ultracomputer [GGKM83]. The CM-5 supports addition, maximum,
logical OR, and XOR. Two variants are also supplied: a parallel
suffix and a segmented parallel prefix (and suffix). With a segmented
operation (think of worms, not virtual memory, and see [SCHW80]), each
processor can set a flag indicating that it begins a segment and the
prefix computation is done separately for each segment. Reduction
operations are similar: each processor supplies a value and all
processors obtain the sum of all values (again max, OR, and XOR are
supported as well).
pp
Each node of a CM-5 contains a SPARC microprocessor for scalar
operations (users are advised against coding in assembler, a hint that
the engine may change), a 64KB cache, and up to 32 MB of local memory.
Memory is accessed 64 bits at a time (plus 8 bits for ECC). An option
available with the CM-5 is the incorporation of 4 vector units in
between each processor and its associated memory. When the vector
units are installed, memory is organized as four 8 MB banks, one
connected to each unit. Each vector unit can perform both
floating-point and integer operations, either one at a peak rate of 32
mega 64-bit operations per second.
pp
As mentioned above, the CM-5 is quite a powerful computer. With the
vector units present, each node has a peak performance of 128 64-bit
MFLOPS or 128 64-bit integer MOPS. The machine is designed for a
maximum of 256K nodes but the current implementation is
q "limited"
to 16K due to restrictions on cable lengths. Since the peak
computational rate for a 16K node system exceeds 2 Teraflops one might
assert that the age of (peak)
q "teraflop computing"
has arrived. However, as I write this in May 1992, the largest
announced delivery of a CM-5 is a 1K node configuration without vector
units. A full 16K system would cost about one-half Billion U.S.
dollars.
sh 3 "Software and Environment"
lp
In addition to the possibly thousands of computation nodes just
described, a CM-5 contains a few control processors that act as hosts
into which users login. The reason for multiple control processors is
that the system administrator can divide the CM-5 into partitions,
each with an individual control processor as host. The host provides
a conventional
sm UNIX -like
operating system; in particular users can timeshare a single
partition. Each computation node runs an operating system microkernel
supporting a subset of the full functionality available on the control
processor acting as its host (a master-slave approach, see [AG89].
pp
Parallel versions of Fortran, C, and Lisp are provided. CM Fortran is
a mild extension of Fortran 90. Additional features include a
\f(CWforall\fP statement and vector-valued subscripts. For an example
of the latter assume that \f(CWA\fP and \f(CWP\fP are vectors of size
20 with all \f(CWP(I)\fP between 1 and 20, then \f(CWA=A(P)\fP does
the 20 parallel assignments \f(CWA(I)=A(P(I))\fP.
pp
An important contribution is the CM Scientific Software Library a
growing set of numerical routines hand tailored to exploit the CM-5
hardware. Although primarily intended for the CM Fortran user, the
library is also usable from TMC's versions of C and Lisp, C* and
*Lisp. To date the library developers have concentrated on linear
algebra, FFTs, random number generators, and statistical analyses.
pp
In addition to supporting the data parallel model of computing
typified by Fortran 90, the CM-5 also supports synchronous (i.e.
blocking) message passing in which the sender does not proceed until
its message is received. (This is the rendezvous model used in
Ada and CSP.) Limited support for asynchronous message passing is
provided and further support is expected.
sh 2 "The Intel Paragon XP/S"
lp
The Intel Paragon XP/S Supercomputer [Inte91] is powered by a
collection of up to 4096 Intel i860 XP processors and can be
configured to provide peak performance ranging from 5 to 300 GFLOPS
(64-bit). The processing nodes are connected in a rectangular mesh
pattern, unlike the hypercube connection pattern used in the earlier
Intel iPSC/860.
pp
The i860 XP node processor chip (2.5 million transistors)
has a peak performance of 75 MFLOPS (64-bit)
and 42 MIPS when operating at 50 MHz.
The chip contains 16KByte data and instruction caches,
and can issue a multiply and add instruction in one cycle
[DS90].
The maximum bandwidth from cache to floating point unit is
800 MBytes/s.
Communication bandwidth
between any two nodes is 200 MByte/sec
full duplex. Each node also has 16-128 MBytes of memory and
a second i860 XP processor devoted to
communication.
pp
The prototype for the Paragon, the Touchstone Delta, was installed at
Caltech\** in 1991
(f
\**^The machine is owned by the Concurrent Supercomputing Consortium,
an alliance of universities, laboratories, federal agencies, and
industry.
)f
and immediately began to compete with the CM2 Connection Machine for
the title of
q "world's fastest supercomputer" .
The lead changed
hands several times.\**
(f
\**\^One point of reference is the 16 GFLOPS reported at the
Supercomputing '91 conference for seismic modeling on the CM2
[MS91].
)f
pp
The Delta system consists of 576 nodes arranged in a mesh that has 16
rows and 36 columns. Thirty-three of the columns form a computational
array of 528 numeric nodes (computing nodes) that each contain an
Intel i860 microprocessor and 16 MBytes of memory. This computational
array is flanked on each side by a column of I/O nodes that each
contain a 1.4 GByte disk (the number of disks is to be doubled later).
The last column contains two HIPPI interfaces (100 Mbyte/sec each) and
an assortment of tape, ethernet, and service nodes. Routing chips are
used to provide internode communication with an internode speed of 25
MByte/sec and a latency of 80 microseconds. The peak performance of
the i860 processor is 60 MFLOPS (64-bit), which translates to a peak
performance for the Delta of over 30 GFLOPS (64-bit).
Achievable speeds in the range 1-15 GFLOPS have been claimed.
Total memory is 8.4 GBytes, on-line disk capacity is 45 GBytes, to be
increased to 90 GBytes.
pp
The operating system being developed for the Delta consists of OSF/1
with extensions for massively parallel systems. The extensions
include a decomposition of OSF/1 into a pure Mach kernel (OSF/1 is
based on Mach), and a modular server framework that can be used to
provide distributed file, network, and process management service.
pp
The system software for interprocess communication is compatible with
that of the iPSC/860. The Express environment is also available.
Language support includes Fortran and C.
The Consortium intends to allocate 80% of the Delta's time for
q "Grand Challenge"
problems (q.v.).
sh 2 "The MasPar MP-1"
lp
Given the success of the CM1 and CM2, it is not surprising to see another
manufacturer produce a machine in the same architectural class (SIMD, tiny
processor). What perhaps
i "is"
surprising is that Thinking Machines, with the new CM-5, has moved to an
MIMD design. The MasPar Computer
Corporation's MP-1 system, introduced in 1990, features an SIMD array of up
to 16K 4-bit processors organized as a 2-dimensional array with each
processor connected to its 8 nearest neighbors (i.e., the NEWS of CM1 plus
the four diagonals). MasPar refers to this interconnection topology as the
X-Net. The MP-1 also contains an array control unit that fetches and
decodes instructions, computes addresses and other scalars, and sends
control signals to the processor array.
pp
An MP-1 system of maximum size has a peak speed of 26 GIPS (32-bit
operations) or 550 MFLOPS (double precision) and dissipates about a
kilowatt (not including I/O). The maximum memory size is 1GB and the
maximum bandwidth to memory is 12 GB/sec. When the X-Net is used, the
maximum aggregate inter-PE communication bandwidth is 23GB/sec. In
addition, a three-stage global routing network is provided, utilizing
custom routing chips and achieving up to 1.3 GB/sec aggregate bandwidth.
This same network is also connected to a 256 MB I/O RAM buffer that is in
turn connected to a frame buffer and various I/O devices.
pp
Although the processor is internally a 4-bit device (e.g. the datapaths are
4-bits wide), it contains 40 programmer-visible, 32-bit registers and
supports integer operands of 1, 8, 16, 32, or 64 bits. In addition, the
same hardware performs 32- and 64-bit floating point operations. This last
characteristic is reminiscent of the CM1 design, but not the CM2 with its
separate Weiteks. Indeed a 16K MP-1 does perform 16K floating point adds
as fast as it performs one; whereas a 64K CM2 performs only 2K floating
point adds concurrently (one per Weitek). The tradeoff is naturally in
single processor floating point speed. The larger, and hence less
numerous, Weiteks produce several MFLOPS each; the MP-1 processors achieve
only a few dozen KFLOPS (which surpasses the older CM1 processors).
pp
MasPar is able to package 32 of these 4-bit processors on a single chip,
illustrating the improved technology now available (two-level metal, 1.6
micron CMOS with 450,000 transistors) compared to the circa 1985 technology
used in CM1, which contained only 16 1-bit processors per chip. Each
14"x19" processor board contains 1024 processors, clocked at 80ns, and
16 MB of ECC memory, the latter organized as 16KB per processor and
implemented using page mode 1Mb DRAMs.
pp
A DECstation 5000 is used as a host and manages program execution, user
interface, and network communications for an MP-1 system. The languages
supported include data parallel versions of FORTRAN and C as well as the
MasPar Parallel Application Language (MPL) that permits direct program
control of the hardware. Ultrix, DEC's version of UNIX, runs on the host
and provides a standard user interface. DEC markets the MP-1 as the DECmpp
12000.
pp
Further information on the MP-1 can be found in [Chri90], [Nick90],
[Blan90], and [Masp91]. An unconventional assessment of virtual
processors, as used for example in CM2, appears in [Chri91].
uh References
(b I F
ll 14c
ti 0
[ACDJ91]
Anant Agarwal, David Chaiken, Godfrey D'Souza, Kirk Johnson, David
Kranz, John Kubiatowicz, Kiyoshi Kurihara, Beng-Hong Lim, Gino Maa,
Dan Nussbaum, Mike Parkin, and Donald Yeung,
q "The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor" ,
in
i "Proceedings of Workshop on Scalable Shared Memory Multiprocessors" ,
Kluwer Academic Publishers,
1991
)b
(b I F
ll 14c
ti 0
[AG89]
George Almasi and Allan Gottlieb,
i "Highly Parallel Computing" ,
Benjamin/Cummings,
1989, 519 pages.
)b
(b I F
ll 14c
ti 0
[Blan90]
Tom Blank,
q "The MasPar MP-1 Architecture" ,
i "IEEE COMPCON Proceedings" ,
1990, pp. 20-24.
)b
(b I F
ll 14c
ti 0
[Chri90]
Peter Christy,
q "Software to Support Massively Parallel Computing on the MasPar MP-1" ,
i "IEEE COMPCON Proceedings" ,
1990,
pp. 29-33.
)b
(b I F
ll 14c
ti 0
[Chri91]
Peter Christy,
q "Virtual Processors Considered Harmful" ,
i "Sixth Distributed Memory Computing Conference Proceedings" ,
1991.
)b
(b I F
ll 14c
ti 0
[DS90]
Robert B.K. Dewar and Matthew Smosna,
i "Microprocessors: A Programmers View" ,
McGraw-Hill, New York, 1990.
)b
(b I F
ll 14c
ti 0
[GKLS84]
Daniel Gajski, David Kuck, Duncan Lawrie, and Ahmed Sameh,
q Cedar
in
i "Supercomputers: Design and Applications" ,
Kai Hwang, ed. 1984.
)b
(b I F
ll 14c
ti 0
[HHW90]
E. Hagersten, S. Haridi, and D.H.D. Warren,
q "The Cache-Coherent Protocol of the Data Diffusion Machine" ,
i "Cache and Interconnect Architectures in Multiprocessors" ,
edited by Michel Dubois and Shreekant Thakkar, 1990.
)b
(b I F
ll 14c
ti 0
[Inte91]
Intel Corporation literature, November 1991.
)b
(b I F
ll 14c
ti 0
[LLSJ92]
Dan Lenoski, James Laudon, Luis Stevens, Truman Joe,
Dave Nakahira, Anoop Gupta, and John Hennessy,
q "The DASH Prototype: Implementation and Performance" ,
i "Proc. 19th Annual International Symposium on Computer Archtecture" ,
May, 1992,
Gold Coast, Australia,
pp. 92-103.
)b
(b I F
ll 14c
ti 0
[Masp91]
q "MP-1 Family Massively Parallel Computers" ,
MasPar Computer Corporation,
1991.
)b
(b I F
ll 14c
ti 0
[MS91]
Jacek Myczkowski and Guy Steele,
q "Seismic Modeling at 14 gigaflops on the Connection Machine" ,
i "Proc. Supercomputing '91" ,
Albuquerque, November, 1991.
)b
(b I F
ll 14c
ti 0
[Nick90]
John R. Nickolls,
q "The Design of the MasPar MP-1: A Cost Effective Massively Parallel Computer" ,
i "IEEE COMPCON Proceedings" , 1990, pp. 25-28.
)b
(b I F
ll 14c
ti 0
[ROTH92]
James Rothnie,
q "Overview of the KSR1 Computer System" ,
Kendall Square Research Report TR 9202001,
March, 1992
)b
(b I F
ll 14c
ti 0
[Seit85]
Charles L. Seitz,
q "The Cosmic Cube" ,
i "Communications of the ACM" ,
b 28
(1),
January 1985,
pp. 22-33.
)b
(b I F
ll 14c
ti 0
[SJG92]
Per Stenstrom, Truman Joe, and Anoop Gupta,
q "Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures" ,
i "Proceedings, 19th International Symposium on Computer Architecture" ,
1992.
)b
(b I F
ll 14c
ti 0
[TMC91]
q "The Connection Machine CM-5 Technical Summary" ,
Thinking Machines Corporation,
1991.
)b