To achieve high performance, contemporary computer systems rely on
two forms of parallelism: instruction-level parallelism (ILP) and
thread-level parallelism (TLP). Wide-issue superscalar processors
exploit ILP by executing multiple instruction from a signel program
in a single cycle. Multiprocessors (MP) exploit TLP by executing different
threads in parallel on different processors. Unfortunately, both parallel-
processing styles statically partition processor resources, thus preventing
them from adapting to dynamically-changing levels of TLP and ILP in a
program. With insufficient TLP, processors in an MP will be idle; with
insufficient ILP, multiple-issue hardware on a superscalar is wasted.
This paper explores parallel processing on an alternative architecture,
simultaneous multithreading (SMT), which allows multiple threads
to compete for and share all of the processor's resources every cycle.
The most compelling reason for running parallel applications on an SMT
processor is its ability to use thread-level parallelism and instruction-
level parallelism interchangeably. By permitting multiple threads to share
the processor's functional units simultaneously, the processor can use both
ILP and TLP to tolerate variations in parallelism. When a program has only
a single thread, all of the SMT processor's resources can be dedicated to
that thread; when more TLP exists, this parallelism can compensate for a
lack of per-thread ILP.
In this work, we examine two alternative on-chip parallel architectures
for the next generation of processors. We compare SMT and small-scale,
on-chip multiprocessors (MP) in their ability to exploit both ILP and
TLP. First, we identify the hardware bottlenecks that prevent
multiprocessors from effectively exploiting ILP. Then, we show that
because of its dynamic resource sharing, SMT avoids these inefficiencies
and benefits from being able to run more threads on a single processor.
The use of TLP is especially advantageous when per-thread ILP is limited.
The ease of adding additional thread contexts on an SMT (relative to
adding additional processors on an MP) allows simultaneous multithreading
to expose more parallelism, further increasing functional unit
utilization and attaining a 52% average speedup (versus a four-
processor, single-chip multiprocessor with comparable execution resources).
This study also addresses an often-cited concern regarding the use of
thread-level parallelism or multithreading: interference in the memory
system and branch prediction hardware. We find that multiple threads
cause inter-thread interference in the caches and also place greater
demands on the memory system, increasing average memory latencies. By
exploiting thread-level parallelism, however, SMT hides these
additional latencies, so that they only have a small impact on total
program performance. We also find that for parallel applications, the
additional threads have minimal effects on branch prediction.
To get the PostScript file, click here.