# CS405 Computer System Architecture MODULE 1 Prepared by Sharika T R, Assistant Professor, SNGCE

# Syllabus

- Parallel computer models
  - Evolution of Computer Architecture,
  - System Attributes to performance,
- · Amdahl's law for a fixed workload.
- · Multiprocessors and Multicomputers,
- Multivector and SIMD computers,
- Architectural development tracks,
- · Conditions of parallelism.

# **Textbook**



Kai Hwang & Naresh
 Jotwani, 'Advanced
 Computer System
 Architecture, Parallelism,
 Scaleability,
 Programability'

3

# computer Architecture vs Computer Organization

- Computer Organization
  - how operational parts of computer system are linked together
  - implements the computer architecture
- Computer Architecture
  - Blueprint for design and implementation of a computer system
  - provides functional details and behaviour of a computer system and comes before computer organization
  - explain what a computer should do



# THE STATE OF COMPUTING Computer Development Milestones

#### How it all started...

- o 500 BC: Abacus (China) The earliest mechanical computer/calculating device.
  - · Operated to perform decimal arithmetic with carry propagation digit by digit
- 1642: Mechanical Adder/Subtractor (Blaise Pascal)
- o 1827: Difference Engine (Charles Babbage)
- 1941: First binary mechanical computer (Konrad Zuse; Germany)
- 1944: Harvard Mark I (IBM)
  - · The very first electromechanical decimal computer as proposed by Howard Aiken

#### Computer Generations

- $\hspace{.1in} \circ \hspace{.1in} 1^{st} \hspace{1.5in} 2^{nd} \hspace{1.5in} 3^{rd} \hspace{1.5in} 4^{th} \hspace{1.5in} 5^{th}$
- Division into generations marked primarily by changes in hardware and software technologies

# **THE STATE OF COMPUTING**

# **Computer Development Milestones**

- First Generation (1945 54)
  - Technology & Architecture:
    - Vacuum Tubes
    - · Relay Memories
    - · CPU driven by PC and accumulator
    - · Fixed Point Arithmetic
  - Software and Applications:
    - · Machine/Assembly Languages
    - · Single user
    - · No subroutine linkage
    - · Programmed I/O using CPU
  - o Representative Systems: ENIAC, Princeton IAS, IBM 701

# THE STATE OF COMPUTING Computer Development Milestones

- Second Generation (1955 64)
  - Technology & Architecture:
    - · Discrete Transistors
    - Core Memories
    - · Floating Point Arithmetic
    - I/O Processors
    - Multiplexed memory access
  - Software and Applications:
    - · High level languages used with compilers
    - · Subroutine libraries
    - · Processing Monitor
  - Representative Systems: IBM 7090, CDC 1604, Univac LARC

# THE STATE OF COMPUTING Computer Development Milestones

- Third Generation (1965 74)
  - o Technology & Architecture:
    - IC Chips (SSI/MSI)
    - Microprogramming
    - Pipelining
    - Cache
    - · Look-ahead processors
  - Software and Applications:
    - · Multiprogramming and Timesharing OS
    - · Multiuser applications
  - o Representative Systems: IBM 360/370, CDC 6600, T1-ASC, PDP-8

# THE STATE OF COMPUTING Computer Development Milestones

- Fourth Generation (1975 90)
  - o Technology & Architecture:
    - LSI/VLSI
    - · Semiconductor memories
    - Multiprocessors
    - Multi-computers
    - · Vector supercomputers
  - Software and Applications:
    - Multiprocessor OS
    - · Languages, Compilers and environment for parallel processing
  - Representative Systems: VAX 9000, Cray X-MP, IBM 3090

# THE STATE OF COMPUTING Computer Development Milestones

- Fifth Generation (1991 onwards)
  - Technology & Architecture:
    - · Advanced VLSI processors
    - · Scalable Architectures
    - · Superscalar processors
  - Software and Applications:
    - · Systems on a chip
    - · Massively parallel processing
    - · Grand challenge applications
    - · Heterogeneous processing
  - o Representative Systems: S-81, IBM ES/9000, Intel Paragon, nCUBE 6480, MPP, VPP500

#### **THE STATE OF COMPUTING Elements of Modern Computers** Computing Problems Computing Operating System Algorithms and Data Structures Algorithms Hardware Resources Mapping Hardware and Data Architecture Structures **Operating System** Programming System Software Support Binding Applications Software (compile, load) Compiler Support High-level Languages Performance Evaluation Fig. 1.1 Elements of a modern computer system

# Computing problems

- Numerical computing For numerical problems in science and technology, the solutions demand complex mathematical formulations and intensive integer or floating-point computations.
- Transaction processing For alphanumerical problems in business and government, the solutions demand accurate transactions, large database management, and information retrieval operations.
- Logical reasoning -For artificial intelligence (AI) problems, the solutions demand logic inferences and symbolic manipulations.

13

# Algorithms and Data Structures

- Special algorithms and data structures are needed to specify the computations and communications involved in computing problems.
- Most numerical algorithms are deterministic, using regularly structured data.

#### Hardware Resources

- Coordinated efforts by hardware resources, an operating system, and application software.
- Hardware core of a computer system -
  - Processors, memory, and peripheral devices
- Special hardware interfaces are often built into I/O devices, such as terminals, workstations, optical page scanners, magnetic ink character recognizers, modems, file servers, voice data entry, printers, and plotters.
- These peripherals are connected to mainframe computers directly or through local or wide-area networks.

15

# **Operating System**

- An effective operating system manages the allocation and deallocation of resources during the execution of user programs.
- Beyond the OS, application software must be developed to benefit the users.

# Mapping

- Mapping of algorithmic and data structures onto the machine architecture is a bidirectional
- process matching algorithmic structure with hardware architecture, and vice versa.
- Efficient mapping will benefit the programmer and produce better source codes.
- It includes processor scheduling, memory maps, interprocessor communications, etc

17

# System Software Support

- The source code written in a HLL must be first translated into object code by an optimizing compiler.
- The compiler assigns variables to registers or to memory words and reserves functional units for operators.
- An assembler is used to translate the compiled object code into machine code which can be recognized by the machine hardware.
- A loader is used to initiate the program execution through the OS kernel.

17-8-2020

19

# **Evolution of Computer Architecture**

- The study of computer architecture involves both hardware organization and programing/software requirements
- Evolution of computer architecture started with the von Neumann architecture
  - Built as a sequential machine executing scalar data
  - slow due to sequential execution of instructions in programs.
- The sequential computer was improved from hit-serial to word-parallel operations, and from fixed-point to floating point operations

- Major advancements came due to following techniques
  - -Look ahead Parallesim and pipelining
  - Flynn's Classification
  - -Parallel/Vector computers



## Look Ahead Technique

- Introduced to prefetch instructions in order to overlap I/E
  - Instruction fetch and exection operations
- and to enable functional parallelism
  - use multiple fuctional units simultaneously
  - practice pipelining at various processing levels

Lookahead is a concept where processor looks at what resources might not be needed by a process(or by processes of higher priority ) in near future and allocates those to other processes(or low priority processes).

parallelism is simply multiple operations being done at the same time.

This can be achieved using multiprocessors, coding in a way that at a time different parts of program could be executed simultaneously..

for example threading etc.

# Flynn's Classification

- Michael Flynn introducted classification of computer architecture based on
  - instruction stream
  - data streams

Cont...

- Major classifications are
  - SISD (Single Instruction Stream over a Single Data stream)
    - conventional sequential machines
  - SIMD(Single Instruction Stream over a Multiple Data stream)
    - · vector computers equiped with scalar or vector hardware
  - MIMD(Multiple Instruction Stream over a Multiple Data stream)
    - Parallel Computers
  - MISD(Multiple Instruction Stream over a Single Data stream)
    - special purpose computer













# Advantages and Disadvantages

- · Advantages of Flynn
  - » Universally accepted
  - » Compact Notation
  - » Easy to classify a system
- Disadvantages of Flynn
  - » Very coarse-grain differentiation among machine systems
  - » Comparison of different systems is limited
  - » Interconnections, I/O, memory not considered in the scheme

# Parallel/ Vector Computer

- execute programs in MIMD mode(simultaneously)
- two major classes
  - shared memory multiprocessor
  - message passing multicomputer
- The major distinction between multiprocessors and multicomputers lies in
  - memory sharing and
  - the mechanisms used for interprocessor communication.

33

# **Shared Memory Multiprocessor**

- processors in a multiprocessor system communicate with each other through
  - -shared variables
  - -in a common memory

# Message Passing Multicomputer

- Processors in a multiprocessor system communicate with each other through
  - -message passing among the nodes
  - -with local memory, unshared with other nodes

35

### **Vector Processor**

- Processors whose instructions operate on vector data
- equipped with multiple vector pipelines that can be concurrently used under hardware or firmware control
- · two families of pipelined vector processors
  - Memory-to-memory architecture
    - supports the pipelined flow of vector operands directly from the memory to pipelines and then back to the memory.
  - Register-to-register architecture
    - uses vector registers to interface between the memory and functional pipelines.

19-8-2020

37

# System Attributes to Performance

- The ideal performance of a computer system demands a perfect match between machine capability and program behavior
- Machine capability can be enhanced with
  - better hardware technology,
  - innovative architectural features, and
  - efficient resources management

- · factors affecting program behavior
  - algorithm design,
  - data structures,
  - language efficiency,
  - programmer skill, and
  - compiler technology
- performance should be described as a range or a distribution

39

# Performance Factors

- cycle time,  $extcolor{}$
- clock rate f,  $f = 1/\tau$
- CPI(Cycle per Instructions)
- Instruction count(I<sub>C</sub>)
- Processor cycle p
- Memory cycle *m*
- Ratio between memory cycle and process cycle k

- Clock Rate and CPI
  - -cycle time  $\tau$ 
    - time taken to complete one clock cycles
  - -clock rate f,  $f = 1/\tau$ 
    - · inverse of clock time
  - -Instruction count( $I_c$ )
    - number of machine instructions to be executed in the program
  - Cycles Per Instruction(CPI)
    - · for measuring the time needed to execute each instruction

41

# Performance Factors

- CPU time (T in seconds/program)
  - time needed to execute the program is estimated by finding the product of three contributing factors:

 $T = I_C \times CPI \times \tau \tag{1.1}$ 

 execution of an instruction requires going through a cycle of events involving the instruction fetch, decode, operand fetch, execution, and store results

Instruction fetch 

memory access

Decode → carried out by CPU

Operand fetch 

memory access

Execution → carried out by CPU

- memory cycle m
  - time needed to complete one memory reference

k times the processor cycle  $\mathcal{T}$ .

- value of k depends on
  - the speed of the cache and
  - · memory technology and
  - processor-memory interconnection scheme used

43

# Cycles per Instruction(CPI)

- C= total number of clock cycles needed to execute a program(n instructions)
- CPI=no of clock cycles needed to execute a single instruction

$$CPI = \frac{C}{I_C}$$
 (3)

• The eqn 1.1can be rewritten as

$$T = Ic * CPI* \tau$$

$$T = Ic * CPI* \tau$$

$$Ic$$

$$T = C * \tau \dots (4)$$

$$T = C$$

$$f$$

- CPI can be divided into two component terms
  - total processor cycles and
  - memory cycles needed to complete the execution of the instruction
- Depending on the instruction type, the complete instruction cycle may involve one to as many as four memory references
  - we can rewrite Eq. 1.1 as

$$T = I_C \times (p + m \times k) \times \tau \tag{1.2}$$

- p→ no: of processor cycles needed for inst decode & execution
- m→ no: of memory reference needed
- k→ ratio between memory and processor cycles
- τ → processor cycle time

45

# System attributes

- five performance factors  $(I_C,\,p,\,m,\,k,\tau)\,$  are influenced by four system attributes:
  - instruction-set architecture
    - affects the program length (I<sub>c</sub>) and processor cycles needed (p)
  - compiler technology
    - affects the values of Ic, p, and the memory reference count (m).
  - CPU implementation and control, and
    - determine the total processor time (  $p \cdot \tau$  ) needed
  - cache and memory hierarchy
    - affect the memory access latency (  $k \cdot \tau$  ).

| Table 1.2 | Performance | Factors versus | System Attributes |
|-----------|-------------|----------------|-------------------|
|-----------|-------------|----------------|-------------------|

| System<br>Attributes                       | Performance Factors      |                                           |                                            |                                 |                     |
|--------------------------------------------|--------------------------|-------------------------------------------|--------------------------------------------|---------------------------------|---------------------|
|                                            | Instr.                   | Average Cycles per Instruction, CPI       |                                            |                                 | Processor           |
|                                            | Count,<br>I <sub>c</sub> | Processor<br>Cycles per<br>Instruction, p | Memory<br>References per<br>Instruction, m | Memory-<br>Access<br>Latency, k | Cycle<br>Time,<br>T |
| Instruction-set<br>Architecture            | 1                        | <b>✓</b>                                  |                                            |                                 |                     |
| Compiler<br>Technology                     | 1                        | <b>✓</b>                                  | <b>✓</b>                                   |                                 |                     |
| Processor<br>Implementation<br>and Control |                          | ~                                         |                                            |                                 | 1                   |
| Cache and<br>Memory<br>Hierarchy           |                          |                                           |                                            | <b>✓</b>                        | <b>✓</b>            |

# MIPS(Million Instructions Per Second) Rate

- MIPS rateis basedon following factors
  - clock rate f
  - Instruction count I<sub>C</sub>
  - CPI of given machine

$$MIPS = \frac{I_C}{T \times 10^6}$$
 (6)

 Let C be the total number of clock cycles needed to execute a given program

$$T = I_{c} \times (P + mx k) \times T \qquad (1.2)$$

$$T = C \times T$$

$$CPI = \frac{C}{I_{c}} \qquad (3)$$

$$T = I_{c} \times CPI \times Z \qquad (1.1)$$

$$= I_{c} \times CPI \times Z \qquad (1.1)$$

$$T = T_{C} \times CPI \times Z \qquad (11)$$

$$= T_{C} \times CPI \times \frac{1}{f}$$

$$= M_{I}PS \quad \text{fale} = \frac{T_{C}}{T \times 10^{6}} \times f$$

$$= \frac{f}{CPI \times 10^{6}} \times T_{C} \qquad (13)$$

$$= \frac{f}{C \times 10^{6}} \times T_{C} \qquad (13)$$

Based on 1.3 the CPU time in eqn 1.2 can be worther as.

T= 
$$T_{c} \times 10^{-6}$$

MIPS rate of a given
computer is diseasely
propositional to the clock
propositional to the CPI.

propositional to the CPI.

propositional to the CPI.

propositional to  $T_{c} \times T_{c} \times T_{c}$ 

MIPS =  $T_{c} \times T_{c} \times T_{c}$ 

# **Throughput Rate**

- System throughput W<sub>S</sub>
  - how many programs a system can execute per unit time
- CPU throughput Wp
  - measure of how many programs can be executed per second
  - multiprogrammed system system throughput is often lower than the CPU throughput

$$Wp = \frac{1}{I_{C} \times CPI}$$

$$= \frac{1}{CPI} \times \frac{1}{I_{C}}$$

$$= MIPS \times 106 \times \frac{1}{I_{C}}$$

51

#### Performance

For some program running on machine A,

Performance of A, Perf(A) = 1 / ExecTime(A)

"A is n times faster than B" iff

Perf(A) / Perf(B) = ExecTime(B) / ExecTime(A) = n

❖ "A is X% faster than B" iff

Perf(A) / Perf(B) = ExecTime(B) / ExecTime(A) = 1 + X/100

#### Problem:

- Machine A runs a program in 20 seconds. Machine B runs the same program in 25 seconds. How many times faster is machine A?
  - n = ExecTime(B) / ExecTime(A)

$$=\frac{25 \text{ sec}}{20 \text{ sec}} = 1.25$$

Machine A is 1.25 times faster than Machine B

53

20/8

# **Problem**

A 40 MHz processor was used to execute a benchmark program with the following instruction mix and clock cycle counts:

| Instruction Type   | Instruction count | Clock cycle count |
|--------------------|-------------------|-------------------|
| Integer Arithmetic | 35000             | 1                 |
| Data Transfer      | 20000             | 2                 |
| Floating point     | 15000             | 2                 |
| Control Transfer   | 6000              | 2                 |
|                    |                   |                   |

Determine the effective CPI, MIPS rate and execution time for this program.

2018 regular question paper

55

# • Clock speed=40Mhz

| Instruction count | Clock cycle count       | Cycles                        |
|-------------------|-------------------------|-------------------------------|
| 35000             | 1                       | 35000x1=35000                 |
| 20000             | 2                       | 20000x2=40000                 |
| 15000             | 2                       | 15000x2=30000                 |
| 6000              | 2                       | 6000x2=12000                  |
|                   | 35000<br>20000<br>15000 | 35000 1<br>20000 2<br>15000 2 |

Instruction Count 76000

Total number of cycles required to execute complete program C= 35000+40000+30000+12000=117000cycles

$$CPI = \frac{C}{I_C} = \frac{117000}{76000} = 1.53$$

$$MIPS = \frac{f}{CPI \times 10^6} = \frac{40 \times 10^6}{1.53 \times 10^6} = 26.14$$

$$T = I_C \times CPI \times \tau = I_C \times CPI \times \frac{1}{f} = 76000 \times 1.53 \times \frac{1}{40} = 2907 = 2.907 ms$$

57

Consider the execution of a task with 100000 instructions on 500 MHz processor. The program consists of FOUR major types of instructions:

| Instruction Type          | CPI | Instruction% |  |
|---------------------------|-----|--------------|--|
| Integer arithmetic        | 1   | 60%          |  |
| Floating point arithmetic | 2   | 20%          |  |
| Load/Store                | 4   | 10%          |  |
| Memory Reference          | 6   | 10%          |  |

When the task is executed on a uniprocessor;

- Calculate the average CPI?
- Determine the corresponding MIPS rate?

#### Solution:

Average CPI= 1\* 0.6 + 2 \* 0.2 + 4 \* 0.1 + 6 \* 0.1= 2 cycles/instruction

$$MIPS = \frac{f}{CPI \times 10^6} = \frac{500MHz}{2cycles/instr} = 250$$

59

# Floating Point Operations per Second

- Performance masuredin FLOPs
- FLOPS: floating point operations per second

## AMDAHL'S LAW FOR FIXED WORKLOAD

61

#### BASICS OF PERFORMANCE EVALUATION

A <u>sequential algorithm</u> is evaluated in terms of its execution time which is expressed as a function of its input size.

For a <u>parallel algorithm</u>, the execution time depends not only on input size but also on factors such as parallel architecture, no. of processors, etc.

#### Performance Metrics

Parallel Run Time

Speedup

Efficiency

#### Standard Performance Measures

Peak Performance

Sustained Performance

Instruction Execution Rate (in MIPS)

Floating Point Capability (in MFLOPS)

#### PERFORMANCE METRICS

#### Parallel Runtime

The parallel run time **T(n)** of a program or application is the time required to run the program on an **n-processor** parallel computer.

When n = 1, T(1) denotes sequential runtime of the program on single processor.

#### Speedup

Speedup **S(n)** is defined as the ratio of time taken to run a program on a single processor to the time taken to run the program on a parallel computer with identical processors

$$S(n) = \frac{T(1)}{T(n)}$$

It measures how faster the program runs on a parallel computer rather than on a single processor.

#### PERFORMANCE METRICS

#### Efficiency

The Efficiency **E(n)** of a program on **n processors** is defined as the ratio of speedup achieved and the number of processor used to achieve it.

$$E(n) = \frac{S(n)}{n} = \frac{T(1)}{n.T(n)}$$

Speedup is expected to be linear i.e. it grows linearly with the number of processors, but in most cases it falls due to parallel overhead.

#### SPEEDUP PERFORMANCE LAWS

#### Speedup Performance Laws

#### Amdahl's Law

[based on fixed problem size or fixed work load]

#### Gustafson's Law

[for scaled problems, where problem size increases with machine size i.e. the number of processors]

#### Sun & Ni's Law

[applied to scaled problems bounded by memory capacity]

# Amdahl's Law

- A program (or algorithm) which can be parallelized can be split up into two parts:
  - A part which cannot be parallelized and
  - A part which can be parallelized
- Ea:
- Imagine a program that processes files from disk. A small part of that
  program may scan the directory and create a list of files internally in memory.
  After that, each file is passed to a separate thread for processing. The part
  that scans the directory and creates the file list cannot be parallelized, but
  processing the files can be done in parallel.
- Total time taken to execute the program only serially is called T.
- The time T includes the time of both the non-parallelizable and parallelizable parts.
- T = Total time of serial execution
- B = Total time of non- parallelizable part
- T B = Total time of parallelizable part (when executed serially, not in parallel)

- First of all, a program can be broken up into a non-parallelizable part B, and
- a parallelizable part 1-B, as illustrated by this diagram:
- The line with the delimiters on at the top is the total execution time T(1).



67

#### SPEEDUP PERFORMANCE LAWS

#### Amdahl's Law (1967)

For a given problem size, the speedup does not increase linearly as the number of processors increases. In fact, the speedup tends to become saturated.

This is a consequence of Amdahl's Law.

According to Amdahl's Law, a program contains two types of operations:

Completely sequential Completely parallel

Let, the time **Ts** taken to perform sequential operations be a fraction  $\alpha$  (0< $\alpha$ <1) of the total execution time **T(1)** of the program, then the time **Tp** to perform parallel operations shall be (1- $\alpha$ ) of **T(1)** 

#### Amdahl's Law

Thus, 
$$Ts = \alpha . T(1)$$
 and  $Tp = (1-\alpha) . T(1)$ 

Assuming that the parallel operations achieve linear speedup (i.e. these operations use 1/n of the time taken to perform on each processor), then

$$T(n) = Ts + Tp/n = \alpha . T(1) + \frac{(1-\alpha). T(1)}{n}$$

Thus, the speedup with **n** processors will be:

$$S(n) = \frac{T(1)}{T(n)} = \frac{T(1)}{\alpha \cdot T(1) + \frac{(1-\alpha) \cdot T(1)}{n}} = \frac{1}{\alpha + \frac{(1-\alpha)}{n}}$$

#### SPEEDUP PERFORMANCE LAWS

#### Amdahl's Law

Sequential operations will tend to dominate the speedup as n becomes very large.

As 
$$n \rightarrow \infty$$
,  $S(n) \rightarrow 1/\alpha$ 

This means, no matter how many processors are employed, the speedup in this problem is limited to  $1/\alpha$ .

This is known as sequential bottleneck of the problem.

Note: Sequential bottleneck cannot be removed just by increasing the no. of processors.

)

#### SPEEDUP PERFORMANCE LAWS

#### Amdahl's Law

A major shortcoming in applying the Amdahl's Law: (is its own characteristic)

The total work load or the problem size is fixed

Thus, execution time decreases with increasing no. of processors

Thus, a successful method of overcoming this shortcoming is to increase the problem size!

## Example: 1

Suppose that a calculation has a 4% serial portion,

- a) What is the limit of speedup on 16 processors?
- b) What is the maximum speedup?

Ans:

$$S_n = \frac{n}{1 + (n-1)\alpha}$$

a) Limit of speedup on 16 processors  
=
$$16/(1 + (16-1)*.04) = 10$$

b) The maximum speedup = 
$$1/\alpha$$
  
=  $1/0.04 = 25$ 

## Example: 2

If 90% of a calculation can be parallelized, then what is the maximum speed-up which can be achieved on 5 processors?

Ans: 
$$S(n) = n/(1 + (n-1)* \alpha)$$
 ( $\alpha = 1-0.9 = 0.1$ ) (sequential fraction)  
=  $5/(1 + (5-1)*.10) = 3.57$ 

(the program can theoretically run 3.57 times faster on five processors than on one)



# Amdahl's Law (A different perspective)

- The performance gain that can be obtained by improving some portion of a computer can be calculated using Amdahl's law.
- Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
- Law defines the term 'speedup'.

 $Speedup = \frac{Performance for entire task using the enhancement when possible}{Performance for entire task without using the enhancement}$ 

Alternatively,

 $Speedup = \frac{Execution time for entire task without using the enhancement}{Execution time for entire task using the enhancement when possible}$ 

74

- Amdahl's law gives us a quick way to find the speedup from some enhancement, which depends on two factors:
  - -Fraction enhancement
    - The fraction of the computation time in the original computer that can be converted to take advantage of the enhancement.
  - -Speed-up-enhanced
    - how much faster the task would run if the enhanced mode were used for the entire program

#### Fraction enhancement

- Example:
- If 20 seconds of the execution time of a program that takes 60 seconds in total can use an enhancement, the fraction enhanced is 20/60.
- This value, Fraction<sub>enhancement</sub>, is always less than or equal to 1.

## Speedup enhancement

- The speedup or improvement gained by the enhanced execution mode, that is, how much faster the task would run if the enhanced mode were used for the entire program – this value is the time of the original mode over the time of the enhanced mode.
- Example, if the enhanced mode takes 2 sec for a portion of the program, while it is 5 sec in the original mode, the speedup enhanced is 5/2.
- This is called Speed-up-enhanced which is always greater than 1.

77

Amdhal's law can serve as a guide to how much an enhancement will improve performance and how to distribute resources to improve cost/performance

Execution time<sub>new</sub> = Execution time<sub>old</sub> 
$$\times \left( (1 - \text{Fraction}_{\text{enhanced}}) + \frac{\text{Fraction}_{\text{enhanced}}}{\text{Speedup}_{\text{enhanced}}} \right)$$

The overall speedup is the ratio of the execution times:

$$Speedup_{overall} = \frac{Execution time_{old}}{Execution time_{new}} = \frac{1}{(1 - Fraction_{enhanced}) + \frac{Fraction_{enhanced}}{Speedup_{enhanced}}}$$

79

• If three different enhancements use fractions of time (f1, f2 and f3) respectively, and have individual speedups as S1, S2 and S3 respectively, the overall speedup is

OverallSpeedup = 
$$\frac{1}{\left(1 - (f_1 + f_2 + f_3)\right) + \left(\frac{f_1}{s_1} + \frac{f_2}{s_2} + \frac{f_3}{s_3}\right)}$$

## Applications of Amdahl's Law

- Amdahl's law gives us a quick way to find the speedup from some enhancement.
- Amdahl's law is particularly useful for comparing the overall system performance of two different systems
- It can also be applied to compare two processor design alternatives, based on enhancement on the same system

81

## Exercise and Solution on Amdahl's Law

- 1. What is the overall speedup if you make 10% of a program 90 times faster?
- 2. What is the overall speedup if you make 90% of a program 10 times faster?

• Amdahl's law OverallSpeedup = 
$$\frac{1}{(1-f)+\frac{f}{s}}$$

• What is the overall speedup if you\_make 10% of a program 90 times faster?  $\frac{1}{(1-0.1) + \frac{0.1}{90}} \approx \frac{1}{0.9011} \approx 1.11$ 

• What is the overall speedup if you make 90% of a program 10 times faster 
$$\frac{1}{(1-0.9) + \frac{0.9}{10}} = \frac{1}{0.19} \approx 5.26$$

3. We are considering an enhancement to the processor of a web server. The new CPU is 20 times faster on search queries than the old processor. The old processor is busy with search queries 70% of the time, what is the speedup gained by integrating the enhanced CPU?

$$Speedup = \frac{1}{(1 - Fraction_{enhanced}) + \frac{Fraction_{enhanced}}{Speedup_{enhanced}}}$$

$$Fraction_{enhanced} = 70 \% = 0.70$$

$$Speedup_{enhanced} = 20$$

$$Speedup = \frac{1}{(1 - 0.70) + \frac{0.70}{20}} = \frac{1}{0.335} = 2.985$$

- 4. We are considering an enhancement to the processor of a server. The new CPU 10X faster. I/O bound server, so 60% time waiting for I/O.
  - New CPU 10X faster
  - I/O bound server, so 60% time waiting for I/O

Speedup<sub>overall</sub> = 
$$\frac{1}{\left(1 - \text{Fraction}_{\text{enhanced}}\right) + \frac{\text{Fraction}_{\text{enhanced}}}{\text{Speedup}_{\text{enhanced}}}}$$
$$= \frac{1}{\left(1 - 0.4\right) + \frac{0.4}{10}} = \frac{1}{0.64} = 1.56$$

Three enhancements with the following speedup are proposed for a new architecture. S1=30, S2=20, S3 = 15. If enhancements 1 and 2 are each usable for 25% of the time, what fraction of the time must enhancement 3 be used to achieve an overall speed up of 10?

OverallSpeedup = 
$$\frac{1}{\left(1 - \left(f_1 + f_2 + f_3\right)\right) + \left(\frac{f_1}{s_1} + \frac{f_2}{s_2} + \frac{f_3}{s_3}\right)}$$

$$10 = 1/ [1 - (0.25 + 0.25 + f3) + [(0.25/30) + (0.25/20) + (f3/10)]]$$

$$10 = 1/ [0.5 - f3) + [(0.5 + 0.75 + 4 f3)/60]]$$

$$10 = 60 / [30 - 60 f3 + 1.25 + 4 f3]$$

$$-56 f3 = 6 - 31.25$$

$$f3 = -25.25/-56 = 0.45 = 45\%$$

85

## **Progmmming Environments**

- programmability of a computer depends on the programming environment provided to the users.
- uniprocessor computers are programmed in a sequential environment
  - Successive system calls must be serialized
- When using a parallel computer,
  - one desires a parallel enviornment where parallelism is automatically exploited.
  - Language extensions or new constructs must be developed to specify parallelism or
  - to facilitate easy detection of parallelism at various granularity levels by more intelligent compilers.

## Implicit Parallelism

- uses a conventional language, such as C, C++, Fortran, or Pascal, to write the source program.
- The sequentially coded source program is translated into parallel object code by a parallelizing compiler.
- this compiler must be able to detect parallelism and assign target machine resources.
- With parallelism being implicit, success relics heavily on the "intelligence" of a parallelizing compiler.
- This approach requires less effort on the part of the programmer

87

## **Explicit Parallelism**

- requires more effort by the programmer to develop a source program using parallel dialects of C, C++, Fortran, or Pascal.
- Parallelism is explicitly specified in the user programs.
- This reduces the burden on the compiler to detect parallelism.
- Instead, the compiler needs to preserve parallelism and, where possible, assigns target machine resources.



## **MULTIPROCESSORS AND MULTICOMPUTERS**

- Two categories of parallel computers are
  - Shared Memory Multiprocessors
  - Message Passing Multicomputers
- Their differrence is based on
  - One has shared common memory
  - Other has unshared distributed memory

## **Shared-Memory Multiprocessors**

- three shared- memory multiprocessor models
  - Uniform Memory Access(UMA model)
  - -Non Uniform Memory Access(NUMA model), and
  - Cache Only Memory Architecture(COMA) model.
- These models differ in how the memory and peripheral resources are shared or distributed.

91

#### The UMA Modal

- Physical memory is uniformly shared by all the processors
- All processors have equal access time to all memory words, which is why it is called uniform memory access
- Each processor may use a private cache.
- Peripherals are also shared in some fashion



Fig. 1.6 The UMA multiprocessor model

- Multiprocessors are called tightly coupled systems due to the high degree of resource sharing.
- · The system interconnect takes the form of
  - a common bus,
  - a crossbar switch, or
  - a multistage network
- UMA model is suitable for
  - general-purpose and times sharing applications by multiple users.
  - It can be used to speed up the execution of a single large program in timecritical applications.
- To coordinate parallel events, synchronization and communication among processors ar done through using shared variables in the common memory.

24-08

## Symmetric and Asymmetric Multiprocessor

- Symmetric Multiprocessor
  - When all processors have equal access to all peripheral devices,
  - all the processors are equally capable of running the executive programs, such as the OS kernel and I/O service routines.

- Asymmetric Multiprocessor
  - only one or a sub set of processors are executive-capable.
  - An executive or a master processor(MP) can execute the operating system and handle I/O.
  - The remaining processors have no I/O capability and thus are called attached processors(APs).
  - Attached processors execute user codes under the supervision of the master processor.
  - In both MP and AP configurations, memory sharing among master and attached processors is still in place.

## Cache coherent UMA

 If one processor update location in shared memory, all other processors know the update

97

# Advantages of UMA

- Suitable for general purpose and timesharing applications by multiple users
- Speed up execution of a single large program in time critical applications

## Non Uniform Memory Acces, NUMA Model

- It is shared-memory system in which the access time varies with the location of the memory word
- Two NUMA machine models are
  - Shared local memories model
  - A hierarchical cluster model



#### Shared local memories model

- shared memory is physically distributed to all processors called local memories.
- The collection of all local memories forms a global address space accessible by all processors
- Access
  - It is faster to access a local memory with a local processor.
  - The access of remote memory attached to other processors takes longer
    - due to the added delay through the interconnection network.



### A hierarchical cluster model

- Globally shared memory
- The processors are divided into several clusters.
- Each cluster is itself an UMA or a NUMA multiprocessor.
- The clusters are connected to global shared-memory modules.
- The entire system is considered a NUMA multiprocessor.
- All processors belonging to the same cluster are allowed to uniformly access the cluster shared-memory modules.
- All clusters have equal access to the global memory.
- However, the access time to the cluster memory is shorter than that to the global memory.



# The COMA Model

- A multiprocessor using cache-only memory assumes the COMA model.
- Special case of NUMA
- distributed main memories are converted to caches.
- There is no memory hierarchy at each processor node.
- All the caches form a global address space.
- Remote cache access is assisted by the distributed cache directories
- Depending on the interconnection networks sometimes hierarchical directories may be used to help locate copies of each blocks.
- Initial dataplacement is not critical because data will eventually migrate to where it will be used



- Remote cache access is assisted by the distributed cache directories
- Depending on the interconnection network used.
- sometimes hierarchical directories may be used to help locate copies of eachblocks.
- Initial dataplacement is not critical because data will eventually migrate to where it will

## Distributed-Memory/Message Passing Multicomputers

- system consists of multiple computers, often called nodes, interconnected by a message-passing network.
- Each node is an autonomous computer consisting of a processor, local memory, and sometimes attached disks or I/O peripherals
- Nodes interconnected by a message passing network
   can be Mesh, Ring, Torus, Hypercube etc (discussed later)
- All interconnection provide point to point static connection among nodes



- Local memories are private and accessible only by processor – Thus Multicomputer are also called Noremote-Memory –Access(NORMA)(difference with UMA and NUMA)
- Communication between nodes if required is carried out by passing messages through static connection network.

- Advantages over Shared Memory
  - Scalable and Flexible : we can add CPU's
  - Reliable and accessible : since with Shared memory a failure can bring the whole system down
- Disadvantage
  - Considered harder to program because we are used to programming on common memory systems.

#### MULTIVECTOR and SIMD COMPUTERS

- We can classify supercomputers into 2
  - Pipelined Vector machines using powerful processors equipped with vector hardware
  - SIMD computers emphasizing on massive data parallelism

111

# **VECTOR SUPERCOMPUTERS**

- A vector computer is usually built on top of scalar processor
  - -its attached to scalar processor as an optional feature
- host computer: load program and data into main memory
- scalar Control Unit: decod intructions
- if it's a scalar operation or program control operation,
  - it will be directly executed using scalar functional pipelines.

- If its vector operation it will be send to the vector control unit.
  - The control unit supervises the flow of vector data between main memory and vector functional pipelines.
  - Vector data flow is coordinated by the control unit.
  - A number of vector functional pipelines may be built into a vector processor



#### **VECTOR PROCESSOR MODELS**

- 2 types :
  - -Register to Register architecture
  - Memory to memory architecture

115

#### REGISTER to REGISTER architecture

- Vector registers are used to hold
  - vector operands,
  - intermediate and
  - final vector results.
- The vector functional pipelines retrieve operands from and put results into the vector registers.
- All vector registers are programmable in user instructions
- Each vector register is equipped with a component counter which keeps track of the component registers used in successive pipeline cycles.

- Length of vector register is usually fixed
- Some machines use reconfigurable vector registers to dynamically match register length
- Generally there are fixed no: of vector registers and functional pipilines in vector processors
  - hence they must be reserved in advance to avoid conflicts

### **MEMORY-to-MEMORY architecture**

- differs from a register-to-register architecture in the use of a vector stream unit to replace the vector registers.
- Vector operands and results are directly retrieved from and stored into the main memory in superwords

## SIMD Supercomputers

- Computers with Multiple Processing elements
- They perform same operation on multiple data points simultaneously
- An operational model of an SIMD computer is specified by a 5-tuple:

$$M = (N, C, I, M, R)$$

- N is the number of Processing Element's(PE's) in m/c (ex: Illiac IV has 64 PE's)
- C is the set of instructions directly executed by Control Unit(CU)including scalar and program flow control instructions.
- I is the set of instruction broadcast by CU to all PE's for parallel execution(Ex: arithmetic, logic, data routing, masking etc)
- M is the set of masking schemes which sets each PE's into enabled and disabled mode
- R is the data-routing function, specifies the pattern to be set up in the interconnection n/w for inter-PE communications.





### ARCHITECTURAL DEVELOPMENT TRACKS

- The architectures ofmost existing computers tollow certain development tracks.
- Understanding features of various tracks provides insights for new architectural development
  - 1. Multiple-Processor Tracks
    - 1.Multiprocessor track
    - 2. Multicomputer track
  - 2. Multiple data track
    - 1.Vector track
    - 2.SIMD track

- 3. Multiple threads track
  - 1. Multithreaded track
  - 2. Dataflow track

123

# Assignment 1

Make a study on different architectural development tracks

#### 26-08-2020

125

#### **CONDITIONS of PARALLELISM**

- The ability to execute several program segments in parallel requires each segment to be independent of the other segments.
- We use a dependence graph to describe the relation between statements
  - The nodes of a dependence graph correspond to the program statement (instructions), and
  - directed edges with different labels are used to represent the ordered relations among the statements.
- The analysis of dependence graphs shows where opportunity exists for parallelization and vectorization.

- Program segments cannot be executed in parallel unless they are independent.
- · Independence comes in several forms
  - I. Data dependence
  - II. Control dependence
  - III. Resource Dependence



# I. Data Dependence

- The ordering relationship between statements is indicated by the data dependence.
- Five types of data dependence are defined below:
  - a) Flow dependence
  - b) Anti-dependence
  - c) Output dependence
  - d) I/O dependence
  - e) Unknown dependence

### a) Flow dependence

- A statement S2 is flow dependent on S1
  - if an execution path exists from S1 to S2 and
  - if at least one output of S1 feeds in as input to S2. Flow dependence is
  - denoted as  $S1 \rightarrow S2$
  - Example
    - S1:LOAD R1,A
    - S2: ADD R2,R1
    - S2 is flow dependent on S1
      - Output of S1 fed as input of S2
      - ie Ais passed to R1

129

## b) Anti-dependence

- Statement S2 is antidependent on statement S1
  - if S2 follows S1 in program order and
  - if the output of S2 overlaps the input to S1.
  - A direct arrow crossed with a bar dicates antidependence from S1 to S2
  - S1 → S2
  - Example
    - S2:ADD R2,R1
    - S3: MOV R1,R3
    - S3 is antidependent on S2
      - –S3 is overlapping the input to S2

## c) Output dependertce

- Two statements are output-dependent if they produce the same output variable.
- S1→S2
- Example
  - S1:LOAD R1,A
  - S3: MOV R1,R3
  - S3 is output-dependent on S1
    - Both modify same registerR1

131

# d) I/O dependence

- Read and write are IO statements.
- IO dependence occurs not because the same variable is involved but because the same file is referenced by both IO statements.
  - S1: READ(4),A(i)
  - S2: PROCESS
  - S3: WRITE(4),B(I)
  - S4: CLOSE(4)
- S1 snd S3 are I/O dependent
  - as they access same file

## e) Unknown dependnce

- The dependence relation between two statements cannot be determined in the following situations:
  - 1. The subscript of a variable is itself subscribed (eg: a(I(J)))
  - 2. The subscript does not contain the loop index variable. (eg: a[])
  - 3.A variable appears more than once with subscripts having different coefficients of the loop variable.
  - 4. The subscript is non linear in the loop index variable.
- When one or more of these conditions exist, a conservative assumption is to claim unknown dependence among the statements involved.



## Example 2

- Consider following code
  - S1: Read (4), A(I) /Read array A from file 4/
  - S2: Process /Process data/
  - S3: Write (4), B(I) /Write array B into file 4/
  - S4: Close (4) /Close file 4/
- S1 and S3 are I/O dependent on each other
  - because they both access the same file



(b) I/O dependence caused by accessing the same file by the read and write statements.

135

## II. Control Dependence

- Implies the order of execution of statements cannot be determined before run time
- For example,
  - conditional statements will not be resolved until run time.
- Different paths taken after a conditional branch may introduce or eliminate data dependence among instructions.
- Dependence may also exist between operations performed in successive iterations of a looping procedure.

## Example

- While the assignment b = 1 is executed only if a < 0, the assignment c = 2 is always executed regardless of the value of a.
- We say that b = 1 is control dependent on the condition a < 0 and that c = 2 is control independent.
- We refer to the branch on which an instruction is control dependent as its control dependence branch.

- Control dependence also avoids parallelism to being exploited.
- Compiler techniques or hardware branch prediction techniques are needed to get around the control dependence in order to exploit more parallelism.

### Resource dependence

- Data and control dependencies are based on the independence of the work to be done.
- Even if several segments are independent in other ways, they cannot be executed in parallel if there aren't sufficient processing resources.
- Resource dependence is concerned with conflicts in using shared resources among parallel events
- ALU conflicts are called ALU dependence
- Memory conflicts are called storage dependence.

139

## Bernstein's Conditions

- A set of conditions which must exist if two processes can execute in parallel
- Notation
  - Let P1 and P2 be two processes.
  - Input set Ii is the set of all input variables for a process Pi.
    - Ii is also called the read set or domain of Pi.
    - We define the input set Ii of a process Pi as the set of all input variables needed to execute the process.
  - Output set Oi is the set of all output variables generated after execution for a process Pi .
    - · Oi is also called write set.

- Input variables are essentially operands which can be fetched from the memory or registers and output variables are the results to be stored in working registers or memory locations.
- If P1 and P2 can execute in parallel (which is written as P1 || P2), then:

$$\mathbf{I}_1 \cap \mathbf{O}_2 = \emptyset$$

$$\mathbf{I}_2 \cap \mathbf{O}_1 = \emptyset$$

$$\mathbf{0}_{1} \cap \mathbf{0}_{2} = \emptyset$$

### Bernstein's condition -2

- In terms of data dependencies, Bernstein's conditions imply that two processes can execute in parallel
  - if they are flow-independent,
  - antindependent, and
  - output-independent.
- a set of processes P1, P2,...,Pk, can execute in parallel
  - if Bernstein's conditions are satisfied on a pairwise basis.
  - That is P1||P2||P3....||Pk if and only if Pi||Pj for all i≠j.

- The parallelism relation || is commutative ie
  - (Pi || Pj implies Pj || Pi ),
  - but not transitive (Pi || Pj and Pj || Pk does not guarantee Pi || Pk ).
  - Therefore, || is not an equivalence relation.
  - Pi || Pj || Pk implies associativity.ie (Pi || Pj )|| Pk =Pi || (Pj || Pk
     )
- Since the order in which parallel executable processes are executed should not make any difference in the output sets.

# Detection of parallelism in a program using Bernstein's conditions

- Consider the simple case in which each process is a single High Level Language(HLL) statement.
- We want to detect the parallelism embedded in the following five statements labeled P I, P2, P3, P4, and P5, in program order.

$$P_1: C = D \times E$$
  
 $P_2: M = G + C$   
 $P_3: A = B + C$   
 $P_4: C = L + M$   
 $P_5: F = G \div E$ 



 (a) A dependence graph showing both data dependence (solid arrows) and resource dependence (dashed arrows)

- Assume that each statement requires one step to execute. No pipelining is considered here.
- The dependence graph demonstrate flow dependence as well as resourse dependence
- In sequential execution five steps are needed



- If two adders are available simultaneously,
  - the parallel execution requires only three steps as shown in Fig. 2.2c.
  - Pairwise, there are 10 pairs of statements to check against Bernstein's conditions.
  - Only 5 pairs,
    - P1||P5, P2|| P3, P2| | P5, P5 ||P3, and P4||P5, can execute in parallel if there are no resource conflicts.
    - only P2||P3||P5, is possible because P2||P3, P3||P5, and P5||P2 are all possible.
- Violations of any one or more of the three conditions prohibits parallelism between two processes.
- data dependence, control dependence, and resource dependence all prevent parallelism from being exploitable.

#### Hardware Parallelism

#### e.g Number of processors

- Defined by machine architecture, hardware multiplicity (number of processors available) and connectivity.
- Displays Resource Utilization Patterns of simultaneously executable operations indicate Peak performance of processor resources.
- Often a function of cost/performance tradeoffs.
- Characterized in a single processor by the number of instructions k issued in a single cycle (k-issue processor).
- A multiprocessor system with n k-issue processor can handle a maximum limit of nk parallel instructions (at ILP level) or n parallel threads at thread-level parallelism (TLP) level.

#### Software Parallelism

#### e.g Degree of Parallelism (DOP)

- Defined by the control and data dependence of programs.
- Revealed in program profiling or program dependency (data flow) graph.
- A function of algorithm, parallel task grain size, programming style and compiler optimization.
- Program Flow Graphs Patterns of simultaneously executable operations
- Parallelism in a program varies during execution period
- Types
  - Control Parallelism and
  - Data Parallelism

149

### Hardware and Software Parallelism

- For implementation of parallelism special hardware and software support is needed
- Complilation support is also needed to close the gap between hardware and software

| Hardware Parallelism                                                                                                                             | Software Parallelism                                                                                            |
|--------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| Its build into machines architecture and hardware multiplicity. Also known as machine parallelism                                                | Its exploited by the concurrent execution of machine language instructions in a program                         |
| It's a function of cost and performance trade off                                                                                                | It's a function of algorithm, programming style and compiler optimization.                                      |
| It displays resource utilization patterns of simultaneously executable operations. It also indicates the peak performance of processor resources | It displays patterns of simultaneously executable operations                                                    |
| Its characterized by no: of instruction issues per machine cycle                                                                                 | The program flow graph displays the patterns of simultaneously executable operations  2 types –                 |
|                                                                                                                                                  | Control parallelism – allows 2 or more operations to be performed simultaneously.                               |
|                                                                                                                                                  | Data parallelism – atmost same operation is performed over many data elements by many processors simultaneously |

# Mismatch between software parallelism and hardware parallelism

- There are eight instructions
  - four loads and four arithmetic operations
- to be executed in three consecutive machine cycles
- Four load operations are performed in the first cycle,
- followed by two multiply operations in the second cycle and
- two add/subtract operations in the third cycle.
- Therefore, the parallelism varies from 4 to 2 in three cycles.
- The average software parallelism is equal to 8/3= 2.67 instructions per cycle in this example program.



L<sub>i</sub>: Load operation X<sub>i</sub>: Multiply operation

(a) Software parallelism

- execution of the same program by a two-issue processor which can execute one memory access (load or write) and one arithmetic (add, subtract, multiply etc.) operation simultaneously.
- With this hardware restriction, the program must execute in seven machine cycles

- With this hardware restriction, the program must execute in seven machine cycles
- Therefore, the hardware parallelism displays an average value of 8/7= 1.14 instructions executed per cycle.
- This demonstrates a mismatch between the software parallelism and the hardware parallelism.



- Let us try to match the software parallelism in a hardware platform of a dual processor system,
- where single-issue processors are used.
- L/S stands for load/store operations.
- six processor cycles are needed to execute the I2 instructions by two processors. .
- S1 and S2 are two inserted store operations, and
- I5 and I6 are two inserted load operations.
- These added instructions are needed for interprocessor communication through the shared memory.



## Types of Software Parallelism

### 1.control parallelism

- which allows two or more operations to be performed simultaneously
- appearing in the form of pipelining or multiple functional units, is limited by the pipeline length and by the multiplicity of functional units.
- Both pipelining and functional parallelism are handled by the hardware; programmers need take no special actions to invoke them.

157

### 2.data parallelism

- in which almost the same operation is performed over many data elements by many processors simultaneously.
- offers the highest potential for concurrency.
- It is practiced in both SIMD and MIMD modes on MPP systems.
- Data parallel code is easier to write and to debug than control parallel code.
- Synchronization in SIMD data parallelism is handled by the hardware.
- Data parallelism exploits parallelism in prop'ortion to the quantity of data involved.

## **Solving Mismatch Problem**

- To solve the mismatch problem between software parallelism and hardware parallelism,
  - one approach is to develop compilation support, and the
  - other is through hardware redesign for more efficient exploitation of parallelism.
- These two approaches must cooperate with each other to produce the best result.

159

## **ROLE OF COMPILERS**

- Hardware processors can be better designed to exploit parallelism by an optimizing compiler
- That is compiler techniques are used to exploit hardware features to improve performance.
- Such processors use large register file and sustained instruction pipelining to execute nearly one instruction per cycle.
- The large register file supports fast access to temporary values generated by an optimizing compiler.
- The registers are exploited by code optimizer and global register allocator in such a compiler.

Suppose that we want to enhance the processor used for Web serving. The new processor is 10 times faster on computation in the Web serving application than the original processor. Assuming that the original processor is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overall speedup gained by incorporating the enhancement?

Fraction<sub>enhanced</sub> = 0.4, Speedup<sub>enhanced</sub> = 10, Speedup<sub>overall</sub> = 
$$\frac{1}{0.6 + \frac{0.4}{10}} = \frac{1}{0.64} \approx 1.56$$

161

A common transformation required in graphics processors is square root. Implementations of floating-point (FP) square root vary significantly in performance, especially among processors designed for graphics. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark. One proposal is to enhance the FPSQR hardware and speed up this operation by a factor of 10. The other alternative is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responsible for half of the execution time for the application. The design team believes that they can make all FP instructions run 1.6 times faster with the same effort as required for the fast square root. Compare these two design alternatives.

We can compare these two alternatives by comparing the speedups:

Speedup<sub>FPSQR</sub> = 
$$\frac{1}{(1-0.2) + \frac{0.2}{10}} = \frac{1}{0.82} = 1.22$$

Speedup<sub>FP</sub> = 
$$\frac{1}{(1-0.5) + \frac{0.5}{1.6}} = \frac{1}{0.8125} = 1.23$$