Fundamentals of computer design¶

约 1193 个字 37 行代码 1 张图片预计阅读时间 5 分钟

Introduction¶

Classer of computers¶

Desktop computers (or personal computers .aka. PC)
- General-purpose,vatiety of software
- Emphasize good performance for a single user at relatively low cost
- Mostly execute third-party software(第三方软件)
Server Computers
- Emphasize great performance for a few complex applications
- Or emphasize reliable performance for many users at once
- Greater computing power, storage, or network capacity than personal computers
Embedded computers
- Largest class and most diverse
- Hidden as components of systems
- Stringent power, cost, and size constraints,performance.
Personal Mobile Devices
- Such as smartphones, tablets/ipad, and wearables
- generally have the same design requirements as PCs
Supercomputer
- Computer cluster(集群)
- High capacity,prformance, and reliability
- Range to building size(即大型机)

Class By Flynn¶

福林分类法

SISD: Single Instruction Stream, Single Data Stream
SIMD: Single Instruction Stream, Multiple Data Stream
MISD: Multiple Instruction Stream, Single Data Stream
MIMD: Multiple Instruction Stream, Multiple Data Stream

definition

IS: Instruction Stream
DS: Data Stream
CS: Control Stream
CU: Control Unit
PU: Processing Unit
MM&SM: Memory Module & Storage Module

See as follows:

Performance¶

There are a lot of factors that can affect the performance:

Architecture
Hardware implementation
Compiler
Operating system
Application

Two major performance metrics:

Latency: The time it takes to complete a task(Response time)
Throughput: The rate at which a task can be completed (Bandwidth)

We define the performance as:

\[ Performance = \frac{1}{Execution} \]

性能与执行时间成反比

Quantitative approaches¶

Bit
Nibble: 4 bits
Byte: 8 bits
Word: 32 bits on many embedded systems
Word: 64 bits on most desktop and server systems
Kibibyte: \(2^{10}\) bytes,KB
Mebibyte: \(2^{20}\) bytes,MB
Gibibyte: \(2^{30}\) bytes,GB
Tebibyte: \(2^{40}\) bytes,TB

CPU Performance¶

\[ CPU \ Execution \ Time = \frac{CPU \ Cycles}{Clock \ Rate} = CPU \ Cycles \times Clock \ Period \]

CPU time example

Computer A: 2GHz clock, 10s CPU time
Designing Computer B
Aim for 6s CPU time
Can do faster clock, but causes 1.2 × clock cycles

To determine how fast Computer B's clock must be:

\[ \text{Clock Rate}_B = \frac{\text{Clock Cycles}_B}{\text{CPU Time}_B} = \frac{1.2 \times \text{Clock Cycles}_A}{6s} \]

Calculate Clock Cycles for Computer A:

\[ \text{Clock Cycles}_A = \text{CPU Time}_A \times \text{Clock Rate}_A = 10s \times 2GHz = 20 \times 10^9 \]

Calculate Clock Rate for Computer B:

\[ \text{Clock Rate}_B = \frac{1.2 \times 20 \times 10^9}{6s} = \frac{24 \times 10^9}{6s} = 4GHz \]

CPI: Cycles Per Instruction is the average number of clock cycles per instruction.

\[ CPI = \frac{CPU \ Cycles}{Instruction \ Count} \]

some CPU time could also be written as:

\[ CPU \ Time = \frac{Instruction \ Count \times CPI}{Clock \ Rate} \]

The Big Picture:

\[ CPU \ Time = \frac{Instructions}{Program} \times \frac{Clock \ cycles}{Instruction} \times \frac{Seconds}{Clock \ cycle} \]

Performance depence on:

Algorithm：affects IC,CPI
Programming language: affects IC,CPI
Compiler: affects CPI,IC
Instruction set architecture: affects CPI,IC,T

Amdahl's Law¶

Amdahl's Law

Amdahl's Law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.

Amdahl's Law depends on two factors:

The fraction of the time the enhancement can be exploited.
The improvement gained by the enhancement while it is exploited.

\[ \text{Improved Execution Time} = \frac{\text{Affected Execution Time}}{\text{Amount of Improvement}} + \text{Unaffected Execution Time} \]

即部分改进的执行时间除以改进的倍数，加上未改进的执行时间，等于改进后的执行时间。

So based on Amdahl's Law, we can get the following formula:

\[ \text{Execution time}_{\text{new}} = \text{Execution time}_{\text{old}} \times \left( (1 - \text{Fraction}_{\text{enhanced}}) + \frac{\text{Fraction}_{\text{enhanced}}}{\text{Speedup}_{\text{enhanced}}} \right) \]

\[ \text{Speedup}_{\text{overall}} = \frac{\text{Execution time}_{\text{old}}}{\text{Execution time}_{\text{new}}} = \frac{1}{(1 - \text{Fraction}_{\text{enhanced}}) + \frac{\text{Fraction}_{\text{enhanced}}}{\text{Speedup}_{\text{enhanced}}}} \]

from the formula above, we can conclude that:

The overall speedup is limited by the enhancement that is used the least.
If Fraction is close to 1, the overall speedup is close to the speedup of the enhancement.
If speedup of the enhancement is close to infinity, the overall speedup is close to 1/(1-Fraction).Which ,in fact ,is hard to achieve.

Amdahl's Law Example

Original Execution Time: 10 seconds
Enhancement: 2x speedup
Fraction of Execution Time Affected: 0.5 The execution time after enhancement is:

\[ \text{Execution time}_{\text{new}} = 10 \times \left( (1 - 0.5) + \frac{0.5}{2} \right) = 10 \times 0.75 = 7.5 \text{ seconds} \]

The overall speedup is:

\[ \text{Speedup}_{\text{overall}} = \frac{10}{7.5} = 1.33 \]

Great Architecture ideas¶

Moore's Law¶

Moore's Law is a principle that states that the number of transistors on a microchip doubles approximately every two years.

Use abstraction to simplify design¶

Abstraction is a technique that allows us to simplify a complex system by hiding unnecessary details.Low-level details are hidden from the higher-level components.

Make the common case fast¶

Identify the common case and try to improve it.
Most cost-efficient method to obtain improvements.

Improve performance via parallelism¶

Improve performance by performing operations in parallel.
There are many levels of parallelism—instruction-level, process-level, etc.

Improve performance via pipelining¶

Break tasks into stages so that multiple tasks can be simultaneously performed in different stages.
Commonly used to improve instruction throughput.

Improve performance via prediction¶

Sometimes faster to assume a particular result than waiting until the result is known.
Known as speculation and is used to guess results of branches.

Use a hierarchy of memories¶

Make the fastest, smallest, and most expensive per bit memory the first level accessed and the slowest, largest, and cheapest per bit memory the last level accessed.
Allows most accesses to be caught at the first level and be able to retain most of the information at the lowest level.

Improve dependability via redundancy¶

Include redundant components that can both detect and correct failures.
Used at many different levels.

Instruction Set Architecture¶

instruction set architecture(ISA) is the set of instructions that a processor can execute.

when designing an ISA, we need to consider the following basic principles:

Compatibility:
- The ISA should be compatible with the existing software.
Versatility:
- The ISA should be versatile enough to support a variety of applications.
High efficiency:
- The ISA should be efficient in terms of the number of instructions and the amount of data that can be processed in a single instruction.
Security:
- The ISA should be secure enough to prevent attacks.

Instruction Set design issues¶

Where are operands stored?: registers,memory,stack,accumulator,etc.
How many operands are needed?: 0,1,2,3,etc. Classification of instructions
How is the operand location specified?: immediate,direct,indirect,relative,etc. Addressing modes
What type and size of operands are needed?: integer,floating-point,vector,etc. Data representation
What operations are supported?: arithmetic,logic,control,etc. Types of instructions

ISA Classes

Stack architecture
Accumulator architecture
Register architecture
General-purpose registers architecture(GPR)

对于A*B-(A+C*B)

; Stack architecture
PUSH A
PUSH B
MUL         ; A * B
PUSH A
PUSH C
PUSH B
MUL         ; C * B
ADD         ; A + (C * B)
SUB         ; (A * B) - (A + C * B)

; Accumulator architecture
LOAD A      ; Acc = A
MUL B       ; Acc = A * B
STORE TEMP  ; Store result in TEMP
LOAD B      ; Acc = B
MUL C       ; Acc = C * B
ADD A       ; Acc = A + (C * B)
STORE TEMP2 ; Store result in TEMP2
LOAD TEMP    ; Acc = (A * B)
SUB TEMP2   ; Acc = (A * B) - (A + C * B)

; Register architecture
LOAD R1, A  ; R1 = A
LOAD R2, B  ; R2 = B
MUL R3, R1, R2  ; R3 = A * B
LOAD R4, C  ; R4 = C
MUL R4, R4, R2  ; R4 = C * B
ADD R4, R4, R1  ; R4 = A + C * B
SUB R3, R3, R4  ; R3 = (A * B) - (A + C * B)

; General-purpose registers architecture(GPR)
load R1, A
load R2, B
mul R3, R1, R2  ; R3 = A * B
load R4, C
load R5, B
mul R6, R4, R5  ; R6 = C * B
add R6, R6, R1  ; R6 = A + C * B
sub R3, R3, R6  ; R3 = (A * B) - (A + C * B)