【读书笔记】《Computer Organization and Design: The Hardware/Software Interface》(1)



Computer Organization and Design: The Hardware/Software Interface》,中文译名,《计算机组成与设计:硬件/软件接口》,是计算机组成原理的经典入门教材之一。节奏紧凑又不紧张,内容充实又不冗长,语言表述朴实易懂又不故作高深,是一本非常适合初次接触计算机组成原理的学生阅读的入门教材。



知识点总结 (Knowledge Points Summary):主要是总结每篇阅读到的知识内容。
疑难点总结 (Difficult Points Summary):主要是总结我在阅读和理解时踩过的坑。
习题练习 (Assignment Solutions):有些文章(书籍)会有习题,我尽量将所有的习题都做一遍并把自己的答案和见解放在这里。欢迎各位同僚查漏补缺。
我的思考 (Reflection):我自己阅读和理解中的心得和体会会放在这里,这一模块不会拘泥于阅读内容,什么想法都有可能。
附录 (Appendix):所有不在上面四个模块的内容都会放在这里,包括但不仅限于拓展阅读、引用来源。


Chapter 1: Computer Abstraction and Technology  (第一章:计算机概述和技术)



1.1 Introduction:

【Knowledge Points Summary】:

1). Classes of Computing Applications:

Personal Computers : individual use, 35 years-old.
Servers : larger, carrying large workloads, greater computing, storage and I/O capability. 
Cload Computing: Warehous Scale Computers (WSCs)1, SaaS (Software as a Service)2
Embedded Systems : widest range of application, low tolerance of failure.
Personal Mobile Devices (PMD)

2). The performance of a program is influenced by :

The Algorithms used in the program:

Determines: the number of source-level statements, the number of I/O operations executed.
Not covered in this book. Please see my another blog of "Six Fundamental Subjects of Computer Science : Recommendation Textbooks List."
The Software Systems used to create and translate the program into machine instructions.
Components : Programming Language, Compiler, Instruction Set Architecture.
Determines : the number of computer instructions for each source-level statement.
Discussed at : Chapter 2 and 3.
The effectiveness of the Computer in executing those instructions.
Components : Processor, Memory system, I/O system (hardware and operation system).
Determines : How fast instructions and I/O operations can be executed.
Discussed at : Chatper 4, 5 and 6.

1.2 Eight Great Idea in Computer Architecture:

Moore's Law:
The integrated circuit (IC) resources doubles everey 18-24 months.
Stated by Gordon Moore in 1965.
Use Abstraction to Simplify Design:
Hidden low-level design detail to high-level design.
Make the Common Case Fast:
Enhance program performance by optimizing the common case of the problem.
Performance via Parallelism:
Significant topic covered in this book
Performance via Pipeplining:
A particular pattern of parallelism.
Performance via Prediction:
Guessing and starting work rather than until knowing for sure.
Assume recovering from misprediction is not expensive and prediction is relatively accurate.
Hierarchy of Memories:
Different memory techniques have different efficience and price, needed to be balanced.
The fastest, smallest and most expensive memory per bit at the top of the hierarchy.
Through special hierarchy architect and implementation, we can simulate the main memory access as fast as cashes.
Dependability via Redundancy:
Making system dependable by including redundant components:
take over failure occurs.
help detect failure.

1.3 Below your program:

1). Hardware/Software Hierarchical Layers:

Hierarchy = (Applications Software (Systems Softwares (Hardware)))
Systems Software:
sitting between the hardware and applications software.
providing commonly useful services.
Operation Systems:

Handler basic input/output operations
Allocating storage and memory
providing for protected sharing of the computer among multiple applications using it simultaneously.
Translate high-level language of a program into instructions that the hardware can execute.
3loaders,  assemblers, linker
Program exeution pathway:
Electronic hardware is controled by electric signals.
On/Off electric signal corresponding to 1/0.
Assembler: Translate symbolic version of instruction (assembly language) into binary version (machine language).
Every basic computer instruction (like add, minus) require one line of assembly language statement.
High-level language program ==(Compiler)==> Assembly Language program ==(Assembler)==> Binary machine language program

1.4 Under Covers:

1). Computer Hardware Organization:

Functions: Inputting data, Outputting data, Processing data, Storing data.
Input: Feed data  (Detail: Chapter 5, 6)
Output: Output data (Detail: Chapter 5, 6)
Memory: Storing data (Detail: Chapter 5)
Datapath:   (Detail: Chapter 3, 4, 6 and Appendix C)
where data goes and is modified.
performs arithmetic operations.

" is collection of functional units such as: arithmatic logic units or multipliers, that perform data processing operations, registers, and buses." R1        

According to the instruction of the program, sending the signals that determine the operations of the datapath, memory and output .
Detials: Chapter 4, 6 and appendix C)
Datapath + Control = Processor

2). I/O devices:

Output devices:

Liquid crestal displays (LCDs)

used on mobile devices to get a thin, low-power display.
Active matrix display: a LCDs using a transistor to control the transmission of light at each individual pixel.
Bit Map:
the matrix of pixels, represented as a matrix of bits.
Normally ranging in size from 1024 x 768 to 2048 x 1536
Raster Refresh Buffer (frame buffer):
storing the bit map (which is used to represent the image)
The bit pattern per pixel is read out to the graphics display at the refresh rate.
Input devices:


Capacitive Sensing.
People are electrical conductors.
if an insulator (glass) is covered with a transparent conductor (human finger), touching distors the electrostatic field of the screen.
Capacitance chages because of touching.
Allowing multiple touches simultaneously.

3). Inside box:

Chips: Integrated circuit, a device combining dozens to millions of transistors.
Central Processor Unit (CPU): datapath + controls
the storage area in which programs are kept when they are runnning and data need by the running program.
DRAM (dynamic random access memory):
built as an integrated circuit.
provide random access to any location.
RAM means that memory accesses take basically the same amount of time no matter what portin of the memory is read.
access times are 50 - 70 ns and cost 5 - 10 $/GiB.
requiring the data to be refreshed periodically in order to retain the data.
One transistor and one capacitor for every bit of data.
volatile memory.
Cache memory:
Inside or along side the CPU
Using Different memory techonlogy: SRAM (static random access memory):
faster, smaller, more expensive
dose not need to be refreshed as the transistors inside would continue to hold the data as long as the power is not cut off
6 transitors for every bit of data
volatile memory.
Cache technique is used to solve the imbalance between the high computing speed of processor and low access speed of the main memory (DRAM).
Cache memory acts as a buffer for the DRAM memory
Instruction set architecture: (Details: in Chapter 2 and Appendix A)
abstract interface between hardware and software.
Application binary interface (ABI):
Combination of the basic instruction set and operation system interface (I/O operation + memory allocation + low-level system functions)  for application programmer.

4). Storing data:

Distinguish by techniques used:

volatile memory:

Storage, such as DRAM, SRAM, that retains data only if it is receiving power.
Once power is cut off, data would be lost.
nonvolatile memory
retains data even in teh absence of a power source.
used to store data and programs between runs.
Distinguish by memory type:

main memory (primary memory):

memory used to hold program and data while running.
volatile memory. Consists of DRAM and SRAM 5.
secondary memory:
memory used to hold program and data between run.
nonvolatile memory:
Magnetic Disk (hard disk):

composed of rotating platters coated with a magnetic recording material
rotating mechanical devices
Access time: 5 - 20 milliseconds
Cost: 0.05 - 0.10 $/GiB (in 2012)
Flash memory(used in PMDs):
Access time: 5 - 50 microsecords
Cost: 0.75 - 1.00 $/GiB (in 2012)

5). Communicating with other computers:

Functions: Communitation, resource sharing, nonlocal access.
Ethernet: used in Local Area Network (LAN), a network of computers.
Internet: used in Wide Area Network (WAN),  a network of networks.
Cost: depends on speed of communication, length
Wireless technologies:
IEEE standard 802.11

1.5 Technologies for Building Processor and Memory:

1). Components:

silicon ==> semiconductor ==> transisitors (on/off switch) ==> integrated circuit ==> Very large-scale integrated (VLSI) circuit. ==> chip

2). Manufacturing process:

silicon crystal ingot ==(Sliced)==> wafers ==(Diced)==> dies (chips) ==(Bonding)==> CPU package
defects: a microscopic flaw result in the failure of the die.
the percentage of good die from the total number of dies on the wafer.
Cost per die = Cost per wafer / Dies per wafer * yield
Dies per Wafer ~= Wafer area / Die area
Yield = 1 / (1 + (Defects per area * Die area/2))2  ----- based on empirical observations of yields at IC factories.

1.6 Performance:

1). Definition:

Responce Time (Execution Time): the total time required for the computer to complete a task.
Throughput (bandwidth): the number of tasks completed per unit time.
Define: "Computer X is n times faster than computer Y" :
= Performancex / PerformanceY = n
= Execution TimeY / Execution Time= n
Time measurement:
Elapsed Time (System Performance): Total time to complete a task
CPU execution time: the actual time the cpu spends computing for a specific task
user execution time (CPU performance): cpu time spent in the program
system execution time: cpu time spent in the operating system performing tasks on behalf of the program

2). Clock Cycle:

is the time for one clock period, usually of the processor clock, which runs at a constant rate.
measuring that relates to how fast the hardware can perform basic functions.
a complete clock period , e.g , 250 picoseconds is the length of each clock cycle
a clock rate = 4 GHz = 1 / clock period = 1 / 250 * 10-12s

3). Performance measurement equations:

CPU execution time for a program:

= CPU clock cycles for a program * Clock period

= CPU clock cycles for a program / clock rate

CPU clock cycles = Instructions for a program * Average clock cycles per instruction (CPI)
CPU execution time:

= Instruction count * CPI * Clock cycle time

= Instruction count * CPI / Clock rate

Time = Seconds / Program = Instructions/Program  *  Clock cycles/Instruction  *  Secods/Clock cycle

4). Performance factors:

Instruction count:
instructions executed for the program
can be measured by:
using software tools that profile the execution
or by using a simulator of the architecture.
or use hardware counter
Clock cycle per Instruction: 
average amout of clock cycles per instruction
can use hardware counter to measure it. (Q: any other method to measure CPI?)
Q: What determines the amout of the clock cycle per instruction? What's different between Implementation and Architecture? (see chapter 4, 5)
Clock cycle time:
published as part of the documentation for a computer.
Q: What determines the clock cycle time?

5). Performance of a prorgram:

depends on:

affect: Instruction Count, possible CPI
Programming Language:
affect: Instruction Count, CPI
affect: Instruction Count, CPI
Instruction et Architecture:
affect: Instruction Count, CPI, clock rate

6). some processors fetch and execute multiple instructions per clock cycle, so CPI can be < 1.0.


To save energy or temporaily boost performance, today's processor can vary their clock rates.
E.g, the Intel Core i7 will temporarily increase clock rate by about 10% until the chip gets too war. Intel calls this Turbo mode.
so we would need to use the average clock rate for a program.

1.7 The Power Wall

1). Clock rate and Power consumptionn are correlated.

2). Practical power used by computer is limited by the cooling commodity microprocessor.

3). Energy is another measure of performance of the computer such as PMDs and Warehouse scale computers.

4). The dominant technology of IC is called CMOS (complementary metal oxide semiconductor互补金属氧化物半导体).

5). Dynamics Energy:

energy consumed when transistors swithc states from 0 to 1 (or 1 to 0).
is the primary source of energy consumption by CMOS.
depends on: Capacitive load * Voltage2   == the energy of a puls during the logic transition of 0 -> 1->0 or 1-> 0 -> 1.
Power required per transistor = 1/2 * C * V2 * f
frequency switched
is a function of the clock rate.
capacitive load of each transistor
is a function of both the number of transistors connected to an output (called fannout) and the technology (determines capacitance of wire and transistors).

1.8 The sea Change: The switch from Uniprocessors to Multiprocessors

1). Define: multiple processors per chip.

2). Parallelism: Improve performance by increasing throughput.

3). Challenge:

Load balance
Time for Synchronization
Overhead(开销) for communication between the parties.

1.9 Benchmarking the Inter Core i7

Q: what is benchmark? Who publish it and how does they calculate it? How to ues it?

1.10 Fallacies and Pitfalls

1). Pitfalls: Expecting the improvement of one aspect of a computer to increase overall performance by an amount proportional to the size of the improvement.

Amdahl's Law (阿姆达尔定律):
Execution time after improvement = Execution time affected by improvement / Amount of improvement + Execution Time unaffacted
is a rule stating that the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used. It is a quantitative version of the law of diminishing returns.
use to evaluate potential enhancement.
use to argue for practical limits to the number of parallel processors. (see chapter 6 for detail)



1.1 概述:


1). 计算机的分类:

个人电脑 (Personal Computer):私人使用,35年的发展历史。
服务器 (Server):更强大的计算能力(Computing Capacity)、存储能力(Storage Capacity)和输入/输出负载。
超级计算机(Super Computer): 彪悍的计算能力,常用于科学数据处理、天气预报等。
云计算(Cload Computing):仓储式服务器集群(Warehouse Scale Computers)1,软件“即”服务(SaaS)2
嵌入式计算机(Embedded Computer):最广泛的使用,低容错率(意味着稳定(Dependability)是优先目标)。

2). 影响程序性能的因素:


包括:编程所用的程序语言,编译器,指令集(instruction set architecture)。
包括:处理器,存储访问系统,I/O系统 (硬件和操作系统)。

1.2: 计算机架构的八大设计思想:

摩尔定律(Moore's Law):

1.3 隐藏在程序背后的东东:

1). 硬件/软件分层架构:

层级关系 = (应用级软件(系统级软件(硬件)))
操作系统 (Operation System):

编译器 (Compiler):
3加载器(loader), 汇编器 (assembler),链接器(linker)
最简单的电信号:开/关 = 二进制信号 0/1
汇编器(Assembler):负责把汇编语言(简单符号构成的计算机指令)转换成二进制机器语言(0/1 构成的语言, 机器可以理解=用来转换成电信号)
高级语言(如 c, java)码好的程序 ==(编译器)==> 汇编语言构成的程序 ==(汇编器)==> 二进制机器语言构成的程序

1.4 计算机内部的东东:

1). 计算机硬件架构:

数据通路(datapath) :负责所有数值运算
控制(control): 根据程序指令,发送信号给数据通路,输入/输出和存储进行相应的操作。
控制 + 数据通路 = 处理器(processor) (第4章会详细讲解这个)

2). 输入/输出:


处理器(processor): 数据通路 + 控制
动态随机存取存储器 (DRAM):
访问时间:50 - 70纳秒,成本: 5 - 10 $/GiB (2012年数据)
缓存(cache memory):

每存储1 bit数据需要6个晶体管

4). 存储数据:

磁盘(硬盘, hard memory)
闪存(flash memory)

5). 计算机间的通信:


1.5 处理器和存储器的制造工艺:

1). 组成成分:

硅 ==> 半导体 ==> 晶体管(on/off开关) ==> 集成电路(IC) ==> 大规模集成(VLSI)电路 ==> 芯片(chips)

2). 制造流程:

晶体硅锭 ==(切片)==> 晶片 ==(切粒)==> 芯片 ==(封装)==> 处理器/存储器



1. 编译后的程序的汇编代码的总计算机指令数是确定的,那么运行同一个汇编程序在不同的的处理器上,CPI是一样的吗?


2. CPU所用的指令集不同,实现架构(只电路实现)肯定不同。即使它们遵从的指令集是一样的,但实现方法(不同的电路设计实现相同的指令,比如1001)也可以不同,就是说架构也可以不同。这个不同就影响了不同的处理器的性能。当然,不同的处理器架构需要不同的指令集,也会有影响它们的性能。一句话总结:指令集决定了不同的处理器架构,指令集如果不兼容,计算机指令就不兼容。

3. 不同指令集的处理器无法运行相同的汇编程序,因为汇编语言的程序声明和指令集对不上。

4. 是不是不同架构的CPU使用不同的编译器?


5. 是不是不同架构的CPU使用的汇编语言不同?


5.2. 处理器架构,指令集和汇编语言的关系?

请见知乎回答: https://www.zhihu.com/question/23474438

6. 如何解决跨平台编译?如何解决跨平台执行?





《Computer Organization and Design: The Hardware/Software Interface》,中文译名,《计算机组成与设计:硬件/软件接口》,是计算机组成原理的经典入门教材之一。节奏紧凑又不紧张,内容充实又不冗长,语言表述朴实易懂又不故作高深,是一本非常适合初次接触计算机组成原理的学生阅读的入门教材。



知识点总结 (Knowledge Points Summary):主要是总结每篇阅读到的知识内容。
疑难点总结 (Difficult Points Summary):主要是总结我在阅读和理解时踩过的坑。
习题练习 (Assignment Solutions):有些文章(书籍)会有习题,我尽量将所有的习题都做一遍并把自己的答案和见解放在这里。欢迎各位同僚查漏补缺。
我的思考 (Reflection):我自己阅读和理解中的心得和体会会放在这里,这一模块不会拘泥于阅读内容,什么想法都有可能。
附录 (Appendix):所有不在上面四个模块的内容都会放在这里,包括但不仅限于拓展阅读、引用来源。


Chapter 1: Computer Abstraction and Technology  (第一章:计算机概述和技术)



1.1 Introduction:

【Knowledge Points Summary】:

1). Classes of Computing Applications:

Personal Computers : individual use, 35 years-old.
Servers : larger, carrying large workloads, greater computing, storage and I/O capability. 
Cload Computing: Warehous Scale Computers (WSCs)1, SaaS (Software as a Service)2
Embedded Systems : widest range of application, low tolerance of failure.
Personal Mobile Devices (PMD)

2). The performance of a program is influenced by :

The Algorithms used in the program:

Determines: the number of source-level statements, the number of I/O operations executed.
Not covered in this book. Please see my another blog of "Six Fundamental Subjects of Computer Science : Recommendation Textbooks List."
The Software Systems used to create and translate the program into machine instructions.
Components : Programming Language, Compiler, Instruction Set Architecture.
Determines : the number of computer instructions for each source-level statement.
Discussed at : Chapter 2 and 3.
The effectiveness of the Computer in executing those instructions.
Components : Processor, Memory system, I/O system (hardware and operation system).
Determines : How fast instructions and I/O operations can be executed.
Discussed at : Chatper 4, 5 and 6.

1.2 Eight Great Idea in Computer Architecture:

Moore's Law:
The integrated circuit (IC) resources doubles everey 18-24 months.
Stated by Gordon Moore in 1965.
Use Abstraction to Simplify Design:
Hidden low-level design detail to high-level design.
Make the Common Case Fast:
Enhance program performance by optimizing the common case of the problem.
Performance via Parallelism:
Significant topic covered in this book
Performance via Pipeplining:
A particular pattern of parallelism.
Performance via Prediction:
Guessing and starting work rather than until knowing for sure.
Assume recovering from misprediction is not expensive and prediction is relatively accurate.
Hierarchy of Memories:
Different memory techniques have different efficience and price, needed to be balanced.
The fastest, smallest and most expensive memory per bit at the top of the hierarchy.
Through special hierarchy architect and implementation, we can simulate the main memory access as fast as cashes.
Dependability via Redundancy:
Making system dependable by including redundant components:
take over failure occurs.
help detect failure.

1.3 Below your program:

1). Hardware/Software Hierarchical Layers:

Hierarchy = (Applications Software (Systems Softwares (Hardware)))
Systems Software:
sitting between the hardware and applications software.
providing commonly useful services.
Operation Systems:

Handler basic input/output operations
Allocating storage and memory
providing for protected sharing of the computer among multiple applications using it simultaneously.
Translate high-level language of a program into instructions that the hardware can execute.
3loaders,  assemblers, linker
Program exeution pathway:
Electronic hardware is controled by electric signals.
On/Off electric signal corresponding to 1/0.
Assembler: Translate symbolic version of instruction (assembly language) into binary version (machine language).
Every basic computer instruction (like add, minus) require one line of assembly language statement.
High-level language program ==(Compiler)==> Assembly Language program ==(Assembler)==> Binary machine language program

1.4 Under Covers:

1). Computer Hardware Organization:

Functions: Inputting data, Outputting data, Processing data, Storing data.
Input: Feed data  (Detail: Chapter 5, 6)
Output: Output data (Detail: Chapter 5, 6)
Memory: Storing data (Detail: Chapter 5)
Datapath:   (Detail: Chapter 3, 4, 6 and Appendix C)
where data goes and is modified.
performs arithmetic operations.

" is collection of functional units such as: arithmatic logic units or multipliers, that perform data processing operations, registers, and buses." R1        

According to the instruction of the program, sending the signals that determine the operations of the datapath, memory and output .
Detials: Chapter 4, 6 and appendix C)
Datapath + Control = Processor

2). I/O devices:

Output devices:

Liquid crestal displays (LCDs)

used on mobile devices to get a thin, low-power display.
Active matrix display: a LCDs using a transistor to control the transmission of light at each individual pixel.
Bit Map:
the matrix of pixels, represented as a matrix of bits.
Normally ranging in size from 1024 x 768 to 2048 x 1536
Raster Refresh Buffer (frame buffer):
storing the bit map (which is used to represent the image)
The bit pattern per pixel is read out to the graphics display at the refresh rate.
Input devices:


Capacitive Sensing.
People are electrical conductors.
if an insulator (glass) is covered with a transparent conductor (human finger), touching distors the electrostatic field of the screen.
Capacitance chages because of touching.
Allowing multiple touches simultaneously.

3). Inside box:

Chips: Integrated circuit, a device combining dozens to millions of transistors.
Central Processor Unit (CPU): datapath + controls
the storage area in which programs are kept when they are runnning and data need by the running program.
DRAM (dynamic random access memory):
built as an integrated circuit.
provide random access to any location.
RAM means that memory accesses take basically the same amount of time no matter what portin of the memory is read.
access times are 50 - 70 ns and cost 5 - 10 $/GiB.
requiring the data to be refreshed periodically in order to retain the data.
One transistor and one capacitor for every bit of data.
volatile memory.
Cache memory:
Inside or along side the CPU
Using Different memory techonlogy: SRAM (static random access memory):
faster, smaller, more expensive
dose not need to be refreshed as the transistors inside would continue to hold the data as long as the power is not cut off
6 transitors for every bit of data
volatile memory.
Cache technique is used to solve the imbalance between the high computing speed of processor and low access speed of the main memory (DRAM).
Cache memory acts as a buffer for the DRAM memory
Instruction set architecture: (Details: in Chapter 2 and Appendix A)
abstract interface between hardware and software.
Application binary interface (ABI):
Combination of the basic instruction set and operation system interface (I/O operation + memory allocation + low-level system functions)  for application programmer.

4). Storing data:

Distinguish by techniques used:

volatile memory:

Storage, such as DRAM, SRAM, that retains data only if it is receiving power.
Once power is cut off, data would be lost.
nonvolatile memory
retains data even in teh absence of a power source.
used to store data and programs between runs.
Distinguish by memory type:

main memory (primary memory):

memory used to hold program and data while running.
volatile memory. Consists of DRAM and SRAM 5.
secondary memory:
memory used to hold program and data between run.
nonvolatile memory:
Magnetic Disk (hard disk):

composed of rotating platters coated with a magnetic recording material
rotating mechanical devices
Access time: 5 - 20 milliseconds
Cost: 0.05 - 0.10 $/GiB (in 2012)
Flash memory(used in PMDs):
Access time: 5 - 50 microsecords
Cost: 0.75 - 1.00 $/GiB (in 2012)

5). Communicating with other computers:

Functions: Communitation, resource sharing, nonlocal access.
Ethernet: used in Local Area Network (LAN), a network of computers.
Internet: used in Wide Area Network (WAN),  a network of networks.
Cost: depends on speed of communication, length
Wireless technologies:
IEEE standard 802.11

1.5 Technologies for Building Processor and Memory:

1). Components:

silicon ==> semiconductor ==> transisitors (on/off switch) ==> integrated circuit ==> Very large-scale integrated (VLSI) circuit. ==> chip

2). Manufacturing process:

silicon crystal ingot ==(Sliced)==> wafers ==(Diced)==> dies (chips) ==(Bonding)==> CPU package
defects: a microscopic flaw result in the failure of the die.
the percentage of good die from the total number of dies on the wafer.
Cost per die = Cost per wafer / Dies per wafer * yield
Dies per Wafer ~= Wafer area / Die area
Yield = 1 / (1 + (Defects per area * Die area/2))2  ----- based on empirical observations of yields at IC factories.

1.6 Performance:

1). Definition:

Responce Time (Execution Time): the total time required for the computer to complete a task.
Throughput (bandwidth): the number of tasks completed per unit time.
Define: "Computer X is n times faster than computer Y" :
= Performancex / PerformanceY = n
= Execution TimeY / Execution Time= n
Time measurement:
Elapsed Time (System Performance): Total time to complete a task
CPU execution time: the actual time the cpu spends computing for a specific task
user execution time (CPU performance): cpu time spent in the program
system execution time: cpu time spent in the operating system performing tasks on behalf of the program

2). Clock Cycle:

is the time for one clock period, usually of the processor clock, which runs at a constant rate.
measuring that relates to how fast the hardware can perform basic functions.
a complete clock period , e.g , 250 picoseconds is the length of each clock cycle
a clock rate = 4 GHz = 1 / clock period = 1 / 250 * 10-12s

3). Performance measurement equations:

CPU execution time for a program:

= CPU clock cycles for a program * Clock period

= CPU clock cycles for a program / clock rate

CPU clock cycles = Instructions for a program * Average clock cycles per instruction (CPI)
CPU execution time:

= Instruction count * CPI * Clock cycle time

= Instruction count * CPI / Clock rate

Time = Seconds / Program = Instructions/Program  *  Clock cycles/Instruction  *  Secods/Clock cycle

4). Performance factors:

Instruction count:
instructions executed for the program
can be measured by:
using software tools that profile the execution
or by using a simulator of the architecture.
or use hardware counter
Clock cycle per Instruction: 
average amout of clock cycles per instruction
can use hardware counter to measure it. (Q: any other method to measure CPI?)
Q: What determines the amout of the clock cycle per instruction? What's different between Implementation and Architecture? (see chapter 4, 5)
Clock cycle time:
published as part of the documentation for a computer.
Q: What determines the clock cycle time?

5). Performance of a prorgram:

depends on:

affect: Instruction Count, possible CPI
Programming Language:
affect: Instruction Count, CPI
affect: Instruction Count, CPI
Instruction et Architecture:
affect: Instruction Count, CPI, clock rate

6). some processors fetch and execute multiple instructions per clock cycle, so CPI can be < 1.0.


To save energy or temporaily boost performance, today's processor can vary their clock rates.
E.g, the Intel Core i7 will temporarily increase clock rate by about 10% until the chip gets too war. Intel calls this Turbo mode.
so we would need to use the average clock rate for a program.

1.7 The Power Wall

1). Clock rate and Power consumptionn are correlated.

2). Practical power used by computer is limited by the cooling commodity microprocessor.

3). Energy is another measure of performance of the computer such as PMDs and Warehouse scale computers.

4). The dominant technology of IC is called CMOS (complementary metal oxide semiconductor互补金属氧化物半导体).

5). Dynamics Energy:

energy consumed when transistors swithc states from 0 to 1 (or 1 to 0).
is the primary source of energy consumption by CMOS.
depends on: Capacitive load * Voltage2   == the energy of a puls during the logic transition of 0 -> 1->0 or 1-> 0 -> 1.
Power required per transistor = 1/2 * C * V2 * f
frequency switched
is a function of the clock rate.
capacitive load of each transistor
is a function of both the number of transistors connected to an output (called fannout) and the technology (determines capacitance of wire and transistors).

1.8 The sea Change: The switch from Uniprocessors to Multiprocessors

1). Define: multiple processors per chip.

2). Parallelism: Improve performance by increasing throughput.

3). Challenge:

Load balance
Time for Synchronization
Overhead(开销) for communication between the parties.

1.9 Benchmarking the Inter Core i7

Q: what is benchmark? Who publish it and how does they calculate it? How to ues it?

1.10 Fallacies and Pitfalls

1). Pitfalls: Expecting the improvement of one aspect of a computer to increase overall performance by an amount proportional to the size of the improvement.

Amdahl's Law (阿姆达尔定律):
Execution time after improvement = Execution time affected by improvement / Amount of improvement + Execution Time unaffacted
is a rule stating that the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used. It is a quantitative version of the law of diminishing returns.
use to evaluate potential enhancement.
use to argue for practical limits to the number of parallel processors. (see chapter 6 for detail)



1.1 概述:


1). 计算机的分类:

个人电脑 (Personal Computer):私人使用,35年的发展历史。
服务器 (Server):更强大的计算能力(Computing Capacity)、存储能力(Storage Capacity)和输入/输出负载。
超级计算机(Super Computer): 彪悍的计算能力,常用于科学数据处理、天气预报等。
云计算(Cload Computing):仓储式服务器集群(Warehouse Scale Computers)1,软件“即”服务(SaaS)2
嵌入式计算机(Embedded Computer):最广泛的使用,低容错率(意味着稳定(Dependability)是优先目标)。

2). 影响程序性能的因素:


包括:编程所用的程序语言,编译器,指令集(instruction set architecture)。
包括:处理器,存储访问系统,I/O系统 (硬件和操作系统)。

1.2: 计算机架构的八大设计思想:

摩尔定律(Moore's Law):

1.3 隐藏在程序背后的东东:

1). 硬件/软件分层架构:

层级关系 = (应用级软件(系统级软件(硬件)))
操作系统 (Operation System):

编译器 (Compiler):
3加载器(loader), 汇编器 (assembler),链接器(linker)
最简单的电信号:开/关 = 二进制信号 0/1
汇编器(Assembler):负责把汇编语言(简单符号构成的计算机指令)转换成二进制机器语言(0/1 构成的语言, 机器可以理解=用来转换成电信号)
高级语言(如 c, java)码好的程序 ==(编译器)==> 汇编语言构成的程序 ==(汇编器)==> 二进制机器语言构成的程序

1.4 计算机内部的东东:

1). 计算机硬件架构:

数据通路(datapath) :负责所有数值运算
控制(control): 根据程序指令,发送信号给数据通路,输入/输出和存储进行相应的操作。
控制 + 数据通路 = 处理器(processor) (第4章会详细讲解这个)

2). 输入/输出:


处理器(processor): 数据通路 + 控制
动态随机存取存储器 (DRAM):
访问时间:50 - 70纳秒,成本: 5 - 10 $/GiB (2012年数据)
缓存(cache memory):

每存储1 bit数据需要6个晶体管

4). 存储数据:

磁盘(硬盘, hard memory)
闪存(flash memory)

5). 计算机间的通信:


1.5 处理器和存储器的制造工艺:

1). 组成成分:

硅 ==> 半导体 ==> 晶体管(on/off开关) ==> 集成电路(IC) ==> 大规模集成(VLSI)电路 ==> 芯片(chips)

2). 制造流程:

晶体硅锭 ==(切片)==> 晶片 ==(切粒)==> 芯片 ==(封装)==> 处理器/存储器



1. 编译后的程序的汇编代码的总计算机指令数是确定的,那么运行同一个汇编程序在不同的的处理器上,CPI是一样的吗?


2. CPU所用的指令集不同,实现架构(只电路实现)肯定不同。即使它们遵从的指令集是一样的,但实现方法(不同的电路设计实现相同的指令,比如1001)也可以不同,就是说架构也可以不同。这个不同就影响了不同的处理器的性能。当然,不同的处理器架构需要不同的指令集,也会有影响它们的性能。一句话总结:指令集决定了不同的处理器架构,指令集如果不兼容,计算机指令就不兼容。

3. 不同指令集的处理器无法运行相同的汇编程序,因为汇编语言的程序声明和指令集对不上。

4. 是不是不同架构的CPU使用不同的编译器?


5. 是不是不同架构的CPU使用的汇编语言不同?


5.2. 处理器架构,指令集和汇编语言的关系?

请见知乎回答: https://www.zhihu.com/question/23474438

6. 如何解决跨平台编译?如何解决跨平台执行?



【读书笔记】《Computer Organization and Design: The Hardware/Software Interface》(1)的相关教程结束。

《【读书笔记】《Computer Organization and Design: The Hardware/Software Interface》(1).doc》
