【读书笔记】《Computer Organization and Design: The Hardware/Software Interface》(1)

2023-06-12,,

笔记前言:

Computer Organization and Design: The Hardware/Software Interface》,中文译名,《计算机组成与设计:硬件/软件接口》,是计算机组成原理的经典入门教材之一。节奏紧凑又不紧张,内容充实又不冗长,语言表述朴实易懂又不故作高深,是一本非常适合初次接触计算机组成原理的学生阅读的入门教材。

读书笔记系列博客是主要是记录我学习和阅读中的心得和体会。既然是读书笔记,肯定不会面面俱到,那就成了抄书笔记了。所有笔记系列博客力求言简意赅,终极目标是抛开书,单靠阅读笔记就能联想起每一章对应的精华要点。

为了使未来的我能够读懂现在的我所写下的文字,每篇读书笔记将使用双语叙述(英文+中文),并且都将分为五个模块:知识点总结,疑难点总结,习题练习,我的思考,附录。

知识点总结 (Knowledge Points Summary):主要是总结每篇阅读到的知识内容。
疑难点总结 (Difficult Points Summary):主要是总结我在阅读和理解时踩过的坑。
习题练习 (Assignment Solutions):有些文章(书籍)会有习题,我尽量将所有的习题都做一遍并把自己的答案和见解放在这里。欢迎各位同僚查漏补缺。
我的思考 (Reflection):我自己阅读和理解中的心得和体会会放在这里,这一模块不会拘泥于阅读内容,什么想法都有可能。
附录 (Appendix):所有不在上面四个模块的内容都会放在这里,包括但不仅限于拓展阅读、引用来源。

在今后的学习生活中,希望自己能保持专注,保持兴奋,保持谦逊,保持好奇。学无止境,任重而道远呐。


Chapter 1: Computer Abstraction and Technology  (第一章:计算机概述和技术)

======================================

English:

1.1 Introduction:

【Knowledge Points Summary】:

1). Classes of Computing Applications:

Personal Computers : individual use, 35 years-old.
Servers : larger, carrying large workloads, greater computing, storage and I/O capability. 
Supercomputers
Cload Computing: Warehous Scale Computers (WSCs)1, SaaS (Software as a Service)2
Embedded Systems : widest range of application, low tolerance of failure.
Personal Mobile Devices (PMD)

2). The performance of a program is influenced by :

The Algorithms used in the program:

Determines: the number of source-level statements, the number of I/O operations executed.
Not covered in this book. Please see my another blog of "Six Fundamental Subjects of Computer Science : Recommendation Textbooks List."
The Software Systems used to create and translate the program into machine instructions.
Components : Programming Language, Compiler, Instruction Set Architecture.
Determines : the number of computer instructions for each source-level statement.
Discussed at : Chapter 2 and 3.
The effectiveness of the Computer in executing those instructions.
Components : Processor, Memory system, I/O system (hardware and operation system).
Determines : How fast instructions and I/O operations can be executed.
Discussed at : Chatper 4, 5 and 6.

1.2 Eight Great Idea in Computer Architecture:

Moore's Law:
The integrated circuit (IC) resources doubles everey 18-24 months.
Stated by Gordon Moore in 1965.
Use Abstraction to Simplify Design:
Hidden low-level design detail to high-level design.
Make the Common Case Fast:
Enhance program performance by optimizing the common case of the problem.
Performance via Parallelism:
Significant topic covered in this book
Performance via Pipeplining:
A particular pattern of parallelism.
Performance via Prediction:
Guessing and starting work rather than until knowing for sure.
Assume recovering from misprediction is not expensive and prediction is relatively accurate.
Hierarchy of Memories:
Different memory techniques have different efficience and price, needed to be balanced.
The fastest, smallest and most expensive memory per bit at the top of the hierarchy.
Through special hierarchy architect and implementation, we can simulate the main memory access as fast as cashes.
Dependability via Redundancy:
Making system dependable by including redundant components:
take over failure occurs.
help detect failure.

1.3 Below your program:

1). Hardware/Software Hierarchical Layers:

Hierarchy = (Applications Software (Systems Softwares (Hardware)))
Systems Software:
sitting between the hardware and applications software.
providing commonly useful services.
including:
Operation Systems:

Handler basic input/output operations
Allocating storage and memory
providing for protected sharing of the computer among multiple applications using it simultaneously.
Compilers:
Translate high-level language of a program into instructions that the hardware can execute.
3loaders,  assemblers, linker
Program exeution pathway:
Electronic hardware is controled by electric signals.
On/Off electric signal corresponding to 1/0.
Assembler: Translate symbolic version of instruction (assembly language) into binary version (machine language).
Every basic computer instruction (like add, minus) require one line of assembly language statement.
High-level language program ==(Compiler)==> Assembly Language program ==(Assembler)==> Binary machine language program

1.4 Under Covers:

1). Computer Hardware Organization:

Functions: Inputting data, Outputting data, Processing data, Storing data.
Components:
Input: Feed data  (Detail: Chapter 5, 6)
Output: Output data (Detail: Chapter 5, 6)
Memory: Storing data (Detail: Chapter 5)
Datapath:   (Detail: Chapter 3, 4, 6 and Appendix C)
where data goes and is modified.
performs arithmetic operations.

" is collection of functional units such as: arithmatic logic units or multipliers, that perform data processing operations, registers, and buses." R1        

Control:
According to the instruction of the program, sending the signals that determine the operations of the datapath, memory and output .
Detials: Chapter 4, 6 and appendix C)
Datapath + Control = Processor

2). I/O devices:

Output devices:

Liquid crestal displays (LCDs)

used on mobile devices to get a thin, low-power display.
Active matrix display: a LCDs using a transistor to control the transmission of light at each individual pixel.
Bit Map:
the matrix of pixels, represented as a matrix of bits.
Normally ranging in size from 1024 x 768 to 2048 x 1536
Raster Refresh Buffer (frame buffer):
storing the bit map (which is used to represent the image)
The bit pattern per pixel is read out to the graphics display at the refresh rate.
Input devices:
Touchscreen:

Implementation:

Capacitive Sensing.
People are electrical conductors.
if an insulator (glass) is covered with a transparent conductor (human finger), touching distors the electrostatic field of the screen.
Capacitance chages because of touching.
Allowing multiple touches simultaneously.

3). Inside box:

Chips: Integrated circuit, a device combining dozens to millions of transistors.
Central Processor Unit (CPU): datapath + controls
Memory:
the storage area in which programs are kept when they are runnning and data need by the running program.
DRAM (dynamic random access memory):
built as an integrated circuit.
provide random access to any location.
RAM means that memory accesses take basically the same amount of time no matter what portin of the memory is read.
access times are 50 - 70 ns and cost 5 - 10 $/GiB.
requiring the data to be refreshed periodically in order to retain the data.
One transistor and one capacitor for every bit of data.
volatile memory.
Cache memory:
Inside or along side the CPU
Using Different memory techonlogy: SRAM (static random access memory):
faster, smaller, more expensive
dose not need to be refreshed as the transistors inside would continue to hold the data as long as the power is not cut off
6 transitors for every bit of data
volatile memory.
Cache technique is used to solve the imbalance between the high computing speed of processor and low access speed of the main memory (DRAM).
Cache memory acts as a buffer for the DRAM memory
Instruction set architecture: (Details: in Chapter 2 and Appendix A)
abstract interface between hardware and software.
Application binary interface (ABI):
Combination of the basic instruction set and operation system interface (I/O operation + memory allocation + low-level system functions)  for application programmer.

4). Storing data:

Distinguish by techniques used:

volatile memory:

Storage, such as DRAM, SRAM, that retains data only if it is receiving power.
Once power is cut off, data would be lost.
nonvolatile memory
retains data even in teh absence of a power source.
used to store data and programs between runs.
Distinguish by memory type:

main memory (primary memory):

memory used to hold program and data while running.
volatile memory. Consists of DRAM and SRAM 5.
secondary memory:
memory used to hold program and data between run.
nonvolatile memory:
Magnetic Disk (hard disk):

composed of rotating platters coated with a magnetic recording material
rotating mechanical devices
Access time: 5 - 20 milliseconds
Cost: 0.05 - 0.10 $/GiB (in 2012)
Flash memory(used in PMDs):
Access time: 5 - 50 microsecords
Cost: 0.75 - 1.00 $/GiB (in 2012)

5). Communicating with other computers:

Networks:
Functions: Communitation, resource sharing, nonlocal access.
Type:
Ethernet: used in Local Area Network (LAN), a network of computers.
Internet: used in Wide Area Network (WAN),  a network of networks.
Cost: depends on speed of communication, length
Wireless technologies:
IEEE standard 802.11

1.5 Technologies for Building Processor and Memory:

1). Components:

silicon ==> semiconductor ==> transisitors (on/off switch) ==> integrated circuit ==> Very large-scale integrated (VLSI) circuit. ==> chip

2). Manufacturing process:

silicon crystal ingot ==(Sliced)==> wafers ==(Diced)==> dies (chips) ==(Bonding)==> CPU package
defects: a microscopic flaw result in the failure of the die.
Yeild:
the percentage of good die from the total number of dies on the wafer.
Cost per die = Cost per wafer / Dies per wafer * yield
Dies per Wafer ~= Wafer area / Die area
Yield = 1 / (1 + (Defects per area * Die area/2))2  ----- based on empirical observations of yields at IC factories.

1.6 Performance:

1). Definition:

Responce Time (Execution Time): the total time required for the computer to complete a task.
Throughput (bandwidth): the number of tasks completed per unit time.
Define: "Computer X is n times faster than computer Y" :
= Performancex / PerformanceY = n
= Execution TimeY / Execution Time= n
Time measurement:
Elapsed Time (System Performance): Total time to complete a task
CPU execution time: the actual time the cpu spends computing for a specific task
user execution time (CPU performance): cpu time spent in the program
system execution time: cpu time spent in the operating system performing tasks on behalf of the program

2). Clock Cycle:

is the time for one clock period, usually of the processor clock, which runs at a constant rate.
measuring that relates to how fast the hardware can perform basic functions.
a complete clock period , e.g , 250 picoseconds is the length of each clock cycle
a clock rate = 4 GHz = 1 / clock period = 1 / 250 * 10-12s

3). Performance measurement equations:

CPU execution time for a program:

= CPU clock cycles for a program * Clock period

= CPU clock cycles for a program / clock rate

CPU clock cycles = Instructions for a program * Average clock cycles per instruction (CPI)
CPU execution time:

= Instruction count * CPI * Clock cycle time

= Instruction count * CPI / Clock rate

Time = Seconds / Program = Instructions/Program  *  Clock cycles/Instruction  *  Secods/Clock cycle

4). Performance factors:

Instruction count:
instructions executed for the program
can be measured by:
using software tools that profile the execution
or by using a simulator of the architecture.
or use hardware counter
Clock cycle per Instruction: 
average amout of clock cycles per instruction
can use hardware counter to measure it. (Q: any other method to measure CPI?)
Q: What determines the amout of the clock cycle per instruction? What's different between Implementation and Architecture? (see chapter 4, 5)
Clock cycle time:
published as part of the documentation for a computer.
Q: What determines the clock cycle time?

5). Performance of a prorgram:

depends on:
Alogorithm:

affect: Instruction Count, possible CPI
Programming Language:
affect: Instruction Count, CPI
Compiler:
affect: Instruction Count, CPI
Instruction et Architecture:
affect: Instruction Count, CPI, clock rate

6). some processors fetch and execute multiple instructions per clock cycle, so CPI can be < 1.0.

7).

To save energy or temporaily boost performance, today's processor can vary their clock rates.
E.g, the Intel Core i7 will temporarily increase clock rate by about 10% until the chip gets too war. Intel calls this Turbo mode.
so we would need to use the average clock rate for a program.

1.7 The Power Wall

1). Clock rate and Power consumptionn are correlated.

2). Practical power used by computer is limited by the cooling commodity microprocessor.

3). Energy is another measure of performance of the computer such as PMDs and Warehouse scale computers.

4). The dominant technology of IC is called CMOS (complementary metal oxide semiconductor互补金属氧化物半导体).

5). Dynamics Energy:

energy consumed when transistors swithc states from 0 to 1 (or 1 to 0).
is the primary source of energy consumption by CMOS.
depends on: Capacitive load * Voltage2   == the energy of a puls during the logic transition of 0 -> 1->0 or 1-> 0 -> 1.
Power required per transistor = 1/2 * C * V2 * f
f:
frequency switched
is a function of the clock rate.
C:
capacitive load of each transistor
is a function of both the number of transistors connected to an output (called fannout) and the technology (determines capacitance of wire and transistors).

1.8 The sea Change: The switch from Uniprocessors to Multiprocessors

1). Define: multiple processors per chip.

2). Parallelism: Improve performance by increasing throughput.

3). Challenge:

Scheduling
Load balance
Time for Synchronization
Overhead(开销) for communication between the parties.

1.9 Benchmarking the Inter Core i7

Q: what is benchmark? Who publish it and how does they calculate it? How to ues it?

1.10 Fallacies and Pitfalls

1). Pitfalls: Expecting the improvement of one aspect of a computer to increase overall performance by an amount proportional to the size of the improvement.

Amdahl's Law (阿姆达尔定律):
Execution time after improvement = Execution time affected by improvement / Amount of improvement + Execution Time unaffacted
is a rule stating that the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used. It is a quantitative version of the law of diminishing returns.
use to evaluate potential enhancement.
use to argue for practical limits to the number of parallel processors. (see chapter 6 for detail)

======================================

中文

1.1 概述:

【知识点总结】:

1). 计算机的分类:

个人电脑 (Personal Computer):私人使用,35年的发展历史。
服务器 (Server):更强大的计算能力(Computing Capacity)、存储能力(Storage Capacity)和输入/输出负载。
超级计算机(Super Computer): 彪悍的计算能力,常用于科学数据处理、天气预报等。
云计算(Cload Computing):仓储式服务器集群(Warehouse Scale Computers)1,软件“即”服务(SaaS)2
嵌入式计算机(Embedded Computer):最广泛的使用,低容错率(意味着稳定(Dependability)是优先目标)。
个人移动设备(PMD):手机,ipad。

2). 影响程序性能的因素:

算法:

影响:问题的解决方案、源码级声明的数量(这一点很模糊,不是说代码量越少越高效)、I/O操作的数量。
阅读内容:不在本书的讨论范围内,详见我的另一篇博文:“计算机基础六大课:教材推荐”。
程序编写、编译所依赖的软件系统:
包括:编程所用的程序语言,编译器,指令集(instruction set architecture)。
影响:对于相应的代码,计算机所需要执行的指令的数量。
阅读内容:本书第2,3章。
计算机执行指令的效率:
包括:处理器,存储访问系统,I/O系统 (硬件和操作系统)。
影响:处理器执行指令的速率,存储访问的速率,I/O操作的速率。
阅读内容:本书第4,5,6章。

1.2: 计算机架构的八大设计思想:

摩尔定律(Moore's Law):
价格不变时,集成电路上可容纳的元器件的容量每18-24个月翻一倍。
戈登·摩尔于1965年提出
通过抽象原则简化设计模型:
对于上层隐藏低层的设计细节。把复杂的细节抽象成简单的表达,这样更有利于设计架构的理解和进一步开发。
优化常规事件以提升性能:
通过并行计算(Parallelism)来提升性能。
本书的重点。
通过流水线技术来提升性能。
并行计算的一种开发范式。
通过预知来提升性能:
通过猜来提前完成下一步任务,以此来提升性能。(我知道猜听起来很不靠谱,但它确实在底层用的不少,以后会接触到。)
预知是建立在纠错成本可以接受和预知的准确相对高的前提下。
存储分层架构:
由于不同的存储技术成本不同,未了平衡存储访问的成本和性能,提出分层次的存储架构。
cache这类的容量小,存储性能高,访问速度快的存储模块属于模型的上层。
通过分层架构,程序员能够像使用cache一样使用内存,以此来提升程序的性能。
通过增加专项的模块来保证系统的可靠性:
在系统设计中计入处理异常和监视异常的模块能大幅提升系统的可靠性。

1.3 隐藏在程序背后的东东:

1). 硬件/软件分层架构:

层级关系 = (应用级软件(系统级软件(硬件)))
系统级软件:
操作系统 (Operation System):

处理基本的输入/输出操作。
分配调度存储空间和内存使用。
为多个应用进程同时使用同一个计算机的资源提供安全的共享方案。
编译器 (Compiler):
将高级的编程语言编译转换成计算机指令。
3加载器(loader), 汇编器 (assembler),链接器(linker)
程序从源代码到计算机硬件可以执行所经过的操作:
硬件只能通过电信号进行控制和交流。
最简单的电信号:开/关 = 二进制信号 0/1
汇编器(Assembler):负责把汇编语言(简单符号构成的计算机指令)转换成二进制机器语言(0/1 构成的语言, 机器可以理解=用来转换成电信号)
每一条最原始的计算机指令(如加,减)都会需要一条汇编语言的指令声明。
高级语言(如 c, java)码好的程序 ==(编译器)==> 汇编语言构成的程序 ==(汇编器)==> 二进制机器语言构成的程序

1.4 计算机内部的东东:

1). 计算机硬件架构:

功能:数据输入,数据输出,数据处理,数据存储
组成:
输入(input)
输出(output)
数据通路(datapath) :负责所有数值运算
控制(control): 根据程序指令,发送信号给数据通路,输入/输出和存储进行相应的操作。
存储(memory)
控制 + 数据通路 = 处理器(processor) (第4章会详细讲解这个)

2). 输入/输出:

3)计算机内部硬件组成:

芯片:集成电路,由上百万晶体管组成
处理器(processor): 数据通路 + 控制
内存(memory):
程序运行时保存数据和程序的地方
动态随机存取存储器 (DRAM):
由晶体管构成
非永久性存储器,断电丢失。
随机存取(RAM)意思是访问内存的任何一个角落所花费的事件是一样的。
访问时间:50 - 70纳秒,成本: 5 - 10 $/GiB (2012年数据)
每存储1bit数据需要一个晶体管和一个电容器。
电容存储,所以需要周期性的刷新一次,否则存储的数据就会丢失。
缓存(cache memory):
 静态随机存取存储器(SRAM):

更快,更小,更贵
每存储1 bit数据需要6个晶体管
不需要刷新
非永久性存存储器,断电丢失
CPU的计算速度(数据处理速度)太快,以至于用DRAM技术实现的内存中的数据查找访问速度实在是跟不上。为了解决这个问题,不浪费限制的CPU资源,SRAM技术诞生了。它位于主存和CPU之间的一级存储器。
所有的现代CPU都配有多级缓存。详细的说会越来越多,以后有机会单开一个缓存的博客。
指令集:
硬件和软件之间的接口,操控硬件的指令的集合。
二进制接口(ABI)

4). 存储数据:

非永久性存储
永久性存储:
程序等待运行时存储数据和程序的地方。
代表:
磁盘(硬盘, hard memory)
闪存(flash memory)

5). 计算机间的通信:

计算机网络:
功能:计算机间的通信,资源共享,异地访问。
分类:
以太网(Ethernet):局域网上应用的技术,多个计算机连接构成的网络。
互联网(Internet):广域网上应用的技术,多个局域网连接而成的网络。

1.5 处理器和存储器的制造工艺:

1). 组成成分:

硅 ==> 半导体 ==> 晶体管(on/off开关) ==> 集成电路(IC) ==> 大规模集成(VLSI)电路 ==> 芯片(chips)

2). 制造流程:

晶体硅锭 ==(切片)==> 晶片 ==(切粒)==> 芯片 ==(封装)==> 处理器/存储器
次品(defects)
晶原良率(yield)

======================================

【疑难点总结】:

1. 编译后的程序的汇编代码的总计算机指令数是确定的,那么运行同一个汇编程序在不同的的处理器上,CPI是一样的吗?

不一样。
前提条件:同一个汇编程序能在两个不同的处理器上执行。那么意味着这两个处理器遵从相同的指令集,只是对于指令集的电路实现(也就是常说的处理器的微架构)不同。
那么既然电路实现不同,那么同一条指令被执行时所操作的电路通路是不同的,走的路不同了,所花费的时间就不一样了,那么执行同样的一条指令所需要的时钟周期(也就是时间)也就不一样了。所以不同架构的处理器(指令集相同,微架构不同)执行相同的汇编程序,CPI是不同的。

2. CPU所用的指令集不同,实现架构(只电路实现)肯定不同。即使它们遵从的指令集是一样的,但实现方法(不同的电路设计实现相同的指令,比如1001)也可以不同,就是说架构也可以不同。这个不同就影响了不同的处理器的性能。当然,不同的处理器架构需要不同的指令集,也会有影响它们的性能。一句话总结:指令集决定了不同的处理器架构,指令集如果不兼容,计算机指令就不兼容。

3. 不同指令集的处理器无法运行相同的汇编程序,因为汇编语言的程序声明和指令集对不上。

4. 是不是不同架构的CPU使用不同的编译器?

是。编译器基于指令集,面向编程语言。在编译的时候需要知道是哪个cpu上跑这个程序。

5. 是不是不同架构的CPU使用的汇编语言不同?

当然,汇编语言和指令集的指令是一一对应的。不同架构所实现的指令集都不同了,汇编语言肯定不同。

5.2. 处理器架构,指令集和汇编语言的关系?

请见知乎回答: https://www.zhihu.com/question/23474438

6. 如何解决跨平台编译?如何解决跨平台执行?

7.指令集是怎么工作的?我是说指令集是如何在硬件层面被实现的?程序被编译成汇编语言,汇编语言被转换成机器指令,那么cpu是靠什么读取这些指令并把它们转换成响应的开/关信号在晶体管中执行的?

请见知乎回答:https://www.zhihu.com/question/62173438/answer/195436918

---恢复内容结束---

笔记前言:

《Computer Organization and Design: The Hardware/Software Interface》,中文译名,《计算机组成与设计:硬件/软件接口》,是计算机组成原理的经典入门教材之一。节奏紧凑又不紧张,内容充实又不冗长,语言表述朴实易懂又不故作高深,是一本非常适合初次接触计算机组成原理的学生阅读的入门教材。

读书笔记系列博客是主要是记录我学习和阅读中的心得和体会。既然是读书笔记,肯定不会面面俱到,那就成了抄书笔记了。所有笔记系列博客力求言简意赅,终极目标是抛开书,单靠阅读笔记就能联想起每一章对应的精华要点。

为了使未来的我能够读懂现在的我所写下的文字,每篇读书笔记将使用双语叙述(英文+中文),并且都将分为五个模块:知识点总结,疑难点总结,习题练习,我的思考,附录。

知识点总结 (Knowledge Points Summary):主要是总结每篇阅读到的知识内容。
疑难点总结 (Difficult Points Summary):主要是总结我在阅读和理解时踩过的坑。
习题练习 (Assignment Solutions):有些文章(书籍)会有习题,我尽量将所有的习题都做一遍并把自己的答案和见解放在这里。欢迎各位同僚查漏补缺。
我的思考 (Reflection):我自己阅读和理解中的心得和体会会放在这里,这一模块不会拘泥于阅读内容,什么想法都有可能。
附录 (Appendix):所有不在上面四个模块的内容都会放在这里,包括但不仅限于拓展阅读、引用来源。

在今后的学习生活中,希望自己能保持专注,保持兴奋,保持谦逊,保持好奇。学无止境,任重而道远呐。


Chapter 1: Computer Abstraction and Technology  (第一章:计算机概述和技术)

======================================

English:

1.1 Introduction:

【Knowledge Points Summary】:

1). Classes of Computing Applications:

Personal Computers : individual use, 35 years-old.
Servers : larger, carrying large workloads, greater computing, storage and I/O capability. 
Supercomputers
Cload Computing: Warehous Scale Computers (WSCs)1, SaaS (Software as a Service)2
Embedded Systems : widest range of application, low tolerance of failure.
Personal Mobile Devices (PMD)

2). The performance of a program is influenced by :

The Algorithms used in the program:

Determines: the number of source-level statements, the number of I/O operations executed.
Not covered in this book. Please see my another blog of "Six Fundamental Subjects of Computer Science : Recommendation Textbooks List."
The Software Systems used to create and translate the program into machine instructions.
Components : Programming Language, Compiler, Instruction Set Architecture.
Determines : the number of computer instructions for each source-level statement.
Discussed at : Chapter 2 and 3.
The effectiveness of the Computer in executing those instructions.
Components : Processor, Memory system, I/O system (hardware and operation system).
Determines : How fast instructions and I/O operations can be executed.
Discussed at : Chatper 4, 5 and 6.

1.2 Eight Great Idea in Computer Architecture:

Moore's Law:
The integrated circuit (IC) resources doubles everey 18-24 months.
Stated by Gordon Moore in 1965.
Use Abstraction to Simplify Design:
Hidden low-level design detail to high-level design.
Make the Common Case Fast:
Enhance program performance by optimizing the common case of the problem.
Performance via Parallelism:
Significant topic covered in this book
Performance via Pipeplining:
A particular pattern of parallelism.
Performance via Prediction:
Guessing and starting work rather than until knowing for sure.
Assume recovering from misprediction is not expensive and prediction is relatively accurate.
Hierarchy of Memories:
Different memory techniques have different efficience and price, needed to be balanced.
The fastest, smallest and most expensive memory per bit at the top of the hierarchy.
Through special hierarchy architect and implementation, we can simulate the main memory access as fast as cashes.
Dependability via Redundancy:
Making system dependable by including redundant components:
take over failure occurs.
help detect failure.

1.3 Below your program:

1). Hardware/Software Hierarchical Layers:

Hierarchy = (Applications Software (Systems Softwares (Hardware)))
Systems Software:
sitting between the hardware and applications software.
providing commonly useful services.
including:
Operation Systems:

Handler basic input/output operations
Allocating storage and memory
providing for protected sharing of the computer among multiple applications using it simultaneously.
Compilers:
Translate high-level language of a program into instructions that the hardware can execute.
3loaders,  assemblers, linker
Program exeution pathway:
Electronic hardware is controled by electric signals.
On/Off electric signal corresponding to 1/0.
Assembler: Translate symbolic version of instruction (assembly language) into binary version (machine language).
Every basic computer instruction (like add, minus) require one line of assembly language statement.
High-level language program ==(Compiler)==> Assembly Language program ==(Assembler)==> Binary machine language program

1.4 Under Covers:

1). Computer Hardware Organization:

Functions: Inputting data, Outputting data, Processing data, Storing data.
Components:
Input: Feed data  (Detail: Chapter 5, 6)
Output: Output data (Detail: Chapter 5, 6)
Memory: Storing data (Detail: Chapter 5)
Datapath:   (Detail: Chapter 3, 4, 6 and Appendix C)
where data goes and is modified.
performs arithmetic operations.

" is collection of functional units such as: arithmatic logic units or multipliers, that perform data processing operations, registers, and buses." R1        

Control:
According to the instruction of the program, sending the signals that determine the operations of the datapath, memory and output .
Detials: Chapter 4, 6 and appendix C)
Datapath + Control = Processor

2). I/O devices:

Output devices:

Liquid crestal displays (LCDs)

used on mobile devices to get a thin, low-power display.
Active matrix display: a LCDs using a transistor to control the transmission of light at each individual pixel.
Bit Map:
the matrix of pixels, represented as a matrix of bits.
Normally ranging in size from 1024 x 768 to 2048 x 1536
Raster Refresh Buffer (frame buffer):
storing the bit map (which is used to represent the image)
The bit pattern per pixel is read out to the graphics display at the refresh rate.
Input devices:
Touchscreen:

Implementation:

Capacitive Sensing.
People are electrical conductors.
if an insulator (glass) is covered with a transparent conductor (human finger), touching distors the electrostatic field of the screen.
Capacitance chages because of touching.
Allowing multiple touches simultaneously.

3). Inside box:

Chips: Integrated circuit, a device combining dozens to millions of transistors.
Central Processor Unit (CPU): datapath + controls
Memory:
the storage area in which programs are kept when they are runnning and data need by the running program.
DRAM (dynamic random access memory):
built as an integrated circuit.
provide random access to any location.
RAM means that memory accesses take basically the same amount of time no matter what portin of the memory is read.
access times are 50 - 70 ns and cost 5 - 10 $/GiB.
requiring the data to be refreshed periodically in order to retain the data.
One transistor and one capacitor for every bit of data.
volatile memory.
Cache memory:
Inside or along side the CPU
Using Different memory techonlogy: SRAM (static random access memory):
faster, smaller, more expensive
dose not need to be refreshed as the transistors inside would continue to hold the data as long as the power is not cut off
6 transitors for every bit of data
volatile memory.
Cache technique is used to solve the imbalance between the high computing speed of processor and low access speed of the main memory (DRAM).
Cache memory acts as a buffer for the DRAM memory
Instruction set architecture: (Details: in Chapter 2 and Appendix A)
abstract interface between hardware and software.
Application binary interface (ABI):
Combination of the basic instruction set and operation system interface (I/O operation + memory allocation + low-level system functions)  for application programmer.

4). Storing data:

Distinguish by techniques used:

volatile memory:

Storage, such as DRAM, SRAM, that retains data only if it is receiving power.
Once power is cut off, data would be lost.
nonvolatile memory
retains data even in teh absence of a power source.
used to store data and programs between runs.
Distinguish by memory type:

main memory (primary memory):

memory used to hold program and data while running.
volatile memory. Consists of DRAM and SRAM 5.
secondary memory:
memory used to hold program and data between run.
nonvolatile memory:
Magnetic Disk (hard disk):

composed of rotating platters coated with a magnetic recording material
rotating mechanical devices
Access time: 5 - 20 milliseconds
Cost: 0.05 - 0.10 $/GiB (in 2012)
Flash memory(used in PMDs):
Access time: 5 - 50 microsecords
Cost: 0.75 - 1.00 $/GiB (in 2012)

5). Communicating with other computers:

Networks:
Functions: Communitation, resource sharing, nonlocal access.
Type:
Ethernet: used in Local Area Network (LAN), a network of computers.
Internet: used in Wide Area Network (WAN),  a network of networks.
Cost: depends on speed of communication, length
Wireless technologies:
IEEE standard 802.11

1.5 Technologies for Building Processor and Memory:

1). Components:

silicon ==> semiconductor ==> transisitors (on/off switch) ==> integrated circuit ==> Very large-scale integrated (VLSI) circuit. ==> chip

2). Manufacturing process:

silicon crystal ingot ==(Sliced)==> wafers ==(Diced)==> dies (chips) ==(Bonding)==> CPU package
defects: a microscopic flaw result in the failure of the die.
Yeild:
the percentage of good die from the total number of dies on the wafer.
Cost per die = Cost per wafer / Dies per wafer * yield
Dies per Wafer ~= Wafer area / Die area
Yield = 1 / (1 + (Defects per area * Die area/2))2  ----- based on empirical observations of yields at IC factories.

1.6 Performance:

1). Definition:

Responce Time (Execution Time): the total time required for the computer to complete a task.
Throughput (bandwidth): the number of tasks completed per unit time.
Define: "Computer X is n times faster than computer Y" :
= Performancex / PerformanceY = n
= Execution TimeY / Execution Time= n
Time measurement:
Elapsed Time (System Performance): Total time to complete a task
CPU execution time: the actual time the cpu spends computing for a specific task
user execution time (CPU performance): cpu time spent in the program
system execution time: cpu time spent in the operating system performing tasks on behalf of the program

2). Clock Cycle:

is the time for one clock period, usually of the processor clock, which runs at a constant rate.
measuring that relates to how fast the hardware can perform basic functions.
a complete clock period , e.g , 250 picoseconds is the length of each clock cycle
a clock rate = 4 GHz = 1 / clock period = 1 / 250 * 10-12s

3). Performance measurement equations:

CPU execution time for a program:

= CPU clock cycles for a program * Clock period

= CPU clock cycles for a program / clock rate

CPU clock cycles = Instructions for a program * Average clock cycles per instruction (CPI)
CPU execution time:

= Instruction count * CPI * Clock cycle time

= Instruction count * CPI / Clock rate

Time = Seconds / Program = Instructions/Program  *  Clock cycles/Instruction  *  Secods/Clock cycle

4). Performance factors:

Instruction count:
instructions executed for the program
can be measured by:
using software tools that profile the execution
or by using a simulator of the architecture.
or use hardware counter
Clock cycle per Instruction: 
average amout of clock cycles per instruction
can use hardware counter to measure it. (Q: any other method to measure CPI?)
Q: What determines the amout of the clock cycle per instruction? What's different between Implementation and Architecture? (see chapter 4, 5)
Clock cycle time:
published as part of the documentation for a computer.
Q: What determines the clock cycle time?

5). Performance of a prorgram:

depends on:
Alogorithm:

affect: Instruction Count, possible CPI
Programming Language:
affect: Instruction Count, CPI
Compiler:
affect: Instruction Count, CPI
Instruction et Architecture:
affect: Instruction Count, CPI, clock rate

6). some processors fetch and execute multiple instructions per clock cycle, so CPI can be < 1.0.

7).

To save energy or temporaily boost performance, today's processor can vary their clock rates.
E.g, the Intel Core i7 will temporarily increase clock rate by about 10% until the chip gets too war. Intel calls this Turbo mode.
so we would need to use the average clock rate for a program.

1.7 The Power Wall

1). Clock rate and Power consumptionn are correlated.

2). Practical power used by computer is limited by the cooling commodity microprocessor.

3). Energy is another measure of performance of the computer such as PMDs and Warehouse scale computers.

4). The dominant technology of IC is called CMOS (complementary metal oxide semiconductor互补金属氧化物半导体).

5). Dynamics Energy:

energy consumed when transistors swithc states from 0 to 1 (or 1 to 0).
is the primary source of energy consumption by CMOS.
depends on: Capacitive load * Voltage2   == the energy of a puls during the logic transition of 0 -> 1->0 or 1-> 0 -> 1.
Power required per transistor = 1/2 * C * V2 * f
f:
frequency switched
is a function of the clock rate.
C:
capacitive load of each transistor
is a function of both the number of transistors connected to an output (called fannout) and the technology (determines capacitance of wire and transistors).

1.8 The sea Change: The switch from Uniprocessors to Multiprocessors

1). Define: multiple processors per chip.

2). Parallelism: Improve performance by increasing throughput.

3). Challenge:

Scheduling
Load balance
Time for Synchronization
Overhead(开销) for communication between the parties.

1.9 Benchmarking the Inter Core i7

Q: what is benchmark? Who publish it and how does they calculate it? How to ues it?

1.10 Fallacies and Pitfalls

1). Pitfalls: Expecting the improvement of one aspect of a computer to increase overall performance by an amount proportional to the size of the improvement.

Amdahl's Law (阿姆达尔定律):
Execution time after improvement = Execution time affected by improvement / Amount of improvement + Execution Time unaffacted
is a rule stating that the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used. It is a quantitative version of the law of diminishing returns.
use to evaluate potential enhancement.
use to argue for practical limits to the number of parallel processors. (see chapter 6 for detail)

======================================

中文

1.1 概述:

【知识点总结】:

1). 计算机的分类:

个人电脑 (Personal Computer):私人使用,35年的发展历史。
服务器 (Server):更强大的计算能力(Computing Capacity)、存储能力(Storage Capacity)和输入/输出负载。
超级计算机(Super Computer): 彪悍的计算能力,常用于科学数据处理、天气预报等。
云计算(Cload Computing):仓储式服务器集群(Warehouse Scale Computers)1,软件“即”服务(SaaS)2
嵌入式计算机(Embedded Computer):最广泛的使用,低容错率(意味着稳定(Dependability)是优先目标)。
个人移动设备(PMD):手机,ipad。

2). 影响程序性能的因素:

算法:

影响:问题的解决方案、源码级声明的数量(这一点很模糊,不是说代码量越少越高效)、I/O操作的数量。
阅读内容:不在本书的讨论范围内,详见我的另一篇博文:“计算机基础六大课:教材推荐”。
程序编写、编译所依赖的软件系统:
包括:编程所用的程序语言,编译器,指令集(instruction set architecture)。
影响:对于相应的代码,计算机所需要执行的指令的数量。
阅读内容:本书第2,3章。
计算机执行指令的效率:
包括:处理器,存储访问系统,I/O系统 (硬件和操作系统)。
影响:处理器执行指令的速率,存储访问的速率,I/O操作的速率。
阅读内容:本书第4,5,6章。

1.2: 计算机架构的八大设计思想:

摩尔定律(Moore's Law):
价格不变时,集成电路上可容纳的元器件的容量每18-24个月翻一倍。
戈登·摩尔于1965年提出
通过抽象原则简化设计模型:
对于上层隐藏低层的设计细节。把复杂的细节抽象成简单的表达,这样更有利于设计架构的理解和进一步开发。
优化常规事件以提升性能:
通过并行计算(Parallelism)来提升性能。
本书的重点。
通过流水线技术来提升性能。
并行计算的一种开发范式。
通过预知来提升性能:
通过猜来提前完成下一步任务,以此来提升性能。(我知道猜听起来很不靠谱,但它确实在底层用的不少,以后会接触到。)
预知是建立在纠错成本可以接受和预知的准确相对高的前提下。
存储分层架构:
由于不同的存储技术成本不同,未了平衡存储访问的成本和性能,提出分层次的存储架构。
cache这类的容量小,存储性能高,访问速度快的存储模块属于模型的上层。
通过分层架构,程序员能够像使用cache一样使用内存,以此来提升程序的性能。
通过增加专项的模块来保证系统的可靠性:
在系统设计中计入处理异常和监视异常的模块能大幅提升系统的可靠性。

1.3 隐藏在程序背后的东东:

1). 硬件/软件分层架构:

层级关系 = (应用级软件(系统级软件(硬件)))
系统级软件:
操作系统 (Operation System):

处理基本的输入/输出操作。
分配调度存储空间和内存使用。
为多个应用进程同时使用同一个计算机的资源提供安全的共享方案。
编译器 (Compiler):
将高级的编程语言编译转换成计算机指令。
3加载器(loader), 汇编器 (assembler),链接器(linker)
程序从源代码到计算机硬件可以执行所经过的操作:
硬件只能通过电信号进行控制和交流。
最简单的电信号:开/关 = 二进制信号 0/1
汇编器(Assembler):负责把汇编语言(简单符号构成的计算机指令)转换成二进制机器语言(0/1 构成的语言, 机器可以理解=用来转换成电信号)
每一条最原始的计算机指令(如加,减)都会需要一条汇编语言的指令声明。
高级语言(如 c, java)码好的程序 ==(编译器)==> 汇编语言构成的程序 ==(汇编器)==> 二进制机器语言构成的程序

1.4 计算机内部的东东:

1). 计算机硬件架构:

功能:数据输入,数据输出,数据处理,数据存储
组成:
输入(input)
输出(output)
数据通路(datapath) :负责所有数值运算
控制(control): 根据程序指令,发送信号给数据通路,输入/输出和存储进行相应的操作。
存储(memory)
控制 + 数据通路 = 处理器(processor) (第4章会详细讲解这个)

2). 输入/输出:

3)计算机内部硬件组成:

芯片:集成电路,由上百万晶体管组成
处理器(processor): 数据通路 + 控制
内存(memory):
程序运行时保存数据和程序的地方
动态随机存取存储器 (DRAM):
由晶体管构成
非永久性存储器,断电丢失。
随机存取(RAM)意思是访问内存的任何一个角落所花费的事件是一样的。
访问时间:50 - 70纳秒,成本: 5 - 10 $/GiB (2012年数据)
每存储1bit数据需要一个晶体管和一个电容器。
电容存储,所以需要周期性的刷新一次,否则存储的数据就会丢失。
缓存(cache memory):
 静态随机存取存储器(SRAM):

更快,更小,更贵
每存储1 bit数据需要6个晶体管
不需要刷新
非永久性存存储器,断电丢失
CPU的计算速度(数据处理速度)太快,以至于用DRAM技术实现的内存中的数据查找访问速度实在是跟不上。为了解决这个问题,不浪费限制的CPU资源,SRAM技术诞生了。它位于主存和CPU之间的一级存储器。
所有的现代CPU都配有多级缓存。详细的说会越来越多,以后有机会单开一个缓存的博客。
指令集:
硬件和软件之间的接口,操控硬件的指令的集合。
二进制接口(ABI)

4). 存储数据:

非永久性存储
永久性存储:
程序等待运行时存储数据和程序的地方。
代表:
磁盘(硬盘, hard memory)
闪存(flash memory)

5). 计算机间的通信:

计算机网络:
功能:计算机间的通信,资源共享,异地访问。
分类:
以太网(Ethernet):局域网上应用的技术,多个计算机连接构成的网络。
互联网(Internet):广域网上应用的技术,多个局域网连接而成的网络。

1.5 处理器和存储器的制造工艺:

1). 组成成分:

硅 ==> 半导体 ==> 晶体管(on/off开关) ==> 集成电路(IC) ==> 大规模集成(VLSI)电路 ==> 芯片(chips)

2). 制造流程:

晶体硅锭 ==(切片)==> 晶片 ==(切粒)==> 芯片 ==(封装)==> 处理器/存储器
次品(defects)
晶原良率(yield)

======================================

【疑难点总结】:

1. 编译后的程序的汇编代码的总计算机指令数是确定的,那么运行同一个汇编程序在不同的的处理器上,CPI是一样的吗?

不一样。
前提条件:同一个汇编程序能在两个不同的处理器上执行。那么意味着这两个处理器遵从相同的指令集,只是对于指令集的电路实现(也就是常说的处理器的微架构)不同。
那么既然电路实现不同,那么同一条指令被执行时所操作的电路通路是不同的,走的路不同了,所花费的时间就不一样了,那么执行同样的一条指令所需要的时钟周期(也就是时间)也就不一样了。所以不同架构的处理器(指令集相同,微架构不同)执行相同的汇编程序,CPI是不同的。

2. CPU所用的指令集不同,实现架构(只电路实现)肯定不同。即使它们遵从的指令集是一样的,但实现方法(不同的电路设计实现相同的指令,比如1001)也可以不同,就是说架构也可以不同。这个不同就影响了不同的处理器的性能。当然,不同的处理器架构需要不同的指令集,也会有影响它们的性能。一句话总结:指令集决定了不同的处理器架构,指令集如果不兼容,计算机指令就不兼容。

3. 不同指令集的处理器无法运行相同的汇编程序,因为汇编语言的程序声明和指令集对不上。

4. 是不是不同架构的CPU使用不同的编译器?

是。编译器基于指令集,面向编程语言。在编译的时候需要知道是哪个cpu上跑这个程序。

5. 是不是不同架构的CPU使用的汇编语言不同?

当然,汇编语言和指令集的指令是一一对应的。不同架构所实现的指令集都不同了,汇编语言肯定不同。

5.2. 处理器架构,指令集和汇编语言的关系?

请见知乎回答: https://www.zhihu.com/question/23474438

6. 如何解决跨平台编译?如何解决跨平台执行?

7.指令集是怎么工作的?我是说指令集是如何在硬件层面被实现的?程序被编译成汇编语言,汇编语言被转换成机器指令,那么cpu是靠什么读取这些指令并把它们转换成响应的开/关信号在晶体管中执行的?

请见知乎回答:https://www.zhihu.com/question/62173438/answer/195436918

【读书笔记】《Computer Organization and Design: The Hardware/Software Interface》(1)的相关教程结束。

《【读书笔记】《Computer Organization and Design: The Hardware/Software Interface》(1).doc》

下载本文的Word格式文档,以方便收藏与打印。