What every programmer should know about memory (Part 2-1) 译-白红宇

What every programmer should know about memory (Part 2-1) 译

阅读量：4072 次

发布时间：2019-05-25

本文共 16334 字，大约阅读时间需要 54 分钟。

What Every Programmer Should Know About Memory
Ulrich Drepper
Red Hat, Inc.
drepper@redhat.com
November 21, 2007

2.1 RAM Types

There have been many types of RAM over the years and each type varies, sometimes significantly, from the other. The older types are today really only interesting to the historians. We will not explore the details of those. Instead we will concentrate on modern RAM types; we will only scrape the surface, exploring some details which are visible to the kernel or application developer through their performance characteristics.

The first interesting details are centered around the question why there are different types of RAM in the same machine. More specifically, why there are both static RAM (SRAM {

In other contexts SRAM might mean “synchronous RAM”.}) and dynamic RAM (DRAM). The former is much faster and provides the same functionality. Why is not all RAM in a machine SRAM? The answer is, as one might expect, cost. SRAM is much more expensive to produce and to use than DRAM. Both these cost factors are important, the second one increasing in importance more and more. To understand these difference we look at the implementation of a bit of storage for both SRAM and DRAM.

In the remainder of this section we will discuss some low-level details of the implementation of RAM. We will keep the level of detail as low as possible. To that end, we will discuss the signals at a “logic level” and not at a level a hardware designer would have to use. That level of detail is unnecessary for our purpose here.

2.1.1 Static RAM

Figure 2.4 shows the structure of a 6 transistor SRAM cell. The core of this cell is formed by the four transistorsM1toM4which form two cross-coupled inverters. They have two stable states, representing 0 and 1 respectively. The state is stable as long as power onVddis available.

If access to the state of the cell is needed the word access lineWLis raised. This makes the state of the cell immediately available for reading onBLandBL. If the cell state must be overwritten theBLandBLlines are first set to the desired values and thenWLis raised. Since the outside drivers are stronger than the four transistors (M1throughM4) this allows the old state to be overwritten.

See [sramwiki] for a more detailed description of the way the cell works. For the following discussion it is important to note that

one cell requires six transistors. There are variants with four transistors but they have disadvantages.

maintaining the state of the cell requires constant power.

the cell state is available for reading almost immediately once the word access lineWLis raised. The signal is as rectangular (changing quickly between the two binary states) as other transistor-controlled signals.

the cell state is stable, no refresh cycles are needed.

There are other, slower and less power-hungry, SRAM forms available, but those are not of interest here since we are looking at fast RAM. These slow variants are mainly interesting because they can be more easily used in a system than dynamic RAM because of their simpler interface.

2.1.2 Dynamic RAM

Dynamic RAM is, in its structure, much simpler than static RAM. Figure 2.5 shows the structure of a usual DRAM cell design. All it consists of is one transistor and one capacitor. This huge difference in complexity of course means that it functions very differently than static RAM.

A dynamic RAM cell keeps its state in the capacitorC. The transistorMis used to guard the access to the state. To read the state of the cell the access lineALis raised; this either causes a current to flow on the data lineDLor not, depending on the charge in the capacitor. To write to the cell the data lineDLis appropriately set and thenALis raised for a time long enough to charge or drain the capacitor.

There are a number of complications with the design of dynamic RAM. The use of a capacitor means that reading the cell discharges the capacitor. The procedure cannot be repeated indefinitely, the capacitor must be recharged at some point. Even worse, to accommodate the huge number of cells (chips with 109 or more cells are now common) the capacity to the capacitor must be low (in the femto-farad range or lower). A fully charged capacitor holds a few 10’s of thousands of electrons. Even though the resistance of the capacitor is high (a couple of tera-ohms) it only takes a short time for the capacity to dissipate. This problem is called “leakage”.

This leakage is why a DRAM cell must be constantly refreshed. For most DRAM chips these days this refresh must happen every 64ms. During the refresh cycle no access to the memory is possible. For some workloads this overhead might stall up to 50% of the memory accesses (see [highperfdram]).

A second problem resulting from the tiny charge is that the information read from the cell is not directly usable. The data line must be connected to a sense amplifier which can distinguish between a stored 0 or 1 over the whole range of charges which still have to count as 1.

A third problem is that charging and draining a capacitor is not instantaneous. The signals received by the sense amplifier are not rectangular, so a conservative estimate as to when the output of the cell is usable has to be used. The formulas for charging and discharging a capacitor are

This means it takes some time (determined by the capacity C and resistance R) for the capacitor to be charged and discharged. It also means that the current which can be detected by the sense amplifiers is not immediately available. Figure 2.6 shows the charge and discharge curves. The X—axis is measured in units of RC (resistance multiplied by capacitance) which is a unit of time.

Unlike the static RAM case where the output is immediately available when the word access line is raised, it will always take a bit of time until the capacitor discharges sufficiently. This delay severely limits how fast DRAM can be.

The simple approach has its advantages, too. The main advantage is size. The chip real estate needed for one DRAM cell is many times smaller than that of an SRAM cell. The SRAM cells also need individual power for the transistors maintaining the state. The structure of the DRAM cell is also simpler and more regular which means packing many of them close together on a die is simpler.

Overall, the (quite dramatic) difference in cost wins. Except in specialized hardware — network routers, for example — we have to live with main memory which is based on DRAM. This has huge implications on the programmer which we will discuss in the remainder of this paper. But first we need to look into a few more details of the actual use of DRAM cells.

2.1.3 DRAM Access

A program selects a memory location using a virtual address. The processor translates this into a physical address and finally the memory controller selects the RAM chip corresponding to that address. To select the individual memory cell on the RAM chip, parts of the physical address are passed on in the form of a number of address lines.

It would be completely impractical to address memory locations individually from the memory controller: 4GB of RAM would require 232 address lines. Instead the address is passed encoded as a binary number using a smaller set of address lines. The address passed to the DRAM chip this way must be demultiplexed first. A demultiplexer with N address lines will have 2N output lines. These output lines can be used to select the memory cell. Using this direct approach is no big problem for chips with small capacities.

But if the number of cells grows this approach is not suitable anymore. A chip with 1Gbit {

I hate those SI prefixes. For me a giga-bit will always be 230 and not 109 bits.} capacity would need 30 address lines and 230 select lines. The size of a demultiplexer increases exponentially with the number of input lines when speed is not to be sacrificed. A demultiplexer for 30 address lines needs a whole lot of chip real estate in addition to the complexity (size and time) of the demultiplexer. Even more importantly, transmitting 30 impulses on the address lines synchronously is much harder than transmitting “only” 15 impulses. Fewer lines have to be laid out at exactly the same length or timed appropriately. {

Modern DRAM types like DDR3 can automatically adjust the timing but there is a limit as to what can be tolerated.}

Figure 2.7 shows a DRAM chip at a very high level. The DRAM cells are organized in rows and columns. They could all be aligned in one row but then the DRAM chip would need a huge demultiplexer. With the array approach the design can get by with one demultiplexer and one multiplexer of half the size. {

Multiplexers and demultiplexers are equivalent and the multiplexer here needs to work as a demultiplexer when writing. So we will drop the differentiation from now on.} This is a huge saving on all fronts. In the example the address linesa0anda1through the row address selection (RAS) demultiplexer select the address lines of a whole row of cells. When reading, the content of all cells is thusly made available to the column address selection (CAS) {

The line over the name indicates that the signal is negated} multiplexer. Based on the address linesa2anda3the content of one column is then made available to the data pin of the DRAM chip. This happens many times in parallel on a number of DRAM chips to produce a total number of bits corresponding to the width of the data bus.

For writing, the new cell value is put on the data bus and, when the cell is selected using theRASandCAS, it is stored in the cell. A pretty straightforward design. There are in reality — obviously — many more complications. There need to be specifications for how much delay there is after the signal before the data will be available on the data bus for reading. The capacitors do not unload instantaneously, as described in the previous section. The signal from the cells is so weak that it needs to be amplified. For writing it must be specified how long the data must be available on the bus after theRASandCASis done to successfully store the new value in the cell (again, capacitors do not fill or drain instantaneously). These timing constants are crucial for the performance of the DRAM chip. We will talk about this in the next section.

A secondary scalability problem is that having 30 address lines connected to every RAM chip is not feasible either. Pins of a chip are a precious resources. It is “bad” enough that the data must be transferred as much as possible in parallel (e.g., in 64 bit batches). The memory controller must be able to address each RAM module (collection of RAM chips). If parallel access to multiple RAM modules is required for performance reasons and each RAM module requires its own set of 30 or more address lines, then the memory controller needs to have, for 8 RAM modules, a whopping 240+ pins only for the address handling.

To counter these secondary scalability problems DRAM chips have, for a long time, multiplexed the address itself. That means the address is transferred in two parts. The first part consisting of address bitsa0anda1in the example in Figure 2.7) select the row. This selection remains active until revoked. Then the second part, address bitsa2anda3, select the column. The crucial difference is that only two external address lines are needed. A few more lines are needed to indicate when theRASandCASsignals are available but this is a small price to pay for cutting the number of address lines in half. This address multiplexing brings its own set of problems, though. We will discuss them in Section 2.2.

2.1.4 Conclusions

Do not worry if the details in this section are a bit overwhelming. The important things to take away from this section are:

there are reasons why not all memory is SRAM

memory cells need to be individually selected to be used

the number of address lines is directly responsible for the cost of the memory controller, motherboards, DRAM module, and DRAM chip

it takes a while before the results of the read or write operation are available

The following section will go into more details about the actual process of accessing DRAM memory. We are not going into more details of accessing SRAM, which is usually directly addressed. This happens for speed and because the SRAM memory is limited in size. SRAM is currently used in CPU caches and on-die where the connections are small and fully under control of the CPU designer. CPU caches are a topic which we discuss later but all we need to know is that SRAM cells have a certain maximum speed which depends on the effort spent on the SRAM. The speed can vary from only slightly slower than the CPU core to one or two orders of magnitude slower.

2.1 RAM 类别

多年来已经出现了许多类型的RAM并且各不相同,一些时候会有显著的不同.今日较老的类型已经无人问津.我们不会探索这些细节.事实上我们将专注于现代的RAM类型.我们将刨开表面去探索一些细节,通过他们的性能特性,内核和应用层开发人员将会看到一些细节.

第一个有趣的细节集中于为什么不同类型的RAM可以存在于一台机器.更明确来讲,为什么SRAM(也可以理解为同步内存)和DRAM.前者在提供了相同的功能下更快.为什么一台机器上不全部使用SRAM?正如人们所预料的,答案是代价.SRAM无论是在生产还是使用都是昂贵的.这两个代价都是很重要的,第二个越来越重要.为了理解这些不同,我们将浏览在SRAM和DRAM上位存储的实现.

在本节的其余部分我们将讨论RAM的底层实现.我们将尽可能的保持细节层次.为了实现这,我们将讨论这个信号在一个逻辑层次,而不是一个硬件工程师层次,但是这对我们来说是无关紧要的.

2.1.1 静态RAM

图2-4展示了6晶体管SRAM单元的架构.单元的核心有晶体管M1到M4,它们组成了两个交叉耦合的反向器.它们有两个稳定的状态,分别代表0和1.这个状态只有Vdd有电就是稳定的.

想要去读这个单元的状态需要字线WL升起.这立刻使得单元的状态在位线BL与BLB上变得可读.如果想要重写单元的状态,首先应在BL和BLB上设置期望的值并且WL升起.因为外部的电压是比4个晶体管电压更高,这使得旧的状态被覆盖.

更多单元工作的原理请看[sramwiki].下面对于接下来的讨论是重要的.

一个单元需要6个晶体管.也有4个晶体管的变体但是它们有缺点.

维护单元的状态需要稳定的状态.

一旦WL字线升起,单元的状态是可以立即读取的.这个信号是矩形的和其他的控制的信号是一致的(因为在两个状态之间迅速的改变).

这个单元状态是稳定的.不需要周期性刷新.

也有其它更慢,能耗更小的SRAM形式,但是我们需求的是更快的RAM,所以我们不需要关注它们.这些较慢的SRAM变种比DRAM更容易的使用因为它们简单的接口.

2.1.2 DRAM

DRAM在架构上是比SRAM更简单的.图2.5展示了普通的DRAM单元架构.它包含了一个晶体管和一个电容器.这种复杂程度的不同意味着它和SRAM在功能上非常不同.

一个动态的RAM单元使用电容保持它的状态.晶体管的使用控制着状态的访问.读单元的状态需要线AL升起.这可能会造成电流流到数据线DL上,这种可能性取决于电容是否有电.为了写数据,先DL会适当的设置并且先AL会升起足够的时间使得电容充电或放电完成.

DRAM的设计有点复杂.电容的使用意味这读单元的时候需要放电,因此这个过程不能无线重复,电容必须在某刻重新充电.更糟糕的是,为了容纳巨大数量的单元(通常一个芯片上有10的9ci方个单元)一个电容器上的电容必须是低的(毫微微法拉一下).充满点的电容器持有几万个电子.即使电容器的电阻很高(几兆欧姆),但是只需要很短的时间来放电.这种现象称为泄漏.

泄漏是导致DRAM单元持续刷新的原因.对于大部分得DRAM来说,刷行的周期是64ms.在刷新的时候是不能访问芯片的.这间接延迟了工作中50%的内存访问时间.

第二个问题细微的电容使得从单元上读取的信息不是直接可用的.数据线必须连接到一个信号放大器上,才可以分辨出单元的输出是0还是1.

第三个问题是电容的充放电不是瞬时的.因此信号放大器的信号接收不是矩形的.所以一个保守的估计在于什么时候的单元输出是可用的.所以一个电容器充放电的公式如下:

这意味这需要花费一些时间需要去充放电(时间取决于电容C和电阻R).它也意味着检测到的电容放大器的输出电流是不能立刻就是使用的.图2.6表示了充电和放电时的曲线.x轴表示的是单位时间下的RC.

这种简单的方法有它的优点.这主要的优点是尺长.一个DRAM的体积比SRAM小很多倍.SRAM单元需要独立的供电去维持晶体管的状态.DRAM的体积结构是更简单的并且整齐的,这意味着DRAM的规模化是更简单的.

总的来说,DRAM在成本方面取胜.除了一些特别的硬件(网络路由),我们生活中的主要内存是基于DRAM的.这对程序员有着巨大的影响,我们会在后边讨论.但是首先我们需要了解DRAM使用的一些细节.

2.1.3 DRAM的访问.

程序在管理内存时使用了虚拟内存.处理器转换这个虚拟内存成物理地址并且内存控制器选择一个RAM芯片对应这个物理地址.为了在RAM芯片中选择单独的内存单元,物理地址中部分会在地址译码器译码之后以地址线的方式传递.

单独的处理来自内存控制器的内存地址是十分不切实际的.4GB的内存需要32根地址线.实际上地址被译码器翻译成二进制的数字,然后通过一组较小的地址线去复用地址.一个N地址线的复用会有2的N此方输出结果.这些输出线可以去选择内存单元.使用这种直接的方法在小容量的芯片去应对内存单元的选择中不成问题的.

但是随着单元数量的变多,这种方法将变得不再合适.一个1GB的芯片需要30根地址线和2的30次选项线(我痛恨SI前缀.对我来说,吉特级别是2的30次方而不是10的9次方比特).在速度不被放弃时,输入地址线复用的次数指数增长.一个30根地址线的复用将占用许多的芯片体积,除此之外,信号分离器也会变的更加复杂.更重要的是,在地址线上同步传输30个脉冲是比15个脉冲是更加复杂的.

图2.7展示了一个非常高级的DRAM芯片.这个芯片使用行列组织而成.它们都可以排成一行但是DRAM需要一个巨大的信号分离器一个信号分离器和一个多路复用器可以实现数组方式的设计.(信号分离器和多路复用器是等价的并且信号分离器会在写时当作多路复用器,我们将讨论这个区别).实现的各方面需要较深较广的理解.比如,a0和a1地址线通过行地址选择器(RAS)信号分离器选择了一整行的单元.在读的时候,行选择器RAS使得这一行上的所有单元变得可读.地址线a2和a3使得一列的单元是可读的.这同时发生很多次当许多DRAM芯片去产生和数据总线带宽一样长的数据位.

这一部分可浏览

获取更加详细的信息

(自加部分参考: 内存阵列址寻址过程是这样的，在内存阵列中分为行和列，当命令请求到达内存后，首先被触发的是tRAS (Active to Precharge Delay)，数据被请求后需预先充电，一旦tRAS被激活后，RAS才开始在一半的物理地址中寻址，行被选定后，tRCD初始化，最后才通过CAS找到精确的地址。整个过程也就是先行寻址再列寻址。从CAS开始到CAS结束就是现在讲解的CAS延迟了。因为CAS是寻址的最后一个步骤，所以在内存参数中它是最重要的。)

当写时,被写的值被传递进入数据线,当使用RAS和CAS选中单元时,值被写入单元.这是一个很直观的设计.在现实中这是相当复杂的.需要值得注意说明的是在信号被传递到数据线与数据可读之间的延迟是多少.正如前面描述的一样,电容不会立刻放电.从单元中流出来的电流信号是如此的微弱.当写时值得注意的是在选中(RAS和CAS)到(成功的存储新值到单元中)之间的时间差.这些时间参数是十分重要的对于DRAM芯片的性能.

第二个问题是在扩展方向的.30个地址线连接到每个RAM芯片是不可行的.芯片的针脚是十分宝贵的资源,以至于这个数据必须平行的传输.(比如64位一组的传输).内存控制器必须有能力去解析每一个RAM模块.假设为了满足性能,那么就需要并行的去访问多个内存模块,并且每个RAM模块需要拥有它自己的30根地址线的集.假设8个RAM模块,那么内存控制器为了地址解析就需要多至240+根的针脚。（理解：每个模块由8个基本存储单位为1的基本单元构成。故每个模块能够并行的输出8位，需要30个引脚，可以采用SIMM封装。如果为了内存控制器可以并行的访问每个模块，我们的内存控制器则需要30*8个引脚）

2.1.4

不用担心在这节中的细节是难以理解。重点总结如下：

1>为什么所有的内存不都是SRAM

2>内存单元需要被单独的选出被使用

3>地址线的数量直接决定这内存控制器，主板，DRAM模块，DRAM芯片的消耗

4> 需要花费一段时间去获取读和写操作的结果。（流程：地址传输-/RAS或/CAS引脚被激活-行列地址译码器-电容放电-信号放大器）

接下来的章节将会更多的介绍访问DRAM内存的实际流程细节。我们将不会有更多SRAM（直接寻址）的细节介绍。SRAM使用在CPU的二级缓存中并且能由CPU设计师完全的控制。CPU缓存我们会在之后介绍，但是我们需要知道SRAM单元有一个明确的最大的速度，这来源于在SRAM上的付出。这个速度是比CPU核心稍微慢一到两个数量级。

转载地址：http://ybgji.baihongyu.com/

你可能感兴趣的文章

javascript传参字符串与引号的嵌套调用