What Every Programmer Should Know About Memory
Ulrich Drepper
Red Hat, Inc.
drepper@redhat.com
November 21, 2007


2.2 DRAM Access Technical Details

In the section introducing DRAM we saw that DRAM chips multiplex the addresses in order to save resources. We also saw that accessing DRAM cells takes time since the capacitors in those cells do not discharge instantaneously to produce a stable signal; we also saw that DRAM cells must be refreshed. Now it is time to put this all together and see how all these factors determine how the DRAM access has to happen.

We will concentrate on current technology; we will not discuss asynchronous DRAM and its variants as they are simply not relevant anymore. Readers interested in this topic are referred to [highperfdram] and [arstechtwo]. We will also not talk about Rambus DRAM (RDRAM) even though the technology is not obsolete. It is just not widely used for system memory. We will concentrate exclusively on Synchronous DRAM (SDRAM) and its successors Double Data Rate DRAM (DDR).

Synchronous DRAM, as the name suggests, works relative to a time source. The memory controller provides a clock, the frequency of which determines the speed of the Front Side Bus (FSB) — the memory controller interface used by the DRAM chips. As of this writing, frequencies of 800MHz, 1,066MHz, or even 1,333MHz are available with higher frequencies (1,600MHz) being announced for the next generation. This does not mean the frequency used on the bus is actually this high. Instead, today’s buses are double- or quad-pumped, meaning that data is transported two or four times per cycle. Higher numbers sell so the manufacturers like to advertise a quad-pumped 200MHz bus as an “effective” 800MHz bus.

For SDRAM today each data transfer consists of 64 bits — 8 bytes. The transfer rate of the FSB is therefore 8 bytes multiplied by the effective bus frequency (6.4GB/s for the quad-pumped 200MHz bus). That sounds a lot but it is the burst speed, the maximum speed which will never be surpassed. As we will see now the protocol for talking to the RAM modules has a lot of downtime when no data can be transmitted. It is exactly this downtime which we must understand and minimize to achieve the best performance.

2.2.1 Read Access Protocol

Figure 2.8 shows the activity on some of the connectors of a DRAM module which happens in three differently colored phases. As usual, time flows from left to right. A lot of details are left out. Here we only talk about the bus clock,RASandCASsignals, and the address and data buses. A read cycle begins with the memory controller making the row address available on the address bus and lowering theRASsignal. All signals are read on the rising edge of the clock (CLK) so it does not matter if the signal is not completely square as long as it is stable at the time it is read. Setting the row address causes the RAM chip to start latching the addressed row.

TheCASsignal can be sent aftertRCD(RAS-to-CASDelay) clock cycles. The column address is then transmitted by making it available on the address bus and lowering theCASline. Here we can see how the two parts of the address (more or less halves, nothing else makes sense) can be transmitted over the same address bus.

Now the addressing is complete and the data can be transmitted. The RAM chip needs some time to prepare for this. The delay is usually calledCASLatency (CL). In Figure 2.8 theCASlatency is 2. It can be higher or lower, depending on the quality of the memory controller, motherboard, and DRAM module. The latency can also have half values. With CL=2.5 the first data would be available at the first falling flank in the blue area.

With all this preparation to get to the data it would be wasteful to only transfer one data word. This is why DRAM modules allow the memory controller to specify how much data is to be transmitted. Often the choice is between 2, 4, or 8 words. This allows filling entire lines in the caches without a newRAS/CASsequence. It is also possible for the memory controller to send a newCASsignal without resetting the row selection. In this way, consecutive memory addresses can be read from or written to significantly faster because theRASsignal does not have to be sent and the row does not have to be deactivated (see below). Keeping the row “open” is something the memory controller has to decide. Speculatively leaving it open all the time has disadvantages with real-world applications (see [highperfdram]). Sending newCASsignals is only subject to the Command Rate of the RAM module (usually specified as Tx, where x is a value like 1 or 2; it will be 1 for high-performance DRAM modules which accept new commands every cycle).

In this example the SDRAM spits out one word per cycle. This is what the first generation does. DDR is able to transmit two words per cycle. This cuts down on the transfer time but does not change the latency. In principle, DDR2 works the same although in practice it looks different. There is no need to go into the details here. It is sufficient to note that DDR2 can be made faster, cheaper, more reliable, and is more energy efficient (see [ddrtwo] for more information).

2.2.2 Precharge and Activation

Figure 2.8 does not cover the whole cycle. It only shows parts of the full cycle of accessing DRAM. Before a newRASsignal can be sent the currently latched row must be deactivated and the new row must be precharged. We can concentrate here on the case where this is done with an explicit command. There are improvements to the protocol which, in some situations, allows this extra step to be avoided. The delays introduced by precharging still affect the operation, though.

Figure 2.9 shows the activity starting from oneCASsignal to theCASsignal for another row. The data requested with the firstCASsignal is available as before, after CL cycles. In the example two words are requested which, on a simple SDRAM, takes two cycles to transmit. Alternatively, imagine four words on a DDR chip.

Even on DRAM modules with a command rate of one the precharge command cannot be issued right away. It is necessary to wait as long as it takes to transmit the data. In this case it takes two cycles. This happens to be the same as CL but that is just a coincidence. The precharge signal has no dedicated line; instead, some implementations issue it by lowering the Write Enable (WE) andRASline simultaneously. This combination has no useful meaning by itself (see [micronddr] for encoding details).

Once the precharge command is issued it takestRP(Row Precharge time) cycles until the row can be selected. In Figure 2.9 much of the time (indicated by the purplish color) overlaps with the memory transfer (light blue). This is good! ButtRPis larger than the transfer time and so the nextRASsignal is stalled for one cycle.

If we were to continue the timeline in the diagram we would find that the next data transfer happens 5 cycles after the previous one stops. This means the data bus is only in use two cycles out of seven. Multiply this with the FSB speed and the theoretical 6.4GB/s for a 800MHz bus become 1.8GB/s. That is bad and must be avoided. The techniques described in Section 6 help to raise this number. But the programmer usually has to do her share.

There is one more timing value for a SDRAM module which we have not discussed. In Figure 2.9 the precharge command was only limited by the data transfer time. Another constraint is that an SDRAM module needs time after aRASsignal before it can precharge another row (denoted astRAS). This number is usually pretty high, in the order of two or three times thetRPvalue. This is a problem if, after aRASsignal, only oneCASsignal follows and the data transfer is finished in a few cycles. Assume that in Figure 2.9 the initialCASsignal was preceded directly by aRASsignal and thattRASis 8 cycles. Then the precharge command would have to be delayed by one additional cycle since the sum oftRCD, CL, andtRP(since it is larger than the data transfer time) is only 7 cycles.

DDR modules are often described using a special notation: w-x-y-z-T. For instance: 2-3-2-8-T1. This means:

There are numerous other timing constants which affect the way commands can be issued and are handled. Those five constants are in practice sufficient to determine the performance of the module, though.

It is sometimes useful to know this information for the computers in use to be able to interpret certain measurements. It is definitely useful to know these details when buying computers since they, along with the FSB and SDRAM module speed, are among the most important factors determining a computer’s speed.

The very adventurous reader could also try to tweak a system. Sometimes the BIOS allows changing some or all these values. SDRAM modules have programmable registers where these values can be set. Usually the BIOS picks the best default value. If the quality of the RAM module is high it might be possible to reduce the one or the other latency without affecting the stability of the computer. Numerous overclocking websites all around the Internet provide ample of documentation for doing this. Do it at your own risk, though and do not say you have not been warned.

2.2.3 Recharging

A mostly-overlooked topic when it comes to DRAM access is recharging. As explained in Section 2.1.2, DRAM cells must constantly be refreshed. This does not happen completely transparently for the rest of the system. At times when a row {Rows are the granularity this happens with despite what [highperfdram] and other literature says (see [micronddr]).} is recharged no access is possible. The study in [highperfdram] found that “[s]urprisingly, DRAM refresh organization can affect performance dramatically”.

Each DRAM cell must be refreshed every 64ms according to the JEDEC specification. If a DRAM array has 8,192 rows this means the memory controller has to issue a refresh command on average every 7.8125µs (refresh commands can be queued so in practice the maximum interval between two requests can be higher). It is the memory controller’s responsibility to schedule the refresh commands. The DRAM module keeps track of the address of the last refreshed row and automatically increases the address counter for each new request.

There is really not much the programmer can do about the refresh and the points in time when the commands are issued. But it is important to keep this part to the DRAM life cycle in mind when interpreting measurements. If a critical word has to be retrieved from a row which currently is being refreshed the processor could be stalled for quite a long time. How long each refresh takes depends on the DRAM module.

2.2.4 Memory Types

It is worth spending some time on the current and soon-to-be current memory types in use. We will start with SDR (Single Data Rate) SDRAMs since they are the basis of the DDR (Double Data Rate) SDRAMs. SDRs were pretty simple. The memory cells and the data transfer rate were identical.

In Figure 2.10 the DRAM cell array can output the memory content at the same rate it can be transported over the memory bus. If the DRAM cell array can operate at 100MHz, the data transfer rate of the bus is thus 100Mb/s. The frequency f for all components is the same. Increasing the throughput of the DRAM chip is expensive since the energy consumption rises with the frequency. With a huge number of array cells this is prohibitively expensive. {Power = Dynamic Capacity × Voltage2 × Frequency.} In reality it is even more of a problem since increasing the frequency usually also requires increasing the voltage to maintain stability of the system. DDR SDRAM (called DDR1 retroactively) manages to improve the throughput without increasing any of the involved frequencies.

The difference between SDR and DDR1 is, as can be seen in Figure 2.11 and guessed from the name, that twice the amount of data is transported per cycle. I.e., the DDR1 chip transports data on the rising and falling edge. This is sometimes called a “double-pumped” bus. To make this possible without increasing the frequency of the cell array a buffer has to be introduced. This buffer holds two bits per data line. This in turn requires that, in the cell array in Figure 2.7, the data bus consists of two lines. Implementing this is trivial: one only has the use the same column address for two DRAM cells and access them in parallel. The changes to the cell array to implement this are also minimal.

The SDR DRAMs were known simply by their frequency (e.g., PC100 for 100MHz SDR). To make DDR1 DRAM sound better the marketers had to come up with a new scheme since the frequency did not change. They came with a name which contains the transfer rate in bytes a DDR module (they have 64-bit busses) can sustain:

Hence a DDR module with 100MHz frequency is called PC1600. With 1600 > 100 all marketing requirements are fulfilled; it sounds much better although the improvement is really only a factor of two. { I will take the factor of two but I do not have to like the inflated numbers.}

To get even more out of the memory technology DDR2 includes a bit more innovation. The most obvious change that can be seen in Figure 2.12 is the doubling of the frequency of the bus. Doubling the frequency means doubling the bandwidth. Since this doubling of the frequency is not economical for the cell array it is now required that the I/O buffer gets four bits in each clock cycle which it then can send on the bus. This means the changes to the DDR2 modules consist of making only the I/O buffer component of the DIMM capable of running at higher speeds. This is certainly possible and will not require measurably more energy, it is just one tiny component and not the whole module. The names the marketers came up with for DDR2 are similar to the DDR1 names only in the computation of the value the factor of two is replaced by four (we now have a quad-pumped bus). Table 2.1 shows the names of the modules in use today.

There is one more twist to the naming. The FSB speed used by CPU, motherboard, and DRAM module is specified by using the effective frequency. I.e., it factors in the transmission on both flanks of the clock cycle and thereby inflates the number. So, a 133MHz module with a 266MHz bus has an FSB “frequency” of 533MHz.

The specification for DDR3 (the real one, not the fake GDDR3 used in graphics cards) calls for more changes along the lines of the transition to DDR2. The voltage will be reduced from 1.8V for DDR2 to 1.5V for DDR3. Since the power consumption equation is calculated using the square of the voltage this alone brings a 30% improvement. Add to this a reduction in die size plus other electrical advances and DDR3 can manage, at the same frequency, to get by with half the power consumption. Alternatively, with higher frequencies, the same power envelope can be hit. Or with double the capacity the same heat emission can be achieved.

The cell array of DDR3 modules will run at a quarter of the speed of the external bus which requires an 8 bit I/O buffer, up from 4 bits for DDR2. See Figure 2.13 for the schematics.

Initially DDR3 modules will likely have slightly higherCASlatencies just because the DDR2 technology is more mature. This would cause DDR3 to be useful only at frequencies which are higher than those which can be achieved with DDR2, and, even then, mostly when bandwidth is more important than latency. There is already talk about 1.3V modules which can achieve the sameCASlatency as DDR2. In any case, the possibility of achieving higher speeds because of faster buses will outweigh the increased latency.

One possible problem with DDR3 is that, for 1,600Mb/s transfer rate or higher, the number of modules per channel may be reduced to just one. In earlier versions this requirement held for all frequencies, so one can hope that the requirement will at some point be lifted for all frequencies. Otherwise the capacity of systems will be severely limited.

Table 2.2 shows the names of the expected DDR3 modules. JEDEC agreed so far on the first four types. Given that Intel’s 45nm processors have an FSB speed of 1,600Mb/s, the 1,866Mb/s is needed for the overclocking market. We will likely see more of this towards the end of the DDR3 lifecycle.

All DDR memory has one problem: the increased bus frequency makes it hard to create parallel data busses. A DDR2 module has 240 pins. All connections to data and address pins must be routed so that they have approximately the same length. Even more of a problem is that, if more than one DDR module is to be daisy-chained on the same bus, the signals get more and more distorted for each additional module. The DDR2 specification allow only two modules per bus (aka channel), the DDR3 specification only one module for high frequencies. With 240 pins per channel a single Northbridge cannot reasonably drive more than two channels. The alternative is to have external memory controllers (as in Figure 2.2) but this is expensive.

What this means is that commodity motherboards are restricted to hold at most four DDR2 or DDR3 modules. This restriction severely limits the amount of memory a system can have. Even old 32-bit IA-32 processors can handle 64GB of RAM and memory demand even for home use is growing, so something has to be done.

One answer is to add memory controllers into each processor as explained in Section 2. AMD does it with the Opteron line and Intel will do it with their CSI technology. This will help as long as the reasonable amount of memory a processor is able to use can be connected to a single processor. In some situations this is not the case and this setup will introduce a NUMA architecture and its negative effects. For some situations another solution is needed.

Intel’s answer to this problem for big server machines, at least for the next years, is called Fully Buffered DRAM (FB-DRAM). The FB-DRAM modules use the same components as today’s DDR2 modules which makes them relatively cheap to produce. The difference is in the connection with the memory controller. Instead of a parallel data bus FB-DRAM utilizes a serial bus (Rambus DRAM had this back when, too, and SATA is the successor of PATA, as is PCI Express for PCI/AGP). The serial bus can be driven at a much higher frequency, reverting the negative impact of the serialization and even increasing the bandwidth. The main effects of using a serial bus are

  1. more modules per channel can be used.
  2. more channels per Northbridge/memory controller can be used.
  3. the serial bus is designed to be fully-duplex (two lines).

An FB-DRAM module has only 69 pins, compared with the 240 for DDR2. Daisy chaining FB-DRAM modules is much easier since the electrical effects of the bus can be handled much better. The FB-DRAM specification allows up to 8 DRAM modules per channel.

Compared with the connectivity requirements of a dual-channel Northbridge it is now possible to drive 6 channels of FB-DRAM with fewer pins: 2×240 pins versus 6×69 pins. The routing for each channel is much simpler which could also help reducing the cost of the motherboards.

Fully duplex parallel busses are prohibitively expensive for the traditional DRAM modules, duplicating all those lines is too costly. With serial lines (even if they are differential, as FB-DRAM requires) this is not the case and so the serial bus is designed to be fully duplexed, which means, in some situations, that the bandwidth is theoretically doubled alone by this. But it is not the only place where parallelism is used for bandwidth increase. Since an FB-DRAM controller can run up to six channels at the same time the bandwidth can be increased even for systems with smaller amounts of RAM by using FB-DRAM. Where a DDR2 system with four modules has two channels, the same capacity can handled via four channels using an ordinary FB-DRAM controller. The actual bandwidth of the serial bus depends on the type of DDR2 (or DDR3) chips used on the FB-DRAM module.

We can summarize the advantages like this:

Pins 240 69 Channels 2 6 DIMMs/Channel 2 8 Max Memory 16GB 192GB Throughput ~10GB/s ~40GB/s

There are a few drawbacks to FB-DRAMs if multiple DIMMs on one channel are used. The signal is delayed—albeit minimally—at each DIMM in the chain, which means the latency increases. But for the same amount of memory with the same frequency FB-DRAM can always be faster than DDR2 and DDR3 since only one DIMM per channel is needed; for large memory systems DDR simply has no answer using commodity components.

2.2.5 Conclusions

This section should have shown that accessing DRAM is not an arbitrarily fast process. At least not fast compared with the speed the processor is running and with which it can access registers and cache. It is important to keep in mind the differences between CPU and memory frequencies. An Intel Core 2 processor running at 2.933GHz and a 1.066GHz FSB have a clock ratio of 11:1 (note: the 1.066GHz bus is quad-pumped). Each stall of one cycle on the memory bus means a stall of 11 cycles for the processor. For most machines the actual DRAMs used are slower, thusly increasing the delay. Keep these numbers in mind when we are talking about stalls in the upcoming sections.

The timing charts for the read command have shown that DRAM modules are capable of high sustained data rates. Entire DRAM rows could be transported without a single stall. The data bus could be kept occupied 100%. For DDR modules this means two 64-bit words transferred each cycle. With DDR2-800 modules and two channels this means a rate of 12.8GB/s.

But, unless designed this way, DRAM access is not always sequential. Non-continuous memory regions are used which means precharging and newRASsignals are needed. This is when things slow down and when the DRAM modules need help. The sooner the precharging can happen and theRASsignal sent the smaller the penalty when the row is actually used.

Hardware and software prefetching (see Section 6.3) can be used to create more overlap in the timing and reduce the stall. Prefetching also helps shift memory operations in time so that there is less contention at later times, right before the data is actually needed. This is a frequent problem when the data produced in one round has to be stored and the data required for the next round has to be read. By shifting the read in time, the write and read operations do not have to be issued at basically the same time.


2.2 DRAM 访问技术细节
在介绍DRAM的章节我们看到了DRAM芯片通过复用了地址线节省了资源。我们也看到了访问DRAM单元花费了时间因为在单元中的电容不能瞬间放电图去产生一个稳定的电流信号。我们也看到了DRAM单元需要被刷新(64ms/16ms)。现在我们需要综合考虑这些因素并且看一下这些因素是怎么决定DRAM的访问过程的。

我们将关注于当前的技术;我们将不再讨论异步DRAM(asynchronous DRAM)和他的变种,仅仅只是和接下来的内容不相关。读者如果对这感兴趣,可以参考[highperfdram]和[arstechtwo].我们将不再讨论Rambus DRAM,即使这个技术没有过时。它没有被系统内存广泛的使用。我们将专注于同步DRAM(Synchronous DRAM SDRAM)并且 它的后继者 双速率 DDR(Double Data DRAM)

同步DRAM,正如其名,和一个同步时钟相关连。内存控制器提供了一个时钟,这个时钟的频率决定了FSB(前端总线)的速率。DRAM芯片使用了内存控制器的接口-FSB。在写作的这时,FSB的频率已经有800MHz,1066MHz,甚至1333MHz,下一代更高的频率1600MHz也正在被宣布。这并不意味总线频率在使用时有这么高。事实上,今天总线是双倍或者4倍传输的,意味着数据在周期内被两倍或者4倍的传输。为了更高的销售,生产商们喜欢宣传4倍的200MHz总线频率为效果为800MHz的总线。

目前SDRAM每次数据传输包含64bits-8bytes。因此FSB的速率计算方法是8B*有效的总线速率(200MHz的4倍的总线速率 为6.4GB/s,8*4*200*10的6次)。这听起来很高,但是这只是峰值速率,这个速度是无法达到的。正如我们所知道的,在访问RAM模块的时候,有一段空档期是无法传输数据的。我们必须非常准确的了解这个空档期才能最小化它来获取最好的性能。

图2.8用三种颜色展示了不同阶段中与DRAM模块相关部件的活动。照例,时间发生从左到右。许多细节被忽略。这儿我们只讨论总线时钟,RAS和CAS,数据总线和地址总线。读周期开始于内存控制器传输行地址行到地址总线并且降低了RAS信号。所有的信号在CLK的上升沿被读取(/RAS引脚被赋予低电平而被激活,行地址被送到行地址选通器,行地址解码器根据接收到的数据选择相应的行),所以不关心这是不是矩形的方波,只有读的时候信号稳定即可。设置行地址将会导致RAM芯片图去锁地址行。

CAS信号在tRCD个时钟周期后发出。传输列地址到地址总线并且降低了CAS信号(/CAS引脚被赋予低电平而被激活,列地址被送到列地址选通器Column Address Latch,列地址解码器C
olumn Address Decode根据接收到的数据选择相应的列)。这儿我们可以看到两部分的地址是怎么传输通过相同的地址总线。

现在寻址完成并且这个数据可以被传输。RAM芯片需要一些时间去准备。这个延迟被成为CL(CAS 时延因素)。在图2.8中,CL是2个周期。它可以或大或小,取决于内存控制器,主板,DRAM模块的质量。这个因素可以有半个周期。如果CL=2.5,那么第一个数据在蓝色区域的第一个下降沿是可用的。

所有以上的准备去获得一个字将是浪费的。这就是为什么DRAM模块允许内存控制器去制定多少数据可以被传输。经常这个选择是2,4,8字。这可以允许在没有新RAS/CAS序列的前提下填存缓存的整个线。内存控制器在没有重置行选择的前提下发送一个CAS信号也是可行的。这样的话,因为RAS信号不需要重新发送而且行不需要失活,连续的内存地址可以被更快的读或写。内存控制器能够决定行是否一直open,如果一直打开的话会对应用有不利的影响。发送新的CAS信号只与RAM模块的命令速率有关(通常指定为Tx,像T1,T2,高性能DRAM模块中为1表示在每个周期可以接受一个新的命令)。

这个例子中SRAM 每个周期可以输出一个字。这就是第一代。DDR有能力在周期内传输两个字节。这降低了传输速率但是无法改变时延。原则上看,DDR2实际上是相同的工作原理,仅仅是看上去不同罢了。知道DDR2可以更快,更便宜,更可靠,更节能便足够了。

2.2.2 预存电和激活

图2.8没有展示全过程。他只是展示了访问DRAM的全过程。在一个新的RAS信号被发送到行选通器之前,行必须失活并且行必须预存电,我们关注的是在显示命令发送之后的情况。这个协议做了一些改进,在一些情况下,允许这个额外的步骤省略。这个预存电带来的延迟可能会影响操作。

图2.9 展示了一个CAS信号到另一个CAS信号的全过程。在CL周期之后,第一个CAS信号所请求的数据是可用的。在一个简单的SDRAM中,消耗两个周期去传输两个被请求的字节。在DDR芯片中,可以传输4个字节。

即使在一个命令速率为1的DRAM上,预充电命令也不能立刻发出。这是重要的去等数据传输完成。在这个例子中,传输数据消耗了两个时钟周期。正好巧合和CL一致。预充电信号没有专线。相反的,一些实现通过同时降低WE线和RAS线。这种组合本身没有什么意义。

一旦预充电命令被发出,它将花费tRP(Row Prechange time)个周期直到行可以被选择。在图2.9中,大部分的时间(紫色部分)是重叠的传输时间(淡蓝色)。tRP时间大于传输时间,所以我们需要多等一个周期来发送RAS信号。(在数据传输完之后,我们必须使得RAS和CAS都失活一个周期)

如果我们将图中的时间线补齐,我们会发现下一次数据传输将会发生在这次数据传输完的5个周期之后。这意味着有效数据传输只占到了2/7。每次数据传输*前端总线速率在理论上应该为6.4GB/s,但是实际上只有1.8GB/s。这是十分糟糕的,必须去避免。第6节提到的技术将会提高这个数值。但是程序员也必须尽力努力。

有些SDRAM模块的时间参数我们没有谈及到。在图2.9中这个预充电命令只被数据传输时间限制。另外一个限制因素是SDRAM在发出RAS信号到下一次进行行的预充电之前是有一段时间间隔的(tRAS)。这个数值通常是相当高的,通常是两到三倍的tRP时间值。假如在RAS信号之后,只有一个CAS信号跟随并且这个数据可以在很短的周期内就可以传输结束。在图2.9中,假设第一个CAS信号是紧跟在RAS信号两个周期(RCD还是2),并且tRAS信号是8个周期,那么这个预充电命令就必须推迟一个时间,因为tRAS=8,(RCD+CL+tRP =7)

DDR模块经常使用一个特殊的符号来描述:w-x-y-z-T.比如:2-3-2-8-T1.意思如下:
w 2 CAS延迟(CL)
x 3 RAS到CAS延迟(tRCD)
y 2 RAS预充电 tRP
z 8 RAS激活到预充电的时间(tRAS)
T T1 命令速率

这儿还有一些其他的时间参数来影响命令的发送和处理.但是这5个参数足以决定模块的性能.
知道计算机的这些信息在说明一些参数的时候是有用的.在买计算机时知道这些也肯定有帮助的.这些信息和FSB,SDRAM模块速度都是最重要的因素来决定计算机的速度.

有冒险精神的读者可以尝试去微调系统.一些时候BIOS允许改变其中一些或者全部的值.SDRAM模块有可编程的寄存器,这些值可以被修改.通常BIOS都会选择最优的值.如果RAM的质量足够的好.在不影响系统计算机系统稳定性的前提下可能去降低一个或其他的时延参数.在网上有大量超频的网站提供了足够的文档.做这个是有风险的,我已经警告过了.

2.2.3 预充电
在提到DRAM访问时重充电总是被忽略的话题.正如在2.1.2节中提到的,DRAM单元总是不断的被刷新.This does not happen completely transparently for the rest of the system.(不理解什么意思,我是sholck222,我是一个英语学渣).在一个行预充电的时候,是不能访问的.在[highperfdram]研究中发现”惊人的,DRAM 刷新机制可以显著的影响性能.
依据JEDEC说明书,每一个DRAM单元必须在每64ms(目前我好像见过16ms的)内刷新一次.如果一个DRAM阵列有8192行,则意味这内存控制器平均0.0078125ms(64ms/8192)需要发出一个刷新命令(在实际中刷新命令也可以纳入队列,所以在两次请求之间的间隔可以更高 Why?).内存控制器需要去调度刷新命令.DRAM模块记录着最后一次刷新的行并且在新请求之后自动的增加这个计数.

程序员无法对刷新和命令何时发出做出更改,但是在解读参数时应该记住这对DRAM生命周期是重要的.如果在一个行正在刷新时,而在这行请求某个关键字,处理器将会延迟相当长的一段时间.DRAM模块刷新时常取决于本身.

2.2.4内存类型

这是非常值得花费一些时间去研究当前和将要使用的内存类型。我们开始研究SDRAM(同步DRAM)因为它是DDR(双倍速率)的基础。SDRAM是简单的。这些内存单元和数据传输是一直的。

在图2-10 DRAM单元阵列输出内存内容的速率和在内存总线上传输数据的速率是一致的。如果DRAM单元阵列速率可以为100MHz,在总线的数据传输速率也是100MHz.全部组件的速率是一致的。增加DRAM芯片的吞吐量是代价昂贵的因为能源消耗会随着频率一起增高。一个巨大的内存单元阵列是非常非常昂贵的。{功率= 电容*电压的平方*频率}。实际中在增加频率的时候会带来问题,通常需要提高电压来维持系统的稳定性。DDR SDRAM (DDR1)可以在不提高参与频率的前提下提高吞吐量。

正如图2.11看到的,和SDR和DDR1的名字上我们也能看出它们的差异,DDR1在每个周期传输两倍的数据。DDR1芯片传输数据在上升沿和下降沿。一些时候我们称之为’双泵’总线。为了使在不提高单元阵列频率的前提下,我们需要引入缓冲。在这个缓存中,每条数据线拥有两位。这反过来要求,在单元阵列中数据总线包含两条线。实现这个是简单的。我们使用相同的列地址来平行的访问两个DRAM。单元阵列实现这个只需作很小的改变。

SDR DRAMS被熟知只是因为它们的频率。因为这个频率没有改变,为了让DDR1 DRAM 听起来更好,营销人员提出了一种新的命名策略。这种新的命令包含了DDR模块(64位的总线)的传输速率。

100MHZ * 64bit * 2 = 1600MB/s (*2 是因为内部的双泵 总线)

从此我们把有着100MHZ频率的DDR被叫作PC1600。因为1600>100,所以销售需求被满足了。尽管只是提高了两倍,但是这是听起来非常好的。(我承认提高了两倍,但是我不喜欢这种数字游戏)

DDR2包含了一点儿创新从而是得内存技术更进一步。最明显的变化是翻倍了总线的频率。(图2-12)翻倍总线的频率意味这翻倍带宽。因为翻倍单元阵列的频率不是经济的,因此这要求I/O 缓存区在每个时钟周期内可以获得4位,这4位之后发送到总线。这意味DDR2模块的改变只是让DIMM封装的芯片I/O缓存区可以运行的更快。这种方案是可行的,同时不会使得供电量提高,这只是一个小的组件而不是整个模块。销售商命名DDR2的方式是和DDR1的方式似的,只是缓存内部由4位双泵总线代替了2位双泵总线。表2-1展示了今日这些使用模块的名称。

CPU,主板,DRAM模块使用有效的频率来表示FSB速度,也就是把在每个时钟周期上升和下降沿传输数据的因素考虑进去,这使得FSB被撑大。比如:一个有着266MHZ总线的133MHZ模块却有一个533的FSB频率。

DDR3(这个不是在显卡中使用的假冒GDDR3)相对DDR2改变了很多。电压从DDR2的1.8V降低到1.5V.因为电能的消耗是和电压的平方成正比而且这大概节省了30%。加上芯片尺寸的缩小和电气技术的进步,DDR3可以实现在相同的频率下,但是只消耗一半的电能。或者在相同的电能消耗下,或得更高的频率。或者在相同热量排放的前提下使得电容翻倍。

DDR3单元阵列的运行速度是内部总线速度的1/4,使得I/O缓存区的内存从DDR2的4位提升到8位,可以看图2.13中的电路图。

起初DDR3模块相比DDR2可能会有轻微较高的CAS延迟,这是因为DDR2技术是更加的成熟。这造成了DDR3在DDR2无法达到的高频或者带宽远比延迟更重要的前提下更有用,之前讨论过的1.3V的DDR3模块可以实现DDR2相同的CAS延迟。在任何情况下,提高速度总是比增加的延迟更加重要。

DDR3可能有一个问题,对于1600Mb/s传输速率或者更高速率的DDR3模块,每个通道的模块数量可能会降低到1.在早期的版本,所有不同频率的模块都有这个限制,所以我们希望对于这些芯片,这个限制能够有所改善,否则这将严重的影响系统的限制。

表2-2展示了预期的DDR3模块的名字。JEDEC目前同意了前面4种。Inter 45nm的处理器有一个1600Mb/s的FSB速度。一个FBS 1866Mb/s的处理器可以用在超频的市场中。在DDR3的发展中,我们也将看到更多的模块类型。

所有的DDR 内存都有一个问题:提高总线频率使得搭建平行的数据总线变得困难。一个DDR2模块有240个引脚。所有连接到数据和地址总线的引脚必须被安排成几乎一样的长度。一个更大的问题是,如果超过一个的DDR模块通过菊花链的方式连接到相同的总线,这个信号将会越来越变形。DDR2指定只允许一个总线和两个模块相连,DDR3对于高频指定只允许一个模块。每个总线带着240个引脚使得北桥无法合理的驱动超过两个总线。替代解决办法是将内存控制器移到内存外(正如图2.2所示),但是这种方法是昂贵的。

这意味这现代主板被限制只能最多有4个DDR2或者DDR3模块。这严重限制了一个系统所能拥有的内存数量。即使一个久的32位IA-32的处理器也可以处理64GB的RAM并且家庭主机也对内存需求日渐增长,所以我们需要做一些改变了。

Inter针对大型服务器方面,在未来几年,将使用Full Buffered DRAM来处理此问题。这个FB-DRAM使用和DDR2模块相同的组件来使得生产它们是容易的。区别在于与内存控制器的连接方式。FB-DRAM没有使用并行总线,反而使用了串行总线。串行总线可以在一个更高的频率下工作,同时也reverting的串行总线的消极影响,甚至增加带宽。使用一个串行总线的影响如下:
1.每个通道上可以使用更多的模块。
2.在北桥和内存控制器之间可以使用更多的通道。
3.串行总线可以被设计成全双工的。

一个FB-DRAM模块相比DDR2的240引脚,只有69引脚。因为总线的电气影响可以被很好的处理,所以通过菊花链相连多个FB-DRAM模块也是更容易的。FB-DRAM指定允许每个通道上有8个DRAM。

与双通道北桥的连接条件对比,FB-DRAM用很少的引脚可以驱动6个模块,2*240(双通道北桥)vs 6*69(FB-DRAM)。每一个通道的排线也更加的简单,这可以降低主板的成本。

对于传统的DRAM模块,全双工并行总线是极其昂贵的,(duplicating all those lines is too costly /what this means?)串行总线使其不是问题(即使和全双工并行总线有着细微的差距),并且串行总线设计成全双工的,这意味着,在一些情况下,仅靠这一点,总线的带宽在理论下就可以翻倍。因为FB-DRAM控制器可以同时和6个控制器相连,所以可以用其来增加一些小内存系统的带宽。一个有着双通道,4个内存模块的DDR2的系统,可以被一个普通的,有着4通道的FB-DRAM代替。这个串行总线实际的带宽具体是由FB-DRAM模块所使用的DDR2(或者DDR3)来决定的。

我们对比之后可以总结的优点如下:

当多个DIMMs 在一个通道上被使用时,FB-DRAM将会有一些缺点。信号在链路上得每个DIMM上都有很小的延迟,但是这样会造成叠加。相同容量的内存和相同频率的前提下FB-DRAM总是比DDR2和DDR3快的。因为每个通道上只需有一个DIMM模块。对于大型的内存系统,DDR更是没有商用组件的解决方法。

2.2.5总结
这一节展示了访问DRAM并不是一个快速的过程。至少与处理器和访问寄存器想比并不是快的。CPU和内存的频率不同是非常重要的。Inter Core 2处理器频率是2.933GHz,并且FSB频率是1.066GHz,它们的时钟比是11:1(1.066GHz总线是一个四泵结构)。在内存总线周期上的每一个延迟都意味着在处理器上延迟11个周期。对于大多数的机器目前的DRAM是慢的因此导致在处理器上延迟更高。 在之后的章节中我们讨论延迟的时候还会关联到时钟比。

之前读命令的时序图已经展示了DRAM模块能实现高速的数据传输。DRAM一整行可以没有延迟的被传输。数据总线可以被100%的占用。对于DDR模块这意味着在每个周期内可以传输2*64bits,对于DDR2-800和双通道这意味这12.GB/s的传输速率。

但是,DRAM的访问不都是串行的。在访问不连续的内存区域时就意味着需要预充电和新的RAS信号,所以这使得速度慢下来,DRAM就需要一些改进。预充电的时间越短,RAS信号发送激活行带来的负面影响就越小。

硬件和软件的预充电会创造更多的时序重叠区并且降低了延迟。预取可以提前内存操作的时间,有利于在数据被请求时减少竞争。如果没有预充电,这一轮产生的数据必须存储同时下一轮被请求的数据必须读出是经常发生的问题。但是通过提前读的时间,这个读和写操作就基本不会在同一时间被发出。

10-03 10:32