ࡱ> HCDEFGza3x!tV$d( zafL4³@1( ww@=N.W DCb!k'"]V4>x]hEp;/ <MggRmI$הHM 5hhQؐjB U}*. b>J/"T[B>g۽}פ&73sA8FBJ\DXx>;ro'F(}c@ M@ba#6&ǎ ˱y Irho~.t-Ve塂cY  4y6Їh9ȾmV_O'h Xk}k:GyL [FjϵH۱&MxUbi~Xl?oSrbJ}]ðO7aN ><6?}MYzp/@g)gO{Nȑd=.Bڱ VN钑Glkk'pN)G_k_lٔ#ZSXNe\ڔ O4 ^yWry|hE'ї221`Co|Bj`ţa%eQRCgvhs'Z?u0tIE |ee/?b ԮGz'|j|1l., gfVXd69d9a+C,9G[gOITS`ďdJCE;EncaWǬOt8]t@=6p[8Pk:*Č=î>2Oiz/Q;Kx_ yu@=נ&nVwbvc;Ӣmµiе݈ ܠ1&ey2 IKL^bR=,V͙vǺN٣<$袤AS"4s5`xQ}ǠpA`9 m=U-?iW8+\2zL+!o~1vF{Gg.p >k2*΍/^Z{雂N>uky+\1;0L)'j7}dQ2(\Gm"ni^c8gйFD5ѻUY]MxBlHL;cy9L-gkX]y%b>:TP]b:uPiz/Q;Kxm*m^s={$㮄kP^E.8n6mq}ؾ~(Y\H?3xΥf{wkm<-m'"S4 k^'Q' ߎ =  7,{rVk2d̺Č)cQE7  \?%}_+ka)U=#)DHȸ,!@={ z*<ڊ7ΫCD8]>I x LVǟ瀈 Z V^EfN%&&mjb6S$Mk\be&mnZζZjE*VTP:<r}+nm%?|?9L7a5"/4yV$eQ"*"]o%a!R@yLJh@2!hni1T5M\=lN v,9>x:J=ޖ~_9%Y-۶= A8۫/ }><^?#ӲWkfًb2l\/#7(٦֢dO;vcm {{bC{v57nHtk{hu=]<\&ڲ}N$_v#f16neդۨNen龱%ulz 27r<*| /^.?䢬*rMH'7ᶬ;򮨾+Q}xXtҿIHH* ,].C im c`<2&dx4n4֠ﭗ#ǯiB|l%֦18[r` `s: sBpJ2i`H9+M|*9kKsALb>dsI*f.K"]Ud%HĚkc3OcL_5}4xdziW&UnzjICSfkM$g ]4t&A3M'h⡣2q:B{G;21F5npz1i2ы oN1ﶦul˴BP^:̃b4B!,G+`5EomPKa?1QFl1b41S牡KeRk:1ec}| lf#(BfaVtMzj)fZ+Fw{tX>~ lrq}zZ>ޤ}Z=-[@V˞;ýtl}m3[frI|6 Ҩ<:8Lsn:|Vć-|z[*BI(z4 zJUv-E!H] +aBxj.?"XyZxf/`)is+ِBfl1s-fnQbo%{E{F\ .rD~s=uZU9HrYQT.I|&RvNvM>E,eI8puPmt@;hdt"fR$u;f" Y nKn(ZζOzIуp$M˥e;'ZJvN*v+pM2nޖAXp74Oa㧰l>C0|o)_>w90_\|||@&FC+yh&䡡<@S@c@ssb.A9h4flhz(ePN2P+%jM:d15ʤ]mQߪ|ֱky0ϣT m_ =~o|Ƅiwңe|"x3oZ~9utpxɜEg<'e8ss~>l׫o}}>r*x>cznj;hx&2?6;:ݶ˯C:+ IZaF߆t6w4Bux֍o{^^)7S^Kwíӱ}o[F ל:Z#}{"Ɲ볩qq|c7o"C( hL{qXg/q6/3W@={}f5@,d]>x] pU> #@K 4/UDX(PQ$EL2jTg*C+v( h-T)b4=gޗ{[F,= ) xw7˳ʦ\9abb1~|"LlXy FP(@p5X0)`CmN{=vh}r qy}V8Y(ܨ5OCJURrWCW8PLiH 3Owtt@9ΥP(Oi;ҝ-6I.]#pX KF՞_ t]¶Bvŷݕ%}CJS<(I-9m' !U4ZØElfg*8k~9ܧAї<:ͦ?GQvM.!o;8KV|]:Ʌ&epcjWwlXz:0 %/Ho gc?ɣJ)o>n,}`'X~`WXzsgfT]x9:9O:2Ai,RotQ̈́2?S2j3W= 杨HV0LYB[6y~K()mv\7_#M]_ f{H>,ue\\*wH+wW,Tlcۄu3Vwwouc!NrԕUԳs`2^3#f\*~wxYExqr(uPcC=3d+9c`12>PGd~`N|pS D `?9d34|J A]vRGs>ѩHist#vtr0ⷸh/4(D\c,`&j߸cxLj'X&Ǎ:%qj#yB$1U_wQZomO-*NNx[z?'OPHn"pO~=^h8َ+'tc4c*}_GQүSpr:Sbʻ'P}F{INZn!kT/v^{W!ECkwn?Tֵ&Bkw~nͧW;WQspo+aMs/iw^my#޺4{g}H yb?zSU8m·yN#Ckw+Gi [Gi囅xhmc#[pӏCiiw^*O?߿oJiwS|OZ۽L~q@Z۽IY~q^=E yg9zyU8mE]nRI瘫SK~5|{fKMk/=82(eFCן)?x|ƛG 4ft4:'#fRa\^?t_ kEz-Iu/M`-|Ov7e޶*2FYe]2ʰ0ᛸyz.b= wetenpq7e)8 nf2t[;i0:e\m 4e, Px +c༗-mg-2Fpssk2_cBq\ԝolXz ImTtZ nj,Ҵ_yO6OKdMf7Cjo7=62\fe4~4q\aKNS ې}AҼp,d5;X&bƓjN_ xi~{ٗQZ'5?g'9 HZ=,aP?x^fo{_H}ޛ{G;_s p_9p3J.v'4X/.8.|U明\R0<Q9_ά=ʼn5{ P=-хmnͦ-1aggbȈյ kb xx6v(   g\ Chart MSGraph.Chart.80*Microsoft Graph Chart\http://www.corsairmemory.com/corsair/products/tech/memory_basics/153707/main.swfhttp://www.corsairmemory.com/corsair/products/tech/memory_basics/153707/main.swff Chart MSGraph.Chart.80*Microsoft Graph ChartO Chart MSGraph.Chart.80*Microsoft Graph ChartbWorksheet Excel.Sheet.802Microsoft Excel WorksheetP/0(  0;[0 0 000$([\{b00 000000000  0=] 0 0 0000 2 3 !A0C0E0G0I0c00000000000000000!%),.:;?]}acdeghijklmnop|DTimes New Romanv00DArialNew Romanv00 DWingdingsRomanv000DMonotype Sortsv00@De0}fԚype Sortsv00G .@  @@``  @n?" dd@  @@`` 63.3h1,@  &3=*f  ,  G"'K6"21% = B 7% ]<")  fM %  !k/M  f %% % a6T$ )'s,_ 4 :Zf e g &3** f   %" +#'bTH(E'&  #1-Co1 F "M.!#$&(\RR\ff-#fR C\O6Q3xTEqg=<1 hTF4P 1:Z  6%EV2H}GNS6Ze{zfBPrJJ^P>EEIB   1489,%COg,CqKO-"<q%U1Rk <q%U1RkJ MLf I.> sd^k  }T WA4C!    ' ! ,YkG= ;$&?mt<=?@CDEFG JKLMNOPQRST&UVWX&YZ[\&]^_` abcdefghijklmnopor$3x!tV$dir$fL4³@1ii"$.W DCb!k'V"$z*<ڊ7ΫCD ("${}f5@2$ʓB(Data-path control unit design Pipeline stalls on cache misses??b  ? 8!Actions Needed on an I-Cache Miss"P"` " 1. Compute the value of PC-4. 2. Instruct the main memory to perform a read and wait for the memory to complete its access. 3. Write the cache entry, putting the data from memory in the data portion of the entry, writing the upper bits of the address (from the ALU) into the tag field, and turning the valid bit on. 4. Restart the instruction execution at the first step, which will re-fetch the instruction, this time finding it in the cache.(`b  9A Case Study: DEC Station 3100P`  Separate I-Cache and D-Cache  Write-through policy One-word-line simplifies write-miss handlingaab a :How to Handle Read/Write`  Write Write-through: for both write hit/miss 1. Index the cache using bits 15-2 of the address. 2. Write both the tag portion (using bits 31 - 16 of the address) and the data portion with the word. 3. Also write the word to main memory using the entire address. D bb`  ;<Performance Penalty due to Write-Through on DEC Station 3100(=P,`b = CPI without cache misses: 1.2 (gcc), but with each write takes 10 cycles: CPI becomes 2.3 (gcc) Note: in gcc, 11% of instructions are stores, each takes 10 cycles Solution: write buffers (in DEC Station 3100: size=4)R`D6abbubZ  9     m <  =Combined I-Cache/D-Cache?2 b`b  ]Hit ratio: Combined may be better: In DECStation 3100: 4.8% vs 5.4% Bandwidth considerations4#"^b@'       >  The primary method of achieving higher memory bandwidth is to increase the physical or logical width of the memory system. In this figure there are two ways in which the memory bandwidth is improved. The simplest design, (a), uses a memory where all components are one word wide; (b) shows a wider memory, bus, and cache while (c) shows a narrow bus and cache with an interleaved memory.8byb b  'Review: Major Components of a Computer ]% Processor-Memory Performance Gap 6"The  Memory Wall )Logic vs DRAM speed gap continues to grow**"`&(Memory Performance Impact on PerformanceSuppose a processor executes at ideal CPI = 1.1 50% arith/logic, 30% ld/st, 20% control and that 10% of data memory operations miss with a 50 cycle miss penaltyP!9{9{,6a'The Memory Hierarchy GoalFact: Large memories are slow and fast memories are small How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)? With hierarchy With parallelism&  A Typical Memory Hierarchy'Characteristics of the Memory Hierarchy #Memory Hierarchy TechnologieszCaches use SRAM for speed and technology compatibility Low density (6 transistor cells), high power, expensive, fast Static: content will last  forever (until power turned off) <7 'b(Memory Performance Metrics Latency: Time to access one word Access time: time between the request and when the data is available (or written) Cycle time: time between requests Usually cycle time > access time Typical read access times for SRAMs in 2004 are 2 to 4 ns for the fastest parts to 8 to 20ns for the typical largest parts Bandwidth: How much data from the memory can be supplied to the processor per unit time width of the data channel * the rate at which it can be used Size: DRAM to SRAM 4 to 8 Cost/Cycle time: SRAM to DRAM 8 to 16!X=D H  O= 2!$Classical RAM Organization (~Square) ",Classical DRAM Organization (~Square Planes)HThe column address selects the requested bit from the row in each plane:I " d*$Classical DRAM OperationwDRAM Organization: N rows x N column x M-bit Read or Write M-bit at a time Each M-bit access requires a RAS / CAS cycle&ee( Page Mode DRAM Operation'Page Mode DRAM N x M SRAM to save a row&h+"Synchronous DRAM (SDRAM) Operation d)Other DRAM Architectures6Double Data Rate SDRAMs  DDR-SDRAMs (and DDR-SRAMs) Double data rate because they transfer data on both the rising and falling edge of the clock Are the most widely used form of SDRAMs DDR2-SDRAMs http://www.corsairmemory.com/corsair/products/tech/memory_basics/153707/main.swff5  Q5 Q` P0>*DRAM Memory Latency & Bandwidth MilestonesIn the time that the memory to processor bandwidth doubles the memory latency improves by a factor of only 1.2 to 1.4 To deliver such high bandwidth, the internal DRAM has to be organized as interleaved memory banks)  bj,"Memory Systems that Support CacheshThe off-chip interconnect and memory architecture can affect overall system performance in dramatic waysk-!One Word Wide Memory OrganizationIf the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory cycle to send address cycles to read DRAM cycle to return data total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clockP|M*|M*l.!One Word Wide Memory OrganizationIf the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory cycle to send address cycles to read DRAM cycle to return data total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clockP|M)|M)m/POne Word Wide Memory Organization, con t#sWhat if the block size is four words? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clockb&Q2&&Q2n0POne Word Wide Memory Organization, con t#rWhat if the block size is four words? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clockb&Q2&&Q2o1POne Word Wide Memory Organization, con t#What if the block size is four words and if a fast page mode DRAM is used? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clockKM-K'M- p2POne Word Wide Memory Organization, con t#What if the block size is four words and if a fast page mode DRAM is used? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clockKM-K&M- q3Interleaved Memory Organization r4Interleaved Memory Organization s5DRAM Memory System SummaryIts important to match the cache characteristics caches access one block at a time (usually more than one word) with the DRAM characteristics use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache with the memory-bus characteristics make sure the memory-bus can support the DRAM access rates and patterns with the goal of increasing the Memory-Bus to Cache bandwidtht1@l$1@l$  //013457 ^cituvwxyz{| !"P  ` ̙33` ` ff3333f` 333MMM` f` f` 3>?" dW@?qKd@l8 -dc% 8`X x?" dZ(@ =d   @@``PR    @ ` ` p>>L0 C;(    Z 8c 8c1 ?P`  CTitle goes here   Z 8c 8c1?  BCPEG323   Z 8c 8c1?  B*   Z\ 8c 8c1 ?@P`'  This is our 1st Level Bullet this is our 2nd level bullet this is our 3rd level bullet This is our next 1st Level Bullet this is our 2nd level bullet this is our 3rd level bullet$" ^B  6>?P`B  s *޽h ? X(=^y___PPT10Y+D=' = @B + mjicse4310  P(  j  s *1 ?s  0  Z|w0agag1 ? E 0 ^*we want this to be in font 11 and justify.+ +B  s *޽h ? X(=^80___PPT10.Hj   0(   B  s *޽h ? X(=^80___PPT10.H0$5  \TFp(  p p 3 rgֳgֳ ?    p C x-gֳgֳ ? v.  & @H p 0޽h ? 333gggy___PPT10Y+D=' = @B +5  \T0Fx(  x x 3 rLgֳgֳ ?    x C xgֳgֳ ?@  & @H x 0޽h ? 333gggy___PPT10Y+D=' = @B +  \TPF(    3 r|0gֳgֳ ?  0   C xT0gֳgֳ ?  0 & @H  0޽h ? 333ggg  \TpF(    3 r<gֳgֳ ?     C xgֳgֳ ?\d   & @H  0޽h ? 333ggg  VNF(    3 rh$gֳgֳ ?0     3 r3gֳgֳ ?   & @H  0޽h ? 333ggg    FM(   0 TA b? ?0h b$00   NH4gֳgֳ? }+Large and Fast: Exploiting Memory Hierarchy",P,gZ ,   N9gֳgֳ? 0w  Instruction and data miss rates for the DECStation 3100 when executing the different programs. The combined miss rate is the effective miss rate seen. It is obtained by weighting the instruction and data individual miss rates by the frequency of instruction and data references.  c2 (    H  0޽h ? 333ggg ! \TF(    3 rgֳgֳ ?     C x,gֳgֳ ?v  & @H  0޽h ? 333gggE " lEdEFD(    3 r8]gֳgֳ ?0 x  & @BF z1   7zI N g!  g!VN k!V  k!VxB B H1?kJTxB  H1?B!VVN gf  gfxB   H1?ghFxB   H1?>fxB   H1?A}xB   H1?=yN i   iVN i  ixB B H1?iHxB  H1?@VN i  ixB  H1?iHxB  H1?@xB  H1?xB  H1?N :  :l  <1?N  Tagֳgֳ?: KCPUb  N j  jl  <1?  Tggֳgֳ?j MCacheb    Tkgֳgֳ? KBusb  t  61?"`D   T8lgֳgֳ?eK#  lMemo- ry  b&   N 6  6VN :V   :VxB !B H1?:TxB " H1?VVN 6f # 6fxB $ H1?6hxB % H1? fxB & H1?|A|}xB ' H1?=yN 8 ( 8VN 8 ) 8xB *B H1?8xB + H1?VN 8 , 8xB - H1?8xB . H1?xB / H1?zzxB 0 H1?N ]: 1 ]:l 2 <1?gN 3 T H1?M K, xB ? H1?&bxB @ H1? " ^N @  A @ VN @ k B @ kxB CB H1?@b jxB D H1?B A kVN   E  xB F H1?b xB G H1?B  xB H H1?^xB I H1?| ^| N   J  l K <1?3  L Twgֳgֳ?  KCPUb   M Ngֳgֳ?-  KBusb  t N 61?"`  O Tgֳgֳ?S!I  PMemoryb  N r P rVN rZ Q rZxB RB H1?XxB S H1?rZVN jn T jnxB U H1?lxB V H1?jnxB W H1?ExB X H1?A}N k ? Y k ?r Z BG1?  [ Tgֳgֳ?k ? S Multiplexer b  N F  \ F VN F V ] F VxB ^B H1?oTxB _ H1?gF VVN fB  ` fB xB a H1?hkxB b H1?cfB xB c H1?A}xB d H1?=yN d   e d  VN h  Z f h  ZxB gB H1?h G XxB h H1??  ZVN d j  i d j xB j H1?d lC xB k H1?; j xB l H1? E xB m H1? A }N (   n (  VN ,  Z o ,  ZxB pB H1?,  XxB q H1?  ZVN ( j  r ( j xB s H1?( l xB t H1? j xB u H1?n En xB v H1? A }l w <1? 7 x Tgֳgֳ?r& ^ MCacheb  t y 61?"`&S z T vgֳgֳ? W Memory bank 0b   t { 61?"` | Tdgֳgֳ?z W Memory bank 1b   t } 61?"`4a ~ Txgֳgֳ? W Memory bank 2b   t  61?"`  Tgֳgֳ? W Memory bank 3b     Tgֳgֳ?q Y 1  l$a. One-word-wide memory organization%%b %   Tdgֳgֳ?U o  cb. Wide memory organizationb    Tgֳgֳ?zo  k#c. Interleaved memory organization$$b $   N0gֳgֳ? r"High-Bandwidth Design Alternatives ##gZ # H  0޽h ? 333ggg  L0   @?_ (    C xR 8c 8c1 ?P   p  H1))?X p  H1))?8 x   Z 8c 8c1?` @ Processor U p  H1))? p  H1))?`  v"  NG 1))? 8 v"   NG 1))?     Z$ 8c 8c1?(@ =ControlU   ZL 8c 8c1?   XDatapath U    Z  8c 8c1?p #  @MemoryU   Z  8c 8c1? ; =DevicesUv"  NG 1))? XX v"  NG 1))? X8   Z  8c 8c1?H  ;InputU  Z 8c 8c1?(   <OutputU  Zxaxa1?>S# *H  0޽h ? X(=^y___PPT10Y+D=' = @B + L0 C (  r  S P`    0 3 r0e0eA ?5% ?0P     Zxaxa1? T,$0 N Moore s Law (z a0  P,$@0  T!xaxa1?0 IProc 55%/year (2X/1.5yr)xR   HGrH Ir1?a,z :. O       ,$D0   T$&xaxa1?1. O  GDRAM 7%/year (2X/10yrs)xr   HGL HEI,1?:D 1 B   <(8c?0p ,$@0#  Z)xaxa1? ,$0 e1Processor-Memory Performance Gap (grows 50%/year)22H  0޽h ?/@    X(=^  ___PPT10 .%0+D ' = @B D ' = @BA?%,( < +O%,( < +D' =%(Du' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%(D4' =%(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<* %(D' =%(Du' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<* %(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%(+p+0+ ++0+ +  L0  Ed(  dr d S EP`   ~ d s *\F@PF   d0 3 r0e0eA O?5% ?O  O  d NG1?`G  LClocks per instruction d N|L1?@  LClocks per DRAM accessH d 0޽h ? X(=^___PPT10i.`RF+D=' = @B +  L0 D}(  r  S TP`   ~  s *pU@K    ZW 8c 8c1 ?`,$0 1CPI = ideal CPI + average stalls per instruction = 1.1(cycle) + ( 0.30 (datamemops/instr) x 0.10 (miss/datamemop) x 50 (cycle/miss) ) = 1.1 cycle + 1.5 cycle = 2.6 so 58% of the time the processor is stalled waiting for memory! A 1% instruction miss rate would add an additional 0.5 to the CPI!@CC ,K   0 3 r0e0eA ?5% ?0    H  0޽h ? X(=^me___PPT10E.)6Jq+kHaD' = @B D' = @BA?%,( < +O%,( < +DA' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%(DA' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%(DA' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*2%(+8+0+ +  L0 0DP(  r  S hP`     S iP` <$0  H  0޽h ? X(=^v n ___PPT10N .*07+!lD' = @B D' = @BA?%,( < +O%,( < +Da' =%(%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*;%(Ds' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*<%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%(+8+0+ +% L0 ~v`? (    TԔ? ` ,$D0   ntxaxaAo?10%  QSecond Level Cache (SRAM)x  PA8c?10% 00 ~  s *D9P   d  <8c?I  Z;xaxa1?P^R 9Controld  <8c?q!    Zd=xaxa1?   TDatapath d   <8c?`` \    Z5xaxa1?O$  KSecondary Memory (Disk)   <8c? f ,$D0   Zxaxa1?G,$0 DOn-Chip Components^B  6o?P ^B  6o? M d  <8c?J aA   ZZxaxa1? R  SRegFilex  PA8c?10% P0 x  PA8c?10%p   Z xaxa1?` f  JMain Memory (DRAM)   `Zp xaxa1?r U  B Data Cache  x  PA8c?10% P    `Zxaxa1?d  ] Instr Cache  x  PA8c?10% 0  @ T1?   <ITLB @ T1? @  <DTLB  B6'Ԕ?p` ,$D0 @ B81?4,$0 UeDRAM6'  Z" 8c 8c1?   "Speed (%cycles):  s 1 s 10 s 100 s 1,000 s8hU#V  Z( 8c 8c1? E jSize (bytes): 100 s K s 10K s M s G s to T sLqU  5'G1  TP0 8c 8c1?0 k Cost: highest lowest&lU]   Z4 8c 8c1 ? H 1By taking advantage of the principle of locality Can present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technologyH1 qK lK18c8XH  0޽h ?   ___PPT10 .+4D ' = @B D ' = @BA?%,( < +O%,( < +D ' =%(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<* %(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<* %(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*%(D' =%(|D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*%(D' =%('D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%(+p+0+  ++0+ +Y! L0  ? (  d  <1?    `$Mxaxa1 ?P`   dR  <1? XB  0D1? ` #  <LO1? D ,$0 5Increasing distance from the processor in access time26*  <U1?P `  7L1$XB   0D1?@ XB   0D1? 0    <Y1?YP ` @ 7L2$   <]1?9 @@  A Main Memory     <a1?y P`  GSecondary MemoryB  6D1? ,$D0  H1?   = Processor B  <D1?Y Y ,$@0  <1? @p,$0 _+(Relative) size of the memory at each level,,z @``   @ ,$D0fB  6D1?@@`   <p1?@`  &Inclusive what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM&s j8  ) I  )  I lB  <D1? ) lB  <DjJ? i )lB  <DԔ?  I lB  <DjJ? I l  8-  8 - ,$D0  N 1? 8"  ^4-8 bytes (word)2   Bt 1? yM  ? 1 to 4 blocks  N1? Y -  o!1,024+ bytes (disk sector = page)2"  B1? m `8-32 bytes (block)2 H  0޽h ? U>=UU(  ___PPT10 +elD( ' = @B D ' = @BA?%,( < +O%,( < +D' =%(Du' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*%(D' =%(Du' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%(D4' =%(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*%(D4' =%(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*%(+p+0+ ++0+ + L0 z@(    C x4' 8c 8c1 ?R+     C xN 8c 8c1 ? 0 *   t  ZX( 8c 8c1 ?` ,$0 Main Memory uses DRAM for size (density) High density (1 transistor cells), low power, cheap, slow Dynamic: needs to be  refreshed regularly (~ every 8 ms) 1% to 2% of the active cycles of the DRAM Addresses divided into 2 halves (row and column) RAS or Row Access Strobe triggering row decoder CAS or Column Access Strobe triggering column selector)u*1gu*1              d  <1?^B  6Do?ppXB @ 0D1? ^B  6Do?XB  @ 0D1?``^B   6Do?  XB  @ 0D1?`P   <l81?P < Dout[15-0]    <<1?l @ SRAM 2M x 16    <TA1?` w ; Din[15-0]   <D1?0  9Address  <G1? 0  = Chip select   <pK1?p  ? Output enable  <1? t > Write enable ^B  6D1?00^B  6D1?pp^B  6D1?  <l1?P 416  <T 1?PP 416  <x 1?P 421H  0޽h ? X(=^___PPT10+PQFDO' = @B D ' = @BA?%,( < +O%,( < +DA' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%(+8+0+ + L0 `D$(  r  S P`   r  S @P`  H  0޽h ? X(=^___PPT10i.8v+D=' = @B +" L0 ?++I(    S ~! 8c 8c1 ?_   ~  VA1?20% |   Z# 8c 8c1?pH  JR o w D e c o d e rUd  <8c?X XB  08c?`h( `XB  08c?Ph( PXB  0p?@p @XB   08c?0h( 0XB   08c? h( XB   08c?h( XB   08c? h( XB   08c? h( dB  <Ԕ?    ZD. 8c 8c1? (  A row address U d  <8c? ( jB  BԔ? @@X  Z3 8c 8c1?p # Fdata bit or wordUXB  08c?@@ XB  08c?00 XB  08c?   XB  08c? XB  08c?  XB  08c?  XB  08c?     `$9 8c 8c 1? FRAM Cell ArrayUz  x     x ,$D0fB  61? l    Z> 8c 8c1?( x Eword (row) lineUz     ,$D0fB  61? <    ZC 8c 8c1? s Fbit (data) linesUz   !  ,$D0fB " 61?  D # T@Gxaxa1? r@Each intersection represents a 6-T SRAM cell or a 1-T DRAM cellAAXB $ 0o?  %  `L 8c 8c1? (  V Column Selector & I/O Circuits!! &  `O 8c 8c1?@ X  Bcolumn addressjB '@ BԔ? 0  ( TPxaxa1? ` 0 ,$D0 mOne memory row holds a block of data, so the column address selects the requested bit or word from that blockNnRXB ) 0o? 0 XB * 0p? 0 XB + 0jJ? 0 H  0޽h ? X(=^ w ___PPT10W +5D ' = @B D ' = @BA?%,( < +O%,( < +D4' =%(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*!%(D4' =%(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*%(D4' =%(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*%(DA' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*(%(+8+0+( +), L0 8+0+?XX*(  F %Q B   0\\   VA1?20%IQ `B  08c?- `B  08c?-r r`B  0p?5b b`B  08c?-R R`B  08c?-B B`B   08c?-2 2 `B   08c?-" " `B   08c?-  `B   08c?: `B   08c?: `B  08c?: `B  08c?: `B  08c?: `B  08c?  : `B  08c?  : `B  0p?  B `B  0Do?%B    Zx  8c 8c1?  `  <data bit   Z 8c 8c1?0 S >data bit U   C xh 8c 8c1 ?l`     Z 8c 8c1?   JR o w D e c o d e rUd  <8c?* dB  <Ԕ? ee   Z 8c 8c1? e  A row address U   Z@ 8c 8c1?b   X Column Selector & I/O Circuits!U!d  <8c?:    Zd 8c 8c1? ]X  Dcolumn addressUdB @ <1? 0   Z  8c 8c1?0@  >data bit U  " Z( 8c 8c1?p#  Eword (row) lineU^B # 61?  $ Z, 8c 8c1? @c Fbit (data) linesU % T0xaxa1?! _-Each intersection represents a 1-T DRAM cell.. &  d2 8c 8c1 ?    pr ' H1? @  9. . .F %Q B  1 `P|  2 VA1?20%IQ `B 3 08c?- `B 4 08c?-r r`B 5 0p?5b b`B 6 08c?-R R`B 7 08c?-B B`B 8 08c?-2 2 `B 9 08c?-" " `B : 08c?-  `B ; 08c?: `B < 08c?: `B = 08c?: `B > 08c?: `B ? 08c?: `B @ 08c?  : `B A 08c?  : `B B 0p?  B `B C 0Do?%B  F %Q B  D %Q B  E VA1?20%IQ `B F 08c?- `B G 08c?-r r`B H 0p?5b b`B I 08c?-R R`B J 08c?-B B`B K 08c?-2 2 `B L 08c?-" " `B M 08c?-  `B N 08c?: `B O 08c?: `B P 08c?: `B Q 08c?: `B R 08c?: `B S 08c?  : `B T 08c?  : `B U 0p?  B `B V 0Do?%B   W  `\L 8c 8c 1?  FRAM Cell ArrayUjB X@ BԔ?  ^B ! 61? & H  0޽h ? X(=^y___PPT10Y+D=' = @B +e9 L0 33+3@jj2(    C x \ 8c 8c1 ?3      C xD] 8c 8c1 ?p   l   j ,$D0`B  08c?hHh`B  08c?(H(`B  08c?Xp `B  08c?X`0`B  08c?hh`B   08c?((`B   08c?p8 `B   08c?`80`B   08c?8 h h`B   08c?8 ( (  Zaxaxa1?hJ: ? Row Address fB  68c? P`B  08c?H H `B  08c?P X `B  08c?h x `B  08c? @ `B  08c? H H   Zgxaxa1?H  7CASN (      `B  08c?0 0 `B  08c?8 ( `B  08c?8 x `B  08c? ( `B  08c? 0 0   Z8mxaxa1?0   7RAS`B  08c? 8 `B  08c? 8 `B  08c?H(  `B   08c?0 0  ! Zqxaxa1?h: ? Col Address `B " 08c?p( `B # 08c?`( 0`B $ 08c?H((`B % 08c?Hhh`B & 08c? ph `B ' 08c? `h 0`B ( 08c?x hh`B ) 08c?x ((`B * 08c?p `B + 08c?`0 , Zxxaxa1?g h : ? Row Address  - Z$}xaxa1?Wh: ? Col Address `B . 08c?p `B / 08c?`0`B 0 08c?((`B 1 08c?hh`B 2 08c?P  `B 3 08c?( 8 `B 4 08c?H@  `B 5 08c?H H `B 6 08c?hh`B 7 08c?((`B 8 08c?p( `B 9 08c?`(0`B : 08c?8hh`B ; 08c?8((fB = 68c? PfB > 68c? fB ? 68c?p pv B N8c))? XB C 08c? XB D 08c? ^B E 68c?P P ^B F 68c?P P h G  `4xaxa1?  8N rowsF  H `B I 08c?`B J 08c?fB K 68c?44fB LB 68c?44 M Zxaxa1? 8N colsXB N 08c? XB O 08c?XB P 08c?XB Q 08c?XB R 08c? S Z,xaxa1?' 8DRAMXB T 08c?X ^B U 68c? p ^B V 68c?  W Z\xaxa1?0Tr  > M bit planes d X <8c?x(^B Y 68c?(x(^B Z 68c?(8^B [ 68c?080^B \ 68c?8^B ] 68c?8^B ^ 68c?^B _ 68c?00XB ` 08c?X 8( a Zxaxa1?9  A Row Address   b Zxaxa1? I| DColumn Address^B c 68c?00`  d ZPxaxa1?(   @ M-bit Output XB e 08c?l    i  ,$D0 < Zxaxa1?  U z  b1st M-bit Access6  lB @ <8c?  A Zhxaxa1?   b2nd M-bit Access6 lB f <8c?p h lB g <8c?x @ x  h Zxaxa1? &  > Cycle Time H  0޽h ? X(=^___PPT10+D~' = @B D9' = @BA?%,( < +O%,( < +D4' =%(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*j%(D4' =%(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*i%(+fM  L0 q?i?p@y~P>(  v  N8c))?(HX8XB  08c? X  XB  08c?@P @^B  68c?H H^B  68c?(    `l£xaxa1?*@ * 8N rowsXB  08c?|LL,XB   08c?|,,,^B   68c?4^B  @ 68c?TT   ZDǣxaxa1?C"?c:5 8N colsXB   08c?XH(XB  08c?Xh(XB  08c?p`@XB  08c?``XB  08c?h  x  Zx̣xaxa1?xR 8DRAMXB  08c?   Z$xaxa1?8  DColumn Address^B  68c?h    Z8xaxa1? }  @ M-bit Output `z HP   HP ,$D0l  <8c?HXX `B  08c?h`B  08c?  `B  08c?hh `B  08c?` ` `B  08c?PpP fB  68c? XP fB  68c?P   Zأxaxa1?0   > M bit planes `B ! 08c?H8`B "B 08c?( # Zܣxaxa1?gr  D N x M SRAM XB $ 08c?0HX0XB % 08c?HXXB & 08c?hh8XB ' 08c?(h^B ( 6p?PP^B ) 68c?0X0 * Zxaxa1?(L A Row Address  ^B + 68c? , C x 8c 8c1 ?T     - C x 8c 8c1 ?   ^B t 68c?p@@^B u 68c?pp^ v Z 8c 8c1 ? ,$0 "After a row is read into the SRAM  register Only CAS is needed to access other M-bit words on that row RAS remains asserted while CAS is toggledH- qKe lK-e8c8Xl   } ,$D0fB / 68c?HfB 0 68c?`H`fB 1 68c?XXfB 2 68c?XhfB 3 68c?fB 4 68c?``fB 5 68c?8XfB 6 68c?8h 7  ` xaxa1?Jr ? Row Address lB 8 <8c? fB 9 68c? X fB : 68c?h fB ; 68c?fB < 68c?x (  =  `xaxa1? R 7CASfB ? 68c?8 ( fB @ 68c?8  A  ` xaxa1?0   7RASfB B 68c?  fB C 68c?(  fB D 68c?0 0  E  `Hxaxa1?r ? Col Address fB F 68c?H``fB G 68c?HfB H 68c?( XfB I 68c?( h J  `D xaxa1?  r ? Col Address fB K 68c?  XfB L 68c?  hfB M 68c?8 ` `fB N 68c?8  lB Q <8c?  R  `xaxa1?w r ? Col Address fB S 68c?XfB T 68c?hfB U 68c?( ``fB V 68c?(  W  `xaxa1?gr ? Col Address fB X 68c?XfB Y 68c?hfB Z 68c?``fB [ 68c?fB \ 68c?X fB ] 68c?  fB ^ 68c? x  lB _ <8c? fB ` 68c?H fB a 68c?fB b 68c?x lB c <8c? fB d 68c?8 fB e 68c?fB f 68c?x lB g <8c? fB h 68c?fB i 68c?``fB j 68c?  fB k 68c? ( fB l 68c?( 8 fB m 68c?8 H fB > 68c?0 0 l P   ~ 0 ,$D0 O  `,"xaxa1?  b1st M-bit Access6 rB P B8c?    n  ``(xaxa1?  \  [ 2nd M-bit6   o  `%xaxa1?   [ 3rd M-bit6   p  `$3xaxa1?   [ 4th M-bit6  rB q B8c? rB r B8c?  rB s B8c?  rB y B8c?    z  `\9xaxa1?z P "  > Cycle Time H  0޽h ? X(=^ } ___PPT10] .+V=D ' = @B D ' = @BA?%,( < +O%,( < +DT' =%(%(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*%(Ds' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*v-%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*v-h%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*vh%(D' =%(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*}%(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*~%(+8+0+v +6 L0 55Dg@ 5(  v  N8c))?(HX8XB  08c? X  XB  08c?@` @^B  68c?H H^B  68c?(    `Ixaxa1?*@ * 8N rowsXB  08c?|LL,XB   08c?|,,,^B   68c?4^B  @ 68c?TT   Z@Nxaxa1?C"?c:5 8N colsXB   08c?XH(XB  08c?Xh(XB  08c?p`@XB  08c?``XB  08c?h  x  ZRxaxa1?xR 8DRAMXB  08c?   ZVxaxa1? q  DColumn Address^B  68c?h    Z[xaxa1? }  @ M-bit Output ,F HP   HP l  <8c?HXX `B  08c?h`B  08c?  `B  08c?hh `B  08c?` ` `B  08c?PpP fB  68c? XP fB  68c?P   Z(axaxa1?0   > M bit planes `B ! 08c?H8`B "B 08c?( # ZXexaxa1?gr  D N x M SRAM XB $ 08c?0HX0XB % 08c?HXXB & 08c?hh8XB ' 08c?(h^B ( 6p?PP^B ) 68c?00 * Zsxaxa1?(L A Row Address  ^B + 68c? , C xv 8c 8c1 ?T   ^B . 68c?p@@^B / 68c?ppl 0 ZD 8c 8c1 ?  k  BAfter a row is read into the SRAM register Inputs CAS as the starting  burst address along with a burst length Transfers a burst of data from a series of sequential addresses within that row A clock controls transfer of successive words in the burst  300MHz in 2004jA qK lKL -AL8c8Xd z <1?p0 ^B { 6D1?p 0 XB | 0D1?ppXB } 0D1?pXB ~ 0D1?XB  0D1?p^B  6D1?ppp  Zxaxa1?C"? 4+1^B 2 68c?a^B 3 68c?a^B 4 68c?A^B 5 68c?A^B 6 68c?Qq^B 7 68c?Qq^B 8 68c?^B 9 68c? :  `txaxa1? ? Row Address dB ; <8c? ^B < 68c? a ^B = 68c? Q@^B > 68c?@a@H @  `Dxaxa1?  7CAS^B A 68c?p q ^B B 68c? @  C  `xaxa1?h :  7RAS^B E 68c?` H ^B F 68c?` `h  G  `xaxa1?c  ? Col Address ^B H 68c?a ^B I 68c?a ^B J 68c?q ^B K 68c?q dB Q <8c? dB _ <8c? I I dB c <8c? dB g <8c? yy^B n 68c?h aah ^B  68c? @@^B  68c? `  p  `xaxa1?@ 9z   b1st M-bit Access6 jB q B8c?0  0  r  `xaxa1?@ Y   [ 2nd M-bit6   s  `xaxa1?@   [ 3rd M-bit6   t  `Dxaxa1?@ i  [ 4th M-bit6  jB u B8c?0 A 0 jB x B8c?@ a @  y  `xaxa1?P ` v "  > Cycle Time jB  B8c?0 0 jB  B8c?0 )0 XB  08c? XB  08c? XB  08c?XB  08c?XB  08c? XB  08c?   Z=xaxa1? <Row Add  H  0޽h ? X(=^___PPT10e.+D=' = @B +  L0 pDV(  r  S P`     c $*?P`P <$0  H  0޽h ? X(=^  ___PPT10 .pr++D3 ' = @B D ' = @BA?%,( < +O%,( < +D' =%(%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*5%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*5%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%(DA' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%(DA' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%(+8+0+ +> L0 /8'8PAN07(  0x 0 c $x6     0 s *D# P<$0  \5 P0  0 #"2&P0   z0 Bt1?0  L528c` x0 Bd1?  L628c` v0 B1?   L758c` t0 B1?`   M1258c` r0 B1?`  M1708c` p0 B1?  M2258c` n0 B1?P  xLatency (nsec) 8c` ;0 Bt1? 0 R16008c` :0 B1?  Q6408c` 90 B1?  Q2678c` 80 B%:1?`   Q1608c` 70 B.:1? `  P408c` 60 B/:1?  P138c` 50 BL9:1?P  s BWidth (MB/s)8c` 30 B4H:1?'0  L668c` 20 B::1?'  L548c` 10 BY:1? '  L208c` 00 Bb:1?` '  L188c` /0 Bk:1?'`  L168c` .0 B0]:1?'  L168c` -0 B}:1?P'  Q Pins/chip  8c` +0 B:1?A0' M2048c` *0 B:1?A' M1708c` )0 BX:1? A' M1308c` (0 B<:1?` A ' L708c` '0 B:1?A` ' L458c` &0 B:1?A' L358c` %0 Bܽ:1?PA' lDie size (mm2),  8c` #0 B:1?[0A M2568c` "0 B:1?[A L648c` !0 BD:1? [A L168c`  0 B:1?` [ A K18c` 0 BD:1?[` A N0.258c` 0 B:1?[A N0.068c` 0 BD:1?P[A OMb/chip8c` 0 Bt1?u0[ N20008c` 0 B81?u[ N19978c` 0 B1? u[ N19938c` 0 B1?` u [ N19868c` 0 B1?u` [ N19838c` 0 B1?u[ N19808c` 0 B1?Pu[ LYear8c` 0 B01?0u M64b8c` 0 B01?u M64b8c` 0 Bh01? u M64b8c` 0 BH01?`  u M32b8c` 0 B01?` u M16b8c` 0 B01?u M16b8c`  0 B01?Pu T Module Width  8c`  0 B01?0 S DDR SDRAM  8c`  0 B1? T Synch DRAM  8c`  0 BD1?  q FastPage DRAM8c` 0 B81?`   q FastPage DRAM8c` 0 B1?`  S Page DRAM  8c` 0 B81? NDRAM8c` 0 B1?P @8c``B =0 0o ?P0ZB >0 s *o ?P0ZB ?0 s *1 ?Pu0uZB @0 s *1 ?P[0[ZB A0 s *1 ?PA0AZB B0 s *1 ?P'0'ZB C0 s *1 ?P 0 `B D0 0o ?P 0 `B E0 0o ?PP ZB F0 s *o ? ZB G0 s *1 ? ZB H0 s *1 ?` ` ZB I0 s *1 ?  ZB J0 s *1 ? ZB K0 s *1 ? `B M0 0o ?00 ZB o0 s *1 ?P0 0 H1? 0   y!Patterson, CACM Vol 47, #10, 2004""H 0 0޽h ? X(=^$___PPT10.>v?+[W_D' = @B DS' = @BA?%,( < +O%,( < +DA' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*0v%(DA' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*0v%(+8+0+0 + L0 D9(  d  <1?"   fpxaxaG 1 ?P  & @   `hxaxa1 ?P`     6   .  6F hh  .  6 @  .  6T8  .   68  .d   <1?   <x1? 5CPUdb   <1?pd   <1?  <1?  7Cached  <1? db  <1?   <1? `  8Memory  <L1?   5busd  <1?   Toxaxa1?0h GOne word wide organization (one word wide bus and one word wide memory)&HG7   `uxaxaG 1 ?`P` $D0`X___PPT9:2 Assume 1 clock cycle to send the address 25 clock cycles for DRAM cycle time, 8 clock cycles access time 1 clock cycle to return a word of data Memory-Bus to Cache bandwidth number of bytes accessed from memory and transferred to cache/CPU per clock cycle0qKZA0qKZA0qKZAR0lKZA;-R   @`  <1?0 ``  }#32-bit data & 32-bit addr per cycle$$ XB  0Do?0 `0 ^B  6D1?0 0   H1?@W =on-chipH  0޽h ? U>=UU(___PPT10++WDO' = @B D ' = @BA?%,( < +O%,( < +DA' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%(+8+0+ +  L0   D (  d  <1?   `@zxaxa1 ?P`     6   .  6ԟF hh  .  6 @  .  6d8  .  6Ĥ8  .d   <1?   <1? 5CPUdb   <1?0d   <1?   <1?0 7Cached  <1?0  db  <1?0   <1? @  8Memory  <\1?0 5busd  <1?   `dxaxaG 1 ?p N$0qKZA  Hp1? =on-chip~  s *D@?  XB  0Do?  H  0޽h ? U>=UU(y___PPT10Y+D=' = @B + L0 =5E (   d   <1?    `xaxa1 ?P`      6   .   6$F hh  .   6 @  .   6d8  .   68  .d   <1?   <1? 5CPUdb   <1?0d   <1?   <1?0 7Cached   <1?0  db   <1?0    <1? @  8Memory   <<1?0 5busd   <1?    `DxaxaG 1 ?p N$0qKZA   HP1? =on-chip~   s * @?  XB   0Do?  R   ZS 8c 8c1 ? ,$D0  1 25 1 27 L0qKZA0lKU(8c8XM   Z\X 8c 8c1 ? 0 ,$D0  4/27 = 0.148L0qKZA 0lKU( 8c8XH   0޽h ? U>=UU(h`___PPT10@+D' = @B Do' = @BA?%,( < +O%,( < +D]' =%(D' =%(D' =A@BBBB0B@>%(D' =1:Bvisible*o3>+B#style.visibility<* %(+DA' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<* %(+p+0+  ++0+  +  L0   E( (  (d ( <1? (  `xxaxa1 ?P`    ( 6{   . ( 6yF hh  . ( 6 @  . ( 68  . ( 6\8  .d  ( <1?  ( <̋1? 5CPUdb  ( <1?0d  ( <1?  ( <d1?0 7Cached ( <1?0  db ( <1?0  ( <1? @  8Memory ( < 1?0 5busd ( <1? (  `xaxaG 1 ?p N$0qKZA ( Hx1? =on-chip~ ( s *`  XB ( 0Do?P p H ( 0޽h ? U>=UU(y___PPT10Y+D=' = @B +  L0 @E%%0(  0d 0 <1? 0  `xaxa1 ?P`    0 6h   . 0 6F hh  . 0 6 @  . 0 68  . 0 6`8  .d  0 <1?  0 <1? 5CPUdb  0 <1?0d  0 <1?  0 <1?0 7Cached 0 <1?0  db 0 <1?0  0 <1? @  8Memory 0 <`1?0 5busd 0 <1? 0  `4xaxaG 1 ?p N$0qKZA 0 Hp1? =on-chip~ 0 s *`  XB 0 0Do?P p z @   0 @  ,$D0 0 B1?@ 0 0 Bx1?@P  = 25 cycles  f 0 61?P @  0 B1? P `  0 0 B }1?P `  = 25 cycles  f 0 61? `  0 B1?   0 0 B1?   = 25 cycles  f  0 61? `  !0 B1?@   0 "0 B0`1?   = 25 cycles  f #0 61?   $0 ZHa 8c 8c1 ?S0 ,$D0 R 1 4 x 25 = 100 1 102N0qKZA70lKU(78c8XW %0 Zc 8c 8c1 ? 0 ,$D0  (4 x 4)/102 = 0.157L0qKZA0lKU(8c8XH 0 0޽h ? U>=UU(___PPT10|+ND' = @B D' = @BA?%,( < +O%,( < +D]' =%(D' =%(D' =A@BBBB0B@>%(D' =1:Bvisible*o3>+B#style.visibility<*$0%(+D4' =%(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*0%(DA' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%0%(+p+0+$0 ++0+%0 +  L0   `E8 (  8d 8 <1? 8  `4m:xaxa1 ?P`  :  8 6    . 8 6`F hh  . 8 6 @  . 8 68  . 8 68  .d  8 <1?  8 <1? 5CPUdb  8 <1?0d  8 <1?  8 <,1?0 7Cached 8 <1?0  db 8 <1?0  8 <01? @  8Memory 8 <1?0 5busd 8 <1? 8  `\xaxaG 1 ?p N$0qKZA 8 H1? =on-chip~ 8 s *8`j  XB 8 0Do? @ H 8 0޽h ? U>=UU(y___PPT10Y+D=' = @B +  L0 E%%@(  @d @ <1? @  `xaxa1 ?P`  :  @ 6   . @ 6F hh  . @ 6P @  . @ 68  . @ 68  .d  @ <1?  @ <1? 5CPUdb  @ <1?0d  @ <1?  @ <L1?0 7Cached @ <1?0  db @ <1?0  @ <1? @  8Memory @ <1?0 5busd @ <1? @  `xaxaG 1 ?p N$0qKZA @ H1? =on-chip~ @ s *`j  XB @ 0Do? @ z p`p  @ p`p ,$D0 @ B1?p 0  0 @ Bt1?  0  = 25 cycles  f @ 61? P 0  @ B1?0  0 @ BX1? p  <8 cycles  f @ 61?p   @ B1? P p  0 @ B1?pP   <8 cycles  f  @ 61? P   !@ B ğ1? p  0 "@ BƟ1?  p  <8 cycles  f #@ 61? `p  $@ Zԟ 8c 8c1 ?C0 ,$D0 R 1 25 + 3*8 = 49 1 51N0qKZA70lKU(78c8XS %@ Zʟ 8c 8c1 ? 0 `,$D0  (4 x 4)/51 = 0.314L0qKZA0lKU(8c8XH @ 0޽h ? U>=UU(___PPT10|+ND' = @B D' = @BA?%,( < +O%,( < +D]' =%(D' =%(D' =A@BBBB0B@>%(D' =1:Bvisible*o3>+B#style.visibility<*$@%(+D4' =%(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*@%(DA' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*%@%(+p+0+$@ ++0+%@ +X L0 g_EH (  H~ H s *P`  : + H Z`G 8c 8c1 ?@@e For a block size of four words cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty 0qKZA0lK_(0qKZA$ 8c8Xd H <1? H < 1? 5CPUdb H <1? d H <1? H <G1?0 7Cached  H <1?0 p db  H <1?0   H <K1? @T  ? Memory bank 1  H <pL1?  5busd  H <1? H HTS1? =on-chipd H <1?0 p  H <N1? 0T  ? Memory bank 0d H <1?0 p  H <<[1? PT  ? Memory bank 2d H <1?0  p  H <4_1? ` T  ? Memory bank 3T H B%(D' =1:Bvisible*o3>+B#style.visibility<*%P%(+D4' =%(D' =%(D' =4@BBBB%(D' =1:Bvisible*o3>+B#style.visibility<*P%(DA' =%(D' =%(D' =A@BBBB0B%(D' =1:Bvisible*o3>+B#style.visibility<*$P%(+p+0+$P ++0+%P +  EX6(  X~ X s *PP`   x X c $$@%  H X 0޽h ? X(=^y___PPT10Y+D=' = @B +0 }P?(  e  S ~ "hghg1 ? E  : Workstation Design Target: 25% of cost on Processor, 25% of cost on Memory (minimum memory size), rest on I/O devices, power supplies, box p   01 ?r   "H  0޽h ? X(=^0 f^p?(  d  c $r   "R  3 r"&e&e E  " TInstead, the memory system of a modern computer consists of a series of black boxes ranging from the fastest to the slowest. Besides variation in speed, these boxes also varies in size (smallest to biggest) and cost. What makes this kind of arrangement work is one of the most important principle in computer design. The principle of locality. The principle of locality states that programs access a relatively small portion of the address space at any instant of time. The design goal is to present the user with as much memory as is available in the cheapest technology (points to the disk). While by taking advantage of the principle of locality, we like to provide the user an average access speed that is very close to the speed that is offered by the fastest technology. (We will go over this slide in detail in the next lectures on caches). (UZ~ }H  0޽h ? X(=^h0 ( ?(    S ~"nana1 ? @  "     T1 ?    "H  0޽h ? a(p0 0(?(    S ~onana1 ? E  " JPut multiple words in one memory row  splits the decoder into two decoders (row and column) and makes the memory core square reducing the length of the bit lines (but increasing the length of the word lines). The lsb part of the address goes into the column decoder (e.g., 6 bits so that 64 words are assigned to one row (with 32 bits per word gives 2**11 bit line pairs) leaving 14 bits for the row decoder (giving 2**14 word lines) for an not quite square array. This scheme is good only for up to 64 Kb to 256 Kb. For bigger memories it is too SLOW because the word and bit lines are too long. SRAM allows you to read an entire row out at a time at a word. Each row control line is referred to as the word line and each vertical data line is referred to as the bit line.1p   01 ?s   oH  0޽h ? X(=^\0 ?(    S ~onana1 ? E  o 6"Similar to SRAM, DRAM is organized into rows and columns. But unlike SRAM, which allows you to read an entire row out at a time at a word, classical DRAM only allows you read out one-bit at time time. So we need several (planes) of them to store one word. The reason for this is to save power as well as area. Remember now the DRAM cell is very small we have a lot of them across horizontally. So it will be very difficult to build a Sense Amplifier for each column due to the area constraint not to mention having a sense amplifier per column will consume a lot of power. You select the bit you want to read or write by supplying a Row and then a Column address. Similar to SRAM, each row control line is referred to as the word line and each vertical data line is referred to as the bit line.p   01 ?s   oH  0޽h ? X(=^0 ~v @(  ^  S ~onana1 ? E  o bAnother performance booster for DRAM is fast page mode operation. In normal DRAM, we can only read and write M-bit at time because only one row and one column is selected at any time by the row and column address. 1) RAS 2) Latch 3) Cas 4) Latch 5) Data In other words, for each M-bit memory access, we have to provided a row address followed by a column address. Very time consuming. So the engineers get smart and say:  Wait a minute, this is silly, why don t we put a N x M register here so we can save an entire row internally whenever we access a row? Fp   01 ?s   oH  0޽h ? X(=^ 0 @8(    S ~3onana1 ? E  o So with this register in place, all we need to do is assert the RAS to latch in the row address, then entire row is read out and save into this register. After that, you only need to provide the column address and assert the CAS needs to access other M-bit within this same row. This type of operation where RAS remains asserted while CAS is toggled to bring in a new column address is called Page Mode operation. Store so don t have to repeat: SRAM + 2 = 71 min. (Y:51)p   01 ?s   oH  0޽h ? X(=^%0 f^C(  X  C s   o^  S :o E  o Memory baseline is a 64KB DRAM in 1980, with three years to the next generation until 1996 and then two years thereafter with a 7% per year performance improvement in latency. Processor assumes a 35% improvement per year until 1986, then a 55% until 2003, then 5% Need to supply an instruction and a data every clock cycle In 1980 there were no caches (and no need for them), by 1995 most systems had 2 level caches (e.g., 60% of the transistors on the Alpha 21164 were in the cache)H  0޽h ? X(=^80___PPT10.("60 4,PD(  X  C s   o,  S 0 bZG(     T1 ?    o  C xodxdx ? ?  o ^` H""DDffH  0޽h ? a(x] p՝{oޏ{ɃWf%lP kMᢦ|F<Kvru gɢ]i")T[p)K$z>_z)ͰNdG/'?]ܿǵ J&j(uɞNRBR*8 $"&!%# $ERO@H2dH%i&G"i!+pfܹ:'wq_)_N9Dqv1@6ȾKv gfLov9'p{P]<ďepyk[IHI, d!"ItmHIzT_TKͿ+hNC^'jJhM$L1|~1wid4I_DHj\RNDZAE-` L:$([Ы\`r@XNzv)_v!:(p\pBʵr,*j媇^k+aAg-P*}\ }La>m%=G: <)5ݧ 9h]&qLݽXgٽkLvjL j(@0P2 א=j(@%G5P P4d40PHCݮ-Ԑ=z/P5Tl44@7h@}*5 'Tdkܿbp&ѓ. 0 z~\y^AB/.=醵$!E{􃀂ۄ)Q=3jeP9O)(ېrD^ۻTA&+$_1o^ 7F|ү \"v a L?ܭB== %JZAźK: 2)c)FHc{ẕ=8~SP0+(JTOfG*Hk ~$Կ3𼂲P-BL/(HmWA~H Fie*H?K(;[Av(ȆTl 。JTj(TP?uhM^WIycߗo`ity6qXRVP=2uMxEۊWǺPP5ڗE,K{D{bho}S}\ 3ExvE[ūw8Sf 3Y,d5`]}q?Lw??[u}}G}qϕG.<(ki-D@)gZ"`lS'9 pzF,Z)QrqޟY'T ޽xEF7[߯-vѿYW8cjz0*>tcd5%.>CV|y&Yz8z֒2=w{!-XmMkX6h8|uکہV<.BkǖEB g%0;h ˯Ksk}y&T Hf]+$w14A 6X`A։\:bQmm1̃.Ɗ Vdz`3XllaolY=kW F 6jw 6f0i/"` 0X g[o;U/LU]U ,1[f x2yFسMMQgeb6uQg8Em*!u) Bf٦L̡U"<xMg(1h{-x M`'x<{>gٽ0i1p yfqZ8~ 0>wgvXMvYK1abr&͔oÚ Xk?v`͂>ٵY&v`ΰ mWN@[ q9Ru\tu<G:.X m'0x;ʵ5q6Xv_`wnⰻJ5m}0lg>?B?G~a1Ol?i;pX!nJ5:4QKR|w-bUGm?jGC4®tUTC8Dl9*@@*VS8D1'XeH`8D!VXgTC‡lR+KguFd*cj cXyTZ9G4.ΰXpra+>Gj܎8/bgS ع9Q$Wb Rډ!?}!V|!] A"|S8Ċ]b O#x ]XT0ra!V|%! AXݘkA {CN*V'pr;aorJf5AN {CR5u#DU.bwRCPXe^+וr`̓""h4jpToͪb/R+\fZh B8ݚ$.ΤBi<_C>#&(g7/>kQC>"aPnL ڧ=g:*>ϢxςxOk&bK;N^VHW+2c箙O6Pߌ#G≮g3֬?Bp`[6S"\g&qny2)uNFB=xʼ=Y."Y޳ƋTk7?p?׽p?{cɽ1`pL+{c*X?Sʽ Vɽp?&ýp?&ýp?&ýpOǟf=?X.Sy$籛55v~?J/P>섿nŸ3a'E % T2T>APҐ44 usI.PɞGqVЂv$ H"/-" d1I'WPHIzTN߫ Lx[ pU^ 1@ 3uԖ%BI385v)m]ft* "ED"";F4&ι܄l(޽?/v/9fG!!V8w߽Q"x?*$+ӝHz:&af9ֽ̻{R3pw_:9m3NLOS 1Sb;:%N"@$)D*˗Ju|ôXyٜ&@73 οJMn~` zn m-_sOz DoїGh~@B|N.86?s_Gh`v{=Wc|y>nڎ'̝hA邒m 2xQ[R28 SMa)Ǘ+#}~0:bC9}0 $y'ћffO|Ӵta$o2U&?2{ϚVJ+ee7Ϻ 浚fWCu~F~nd I1iævM5kQL6̚[\C5v05Z0M_1tl[V6jH9*=׹I9=7MV֞eq nnJ#}U㓐e+B>Ç Ơ/m]&7:*/77㒤#|q6|zxGK(BUկV׍՝+T5hTMVZUTU15*v_RKS#* ö̘5U4(?rhK ?œ\nAf7?kv-{MT`íe斖ZU=vK\ :xB3EVV=#i:Rt6bXԨ#a(?ͧ__BTn2SX]Y%~sb-e\uY _j4Z=t;XN!cFJV:ټ5ĮX>7=O73k{3;W2;{k\θpXcٺ s\#{Yͷ6d( /PhH㭮EL VOPnFI㚹5Fɽ}]8 F),1E2Ծ!aM{e;ì2]'{w )`iW8= -5ise:2ql'wU_u*L5%R=(mp3*X#IrټΙOS3%uehIF9Q8w硅WT>,Xg+ X(]d!y'>.V+!"y4!1_d5OmgȋX'6@Es緱NqMR..-}UW/Lws lws?w?sw}:oyCsQ_g_:Z?> BN= .W񲻓B5]<e.m$=Iy?-(݁K:`o#`6;'kPh`o?Idc/>AQZ'x':x܅D^D<.Dq/$z $sH&҈ ys| r'n&B?!'OcC/9T3,/AJ_t'TR) H__TT$~G2H%`Ջ)ܠ%S՝&Y/7X"Fd("RGf6~1 [ab 8QA=I,#P+*} <@l$D K'O1$yv8' qA^F"W4qJu %*%%:%=:jd=SDzZ{ۈT6Z݌tFmoַSd;xIG>"͠2(ŠTafO"٣䰧{4-z}]fiQ3ͳQr_Q)J% _/Rpoz"J5ET6b( YGN$'0YNVg{Sf"6C(*qu8DO;@"R"&b @WYǕC7YO[Il@}e+QDoށE)q(J-.'0XNчFE#ΓmE fQ7R#P0xeCPé$e\~<.:̧.|궀:. 꾀(  ɉBr)$W əBr*$ ȩr+ ȹr/'mgnQ5v{ؓw76UmK5+7I]62-E?(Wqn|)χEU0Nq"3DptE_EJnhPX[GO2/PC(?PBwBa*L=)wΟ{eRvD7mdGdd~+V\az.gz,=L(`4b2zKtF4R8zVJ >x[l\Wq޻ݍ\;vֱ؎&:n&4} 9MBRM48QK.B?Jh T #@T+" h@$AQ9>ֻݍ&wΜ93k_u[S[8D0iѰLA6LeVc4$a;|cp& l`ǔLNb{_b ,1>8< % ydSο;) 8Sh~ٺ3Y8He'w7#c#:u,ءӒ<>2AnY>xw?dBdssa"_ Z(ZS׮^>󰯳Źӻ#Hf;CόR^*+Feޕ4RR:pvSR}MY;dy L'X${'쟳9kged~[n\qQޘWa fz#}7Dc5es/q?gio<&wMl`1d/|e˂>:Muԁű'`51!o2aD7_xsOٳg!ּ`dfg]@_ g6s'aLINBHzgg N<'iI8wN~hN$X/%>FGI)\r+T) }]bo+u?<)W>F%>ߺޞ}iGZv%)*e֧{8TjFdžm9Ѐx(.+jw^wq04|rt- zRl2=.wFjVm&_zcL5$oE~$I:bʜ#U,[!& +<Qbit4HmxO:YG/uLsp46òz-Tg{[G޼:(e\%HJ|0K3փVVk>?{>v’[$9k!C$IL@zkѣKo$ҵ:҄F5}ϖ&ك#C$? ȮΗ=! ,.[]ks`YʋަxBgإ(=N v@rţU A;2.V~P_:L!Z?Bsz_4jDޑd; C¡9P1%csQ*˟ [>/p V̴S6۞TL`ݨaDӨnEc츾VIⲟ+N`kT_Qk|/eζkΈ~/JGJ_oJK>o_B=pqM u ߷[_۳Rfqri O 2o˸**k ,_?/ S lZ pIMqJO6Ƅw|<:rT 2Zc?Hm;~lCѵb-+ ı5So,,>[:t؊#ՎAhw <<ۄrs.7 g2z((JRw'п;C1l>PM֓9wPz ϲR'p/Qouy.v *K%6F|VaD+0)DY~[T:BrIn c<릛$ի-O83gByJtL ?e>q>{wn9NA'>krpl喣)r,ɖW.}cW-l}b6 $ݝ5&le"QЄ@i_iHtZhh~+^<]WBW O8[_5W*0nXWE V=s|MޓFh!wml&;9wK:Mz2﫸W=u`w QÅzp S:Zب#"YZ8,Ez/J_:5֚kZ3k|J$Ñ:< N [@Ya\kC?ٶ rN6\\J5vl/S1Pq ]]{.xZ}lG3aǎql/c;?ЦvR9/]khHR* A!EᏤ$bp$5(5>{3{wEޜΛ߼Λٹ gI20S@b:t)}ҿz {h}H~2r5HHHktjkj4^JbK{` 8<,a}z`%ӪAgvZpMkp N ۟b|@WLNvI8]?;E=n]nنO݁iRFF&NoƼi R֒vX檣Ki\~ܧqh"%bIk1Xđh4Qkb|kSM+N?T. ګ^H6>cPDoZ2isBbCK!B>%~#M6V4҅i4|yH;?LI낭!=}\ڂa~Rz[uYo!Ζ/,xU.2ZxgP5v ? ׊ր+ϫM='F)\ ~U㲀Q@~ .m܀>jmܼyH=&sFFWE>-bj}`dZƄerz҂}B;,[.l*m -ĵC#ذڡ}*IN脏)uQ.uN3 yM2X2ˏ>mGP[+aH 0ʃrڗ::0/ =~%:5TXX.s [4v{ 3nJ},O.ڻ4bHiʃ)/)󚔏s4ڔ'}6oOn~2!f5G 'lM@l H8 njO-AMHU rۍKȍZGKu.Sψ?#*qb J= A\=thxpƷ4#)Q»4|_O/Z|P7Iv㠾9{seovjxEǣxt}"MFŨV@=߲˙_<p|dHPM=1V2$ozD=S֞c 3ItZ{"]rM/E:sذ8l9lAnv=.~П6i2$x"> rL0Av0p hF!@+ۢg (y/0N30lH7>\08`PF}fhPU\Hs ANIo~<# |d03'S͊5~(^\7 0+Oh+'0  4 @ L XdltCPEG323 Topic 7dDavid L. Mills556Microsoft PowerPoint 4.0@Lhc@e,5ò@ (@0W"/ Gl;  @  --$  --'@Arial-. 2 "CPEG323. "System$-@Arial-.  2 1.-=-=-8(223(3(2(556(6(5--'@Arial-. =@2 1<&Integration of cache and MIPS Pipeline    .-@Wingdings-. ( 2 q.-@Arial-.  2 Data .-@Arial-.  2 - .-@Arial-. +2 path control unit design     .-@Wingdings-. ( 2 q.-@Arial-. 62 Pipeline stalls on cache misses     .-՜.+,D՜.+,T    Letter Paper (8.5x11 in)Tl" *Times New RomanArial WingdingsMonotype Sorts 新細明體 mjicse431Microsoft Graph ChartMicrosoft Excel Worksheet'Integration of cache and MIPS Pipeline"Actions Needed on an I-Cache MissA Case Study: DEC Station 3100How to Handle Read/Write=Performance Penalty due to Write-Through on DEC Station 3100Slide 6Combined I-Cache/D-Cache?Slide 8(Review: Major Components of a Computer!Processor-Memory Performance GapThe “Memory Wall”)Memory Performance Impact on PerformanceThe Memory Hierarchy GoalA Typical Memory Hierarchy(Characteristics of the Memory HierarchyMemory Hierarchy TechnologiesMemory Performance Metrics%Classical RAM Organization (~Square)-Classical DRAM Organization (~Square Planes)Classical DRAM OperationPage Mode DRAM Operation#Synchronous DRAM (SDRAM) OperationOther DRAM Architectures+DRAM Memory Latency & Bandwidth Milestones#Memory Systems that Support Caches"One Word Wide Memory Organization"One Word Wide Memory Organization+One Word Wide Memory Organization, con’t+One Word Wide Memory Organization, con’t+One Word Wide Memory Organization, con’t+One Word Wide Memory Organization, con’t Interleaved Memory Organization Interleaved Memory OrganizationDRAM Memory System Summary  Fonts UsedDesign TemplateEmbedded OLE Servers Slide Titles"$ 8@ _PID_HLINKSAQhttp://www.corsairmemory.com/corsair/products/tech/memory_basics/153707/main.swf&_]1David L. MillsDavid L. Mills  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~      !"#$%&'()*,-./012456789:<=>?@ABIRoot EntrydO)PicturesE#Current User;SummaryInformation(+PowerPoint Document(1DocumentSummaryInformation83