Number systems, instruction cycles, pipelining, memory hierarchy and processor architecture essentials.
| Representation | Range (8-bit) | Description |
|---|---|---|
| Unsigned | 0 to 255 | No sign bit; all bits represent magnitude |
| Sign-Magnitude | -127 to +127 | MSB = sign (0=positive, 1=negative); two representations of zero (+0, -0) |
| 1's Complement | -127 to +127 | Invert all bits for negative; two representations of zero |
| 2's Complement | -128 to +127 | Invert all bits + 1 for negative; single zero; most widely used |
┌─────────────────────────────────────────────────────────────────┐
│ 2'S COMPLEMENT CONVERSION STEPS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Positive to Negative (e.g., +25 → 8-bit 2's complement): │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Step 1: Write binary of +25 = 00011001 │ │
│ │ Step 2: Invert all bits = 11100110 (1's comp) │ │
│ │ Step 3: Add 1 = 11100111 (2's comp) │ │
│ │ -25 in 2's complement = 11100111 │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Negative to Positive (e.g., 11100111 → ?): │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Step 1: Identify as negative (MSB = 1) │ │
│ │ Step 2: Invert all bits = 00011000 │ │
│ │ Step 3: Add 1 = 00011001 = +25 │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Shortcut: Subtract from 2^n │
│ -25 = 256 - 25 = 231 = 11100111 ✓ │
└─────────────────────────────────────────────────────────────────┘| Field | Bits | Description |
|---|---|---|
| Sign (S) | 1 bit | 0 = positive, 1 = negative |
| Exponent (E) | 8 bits | Biased exponent; bias = 127 |
| Mantissa (M) | 23 bits | Fractional part; implicit leading 1 |
| Field | Bits | Description |
|---|---|---|
| Sign (S) | 1 bit | 0 = positive, 1 = negative |
| Exponent (E) | 11 bits | Biased exponent; bias = 1023 |
| Mantissa (M) | 52 bits | Fractional part; implicit leading 1 |
┌─────────────────────────────────────────────────────────────────┐
│ IEEE 754 SINGLE PRECISION — WORKED EXAMPLE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Convert -13.375 to IEEE 754 Single Precision: │
│ │
│ Step 1: Sign bit │
│ -13.375 is negative → S = 1 │
│ │
│ Step 2: Convert integer part (13) to binary │
│ 13 = 8 + 4 + 1 = 1101₂ │
│ │
│ Step 3: Convert fractional part (0.375) to binary │
│ 0.375 × 2 = 0.75 → 0 │
│ 0.75 × 2 = 1.5 → 1 │
│ 0.5 × 2 = 1.0 → 1 │
│ 0.375 = 0.011₂ │
│ │
│ Step 4: Combine = 1101.011 │
│ │
│ Step 5: Normalize (scientific notation in binary) │
│ 1101.011 = 1.101011 × 2^3 │
│ Mantissa (M) = 10101100000000000000000 (23 bits, pad with 0) │
│ Exponent = 3 + 127 = 130 = 10000010₂ │
│ │
│ Result: 1 10000010 10101100000000000000000 │
│ S E(8) M(23) │
│ │
│ Hex: C1560000 │
│ Verify: (-1)^1 × 1.101011 × 2^(130-127) │
│ = -1 × (1 + 0.5 + 0.125 + 0.03125 + 0.0078125) × 8 │
│ = -1 × 1.671875 × 8 = -13.375 ✓ │
└─────────────────────────────────────────────────────────────────┘┌─────────────────────────────────────────────────────────────────┐
│ FLOATING POINT ADDITION / SUBTRACTION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Example: Add 1.5 × 2^2 + 1.125 × 2^0 │
│ │
│ Step 1: Equalize exponents │
│ Smaller exponent adjusts to larger: │
│ 1.125 × 2^0 = 0.28125 × 2^2 (shift mantissa right by 2) │
│ │
│ Step 2: Add mantissas │
│ 1.5 + 0.28125 = 1.78125 │
│ Result: 1.78125 × 2^2 = 7.125 │
│ │
│ Step 3: Normalize result │
│ 1.78125 × 2^2 (already normalized, MSB = 1) │
│ │
│ Step 4: Round (if mantissa exceeds 23 bits) │
│ Apply rounding mode (round to nearest, toward zero, etc.) │
│ │
│ ───────────────────────────────────────────────────── │
│ │
│ Example: Subtract 1.5 × 2^2 - 1.375 × 2^2 │
│ │
│ Step 1: Exponents already equal (both 2) │
│ Step 2: Subtract mantissas │
│ 1.5 - 1.375 = 0.125 │
│ Step 3: Normalize → 1.0 × 2^(-3) │
│ (shifted mantissa left 3 positions, decreased exponent by 3) │
│ │
│ ───────────────────────────────────────────────────── │
│ FLOATING POINT MULTIPLICATION: │
│ 1. Add exponents: E = E1 + E2 - bias │
│ 2. Multiply mantissas: M = M1 × M2 │
│ 3. Normalize and round result │
│ 4. Set sign: S = S1 XOR S2 │
└─────────────────────────────────────────────────────────────────┘┌─────────────────────────────────────────────────────────────────┐
│ OVERFLOW / UNDERFLOW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ OVERFLOW: Result exceeds maximum representable value │
│ ───────────────────────────────────────────── │
│ 2's complement overflow detection (addition): │
│ • Positive + Positive = Negative → OVERFLOW │
│ • Negative + Negative = Positive → OVERFLOW │
│ • Rule: If carry-in to MSB ≠ carry-out from MSB → overflow │
│ │
│ Examples (8-bit 2's complement): │
│ 01111111 (+127) + 00000001 (+1) = 10000000 (-128) OVERFLOW! │
│ 10000000 (-128) + 10000000 (-128) = 00000000 (0) OVERFLOW! │
│ 01111111 (+127) + 11111111 (-1) = 01111110 (+126) OK │
│ 10000001 (-127) + 11111111 (-1) = 10000000 (-128) OK │
│ │
│ UNDERFLOW: Result is smaller than minimum representable value │
│ ───────────────────────────────────────────── │
│ In floating point: result is too close to zero → flushed to 0 │
│ Example: 1.0 × 2^(-126) × 0.5 × 2^(-126) = 0.5 × 2^(-252) │
│ This exceeds the minimum exponent → gradual/denormalized │
│ or sudden underflow to zero depending on implementation │
└─────────────────────────────────────────────────────────────────┘127 for single precision, 1023 for double precision.| Format | Address Fields | Example | # Operands in Memory |
|---|---|---|---|
| Zero-Address | 0 (Stack-based) | PUSH A; PUSH B; ADD | 2 (operands on stack) |
| One-Address | 1 (Accumulator) | ADD B (ACC ← ACC + B) | 1 (ACC implicit) |
| Two-Address | 2 (Register) | ADD R1, R2 (R1 ← R1 + R2) | 0 (all registers) |
| Three-Address | 3 (Register) | ADD R1, R2, R3 (R1←R2+R3) | 0 (all registers) |
┌─────────────────────────────────────────────────────────────────┐
│ INSTRUCTION FORMAT — SIDE BY SIDE COMPARISON │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Evaluate: Z = (A × B) + (C × D) - E │
│ ───────────────────────────────────────────── │
│ │
│ THREE-ADDRESS (3 instructions): │
│ MUL T1, A, B ; T1 = A * B │
│ MUL T2, C, D ; T2 = C * D │
│ ADD T3, T1, T2 ; T3 = T1 + T2 │
│ SUB Z, T3, E ; Z = T3 - E │
│ Instructions: 4 | Memory accesses: 4 reads + 1 write │
│ │
│ TWO-ADDRESS (5 instructions): │
│ MOV R1, A ; R1 = A │
│ MUL R1, B ; R1 = A * B │
│ MOV R2, C ; R2 = C │
│ MUL R2, D ; R2 = C * D │
│ ADD R1, R2 ; R1 = A*B + C*D │
│ SUB R1, E ; R1 = A*B + C*D - E │
│ Instructions: 6 | Memory accesses: 6 reads + 1 write │
│ │
│ ONE-ADDRESS / ACCUMULATOR (8 instructions): │
│ LOAD A ; ACC = A │
│ MUL B ; ACC = A * B │
│ STORE T ; T = A * B │
│ LOAD C ; ACC = C │
│ MUL D ; ACC = C * D │
│ ADD T ; ACC = C*D + A*B │
│ SUB E ; ACC = C*D + A*B - E │
│ STORE Z ; Z = result │
│ Instructions: 8 | Memory accesses: 8 reads + 2 writes │
│ │
│ ZERO-ADDRESS / STACK (10 instructions): │
│ PUSH A; PUSH B; MUL; PUSH C; PUSH D; MUL; ADD; PUSH E; │
│ SUB; POP Z │
│ Instructions: 10 | No explicit memory operands (stack) │
└─────────────────────────────────────────────────────────────────┘| Addressing Mode | Syntax | Effective Address | Description |
|---|---|---|---|
| Immediate | #5 | Operand itself | Operand is part of the instruction; no memory access |
| Register | R1 | Register R1 | Operand is in a register; fastest access |
| Direct (Absolute) | 1000 | 1000 | Address field = memory address of operand |
| Indirect | @1000 | M[1000] | Address field points to memory location containing the EA |
| Indexed | X(R1) | R1 + X | Base register + offset; used for arrays |
| Based | 1000(R1) | R1 + 1000 | Base register + displacement; position-independent code |
| Relative | $label | PC + offset | PC-relative; used for branches; supports code relocation |
| Implied / Implicit | CLR | (fixed register) | Operand is implied by opcode (e.g., accumulator, stack top) |
| Auto-increment | (R1)+ | R1; then R1 ← R1 + d | Access address in R1, then increment R1; good for array traversal |
| Auto-decrement | -(R1) | R1 ← R1 - d; then R1 | Decrement R1, then access; stack push emulation |
| Feature | RISC | CISC |
|---|---|---|
| Instruction Count | Small (50-100) | Large (200-500+) |
| Instruction Length | Fixed (32-bit) | Variable (16-64 bit) |
| Execution Time | 1 clock cycle each | 1 to many clock cycles |
| Addressing Modes | Few (2-3) | Many (12-15+) |
| Registers | Many (32+) | Few (8-16) |
| Pipeline | Easy to pipeline | Difficult (variable length) |
| Microcode | No microcode | Microprogrammed control |
| Examples | ARM, MIPS, RISC-V, SPARC | x86/x64, VAX, IBM Z |
| Design Philosophy | Software does more (compiler) | Hardware does more (microcode) |
| Clock Speed | Higher (simple instructions) | Lower (complex instructions) |
| Code Density | Lower (more instructions) | Higher (complex ops) |
| Power Consumption | Lower | Higher |
| Stage | Abbreviation | Operation |
|---|---|---|
| Fetch | IF | Read instruction from memory using PC |
| Decode | ID | Decode opcode, read register operands |
| Execute | EX | Perform ALU operation (add, AND, shift) |
| Memory | MEM | Read/write data memory (if needed) |
| Write Back | WB | Write result back to register file |
Indexed (constant offset + register = array element) and Based (register + constant offset = struct field or local variable).Relative addressing is key for position-independent code and is how CALL and JMP instructions work in most ISAs.┌─────────────────────────────────────────────────────────────────┐
│ 5-STAGE PIPELINE TIMING DIAGRAM │
├─────────────────────────────────────────────────────────────────┤
│ │
│ WITHOUT PIPELINE (Sequential): │
│ ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐ │
│ │ I1 │ I2 │ I3 │ I4 │ I5 │ I6 │ I7 │ I8 │ │
│ └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘ │
│ Time: 1 2 3 4 5 6 7 8 │
│ Total = n × k = 8 × 5 = 40 cycles │
│ │
│ WITH PIPELINE (Overlapped): │
│ Cycle: 1 2 3 4 5 6 7 8 9 10 11 12│
│ I1: [IF] [ID] [EX] [MEM] [WB] │
│ I2: [IF] [ID] [EX] [MEM] [WB] │
│ I3: [IF] [ID] [EX] [MEM] [WB] │
│ I4: [IF] [ID] [EX] [MEM] [WB] │
│ I5: [IF] [ID] [EX] [MEM] [WB] │
│ I6: [IF] [ID] [EX] [MEM] [WB] │
│ I7: [IF] [ID] [EX] [MEM] [WB]│
│ I8: [IF] [ID] [EX] [MEM][WB]│
│ │
│ Total = k + (n-1) = 5 + 7 = 12 cycles │
│ Speedup = 40 / 12 = 3.33x │
│ Theoretical max = k = 5x │
│ │
│ FORMULA: │
│ Sequential time = n × k × t (where t = cycle time) │
│ Pipelined time = [k + (n - 1)] × t │
│ Speedup S = (n × k) / (k + n - 1) │
│ As n → ∞: S → k (number of pipeline stages) │
└─────────────────────────────────────────────────────────────────┘| Hazard | Name | Description | Example |
|---|---|---|---|
| RAW | Read After Write | Instruction reads a register before a prior write completes | ADD R1,R2,R3 followed by SUB R4,R1,R5 |
| WAR | Write After Read | Instruction writes a register before a prior read completes | SUB R4,R1,R5 followed by ADD R1,R2,R3 |
| WAW | Write After Write | Two instructions write to the same register; order matters | ADD R1,R2,R3 followed by MUL R1,R4,R5 |
| Hazard Type | Cause | Solution |
|---|---|---|
| Control Hazard | Branch changes PC before pipeline is flushed | Branch prediction, delayed branch, branch target buffer |
| Structural Hazard | Two instructions need same hardware resource | Duplicate resources (separate IF/MEM), pipeline stall |
| Data Hazard (RAW) | Producer-consumer register dependency | Forwarding, stall insertion, compiler scheduling |
┌─────────────────────────────────────────────────────────────────┐
│ DATA FORWARDING (BYPASSING) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Problem: RAW hazard between consecutive instructions │
│ │
│ I1: ADD R1, R2, R3 ; R1 = R2 + R3 (writes in WB stage) │
│ I2: SUB R4, R1, R5 ; R4 = R1 - R5 (reads R1 in ID stage) │
│ │
│ Without forwarding: 2 stall cycles needed (wait for WB of I1) │
│ │
│ WITH FORWARDING: │
│ I1: [IF] [ID] [EX→ALU result] [MEM] [WB] │
│ I2: [IF] [ID←gets R1 from EX/MEM] [EX] [MEM] [WB] │
│ │
│ Forward ALU result directly from EX/MEM pipeline register │
│ to the ID/EX input — no stall needed! │
│ │
│ FORWARDING CASES: │
│ ┌────────────────┬──────────────────────────────────┐ │
│ │ EX-EX Forward │ Result from I1 EX → I2 EX input │ 0 stalls│
│ │ MEM-EX Forward │ Result from I1 MEM → I2 EX input│ 1 stall │
│ │ WB-EX Forward │ Result from I1 WB → I2 EX input│ 2 stalls│
│ └────────────────┴──────────────────────────────────┘ │
│ │
│ LOAD-USE HAZARD (cannot fully avoid with forwarding): │
│ I1: LW R1, 0(R2) ; loads from memory (data in MEM stage) │
│ I2: ADD R3, R1, R4 ; needs R1 in ID stage │
│ → Must stall 1 cycle even with forwarding (data not in EX) │
└─────────────────────────────────────────────────────────────────┘┌─────────────────────────────────────────────────────────────────┐
│ BRANCH PREDICTION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ BRANCH PENALTY: When branch is resolved in EX/MEM stage, │
│ already fetched instructions in pipeline must be flushed. │
│ Penalty = number of stages between fetch and branch resolve │
│ │
│ STATIC PREDICTION (no runtime data): │
│ ┌────────────────────┬────────────────────────────────────┐ │
│ │ Predict Not Taken │ Always fetch sequential; flush if taken│
│ │ Predict Taken │ Always fetch branch target │
│ │ Backward Taken │ Predict backward branches as taken │
│ │ │ (loops usually iterate) │
│ │ Profile-guided │ Use compiler profiling info │
│ └────────────────────┴────────────────────────────────────┘ │
│ │
│ DYNAMIC PREDICTION (runtime history): │
│ ┌────────────────────┬──────────────┬──────────────────────┐ │
│ │ 1-bit predictor │ Predict last │ ~60-70% accuracy │ │
│ │ 2-bit saturating │ Change only │ ~75-85% accuracy │ │
│ │ │ on 2 same │ (avoids oscillation │ │
│ │ │ predictions │ in nested loops) │ │
│ │ BHT (Branch │ 2-bit counter │ Indexed by PC │ │
│ │ History Table) │ per branch │ │ │
│ │ Global History │ Uses pattern │ Correlating │ │
│ │ Register │ of last N │ branches │ │
│ │ (GHR) │ branches │ │ │
│ │ Tournament │ Local vs │ Chooses better │ │
│ │ Predictor │ global │ predictor │ │
│ └────────────────────┴──────────────┴──────────────────────┘ │
│ │
│ BRANCH TARGET BUFFER (BTB): │
│ Cache of {PC → Target Address} pairs │
│ Fetch stage checks BTB; if hit, fetch from predicted target │
│ Eliminates 1-cycle penalty for taken branches │
└─────────────────────────────────────────────────────────────────┘┌─────────────────────────────────────────────────────────────────┐
│ PIPELINE SPEEDUP — CALCULATION EXAMPLE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Given: │
│ Pipeline stages (k) = 5 │
│ Number of instructions (n) = 100 │
│ Branch frequency = 20% │
│ Branch penalty = 3 cycles (with no prediction) │
│ Load-use stall frequency = 10% │
│ Load-use stall cycles = 1 │
│ │
│ Case 1: No stalls, ideal pipeline │
│ Time = (k + n - 1) × t = (5 + 99) × t = 104t │
│ Sequential = n × k × t = 100 × 5 × t = 500t │
│ Speedup = 500/104 = 4.81x │
│ │
│ Case 2: With stalls │
│ Branch stalls = 100 × 0.20 × 3 = 60 cycles │
│ Load-use stalls = 100 × 0.10 × 1 = 10 cycles │
│ Total stalls = 70 cycles │
│ │
│ Time = (k + n - 1 + stalls) × t = (5 + 99 + 70) × t = 174t │
│ Speedup = 500/174 = 2.87x │
│ │
│ Case 3: With perfect branch prediction (100% accuracy) │
│ Branch stalls = 0 cycles │
│ Load-use stalls = 10 cycles │
│ Time = (5 + 99 + 10) × t = 114t │
│ Speedup = 500/114 = 4.39x │
│ │
│ Case 4: With 85% accurate branch prediction │
│ Mispredictions = 100 × 0.20 × 0.15 = 3 instructions │
│ Branch stalls = 3 × 3 = 9 cycles │
│ Load-use stalls = 10 cycles │
│ Time = (5 + 99 + 9 + 10) × t = 123t │
│ Speedup = 500/123 = 4.07x │
└─────────────────────────────────────────────────────────────────┘| Mapping | Lines/Set | Tag Bits | Search Method |
|---|---|---|---|
| Direct | 1 | log₂(blocks) - log₂(lines) | 1 comparison (direct index) |
| Fully Associative | All | log₂(blocks) | Parallel compare all tags |
| N-way Set Associative | N | log₂(blocks) - log₂(sets) | N comparisons per set |
| Mapping | Tag | Set/Index | Block Offset | Word Offset |
|---|---|---|---|---|
| Direct Mapped | ─── | Index | Block Offset | Word Offset |
| Fully Associative | ─── Tag ─── | (none) | Block Offset | Word Offset |
| N-way Set Assoc | ─── | Set Index | Block Offset | Word Offset |
┌─────────────────────────────────────────────────────────────────┐
│ CACHE MAPPING — WORKED EXAMPLES │
├─────────────────────────────────────────────────────────────────┤
│ PROBLEM: │
│ Main memory size = 256 KB │
│ Cache size = 4 KB │
│ Block size = 64 bytes │
│ Address bus = 32 bits │
│ │
│ Step 1: Calculate parameters │
│ Number of cache lines = Cache size / Block size │
│ = 4 KB / 64 B = 64 lines │
│ Number of memory blocks = Memory size / Block size │
│ = 256 KB / 64 B = 4096 blocks │
│ Offset bits = log₂(64) = 6 bits │
│ Number of words per block = 64 / 4 = 16 words (word offset) │
│ │
│ ═══════════════════════════════════════════════════════ │
│ DIRECT MAPPED CACHE: │
│ ───────────────────── │
│ Index bits = log₂(64) = 6 bits │
│ Tag bits = 32 - 6 - 6 = 20 bits │
│ Address format: [TAG:20 | INDEX:6 | OFFSET:6] │
│ │
│ For address 0x00008A3C: │
│ Binary: 0000 0000 0000 0000 1000 1010 0011 1100 │
│ Tag = 00000000000000000000 = 0x00000 │
│ Index = 001010 = 10 (decimal) │
│ Offset = 001111 = 15 (decimal) │
│ → Maps to cache line 10 │
│ │
│ ═══════════════════════════════════════════════════════ │
│ 4-WAY SET ASSOCIATIVE CACHE: │
│ ───────────────────────────── │
│ Number of sets = 64 / 4 = 16 sets │
│ Set bits = log₂(16) = 4 bits │
│ Tag bits = 32 - 4 - 6 = 22 bits │
│ Address format: [TAG:22 | SET:4 | OFFSET:6] │
│ │
│ For address 0x00008A3C: │
│ Set = 1010 = 10 (decimal) │
│ Tag = 0000000000000000000000 │
│ → Search 4 tags in set 10 for a match │
│ │
│ ═══════════════════════════════════════════════════════ │
│ FULLY ASSOCIATIVE CACHE: │
│ ───────────────────────── │
│ No index; entire cache is one set │
│ Tag bits = 32 - 6 = 26 bits │
│ Address format: [TAG:26 | OFFSET:6] │
│ → Compare tag with ALL 64 cache line tags in parallel │
│ → Most flexible, but most expensive (needs associative search)│
└─────────────────────────────────────────────────────────────────┘┌─────────────────────────────────────────────────────────────────┐
│ CACHE PERFORMANCE — CALCULATIONS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ KEY FORMULAS: │
│ ───────────── │
│ Hit Ratio (h) = Hits / (Hits + Misses) │
│ Miss Ratio (m) = 1 - h │
│ │
│ Effective Access Time (EAT): │
│ EAT = h × t_cache + m × t_memory │
│ │
│ With different miss levels: │
│ EAT = h₁×t_L1 + m₁×(h₂×t_L2 + m₂×(h₃×t_L3 + m₃×t_mem)) │
│ │
│ Average Memory Access Time (AMAT) = EAT │
│ │
│ ═══════════════════════════════════════════════════════ │
│ NUMERICAL EXAMPLE: │
│ ───────────────── │
│ L1 cache access time = 2 ns │
│ L2 cache access time = 10 ns │
│ Main memory access time = 100 ns │
│ L1 hit rate = 90% │
│ L2 local hit rate = 80% (of L1 misses) │
│ │
│ AMAT = h₁ × t_L1 + (1-h₁) × [h₂ × t_L2 + (1-h₂) × t_mem] │
│ AMAT = 0.90 × 2 + 0.10 × [0.80 × 10 + 0.20 × 100] │
│ AMAT = 1.8 + 0.10 × [8 + 20] │
│ AMAT = 1.8 + 0.10 × 28 │
│ AMAT = 1.8 + 2.8 = 4.6 ns │
│ │
│ Speedup over memory-only access = 100 / 4.6 = 21.7x │
│ │
│ ═══════════════════════════════════════════════════════ │
│ MISS PENALTY CALCULATION: │
│ ───────────────────────── │
│ Miss penalty = time to fetch block from lower level │
│ For main memory: miss penalty = memory latency + │
│ (block size / bus width) × memory cycle time │
│ │
│ Example: Memory latency = 50 ns │
│ Bus width = 8 bytes, memory cycle = 10 ns │
│ Block size = 64 bytes │
│ Transfer time = (64/8) × 10 = 80 ns │
│ Total miss penalty = 50 + 80 = 130 ns │
└─────────────────────────────────────────────────────────────────┘| Policy | Write Hit | Write Miss | Pros | Cons |
|---|---|---|---|---|
| Write-Through | Write to cache + memory | Write to memory (no allocate) | Consistent | Slow writes (every write hits memory) |
| Write-Back | Write to cache only (mark dirty) | Fetch block into cache, then write | Fast writes (batch) | Complexity, dirty bit management |
| Write-Allocate | — | Load block into cache on write miss | Spatial locality benefit | Extra memory read on miss |
| No-Write-Allocate | — | Write directly to memory | Simpler | No cache benefit for writes |
| State | Name | Cache Line Status | Memory Status |
|---|---|---|---|
| M | Modified | Data is modified (dirty); only in this cache | Stale (out of date) |
| E | Exclusive | Clean copy; only in this cache; can write without bus transaction | Up to date |
| S | Shared | Clean copy; may exist in other caches | Up to date |
| I | Invalid | Line is not valid; must fetch on next access | — |
┌─────────────────────────────────────────────────────────────────┐
│ TYPES OF CACHE MISSES (3C Model) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┬──────────────────────────────────────────┐ │
│ │ Compulsory │ First access to a block (cold start) │ │
│ │ (Cold Miss) │ Cannot avoid — every block must be loaded │ │
│ │ │ Reduction: larger block size, prefetching │ │
│ ├──────────────┼──────────────────────────────────────────┤ │
│ │ Capacity │ Cache is too small; blocks evicted before │ │
│ │ Miss │ reuse │ │
│ │ │ Reduction: larger cache size │ │
│ ├──────────────┼──────────────────────────────────────────┤ │
│ │ Conflict │ Multiple memory blocks map to same cache │ │
│ │ (Collision) │ line/set; causes eviction of useful data │ │
│ │ Miss │ Reduction: higher associativity, different │ │
│ │ │ mapping, victim cache │ │
│ └──────────────┴──────────────────────────────────────────┘ │
│ │
│ (Some add a 4th C: Coherency miss — due to cache coherence │
│ protocol invalidations in multiprocessor systems) │
│ │
│ MISS RATE TRENDS: │
│ ┌─────────────────┬──────┬──────┬──────┐ │
│ │ Cache Size │ 1KB │ 4KB │ 64KB │ │
│ │ Direct Map │ 20% │ 12% │ 3% │ │
│ │ 4-way Set Assoc │ 15% │ 8% │ 2% │ │
│ │ Fully Assoc │ 12% │ 6% │ 1% │ │
│ └─────────────────┴──────┴──────┴──────┘ │
│ (Approximate; actual rates depend on workload) │
└─────────────────────────────────────────────────────────────────┘| Level | Technology | Typical Size | Access Time | Cost/GB | Managed By |
|---|---|---|---|---|---|
| Registers | SRAM (on CPU) | 1-4 KB | < 1 ns | Highest | Compiler |
| L1 Cache | SRAM | 32-64 KB | 1-2 ns | Very High | Hardware |
| L2 Cache | SRAM | 256 KB - 1 MB | 3-10 ns | High | Hardware |
| L3 Cache | SRAM | 4-32 MB | 10-30 ns | Moderate | Hardware |
| Main Memory (RAM) | DRAM | 4-64 GB | 50-100 ns | Low | OS |
| SSD | NAND Flash | 256 GB - 4 TB | 50-250 μs | Very Low | OS/Controller |
| HDD | Magnetic | 1-20 TB | 5-10 ms | Lowest | OS/Controller |
| Property | SRAM (Static) | DRAM (Dynamic) |
|---|---|---|
| Storage Element | 6 transistors per cell | 1 transistor + 1 capacitor |
| Refresh Required | No (static) | Yes (every few ms) |
| Speed | Faster (1-10 ns) | Slower (50-100 ns) |
| Density | Lower (fewer bits per area) | Higher (more bits per area) |
| Cost | More expensive | Less expensive |
| Power | Low (no refresh) | Higher (refresh circuitry) |
| Used In | Cache memory (L1, L2, L3) | Main memory (RAM) |
| Volatile | Yes | Yes |
| Concept | Description |
|---|---|
| Virtual Address | Address generated by CPU; what the program sees |
| Physical Address | Actual address on the memory bus |
| Page | Fixed-size block of virtual memory (typically 4 KB) |
| Frame | Fixed-size block of physical memory (same size as page) |
| Page Table | Maps virtual page numbers to physical frame numbers |
| TLB | Translation Lookaside Buffer; cache for page table entries |
| Page Fault | Access to a page not in physical memory; OS must load it from disk |
| Thrashing | Excessive paging; system spends more time paging than executing |
┌─────────────────────────────────────────────────────────────────┐
│ VIRTUAL ADDRESS TRANSLATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Virtual Address Structure (32-bit, 4 KB page): │
│ ┌──────────────────────┬────────────────┐ │
│ │ Virtual Page No. │ Page Offset │ │
│ │ (20 bits) │ (12 bits) │ │
│ └──────────────────────┴────────────────┘ │
│ │
│ Page Table Entry (PTE): │
│ ┌──────────────────────┬─────┬─────┬─────┬─────┐ │
│ │ Frame Number (20) │ V │ D │ R │ W │ │
│ └──────────────────────┴─────┴─────┴─────┴─────┘ │
│ V = Valid, D = Dirty, R = Referenced, W = Write-protected │
│ │
│ Translation Process: │
│ 1. CPU generates virtual address │
│ 2. Split into VPN + Offset │
│ 3. Check TLB for VPN → Frame mapping │
│ ├─ TLB Hit: Get frame number from TLB → Physical address │
│ └─ TLB Miss: Walk page table in memory │
│ ├─ Valid = 1: Access frame, update TLB │
│ └─ Valid = 0: Page Fault → OS handles (load from disk) │
│ │
│ Physical Address = Frame Number + Page Offset │
│ ┌──────────────────────┬────────────────┐ │
│ │ Physical Frame No. │ Page Offset │ │
│ └──────────────────────┴────────────────┘ │
│ │
│ MULTILEVEL PAGE TABLE (to reduce memory overhead): │
│ Level 1: Page Directory (outer) → Level 2: Page Table (inner) │
│ For 32-bit with 4KB pages: 10-bit dir + 10-bit page + 12 off │
│ For 64-bit (4-level): each level uses 9 bits (512 entries) │
└─────────────────────────────────────────────────────────────────┘┌─────────────────────────────────────────────────────────────────┐
│ TLB & MEMORY HIERARCHY PERFORMANCE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ TLB (Translation Lookaside Buffer): │
│ - Cache that stores recent virtual → physical translations │
│ - Typically 64-512 entries, fully associative or set-assoc │
│ - TLB hit: ~1 cycle (no page table walk needed) │
│ - TLB miss: ~10-30 cycles (walk page table in memory) │
│ │
│ EFFECTIVE ACCESS TIME WITH TLB: │
│ ┌───────────────────────────────────────────────────┐ │
│ │ EAT = h_tlb × t_tlb │ │
│ │ + (1 - h_tlb) × [ │ │
│ │ h_page × (t_ptw + h_cache × t_c + │ │
│ │ (1-h_cache) × t_mem) │ │
│ │ + (1-h_page) × t_pagefault │ │
│ │ ] │ │
│ └───────────────────────────────────────────────────┘ │
│ │
│ WORKED EXAMPLE: │
│ TLB hit rate = 95% │
│ TLB access time = 2 ns │
│ Page table access = 20 ns (in memory) │
│ Cache hit rate = 90% (after TLB miss, PTE in cache) │
│ Cache access time = 5 ns │
│ Page fault rate = 0.1% │
│ Page fault service time = 8 ms = 8,000,000 ns │
│ │
│ TLB hit path = 2 ns │
│ TLB miss, cache hit (PTE) = 20 + 5 = 25 ns │
│ TLB miss, cache miss (PTE) = 20 + 100 = 120 ns │
│ Page fault = 8,000,000 ns │
│ │
│ PTE in cache = 95% × 90% + (rest...) │
│ │
│ EAT = 0.95 × 2 │
│ + 0.05 × [0.90 × 25 + 0.099 × 120 + 0.001 × 8000000] │
│ = 1.9 + 0.05 × [22.5 + 11.88 + 8000] │
│ = 1.9 + 0.05 × 8034.38 │
│ = 1.9 + 401.72 │
│ = 403.62 ns │
│ │
│ (Note: Page faults dominate! TLB and cache improvements are │
│ negligible when page fault rate is significant.) │
│ │
│ INTERLEAVED MEMORY: │
│ - Memory divided into multiple modules (banks) │
│ - Consecutive addresses placed in different banks │
│ - K banks → access time = (access time of 1 bank) / K │
│ - Improves bandwidth by overlapping memory accesses │
│ - Example: 4 banks, 100ns each → effective = 100ns for first, │
│ then 25ns for each subsequent word (pipeline) │
└─────────────────────────────────────────────────────────────────┘| Component | Operation | Logic |
|---|---|---|
| Arithmetic | ADD, SUB, INC, DEC | Full adders chained; SUB via 2's complement addition |
| Logical | AND, OR, XOR, NOT | Bitwise gates on each bit position |
| Shift | SHL, SHR, ROT | Barrel shifter for O(1) shifts of any amount |
| Comparison | LT, GT, EQ, NE | Subtract + check flags (zero, sign, overflow) |
| Feature | Hardwired | Microprogrammed |
|---|---|---|
| Speed | Fast (combinational logic) | Slower (microcode ROM lookup) |
| Flexibility | Fixed; difficult to modify | Flexible; change microcode for new instructions |
| Complexity | Complex for large ISAs | Simpler hardware; complex for microcode |
| Design Time | Long | Short |
| Cost | Higher for complex ISAs | Lower (uses ROM) |
| Instruction Set | RISC typically | CISC typically (x86 uses hybrid) |
| Fault Tolerance | Harder to debug | Easier (microcode is software-like) |
| Examples | RISC-V, MIPS, early RISC | x86 (partially), VAX, IBM System/360 |
┌─────────────────────────────────────────────────────────────────┐
│ MICROOPERATIONS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ A microoperation is an elementary operation performed on data │
│ stored in registers. Each instruction = sequence of microops. │
│ │
│ NOTATION: R1 ← R1 + R2 (register transfer) │
│ │
│ REGISTER TRANSFER MICROOPERATIONS: │
│ ┌─────────────────────────┬──────────────────────────────────┐ │
│ │ Transfer: R2 ← R1 │ Copy contents of R1 to R2 │ │
│ │ Add: R3 ← R1 + R2 │ R3 = sum of R1 and R2 │ │
│ │ Sub: R3 ← R1 - R2 │ R3 = R1 minus R2 │ │
│ │ AND: R3 ← R1 ∧ R2 │ Bitwise AND │ │
│ │ OR: R3 ← R1 ∨ R2 │ Bitwise OR │ │
│ │ XOR: R3 ← R1 ⊕ R2 │ Bitwise XOR │
│ │ NOT: R2 ← R1' │ Complement of R1 │ │
│ │ Shift L: R2 ← shl R1 │ Shift R1 left by 1 │ │
│ │ Shift R: R2 ← shr R1 │ Shift R1 right by 1 │ │
│ │ Increment: R1 ← R1 + 1 │ Add 1 to R1 │ │
│ │ Clear: R1 ← 0 │ Set R1 to zero │ │
│ └─────────────────────────┴──────────────────────────────────┘ │
│ │
│ BUS TRANSFER: │
│ When a system bus connects registers: │
│ BUS ← R1 ; R1 places data on bus │
│ R2 ← BUS ; R2 captures data from bus │
│ Combined: R2 ← BUS, BUS ← R1 (requires 2 control signals) │
│ │
│ MEMORY TRANSFER: │
│ MAR ← address ; Load Memory Address Register │
│ MDR ← M[MAR] ; Read from memory to Memory Data Register │
│ R1 ← MDR ; Copy MDR to R1 │
│ Combined: Read(R1, addr) │
│ │
│ MDR ← R1 ; Load data to write │
│ MAR ← address ; Set address │
│ M[MAR] ← MDR ; Write MDR to memory │
│ Combined: Write(R1, addr) │
└─────────────────────────────────────────────────────────────────┘┌─────────────────────────────────────────────────────────────────┐
│ BUS ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SINGLE BUS ARCHITECTURE: │
│ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │
│ │CPU │──│ ALU│──│Reg │──│ I/O│──│Mem │──│DMA │ │
│ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ │
│ ═════════════ Shared System Bus ═════════════ │
│ │
│ Pros: Simple, low cost │
│ Cons: Bottleneck — only one transfer at a time │
│ │
│ ───────────────────────────────────────────── │
│ MULTI-BUS ARCHITECTURE: │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ CPU │────▶│ System Bus │────▶│ Memory │ │
│ │ (with │ │ (address + │ │ │ │
│ │ cache) │ │ data + ctrl)│ └──────────┘ │
│ └──────────┘ └──────┬───────┘ │
│ │ │
│ ┌──────▼───────┐ ┌──────────┐ │
│ │ I/O Bus │────▶│ I/O │ │
│ │ (PCI/PCIe) │ │ Devices │ │
│ └─────────────┘ └──────────┘ │
│ │
│ Bus Types: │
│ ┌──────────────────┬────────────────────────────────────┐ │
│ │ Data Bus │ Carries data (bidirectional) │ │
│ │ Address Bus │ Carries address (unidirectional, │ │
│ │ │ CPU → memory/I/O) │ │
│ │ Control Bus │ Carries signals: read/write, clock, │ │
│ │ │ interrupt, bus request/grant │ │
│ └──────────────────┴────────────────────────────────────┘ │
│ │
│ Bus Arbitration: │
│ • Centralized: Single arbiter (daisy chain, priority) │
│ • Distributed: Each device has arbiter logic │
│ • Protocols: Request → Grant → Transfer → Release │
└─────────────────────────────────────────────────────────────────┘┌─────────────────────────────────────────────────────────────────┐
│ I/O ORGANIZATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. PROGRAMMED I/O (Polling): │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ CPU continuously checks I/O device status register │ │
│ │ while (status != READY) { check_status(); } │ │
│ │ data = read_port(); │ │
│ │ │ │
│ │ Pros: Simple hardware │ │
│ │ Cons: CPU wastes cycles polling; slow for I/O bound │ │
│ │ Best for: Simple embedded systems, infrequent I/O │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ 2. INTERRUPT-DRIVEN I/O: │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ CPU starts I/O operation, then continues other work │ │
│ │ Device sends interrupt when data is ready │ │
│ │ │ │
│ │ Flow: │ │
│ │ CPU → Issue I/O command → Continue execution │ │
│ │ Device → Complete operation → Send IRQ │ │
│ │ CPU → Save context → Jump to ISR → Handle data │ │
│ │ CPU → Restore context → Resume execution │ │
│ │ │ │
│ │ Pros: CPU free during I/O wait; efficient │ │
│ │ Cons: Context switch overhead per interrupt │ │
│ │ Best for: General-purpose computing │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ 3. DMA (Direct Memory Access): │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ DMA controller transfers data directly between │ │
│ │ I/O device and memory without CPU involvement │ │
│ │ │ │
│ │ CPU sets up DMA: │ │
│ │ - Source address (I/O port or memory) │ │
│ │ - Destination address (memory or I/O port) │ │
│ │ - Transfer count (number of words/bytes) │ │
│ │ - Direction (read/write) │ │
│ │ │ │
│ │ DMA operates in cycle stealing: │ │
│ │ When DMA needs bus, it requests it from CPU │ │
│ │ CPU pauses for 1 cycle, DMA transfers 1 word │ │
│ │ CPU resumes; cycle repeated until transfer done │ │
│ │ │ │
│ │ Pros: Fastest; CPU free; bulk transfer capability │ │
│ │ Cons: Complex hardware; bus contention │ │
│ │ Best for: Disk I/O, network, display, bulk transfer │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ COMPARISON: │
│ ┌──────────────┬──────────┬──────────┬──────────┐ │
│ │ │ CPU Load │ Speed │ Hardware │ │
│ │ Programmed │ 100% │ Slowest │ Simplest │ │
│ │ Interrupt │ Low │ Medium │ Moderate │ │
│ │ DMA │ Minimal │ Fastest │ Complex │ │
│ └──────────────┴──────────┴──────────┴──────────┘ │
└─────────────────────────────────────────────────────────────────┘| Classification | Data Stream | Instruction Stream | Example |
|---|---|---|---|
| SISD | Single | Single | Traditional uniprocessor (Pentium 1) |
| SIMD | Multiple | Single | Vector processors, GPUs, Intel SSE/AVX |
| MISD | Single | Multiple | Rare; fault-tolerant systems, some crypto |
| MIMD | Multiple | Multiple | Multi-core CPUs, distributed systems, clusters |
| Type | Full Form | Memory Access | Latency | Example |
|---|---|---|---|---|
| UMA | Uniform Memory Access | All processors see same memory latency | Uniform | Symmetric multiprocessing (SMP) |
| NUMA | Non-Uniform Memory Access | Local memory faster than remote | Variable | AMD Epyc, modern servers |
┌─────────────────────────────────────────────────────────────────┐
│ AMDAHL'S LAW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ "The speedup of a parallel system is limited by the │
│ sequential portion of the program." │
│ │
│ FORMULA: │
│ │
│ 1 │
│ S = ───────────────────── │
│ (1 - P) + (P / N) │
│ │
│ Where: │
│ S = Speedup (ratio of original to parallel time) │
│ P = Fraction of program that can be parallelized (0 ≤ P ≤ 1)│
│ N = Number of processors (cores) │
│ │
│ Maximum speedup (N → ∞): │
│ │
│ 1 1 │
│ Smax = ───── = ───────── │
│ 1 - P sequential_fraction │
│ │
│ ═══════════════════════════════════════════════════════ │
│ WORKED EXAMPLE 1: │
│ ───────────────── │
│ Program has 60% parallelizable code (P = 0.6), 4 processors: │
│ │
│ S = 1 / [(1 - 0.6) + (0.6 / 4)] │
│ S = 1 / [0.4 + 0.15] │
│ S = 1 / 0.55 = 1.82x │
│ │
│ With 8 processors: │
│ S = 1 / [0.4 + 0.075] = 1 / 0.475 = 2.11x │
│ │
│ With 16 processors: │
│ S = 1 / [0.4 + 0.0375] = 1 / 0.4375 = 2.29x │
│ │
│ With ∞ processors: │
│ Smax = 1 / 0.4 = 2.5x │
│ → Adding more processors gives diminishing returns! │
│ │
│ ═══════════════════════════════════════════════════════ │
│ WORKED EXAMPLE 2: │
│ ───────────────── │
│ Program has 90% parallelizable code (P = 0.9), 8 processors: │
│ │
│ S = 1 / [(1 - 0.9) + (0.9 / 8)] │
│ S = 1 / [0.1 + 0.1125] │
│ S = 1 / 0.2125 = 4.71x │
│ │
│ With 64 processors: │
│ S = 1 / [0.1 + 0.0141] = 1 / 0.1141 = 8.77x │
│ │
│ With ∞ processors: │
│ Smax = 1 / 0.1 = 10x │
│ │
│ KEY INSIGHT: │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ If P = 0.5 → max speedup = 2x regardless of cores! │ │
│ │ If P = 0.95 → max speedup = 20x │ │
│ │ If P = 0.99 → max speedup = 100x │ │
│ │ The sequential fraction is the ultimate bottleneck! │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘┌─────────────────────────────────────────────────────────────────┐
│ TYPES OF PARALLELISM │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. INSTRUCTION-LEVEL PARALLELISM (ILP): │
│ ┌────────────────────────────────────────────────────┐ │
│ │ • Superscalar: Multiple instructions issued per clock│ │
│ │ (e.g., 4-wide: 4 instructions/clock) │ │
│ │ • Out-of-Order (OoO): Execution engine reorders │ │
│ │ instructions to avoid stalls │ │
│ │ • VLIW: Compiler schedules instructions (Itanium) │ │
│ │ • Speculative execution: Execute before knowing if │ │
│ │ branch taken; squash if wrong │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ 2. DATA-LEVEL PARALLELISM (DLP): │
│ ┌────────────────────────────────────────────────────┐ │
│ │ • SIMD: Single instruction operates on multiple data│ │
│ │ • Vector processors: Operate on arrays/vectors │ │
│ │ • GPU: Thousands of cores doing same operation │ │
│ │ on different data (shader programs) │ │
│ │ • Examples: SSE, AVX, NEON, CUDA │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ 3. THREAD-LEVEL PARALLELISM (TLP): │
│ ┌────────────────────────────────────────────────────┐ │
│ │ • Multi-core: Multiple CPU cores on one chip │ │
│ │ • Multi-threading: Hardware threads (SMT/HT) │ │
│ │ • Simultaneous Multithreading (SMT): │ │
│ │ Intel Hyper-Threading = 2 threads per core │ │
│ │ Shares ALU, caches; duplicating registers/PC │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ PIPELINING vs PARALLEL PROCESSING: │
│ ┌────────────────────┬─────────────────────────────────┐ │
│ │ Pipelining │ Overlap stages of DIFFERENT │ │
│ │ │ instructions (temporal parallelism)│ │
│ │ Parallel Proc │ Execute MULTIPLE instructions │ │
│ │ │ simultaneously (spatial parallel) │ │
│ │ Combination │ Superscalar pipeline: pipeline + │ │
│ │ │ multiple execution units │ │
│ └────────────────────┴─────────────────────────────────┘ │
│ │
│ SPEEDUP COMPARISON: │
│ Pipeline (5 stages, 100 instr): S = 500/104 = 4.81x │
│ Parallel (4 cores, 100 instr): S = 400/100 = 4.0x │
│ Pipeline + Parallel: S ≈ 4.81 × 4 = 19.24x (ideal) │
└─────────────────────────────────────────────────────────────────┘1 / (1 - P). For P=0.95, the maximum speedup is only 20x no matter how many cores you add. This is why algorithm efficiency matters — parallelism cannot overcome an inherently sequential algorithm.RISC (Reduced Instruction Set Computer) uses a small set of simple, uniform instructions that each execute in one clock cycle. It relies on a large register file (32+) and the compiler to optimize instruction scheduling. Examples: ARM (mobile phones, Apple Silicon M1/M2), RISC-V (open-source, growing adoption),MIPS (networking, embedded). CISC (Complex Instruction Set Computer) has many complex instructions that can operate directly on memory, with variable instruction lengths. It uses microcode to decode complex instructions into simpler micro-operations internally. Example: x86/x64 (Intel, AMD desktop/server CPUs). Modern x86 CPUs are actually hybrid: they decode CISC instructions into RISC-like micro-ops (μops) before execution. Key trade-off: RISC is simpler, faster per instruction, and more power-efficient; CISC has better code density (fewer bytes per instruction) and backward compatibility.
Step 1: Calculate fields. Block size = 16 bytes → offset = log₂(16) = 4 bits. Number of sets = 64/4 = 16 sets → set index = log₂(16) = 4 bits. Tag = 32 - 4 - 4 = 24 bits. Address format: [Tag:24][Set:4][Offset:4].Step 2: Map addresses. 0x00 = 0b{...0000_0000 0000_0000_0000_0000} → Set=0, Tag=0. 0x10 = 0b{...0000_0001_0000} → Set=0, Tag=0. 0x20 → Set=1. 0x30 → Set=1. 0x00 → Set=0.Step 3: Access pattern: 0x00=MISS (cold), 0x10=MISS (same set, different tag? No — same tag=0, different offset. Actually both 0x00 and 0x10 map to set 0 with tag 0 but different block offsets 0 and 1 — if blocks are different, these are in the SAME cache line! Block at 0x00 covers 0x00-0x0F, and block at 0x10 covers 0x10-0x1F. So: 0x00=MISS, 0x10=MISS (different block, same set), 0x20=MISS (set 1), 0x30=MISS (set 1, but 0x20-0x2F and 0x30-0x3F are different blocks), 0x00=HIT. Hit ratio = 1/5 = 20%.
Pipeline hazards are situations that prevent the next instruction from executing in its designated clock cycle.Three types: (1) Data hazards — RAW (Read After Write) is most common: e.g., ADD R1,R2,R3 followed bySUB R4,R1,R5 — the ADD hasn't written R1 yet when SUB needs it. Solutions:Data forwarding/bypassing (route ALU result directly to the next instruction's input), stall insertion (insert NOP bubbles), compiler scheduling (reorder independent instructions). (2) Control hazards — branches change the PC before the pipeline knows the outcome.Solutions: Branch prediction (static: predict not-taken; dynamic: 2-bit saturating counter, BTB), delayed branch (fill branch delay slot with useful instruction). (3) Structural hazards — two instructions need the same hardware unit.Solutions: Resource duplication (Harvard architecture: separate instruction/data memories), pipeline stall.
Step 1: Sign = negative → S = 1.Step 2: Convert 6.5 to binary: 6 = 110₂; 0.5 = 0.1₂ → 6.5 = 110.1₂.Step 3: Normalize: 110.1 = 1.101 × 2².Step 4: Exponent = 2 + 127 = 129 = 10000001₂.Step 5: Mantissa = 10100000000000000000000 (take fraction after the implicit 1, pad to 23 bits).Result: 1 10000001 10100000000000000000000.Hex: C0D00000.Verification: (-1)¹ × 1.101₂ × 2^(129-127) = -1 × (1 + 0.5 + 0.125) × 4 = -1 × 1.625 × 4 = -6.5 ✓.Special cases to mention: +0 = 00000000 (all zeros), -0 = 80000000 (sign bit set), +Infinity = 7F800000 (E=255, M=0), NaN = 7FC00000 (E=255, M≠0).
Part 1: Speedup with 4 processors. P = 0.70, N = 4. S = 1 / [(1 - 0.70) + (0.70 / 4)] = 1 / [0.30 + 0.175] = 1 / 0.475 = 2.105x. New execution time = 100 / 2.105 = 47.5 seconds.Part 2: Processors for 2x speedup. S = 2, P = 0.70. 2 = 1 / [0.30 + 0.70/N] → 0.30 + 0.70/N = 0.50 → 0.70/N = 0.20 → N = 0.70/0.20 = 3.5 → 4 processors(need to round up since you can't have a fractional processor; with N=3: S = 1/[0.30+0.233] = 1.82x < 2x).Maximum theoretical speedup (N→∞) = 1/0.30 = 3.33x. This means no matter how many cores you add, the program will never be more than 3.33x faster — the 30% sequential portion is the hard limit.Interview tip:Always mention Amdahl's Law when discussing multicore scaling and its implication that improving the sequential fraction (reducing 1-P) is often more effective than adding cores.