COA Cheatsheet Cheatsheet

🔢

Number Representation

FUNDAMENTAL

Representation	Range (8-bit)	Description
Unsigned	0 to 255	No sign bit; all bits represent magnitude
Sign-Magnitude	-127 to +127	MSB = sign (0=positive, 1=negative); two representations of zero (+0, -0)
1's Complement	-127 to +127	Invert all bits for negative; two representations of zero
2's Complement	-128 to +127	Invert all bits + 1 for negative; single zero; most widely used

twos-complement-steps.txt

┌─────────────────────────────────────────────────────────────────┐
│              2'S COMPLEMENT CONVERSION STEPS                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Positive to Negative (e.g., +25 → 8-bit 2's complement):       │
│  ┌──────────────────────────────────────────────────────┐       │
│  │ Step 1: Write binary of +25 = 00011001               │       │
│  │ Step 2: Invert all bits     = 11100110 (1's comp)    │       │
│  │ Step 3: Add 1               = 11100111 (2's comp)    │       │
│  │         -25 in 2's complement = 11100111             │       │
│  └──────────────────────────────────────────────────────┘       │
│                                                                 │
│  Negative to Positive (e.g., 11100111 → ?):                     │
│  ┌──────────────────────────────────────────────────────┐       │
│  │ Step 1: Identify as negative (MSB = 1)                │       │
│  │ Step 2: Invert all bits     = 00011000               │       │
│  │ Step 3: Add 1               = 00011001 = +25         │       │
│  └──────────────────────────────────────────────────────┘       │
│                                                                 │
│  Shortcut: Subtract from 2^n                                    │
│    -25 = 256 - 25 = 231 = 11100111 ✓                           │
└─────────────────────────────────────────────────────────────────┘

IEEE 754 Single Precision (32-bit)

Field	Bits	Description
Sign (S)	1 bit	0 = positive, 1 = negative
Exponent (E)	8 bits	Biased exponent; bias = 127
Mantissa (M)	23 bits	Fractional part; implicit leading 1

IEEE 754 Double Precision (64-bit)

Field	Bits	Description
Sign (S)	1 bit	0 = positive, 1 = negative
Exponent (E)	11 bits	Biased exponent; bias = 1023
Mantissa (M)	52 bits	Fractional part; implicit leading 1

ieee754-worked-example.txt

┌─────────────────────────────────────────────────────────────────┐
│        IEEE 754 SINGLE PRECISION — WORKED EXAMPLE                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Convert -13.375 to IEEE 754 Single Precision:                  │
│                                                                 │
│  Step 1: Sign bit                                               │
│    -13.375 is negative → S = 1                                  │
│                                                                 │
│  Step 2: Convert integer part (13) to binary                    │
│    13 = 8 + 4 + 1 = 1101₂                                      │
│                                                                 │
│  Step 3: Convert fractional part (0.375) to binary              │
│    0.375 × 2 = 0.75  → 0                                       │
│    0.75  × 2 = 1.5   → 1                                       │
│    0.5   × 2 = 1.0   → 1                                       │
│    0.375 = 0.011₂                                                  │
│                                                                 │
│  Step 4: Combine = 1101.011                                     │
│                                                                 │
│  Step 5: Normalize (scientific notation in binary)              │
│    1101.011 = 1.101011 × 2^3                                    │
│    Mantissa (M) = 10101100000000000000000 (23 bits, pad with 0) │
│    Exponent  = 3 + 127 = 130 = 10000010₂                         │
│                                                                 │
│  Result: 1  10000010  10101100000000000000000                   │
│          S    E(8)         M(23)                                 │
│                                                                 │
│  Hex: C1560000                                                  │
│  Verify: (-1)^1 × 1.101011 × 2^(130-127)                      │
│        = -1 × (1 + 0.5 + 0.125 + 0.03125 + 0.0078125) × 8     │
│        = -1 × 1.671875 × 8 = -13.375 ✓                         │
└─────────────────────────────────────────────────────────────────┘

floating-point-arithmetic.txt

┌─────────────────────────────────────────────────────────────────┐
│           FLOATING POINT ADDITION / SUBTRACTION                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Example: Add 1.5 × 2^2 + 1.125 × 2^0                          │
│                                                                 │
│  Step 1: Equalize exponents                                     │
│    Smaller exponent adjusts to larger:                           │
│    1.125 × 2^0 = 0.28125 × 2^2 (shift mantissa right by 2)     │
│                                                                 │
│  Step 2: Add mantissas                                          │
│    1.5 + 0.28125 = 1.78125                                      │
│    Result: 1.78125 × 2^2 = 7.125                                │
│                                                                 │
│  Step 3: Normalize result                                       │
│    1.78125 × 2^2 (already normalized, MSB = 1)                  │
│                                                                 │
│  Step 4: Round (if mantissa exceeds 23 bits)                    │
│    Apply rounding mode (round to nearest, toward zero, etc.)     │
│                                                                 │
│  ─────────────────────────────────────────────────────           │
│                                                                 │
│  Example: Subtract 1.5 × 2^2 - 1.375 × 2^2                     │
│                                                                 │
│  Step 1: Exponents already equal (both 2)                        │
│  Step 2: Subtract mantissas                                     │
│    1.5 - 1.375 = 0.125                                          │
│  Step 3: Normalize → 1.0 × 2^(-3)                              │
│    (shifted mantissa left 3 positions, decreased exponent by 3) │
│                                                                 │
│  ─────────────────────────────────────────────────────           │
│  FLOATING POINT MULTIPLICATION:                                 │
│  1. Add exponents: E = E1 + E2 - bias                           │
│  2. Multiply mantissas: M = M1 × M2                             │
│  3. Normalize and round result                                  │
│  4. Set sign: S = S1 XOR S2                                     │
└─────────────────────────────────────────────────────────────────┘

overflow-underflow.txt

┌─────────────────────────────────────────────────────────────────┐
│                  OVERFLOW / UNDERFLOW                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  OVERFLOW: Result exceeds maximum representable value            │
│  ─────────────────────────────────────────────                   │
│  2's complement overflow detection (addition):                   │
│  • Positive + Positive = Negative  → OVERFLOW                   │
│  • Negative + Negative = Positive  → OVERFLOW                   │
│  • Rule: If carry-in to MSB ≠ carry-out from MSB → overflow     │
│                                                                 │
│  Examples (8-bit 2's complement):                               │
│    01111111 (+127) + 00000001 (+1) = 10000000 (-128)  OVERFLOW! │
│    10000000 (-128) + 10000000 (-128) = 00000000 (0)   OVERFLOW! │
│    01111111 (+127) + 11111111 (-1)  = 01111110 (+126) OK        │
│    10000001 (-127) + 11111111 (-1)  = 10000000 (-128) OK        │
│                                                                 │
│  UNDERFLOW: Result is smaller than minimum representable value   │
│  ─────────────────────────────────────────────                   │
│  In floating point: result is too close to zero → flushed to 0  │
│  Example: 1.0 × 2^(-126) × 0.5 × 2^(-126) = 0.5 × 2^(-252)   │
│    This exceeds the minimum exponent → gradual/denormalized      │
│    or sudden underflow to zero depending on implementation       │
└─────────────────────────────────────────────────────────────────┘

2's complement is universal in modern processors. Key advantages: single representation of zero, easy arithmetic (addition and subtraction use the same circuit), and the MSB directly indicates sign. For IEEE 754, remember the bias — it allows comparison of floating point numbers using integer comparison hardware. Always memorize the bias values:127 for single precision, 1023 for double precision.

📋

Instruction Set Architecture

ISA

Format	Address Fields	Example	# Operands in Memory
Zero-Address	0 (Stack-based)	PUSH A; PUSH B; ADD	2 (operands on stack)
One-Address	1 (Accumulator)	ADD B (ACC ← ACC + B)	1 (ACC implicit)
Two-Address	2 (Register)	ADD R1, R2 (R1 ← R1 + R2)	0 (all registers)
Three-Address	3 (Register)	ADD R1, R2, R3 (R1←R2+R3)	0 (all registers)

instruction-format-comparison.txt

┌─────────────────────────────────────────────────────────────────┐
│         INSTRUCTION FORMAT — SIDE BY SIDE COMPARISON             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Evaluate: Z = (A × B) + (C × D) - E                            │
│  ─────────────────────────────────────────────                   │
│                                                                 │
│  THREE-ADDRESS (3 instructions):                                │
│    MUL T1, A, B    ; T1 = A * B                                 │
│    MUL T2, C, D    ; T2 = C * D                                 │
│    ADD T3, T1, T2  ; T3 = T1 + T2                               │
│    SUB  Z, T3, E   ; Z  = T3 - E                                │
│    Instructions: 4 | Memory accesses: 4 reads + 1 write         │
│                                                                 │
│  TWO-ADDRESS (5 instructions):                                  │
│    MOV  R1, A      ; R1 = A                                     │
│    MUL  R1, B      ; R1 = A * B                                 │
│    MOV  R2, C      ; R2 = C                                     │
│    MUL  R2, D      ; R2 = C * D                                 │
│    ADD  R1, R2     ; R1 = A*B + C*D                             │
│    SUB  R1, E      ; R1 = A*B + C*D - E                         │
│    Instructions: 6 | Memory accesses: 6 reads + 1 write         │
│                                                                 │
│  ONE-ADDRESS / ACCUMULATOR (8 instructions):                    │
│    LOAD A    ; ACC = A                                           │
│    MUL  B    ; ACC = A * B                                       │
│    STORE T   ; T = A * B                                         │
│    LOAD C    ; ACC = C                                           │
│    MUL  D    ; ACC = C * D                                       │
│    ADD  T    ; ACC = C*D + A*B                                   │
│    SUB  E    ; ACC = C*D + A*B - E                              │
│    STORE Z   ; Z = result                                        │
│    Instructions: 8 | Memory accesses: 8 reads + 2 writes        │
│                                                                 │
│  ZERO-ADDRESS / STACK (10 instructions):                        │
│    PUSH A; PUSH B; MUL; PUSH C; PUSH D; MUL; ADD; PUSH E;      │
│    SUB; POP Z                                                  │
│    Instructions: 10 | No explicit memory operands (stack)       │
└─────────────────────────────────────────────────────────────────┘

Addressing Mode	Syntax	Effective Address	Description
Immediate	#5	Operand itself	Operand is part of the instruction; no memory access
Register	R1	Register R1	Operand is in a register; fastest access
Direct (Absolute)	1000	1000	Address field = memory address of operand
Indirect	@1000	M[1000]	Address field points to memory location containing the EA
Indexed	X(R1)	R1 + X	Base register + offset; used for arrays
Based	1000(R1)	R1 + 1000	Base register + displacement; position-independent code
Relative	$label	PC + offset	PC-relative; used for branches; supports code relocation
Implied / Implicit	CLR	(fixed register)	Operand is implied by opcode (e.g., accumulator, stack top)
Auto-increment	(R1)+	R1; then R1 ← R1 + d	Access address in R1, then increment R1; good for array traversal
Auto-decrement	-(R1)	R1 ← R1 - d; then R1	Decrement R1, then access; stack push emulation

RISC vs CISC — Detailed Comparison

Feature	RISC	CISC
Instruction Count	Small (50-100)	Large (200-500+)
Instruction Length	Fixed (32-bit)	Variable (16-64 bit)
Execution Time	1 clock cycle each	1 to many clock cycles
Addressing Modes	Few (2-3)	Many (12-15+)
Registers	Many (32+)	Few (8-16)
Pipeline	Easy to pipeline	Difficult (variable length)
Microcode	No microcode	Microprogrammed control
Examples	ARM, MIPS, RISC-V, SPARC	x86/x64, VAX, IBM Z
Design Philosophy	Software does more (compiler)	Hardware does more (microcode)
Clock Speed	Higher (simple instructions)	Lower (complex instructions)
Code Density	Lower (more instructions)	Higher (complex ops)
Power Consumption	Lower	Higher

Instruction Pipeline Overview

Stage	Abbreviation	Operation
Fetch	IF	Read instruction from memory using PC
Decode	ID	Decode opcode, read register operands
Execute	EX	Perform ALU operation (add, AND, shift)
Memory	MEM	Read/write data memory (if needed)
Write Back	WB	Write result back to register file

Addressing modes are heavily tested in interviews. Understand the difference betweenIndexed (constant offset + register = array element) and Based (register + constant offset = struct field or local variable).Relative addressing is key for position-independent code and is how CALL and JMP instructions work in most ISAs.

⚡

Pipelining

PIPELINE

pipeline-timing-diagram.txt

┌─────────────────────────────────────────────────────────────────┐
│              5-STAGE PIPELINE TIMING DIAGRAM                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  WITHOUT PIPELINE (Sequential):                                 │
│  ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐      │
│  │  I1  │  I2  │  I3  │  I4  │  I5  │  I6  │  I7  │  I8  │      │
│  └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘      │
│  Time:  1     2     3     4     5     6     7     8             │
│  Total = n × k = 8 × 5 = 40 cycles                             │
│                                                                 │
│  WITH PIPELINE (Overlapped):                                    │
│  Cycle:   1     2     3     4     5     6     7     8  9 10 11 12│
│  I1:    [IF] [ID] [EX] [MEM] [WB]                              │
│  I2:         [IF] [ID] [EX] [MEM] [WB]                         │
│  I3:              [IF] [ID] [EX] [MEM] [WB]                    │
│  I4:                   [IF] [ID] [EX] [MEM] [WB]               │
│  I5:                        [IF] [ID] [EX] [MEM] [WB]          │
│  I6:                             [IF] [ID] [EX] [MEM] [WB]     │
│  I7:                                  [IF] [ID] [EX] [MEM] [WB]│
│  I8:                                       [IF] [ID] [EX] [MEM][WB]│
│                                                                 │
│  Total = k + (n-1) = 5 + 7 = 12 cycles                        │
│  Speedup = 40 / 12 = 3.33x                                     │
│  Theoretical max = k = 5x                                       │
│                                                                 │
│  FORMULA:                                                       │
│  Sequential time   = n × k × t (where t = cycle time)          │
│  Pipelined time    = [k + (n - 1)] × t                         │
│  Speedup S = (n × k) / (k + n - 1)                             │
│  As n → ∞: S → k (number of pipeline stages)                   │
└─────────────────────────────────────────────────────────────────┘

Data Hazards (RAW / WAR / WAW)

Hazard	Name	Description	Example
RAW	Read After Write	Instruction reads a register before a prior write completes	ADD R1,R2,R3 followed by SUB R4,R1,R5
WAR	Write After Read	Instruction writes a register before a prior read completes	SUB R4,R1,R5 followed by ADD R1,R2,R3
WAW	Write After Write	Two instructions write to the same register; order matters	ADD R1,R2,R3 followed by MUL R1,R4,R5

Control & Structural Hazards

Hazard Type	Cause	Solution
Control Hazard	Branch changes PC before pipeline is flushed	Branch prediction, delayed branch, branch target buffer
Structural Hazard	Two instructions need same hardware resource	Duplicate resources (separate IF/MEM), pipeline stall
Data Hazard (RAW)	Producer-consumer register dependency	Forwarding, stall insertion, compiler scheduling

forwarding-bypassing.txt

┌─────────────────────────────────────────────────────────────────┐
│                DATA FORWARDING (BYPASSING)                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Problem: RAW hazard between consecutive instructions            │
│                                                                 │
│    I1: ADD R1, R2, R3   ; R1 = R2 + R3 (writes in WB stage)    │
│    I2: SUB R4, R1, R5   ; R4 = R1 - R5 (reads R1 in ID stage) │
│                                                                 │
│  Without forwarding: 2 stall cycles needed (wait for WB of I1)  │
│                                                                 │
│  WITH FORWARDING:                                                │
│    I1: [IF] [ID] [EX→ALU result] [MEM] [WB]                    │
│    I2:      [IF] [ID←gets R1 from EX/MEM] [EX] [MEM] [WB]     │
│                                                                 │
│    Forward ALU result directly from EX/MEM pipeline register     │
│    to the ID/EX input — no stall needed!                        │
│                                                                 │
│  FORWARDING CASES:                                               │
│  ┌────────────────┬──────────────────────────────────┐          │
│  │ EX-EX Forward  │ Result from I1 EX → I2 EX input │ 0 stalls│
│  │ MEM-EX Forward │ Result from I1 MEM → I2 EX input│ 1 stall │
│  │ WB-EX Forward  │ Result from I1 WB  → I2 EX input│ 2 stalls│
│  └────────────────┴──────────────────────────────────┘          │
│                                                                 │
│  LOAD-USE HAZARD (cannot fully avoid with forwarding):          │
│    I1: LW R1, 0(R2)   ; loads from memory (data in MEM stage)  │
│    I2: ADD R3, R1, R4  ; needs R1 in ID stage                   │
│    → Must stall 1 cycle even with forwarding (data not in EX)  │
└─────────────────────────────────────────────────────────────────┘

branch-prediction.txt

┌─────────────────────────────────────────────────────────────────┐
│                  BRANCH PREDICTION                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  BRANCH PENALTY: When branch is resolved in EX/MEM stage,       │
│  already fetched instructions in pipeline must be flushed.       │
│  Penalty = number of stages between fetch and branch resolve     │
│                                                                 │
│  STATIC PREDICTION (no runtime data):                            │
│  ┌────────────────────┬────────────────────────────────────┐    │
│  │ Predict Not Taken   │ Always fetch sequential; flush if taken│
│  │ Predict Taken       │ Always fetch branch target          │
│  │ Backward Taken      │ Predict backward branches as taken  │
│  │                    │ (loops usually iterate)               │
│  │ Profile-guided      │ Use compiler profiling info          │
│  └────────────────────┴────────────────────────────────────┘    │
│                                                                 │
│  DYNAMIC PREDICTION (runtime history):                           │
│  ┌────────────────────┬──────────────┬──────────────────────┐   │
│  │ 1-bit predictor    │ Predict last  │ ~60-70% accuracy    │   │
│  │ 2-bit saturating   │ Change only   │ ~75-85% accuracy    │   │
│  │                    │ on 2 same     │ (avoids oscillation │   │
│  │                    │ predictions  │ in nested loops)    │   │
│  │ BHT (Branch        │ 2-bit counter │ Indexed by PC       │   │
│  │ History Table)     │ per branch    │                    │   │
│  │ Global History     │ Uses pattern  │ Correlating        │   │
│  │ Register           │ of last N     │ branches           │   │
│  │ (GHR)              │ branches      │                    │   │
│  │ Tournament         │ Local vs      │ Chooses better     │   │
│  │ Predictor          │ global        │ predictor          │   │
│  └────────────────────┴──────────────┴──────────────────────┘   │
│                                                                 │
│  BRANCH TARGET BUFFER (BTB):                                    │
│  Cache of {PC → Target Address} pairs                           │
│  Fetch stage checks BTB; if hit, fetch from predicted target    │
│  Eliminates 1-cycle penalty for taken branches                  │
└─────────────────────────────────────────────────────────────────┘

speedup-calculation.txt

┌─────────────────────────────────────────────────────────────────┐
│            PIPELINE SPEEDUP — CALCULATION EXAMPLE                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Given:                                                         │
│    Pipeline stages (k) = 5                                      │
│    Number of instructions (n) = 100                             │
│    Branch frequency = 20%                                       │
│    Branch penalty = 3 cycles (with no prediction)               │
│    Load-use stall frequency = 10%                               │
│    Load-use stall cycles = 1                                    │
│                                                                 │
│  Case 1: No stalls, ideal pipeline                              │
│    Time = (k + n - 1) × t = (5 + 99) × t = 104t               │
│    Sequential = n × k × t = 100 × 5 × t = 500t                 │
│    Speedup = 500/104 = 4.81x                                    │
│                                                                 │
│  Case 2: With stalls                                            │
│    Branch stalls = 100 × 0.20 × 3 = 60 cycles                  │
│    Load-use stalls = 100 × 0.10 × 1 = 10 cycles                │
│    Total stalls = 70 cycles                                     │
│                                                                 │
│    Time = (k + n - 1 + stalls) × t = (5 + 99 + 70) × t = 174t │
│    Speedup = 500/174 = 2.87x                                    │
│                                                                 │
│  Case 3: With perfect branch prediction (100% accuracy)         │
│    Branch stalls = 0 cycles                                     │
│    Load-use stalls = 10 cycles                                  │
│    Time = (5 + 99 + 10) × t = 114t                             │
│    Speedup = 500/114 = 4.39x                                    │
│                                                                 │
│  Case 4: With 85% accurate branch prediction                    │
│    Mispredictions = 100 × 0.20 × 0.15 = 3 instructions         │
│    Branch stalls = 3 × 3 = 9 cycles                            │
│    Load-use stalls = 10 cycles                                  │
│    Time = (5 + 99 + 9 + 10) × t = 123t                        │
│    Speedup = 500/123 = 4.07x                                    │
└─────────────────────────────────────────────────────────────────┘

💡

RAW data hazards are the most frequently tested. Know the difference between forwarding (no stall) and stalling (insert bubble). For load-use hazards, even with forwarding, you still need 1 stall cycle because the data is not available until the end of the MEM stage. Branch prediction accuracy above 95% is common in modern CPUs using tournament predictors and TAGE-style predictors.

📦

Cache Memory

CACHE

Cache Mapping Techniques

Mapping	Lines/Set	Tag Bits	Search Method
Direct	1	log₂(blocks) - log₂(lines)	1 comparison (direct index)
Fully Associative	All	log₂(blocks)	Parallel compare all tags
N-way Set Associative	N	log₂(blocks) - log₂(sets)	N comparisons per set

Address Structure

Mapping	Tag	Set/Index	Block Offset	Word Offset
Direct Mapped	───	Index	Block Offset	Word Offset
Fully Associative	─── Tag ───	(none)	Block Offset	Word Offset
N-way Set Assoc	───	Set Index	Block Offset	Word Offset

cache-mapping-example.txt

┌─────────────────────────────────────────────────────────────────┐
│          CACHE MAPPING — WORKED EXAMPLES                         │
├─────────────────────────────────────────────────────────────────┤
                                                                  
│  PROBLEM:                                                       │
│    Main memory size = 256 KB                                    │
│    Cache size = 4 KB                                            │
│    Block size = 64 bytes                                        │
│    Address bus = 32 bits                                        │
│                                                                 │
│  Step 1: Calculate parameters                                   │
│    Number of cache lines = Cache size / Block size              │
│                         = 4 KB / 64 B = 64 lines               │
│    Number of memory blocks = Memory size / Block size           │
│                          = 256 KB / 64 B = 4096 blocks          │
│    Offset bits = log₂(64) = 6 bits                              │
│    Number of words per block = 64 / 4 = 16 words (word offset)  │
│                                                                 │
│  ═══════════════════════════════════════════════════════        │
│  DIRECT MAPPED CACHE:                                           │
│  ─────────────────────                                          │
│    Index bits = log₂(64) = 6 bits                               │
│    Tag bits = 32 - 6 - 6 = 20 bits                             │
│    Address format: [TAG:20 | INDEX:6 | OFFSET:6]               │
│                                                                 │
│    For address 0x00008A3C:                                      │
│      Binary: 0000 0000 0000 0000 1000 1010 0011 1100            │
│      Tag = 00000000000000000000 = 0x00000                       │
│      Index = 001010 = 10 (decimal)                             │
│      Offset = 001111 = 15 (decimal)                             │
│      → Maps to cache line 10                                    │
│                                                                 │
│  ═══════════════════════════════════════════════════════        │
│  4-WAY SET ASSOCIATIVE CACHE:                                   │
│  ─────────────────────────────                                  │
│    Number of sets = 64 / 4 = 16 sets                           │
│    Set bits = log₂(16) = 4 bits                                │
│    Tag bits = 32 - 4 - 6 = 22 bits                             │
│    Address format: [TAG:22 | SET:4 | OFFSET:6]                  │
│                                                                 │
│    For address 0x00008A3C:                                      │
│      Set = 1010 = 10 (decimal)                                  │
│      Tag = 0000000000000000000000                                │
│      → Search 4 tags in set 10 for a match                      │
│                                                                 │
│  ═══════════════════════════════════════════════════════        │
│  FULLY ASSOCIATIVE CACHE:                                       │
│  ─────────────────────────                                      │
│    No index; entire cache is one set                             │
│    Tag bits = 32 - 6 = 26 bits                                 │
│    Address format: [TAG:26 | OFFSET:6]                          │
│    → Compare tag with ALL 64 cache line tags in parallel         │
│    → Most flexible, but most expensive (needs associative search)│
└─────────────────────────────────────────────────────────────────┘

cache-performance.txt

┌─────────────────────────────────────────────────────────────────┐
│           CACHE PERFORMANCE — CALCULATIONS                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  KEY FORMULAS:                                                  │
│  ─────────────                                                  │
│  Hit Ratio (h) = Hits / (Hits + Misses)                        │
│  Miss Ratio (m) = 1 - h                                        │
│                                                                 │
│  Effective Access Time (EAT):                                   │
│    EAT = h × t_cache + m × t_memory                             │
│                                                                 │
│  With different miss levels:                                    │
│    EAT = h₁×t_L1 + m₁×(h₂×t_L2 + m₂×(h₃×t_L3 + m₃×t_mem))    │
│                                                                 │
│  Average Memory Access Time (AMAT) = EAT                        │
│                                                                 │
│  ═══════════════════════════════════════════════════════        │
│  NUMERICAL EXAMPLE:                                              │
│  ─────────────────                                              │
│    L1 cache access time = 2 ns                                  │
│    L2 cache access time = 10 ns                                 │
│    Main memory access time = 100 ns                             │
│    L1 hit rate = 90%                                            │
│    L2 local hit rate = 80% (of L1 misses)                      │
│                                                                 │
│    AMAT = h₁ × t_L1 + (1-h₁) × [h₂ × t_L2 + (1-h₂) × t_mem]  │
│    AMAT = 0.90 × 2 + 0.10 × [0.80 × 10 + 0.20 × 100]         │
│    AMAT = 1.8 + 0.10 × [8 + 20]                                │
│    AMAT = 1.8 + 0.10 × 28                                      │
│    AMAT = 1.8 + 2.8 = 4.6 ns                                   │
│                                                                 │
│  Speedup over memory-only access = 100 / 4.6 = 21.7x           │
│                                                                 │
│  ═══════════════════════════════════════════════════════        │
│  MISS PENALTY CALCULATION:                                       │
│  ─────────────────────────                                      │
│    Miss penalty = time to fetch block from lower level           │
│    For main memory: miss penalty = memory latency +              │
│      (block size / bus width) × memory cycle time               │
│                                                                 │
│    Example: Memory latency = 50 ns                              │
│             Bus width = 8 bytes, memory cycle = 10 ns           │
│             Block size = 64 bytes                               │
│    Transfer time = (64/8) × 10 = 80 ns                         │
│    Total miss penalty = 50 + 80 = 130 ns                       │
└─────────────────────────────────────────────────────────────────┘

Write Policies

Policy	Write Hit	Write Miss	Pros	Cons
Write-Through	Write to cache + memory	Write to memory (no allocate)	Consistent	Slow writes (every write hits memory)
Write-Back	Write to cache only (mark dirty)	Fetch block into cache, then write	Fast writes (batch)	Complexity, dirty bit management
Write-Allocate	—	Load block into cache on write miss	Spatial locality benefit	Extra memory read on miss
No-Write-Allocate	—	Write directly to memory	Simpler	No cache benefit for writes

Cache Coherence — MESI Protocol

State	Name	Cache Line Status	Memory Status
M	Modified	Data is modified (dirty); only in this cache	Stale (out of date)
E	Exclusive	Clean copy; only in this cache; can write without bus transaction	Up to date
S	Shared	Clean copy; may exist in other caches	Up to date
I	Invalid	Line is not valid; must fetch on next access	—

cache-miss-types.txt

┌─────────────────────────────────────────────────────────────────┐
│                TYPES OF CACHE MISSES (3C Model)                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────┬──────────────────────────────────────────┐    │
│  │ Compulsory   │ First access to a block (cold start)      │    │
│  │ (Cold Miss)  │ Cannot avoid — every block must be loaded │    │
│  │              │ Reduction: larger block size, prefetching  │    │
│  ├──────────────┼──────────────────────────────────────────┤    │
│  │ Capacity     │ Cache is too small; blocks evicted before │    │
│  │ Miss         │ reuse                                     │    │
│  │              │ Reduction: larger cache size               │    │
│  ├──────────────┼──────────────────────────────────────────┤    │
│  │ Conflict     │ Multiple memory blocks map to same cache  │    │
│  │ (Collision)  │ line/set; causes eviction of useful data  │    │
│  │ Miss         │ Reduction: higher associativity, different │    │
│  │              │ mapping, victim cache                      │    │
│  └──────────────┴──────────────────────────────────────────┘    │
│                                                                 │
│  (Some add a 4th C: Coherency miss — due to cache coherence     │
│   protocol invalidations in multiprocessor systems)              │
│                                                                 │
│  MISS RATE TRENDS:                                              │
│  ┌─────────────────┬──────┬──────┬──────┐                      │
│  │ Cache Size      │  1KB │  4KB │ 64KB │                      │
│  │ Direct Map      │ 20%  │ 12%  │  3%  │                      │
│  │ 4-way Set Assoc │ 15%  │  8%  │  2%  │                      │
│  │ Fully Assoc     │ 12%  │  6%  │  1%  │                      │
│  └─────────────────┴──────┴──────┴──────┘                      │
│  (Approximate; actual rates depend on workload)                 │
└─────────────────────────────────────────────────────────────────┘

💡

Cache calculations are the most common numerical questions in COA exams.Always carefully identify: (1) total cache size, (2) block/line size, (3) mapping type, then calculate tag/index/offset bits. For performance, memorize the AMAT formula and practice with 3-level hierarchies. For MESI, understand each transition trigger — especially E→M on write and S→I on another core's write.

🧠

Memory Hierarchy

MEMORY

Level	Technology	Typical Size	Access Time	Cost/GB	Managed By
Registers	SRAM (on CPU)	1-4 KB	< 1 ns	Highest	Compiler
L1 Cache	SRAM	32-64 KB	1-2 ns	Very High	Hardware
L2 Cache	SRAM	256 KB - 1 MB	3-10 ns	High	Hardware
L3 Cache	SRAM	4-32 MB	10-30 ns	Moderate	Hardware
Main Memory (RAM)	DRAM	4-64 GB	50-100 ns	Low	OS
SSD	NAND Flash	256 GB - 4 TB	50-250 μs	Very Low	OS/Controller
HDD	Magnetic	1-20 TB	5-10 ms	Lowest	OS/Controller

SRAM vs DRAM

Property	SRAM (Static)	DRAM (Dynamic)
Storage Element	6 transistors per cell	1 transistor + 1 capacitor
Refresh Required	No (static)	Yes (every few ms)
Speed	Faster (1-10 ns)	Slower (50-100 ns)
Density	Lower (fewer bits per area)	Higher (more bits per area)
Cost	More expensive	Less expensive
Power	Low (no refresh)	Higher (refresh circuitry)
Used In	Cache memory (L1, L2, L3)	Main memory (RAM)
Volatile	Yes	Yes

Virtual Memory Concepts

Concept	Description
Virtual Address	Address generated by CPU; what the program sees
Physical Address	Actual address on the memory bus
Page	Fixed-size block of virtual memory (typically 4 KB)
Frame	Fixed-size block of physical memory (same size as page)
Page Table	Maps virtual page numbers to physical frame numbers
TLB	Translation Lookaside Buffer; cache for page table entries
Page Fault	Access to a page not in physical memory; OS must load it from disk
Thrashing	Excessive paging; system spends more time paging than executing

virtual-address-translation.txt

┌─────────────────────────────────────────────────────────────────┐
│           VIRTUAL ADDRESS TRANSLATION                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Virtual Address Structure (32-bit, 4 KB page):                │
│  ┌──────────────────────┬────────────────┐                      │
│  │   Virtual Page No.   │   Page Offset  │                      │
│  │      (20 bits)       │   (12 bits)    │                      │
│  └──────────────────────┴────────────────┘                      │
│                                                                 │
│  Page Table Entry (PTE):                                        │
│  ┌──────────────────────┬─────┬─────┬─────┬─────┐              │
│  │   Frame Number (20)  │ V   │ D   │ R   │ W   │              │
│  └──────────────────────┴─────┴─────┴─────┴─────┘              │
│  V = Valid, D = Dirty, R = Referenced, W = Write-protected     │
│                                                                 │
│  Translation Process:                                           │
│  1. CPU generates virtual address                               │
│  2. Split into VPN + Offset                                     │
│  3. Check TLB for VPN → Frame mapping                          │
│     ├─ TLB Hit:  Get frame number from TLB → Physical address  │
│     └─ TLB Miss: Walk page table in memory                     │
│        ├─ Valid = 1: Access frame, update TLB                  │
│        └─ Valid = 0: Page Fault → OS handles (load from disk)  │
│                                                                 │
│  Physical Address = Frame Number + Page Offset                  │
│  ┌──────────────────────┬────────────────┐                      │
│  │  Physical Frame No.  │  Page Offset   │                      │
│  └──────────────────────┴────────────────┘                      │
│                                                                 │
│  MULTILEVEL PAGE TABLE (to reduce memory overhead):             │
│  Level 1: Page Directory (outer) → Level 2: Page Table (inner)  │
│  For 32-bit with 4KB pages: 10-bit dir + 10-bit page + 12 off  │
│  For 64-bit (4-level): each level uses 9 bits (512 entries)    │
└─────────────────────────────────────────────────────────────────┘

tlb-performance.txt

┌─────────────────────────────────────────────────────────────────┐
│              TLB & MEMORY HIERARCHY PERFORMANCE                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  TLB (Translation Lookaside Buffer):                            │
│  - Cache that stores recent virtual → physical translations     │
│  - Typically 64-512 entries, fully associative or set-assoc     │
│  - TLB hit: ~1 cycle (no page table walk needed)               │
│  - TLB miss: ~10-30 cycles (walk page table in memory)          │
│                                                                 │
│  EFFECTIVE ACCESS TIME WITH TLB:                                │
│  ┌───────────────────────────────────────────────────┐          │
│  │ EAT = h_tlb × t_tlb                              │          │
│  │     + (1 - h_tlb) × [                             │          │
│  │         h_page × (t_ptw + h_cache × t_c +         │          │
│  │                      (1-h_cache) × t_mem)         │          │
│  │       + (1-h_page) × t_pagefault                   │          │
│  │     ]                                              │          │
│  └───────────────────────────────────────────────────┘          │
│                                                                 │
│  WORKED EXAMPLE:                                                │
│    TLB hit rate = 95%                                           │
│    TLB access time = 2 ns                                       │
│    Page table access = 20 ns (in memory)                        │
│    Cache hit rate = 90% (after TLB miss, PTE in cache)          │
│    Cache access time = 5 ns                                     │
│    Page fault rate = 0.1%                                       │
│    Page fault service time = 8 ms = 8,000,000 ns                │
│                                                                 │
│    TLB hit path = 2 ns                                          │
│    TLB miss, cache hit (PTE) = 20 + 5 = 25 ns                  │
│    TLB miss, cache miss (PTE) = 20 + 100 = 120 ns              │
│    Page fault = 8,000,000 ns                                    │
│                                                                 │
│    PTE in cache = 95% × 90% + (rest...)                         │
│                                                                 │
│    EAT = 0.95 × 2                                              │
│        + 0.05 × [0.90 × 25 + 0.099 × 120 + 0.001 × 8000000]   │
│        = 1.9 + 0.05 × [22.5 + 11.88 + 8000]                    │
│        = 1.9 + 0.05 × 8034.38                                   │
│        = 1.9 + 401.72                                           │
│        = 403.62 ns                                              │
│                                                                 │
│    (Note: Page faults dominate! TLB and cache improvements are  │
│     negligible when page fault rate is significant.)             │
│                                                                 │
│  INTERLEAVED MEMORY:                                            │
│  - Memory divided into multiple modules (banks)                 │
│  - Consecutive addresses placed in different banks              │
│  - K banks → access time = (access time of 1 bank) / K         │
│  - Improves bandwidth by overlapping memory accesses            │
│  - Example: 4 banks, 100ns each → effective = 100ns for first, │
│    then 25ns for each subsequent word (pipeline)                │
└─────────────────────────────────────────────────────────────────┘

⚠️

Virtual memory questions almost always involve TLB + page table + cache. Draw the full hierarchy: CPU > TLB > Page Table > Cache > Main Memory. Calculate EAT step by step, considering hit rates at each level. Remember: a TLB miss triggers a page table walk (in L1/L2 cache or memory), and a page fault triggers a disk access (8-10 ms — 5-6 orders of magnitude slower than memory).

🔧

Processor Design

PROCESSOR

ALU (Arithmetic Logic Unit)

Component	Operation	Logic
Arithmetic	ADD, SUB, INC, DEC	Full adders chained; SUB via 2's complement addition
Logical	AND, OR, XOR, NOT	Bitwise gates on each bit position
Shift	SHL, SHR, ROT	Barrel shifter for O(1) shifts of any amount
Comparison	LT, GT, EQ, NE	Subtract + check flags (zero, sign, overflow)

Control Unit Comparison

Feature	Hardwired	Microprogrammed
Speed	Fast (combinational logic)	Slower (microcode ROM lookup)
Flexibility	Fixed; difficult to modify	Flexible; change microcode for new instructions
Complexity	Complex for large ISAs	Simpler hardware; complex for microcode
Design Time	Long	Short
Cost	Higher for complex ISAs	Lower (uses ROM)
Instruction Set	RISC typically	CISC typically (x86 uses hybrid)
Fault Tolerance	Harder to debug	Easier (microcode is software-like)
Examples	RISC-V, MIPS, early RISC	x86 (partially), VAX, IBM System/360

microoperations.txt

┌─────────────────────────────────────────────────────────────────┐
│                  MICROOPERATIONS                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  A microoperation is an elementary operation performed on data   │
│  stored in registers. Each instruction = sequence of microops.  │
│                                                                 │
│  NOTATION: R1 ← R1 + R2  (register transfer)                    │
│                                                                 │
│  REGISTER TRANSFER MICROOPERATIONS:                              │
│  ┌─────────────────────────┬──────────────────────────────────┐ │
│  │ Transfer: R2 ← R1       │ Copy contents of R1 to R2        │ │
│  │ Add:      R3 ← R1 + R2  │ R3 = sum of R1 and R2           │ │
│  │ Sub:      R3 ← R1 - R2  │ R3 = R1 minus R2                │ │
│  │ AND:      R3 ← R1 ∧ R2  │ Bitwise AND                     │ │
│  │ OR:       R3 ← R1 ∨ R2  │ Bitwise OR                      │ │
│  │ XOR:      R3 ← R1 ⊕ R2  │ Bitwise XOR                     │
│  │ NOT:      R2 ← R1'       │ Complement of R1                │ │
│  │ Shift L:  R2 ← shl R1   │ Shift R1 left by 1              │ │
│  │ Shift R:  R2 ← shr R1   │ Shift R1 right by 1             │ │
│  │ Increment: R1 ← R1 + 1  │ Add 1 to R1                     │ │
│  │ Clear:    R1 ← 0        │ Set R1 to zero                  │ │
│  └─────────────────────────┴──────────────────────────────────┘ │
│                                                                 │
│  BUS TRANSFER:                                                  │
│  When a system bus connects registers:                          │
│    BUS ← R1      ; R1 places data on bus                        │
│    R2 ← BUS      ; R2 captures data from bus                    │
│  Combined: R2 ← BUS, BUS ← R1  (requires 2 control signals)   │
│                                                                 │
│  MEMORY TRANSFER:                                               │
│    MAR ← address    ; Load Memory Address Register               │
│    MDR ← M[MAR]    ; Read from memory to Memory Data Register  │
│    R1 ← MDR        ; Copy MDR to R1                            │
│    Combined: Read(R1, addr)                                     │
│                                                                 │
│    MDR ← R1        ; Load data to write                         │
│    MAR ← address   ; Set address                                │
│    M[MAR] ← MDR    ; Write MDR to memory                        │
│    Combined: Write(R1, addr)                                    │
└─────────────────────────────────────────────────────────────────┘

bus-architecture.txt

┌─────────────────────────────────────────────────────────────────┐
│                BUS ARCHITECTURE                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SINGLE BUS ARCHITECTURE:                                       │
│  ┌────┐  ┌────┐  ┌────┐  ┌────┐  ┌────┐  ┌────┐              │
│  │CPU │──│ ALU│──│Reg │──│ I/O│──│Mem │──│DMA │              │
│  └────┘  └────┘  └────┘  └────┘  └────┘  └────┘              │
│  ═════════════ Shared System Bus ═════════════                  │
│                                                                 │
│  Pros: Simple, low cost                                         │
│  Cons: Bottleneck — only one transfer at a time                 │
│                                                                 │
│  ─────────────────────────────────────────────                  │
│  MULTI-BUS ARCHITECTURE:                                        │
│                                                                 │
│  ┌──────────┐     ┌──────────────┐     ┌──────────┐            │
│  │  CPU     │────▶│ System Bus   │────▶│  Memory  │            │
│  │ (with    │     │ (address +   │     │          │            │
│  │  cache)  │     │  data + ctrl)│     └──────────┘            │
│  └──────────┘     └──────┬───────┘                              │
│                          │                                      │
│                   ┌──────▼───────┐     ┌──────────┐            │
│                   │ I/O Bus     │────▶│  I/O      │            │
│                   │ (PCI/PCIe)  │     │ Devices  │            │
│                   └─────────────┘     └──────────┘            │
│                                                                 │
│  Bus Types:                                                     │
│  ┌──────────────────┬────────────────────────────────────┐     │
│  │ Data Bus         │ Carries data (bidirectional)        │     │
│  │ Address Bus      │ Carries address (unidirectional,    │     │
│  │                  │ CPU → memory/I/O)                   │     │
│  │ Control Bus      │ Carries signals: read/write, clock, │     │
│  │                  │ interrupt, bus request/grant         │     │
│  └──────────────────┴────────────────────────────────────┘     │
│                                                                 │
│  Bus Arbitration:                                               │
│  • Centralized: Single arbiter (daisy chain, priority)          │
│  • Distributed: Each device has arbiter logic                  │
│  • Protocols: Request → Grant → Transfer → Release             │
└─────────────────────────────────────────────────────────────────┘

io-organization.txt

┌─────────────────────────────────────────────────────────────────┐
│                I/O ORGANIZATION                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. PROGRAMMED I/O (Polling):                                   │
│  ┌─────────────────────────────────────────────────────┐       │
│  │ CPU continuously checks I/O device status register   │       │
│  │ while (status != READY) { check_status(); }          │       │
│  │ data = read_port();                                  │       │
│  │                                                     │       │
│  │ Pros: Simple hardware                               │       │
│  │ Cons: CPU wastes cycles polling; slow for I/O bound  │       │
│  │ Best for: Simple embedded systems, infrequent I/O    │       │
│  └─────────────────────────────────────────────────────┘       │
│                                                                 │
│  2. INTERRUPT-DRIVEN I/O:                                      │
│  ┌─────────────────────────────────────────────────────┐       │
│  │ CPU starts I/O operation, then continues other work  │       │
│  │ Device sends interrupt when data is ready            │       │
│  │                                                     │       │
│  │ Flow:                                               │       │
│  │   CPU → Issue I/O command → Continue execution       │       │
│  │   Device → Complete operation → Send IRQ             │       │
│  │   CPU → Save context → Jump to ISR → Handle data     │       │
│  │   CPU → Restore context → Resume execution            │       │
│  │                                                     │       │
│  │ Pros: CPU free during I/O wait; efficient            │       │
│  │ Cons: Context switch overhead per interrupt           │       │
│  │ Best for: General-purpose computing                  │       │
│  └─────────────────────────────────────────────────────┘       │
│                                                                 │
│  3. DMA (Direct Memory Access):                                │
│  ┌─────────────────────────────────────────────────────┐       │
│  │ DMA controller transfers data directly between       │       │
│  │ I/O device and memory without CPU involvement        │       │
│  │                                                     │       │
│  │ CPU sets up DMA:                                    │       │
│  │   - Source address (I/O port or memory)              │       │
│  │   - Destination address (memory or I/O port)         │       │
│  │   - Transfer count (number of words/bytes)           │       │
│  │   - Direction (read/write)                           │       │
│  │                                                     │       │
│  │ DMA operates in cycle stealing:                      │       │
│  │   When DMA needs bus, it requests it from CPU        │       │
│  │   CPU pauses for 1 cycle, DMA transfers 1 word      │       │
│  │   CPU resumes; cycle repeated until transfer done    │       │
│  │                                                     │       │
│  │ Pros: Fastest; CPU free; bulk transfer capability    │       │
│  │ Cons: Complex hardware; bus contention               │       │
│  │ Best for: Disk I/O, network, display, bulk transfer  │       │
│  └─────────────────────────────────────────────────────┘       │
│                                                                 │
│  COMPARISON:                                                    │
│  ┌──────────────┬──────────┬──────────┬──────────┐              │
│  │              │ CPU Load │ Speed    │ Hardware │              │
│  │ Programmed   │ 100%     │ Slowest  │ Simplest │              │
│  │ Interrupt    │ Low      │ Medium   │ Moderate │              │
│  │ DMA          │ Minimal  │ Fastest  │ Complex  │              │
│  └──────────────┴──────────┴──────────┴──────────┘              │
└─────────────────────────────────────────────────────────────────┘

Know the fetch-decode-execute cycle by heart. For control unit questions, explain that hardwired control uses a state machine (finite state machine with combinational logic) while microprogrammed control uses a control store (microcode ROM). Modern x86 CPUs use a hybrid approach: CISC instructions are decoded into micro-ops (like RISC instructions) that are executed on a RISC-like backend.

🌐

Parallelism

PARALLELISM

Flynn's Taxonomy

Classification	Data Stream	Instruction Stream	Example
SISD	Single	Single	Traditional uniprocessor (Pentium 1)
SIMD	Multiple	Single	Vector processors, GPUs, Intel SSE/AVX
MISD	Single	Multiple	Rare; fault-tolerant systems, some crypto
MIMD	Multiple	Multiple	Multi-core CPUs, distributed systems, clusters

Multiprocessor Architectures

Type	Full Form	Memory Access	Latency	Example
UMA	Uniform Memory Access	All processors see same memory latency	Uniform	Symmetric multiprocessing (SMP)
NUMA	Non-Uniform Memory Access	Local memory faster than remote	Variable	AMD Epyc, modern servers

amdahls-law.txt

┌─────────────────────────────────────────────────────────────────┐
│                    AMDAHL'S LAW                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  "The speedup of a parallel system is limited by the            │
│   sequential portion of the program."                            │
│                                                                 │
│  FORMULA:                                                       │
│                                                                 │
│           1                                                      │
│  S = ─────────────────────                                       │
│      (1 - P) + (P / N)                                          │
│                                                                 │
│  Where:                                                         │
│    S = Speedup (ratio of original to parallel time)             │
│    P = Fraction of program that can be parallelized (0 ≤ P ≤ 1)│
│    N = Number of processors (cores)                             │
│                                                                 │
│  Maximum speedup (N → ∞):                                       │
│                                                                 │
│         1              1                                         │
│  Smax = ───── = ─────────                                        │
│        1 - P   sequential_fraction                               │
│                                                                 │
│  ═══════════════════════════════════════════════════════        │
│  WORKED EXAMPLE 1:                                               │
│  ─────────────────                                              │
│  Program has 60% parallelizable code (P = 0.6), 4 processors:   │
│                                                                 │
│    S = 1 / [(1 - 0.6) + (0.6 / 4)]                             │
│    S = 1 / [0.4 + 0.15]                                        │
│    S = 1 / 0.55 = 1.82x                                        │
│                                                                 │
│  With 8 processors:                                             │
│    S = 1 / [0.4 + 0.075] = 1 / 0.475 = 2.11x                  │
│                                                                 │
│  With 16 processors:                                            │
│    S = 1 / [0.4 + 0.0375] = 1 / 0.4375 = 2.29x                │
│                                                                 │
│  With ∞ processors:                                             │
│    Smax = 1 / 0.4 = 2.5x                                       │
│    → Adding more processors gives diminishing returns!          │
│                                                                 │
│  ═══════════════════════════════════════════════════════        │
│  WORKED EXAMPLE 2:                                               │
│  ─────────────────                                              │
│  Program has 90% parallelizable code (P = 0.9), 8 processors:   │
│                                                                 │
│    S = 1 / [(1 - 0.9) + (0.9 / 8)]                             │
│    S = 1 / [0.1 + 0.1125]                                      │
│    S = 1 / 0.2125 = 4.71x                                      │
│                                                                 │
│  With 64 processors:                                            │
│    S = 1 / [0.1 + 0.0141] = 1 / 0.1141 = 8.77x                │
│                                                                 │
│  With ∞ processors:                                             │
│    Smax = 1 / 0.1 = 10x                                        │
│                                                                 │
│  KEY INSIGHT:                                                   │
│  ┌──────────────────────────────────────────────────────┐       │
│  │ If P = 0.5 → max speedup = 2x regardless of cores!  │       │
│  │ If P = 0.95 → max speedup = 20x                      │       │
│  │ If P = 0.99 → max speedup = 100x                     │       │
│  │ The sequential fraction is the ultimate bottleneck!  │       │
│  └──────────────────────────────────────────────────────┘       │
└─────────────────────────────────────────────────────────────────┘

parallelism-types.txt

┌─────────────────────────────────────────────────────────────────┐
│              TYPES OF PARALLELISM                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. INSTRUCTION-LEVEL PARALLELISM (ILP):                         │
│  ┌────────────────────────────────────────────────────┐         │
│  │ • Superscalar: Multiple instructions issued per clock│         │
│  │   (e.g., 4-wide: 4 instructions/clock)              │         │
│  │ • Out-of-Order (OoO): Execution engine reorders     │         │
│  │   instructions to avoid stalls                      │         │
│  │ • VLIW: Compiler schedules instructions (Itanium)   │         │
│  │ • Speculative execution: Execute before knowing if  │         │
│  │   branch taken; squash if wrong                     │         │
│  └────────────────────────────────────────────────────┘         │
│                                                                 │
│  2. DATA-LEVEL PARALLELISM (DLP):                               │
│  ┌────────────────────────────────────────────────────┐         │
│  │ • SIMD: Single instruction operates on multiple data│         │
│  │ • Vector processors: Operate on arrays/vectors      │         │
│  │ • GPU: Thousands of cores doing same operation      │         │
│  │   on different data (shader programs)               │         │
│  │ • Examples: SSE, AVX, NEON, CUDA                    │         │
│  └────────────────────────────────────────────────────┘         │
│                                                                 │
│  3. THREAD-LEVEL PARALLELISM (TLP):                             │
│  ┌────────────────────────────────────────────────────┐         │
│  │ • Multi-core: Multiple CPU cores on one chip        │         │
│  │ • Multi-threading: Hardware threads (SMT/HT)        │         │
│  │ • Simultaneous Multithreading (SMT):                │         │
│  │   Intel Hyper-Threading = 2 threads per core        │         │
│  │   Shares ALU, caches; duplicating registers/PC      │         │
│  └────────────────────────────────────────────────────┘         │
│                                                                 │
│  PIPELINING vs PARALLEL PROCESSING:                             │
│  ┌────────────────────┬─────────────────────────────────┐       │
│  │ Pipelining         │ Overlap stages of DIFFERENT      │       │
│  │                    │ instructions (temporal parallelism)│      │
│  │ Parallel Proc      │ Execute MULTIPLE instructions     │       │
│  │                    │ simultaneously (spatial parallel) │      │
│  │ Combination        │ Superscalar pipeline: pipeline +  │       │
│  │                    │ multiple execution units           │       │
│  └────────────────────┴─────────────────────────────────┘       │
│                                                                 │
│  SPEEDUP COMPARISON:                                            │
│  Pipeline (5 stages, 100 instr):  S = 500/104 = 4.81x          │
│  Parallel (4 cores, 100 instr):  S = 400/100 = 4.0x            │
│  Pipeline + Parallel: S ≈ 4.81 × 4 = 19.24x (ideal)           │
└─────────────────────────────────────────────────────────────────┘

Amdahl's Law is a guaranteed interview question. Memorize the formula and practice calculations. Key insight: even with infinite processors, speedup is capped at 1 / (1 - P). For P=0.95, the maximum speedup is only 20x no matter how many cores you add. This is why algorithm efficiency matters — parallelism cannot overcome an inherently sequential algorithm.

🎓

Interview Q&A

PLACEMENT QUESTIONS

Q1: Explain RISC vs CISC with real-world examples

RISC (Reduced Instruction Set Computer) uses a small set of simple, uniform instructions that each execute in one clock cycle. It relies on a large register file (32+) and the compiler to optimize instruction scheduling. Examples: ARM (mobile phones, Apple Silicon M1/M2), RISC-V (open-source, growing adoption),MIPS (networking, embedded). CISC (Complex Instruction Set Computer) has many complex instructions that can operate directly on memory, with variable instruction lengths. It uses microcode to decode complex instructions into simpler micro-operations internally. Example: x86/x64 (Intel, AMD desktop/server CPUs). Modern x86 CPUs are actually hybrid: they decode CISC instructions into RISC-like micro-ops (μops) before execution. Key trade-off: RISC is simpler, faster per instruction, and more power-efficient; CISC has better code density (fewer bytes per instruction) and backward compatibility.

Q2: Given a cache with 64 lines, 16-byte blocks, and 4-way set associative mapping, calculate tag and set bits for a 32-bit address. A program accesses addresses: 0x00, 0x10, 0x20, 0x30, 0x00. What is the hit/miss pattern?

Step 1: Calculate fields. Block size = 16 bytes → offset = log₂(16) = 4 bits. Number of sets = 64/4 = 16 sets → set index = log₂(16) = 4 bits. Tag = 32 - 4 - 4 = 24 bits. Address format: [Tag:24][Set:4][Offset:4].Step 2: Map addresses. 0x00 = 0b{...0000_0000 0000_0000_0000_0000} → Set=0, Tag=0. 0x10 = 0b{...0000_0001_0000} → Set=0, Tag=0. 0x20 → Set=1. 0x30 → Set=1. 0x00 → Set=0.Step 3: Access pattern: 0x00=MISS (cold), 0x10=MISS (same set, different tag? No — same tag=0, different offset. Actually both 0x00 and 0x10 map to set 0 with tag 0 but different block offsets 0 and 1 — if blocks are different, these are in the SAME cache line! Block at 0x00 covers 0x00-0x0F, and block at 0x10 covers 0x10-0x1F. So: 0x00=MISS, 0x10=MISS (different block, same set), 0x20=MISS (set 1), 0x30=MISS (set 1, but 0x20-0x2F and 0x30-0x3F are different blocks), 0x00=HIT. Hit ratio = 1/5 = 20%.

Q3: What are pipeline hazards? How do you resolve them?

Pipeline hazards are situations that prevent the next instruction from executing in its designated clock cycle.Three types: (1) Data hazards — RAW (Read After Write) is most common: e.g., ADD R1,R2,R3 followed bySUB R4,R1,R5 — the ADD hasn't written R1 yet when SUB needs it. Solutions:Data forwarding/bypassing (route ALU result directly to the next instruction's input), stall insertion (insert NOP bubbles), compiler scheduling (reorder independent instructions). (2) Control hazards — branches change the PC before the pipeline knows the outcome.Solutions: Branch prediction (static: predict not-taken; dynamic: 2-bit saturating counter, BTB), delayed branch (fill branch delay slot with useful instruction). (3) Structural hazards — two instructions need the same hardware unit.Solutions: Resource duplication (Harvard architecture: separate instruction/data memories), pipeline stall.

Q4: Represent -6.5 in IEEE 754 single precision format

Step 1: Sign = negative → S = 1.Step 2: Convert 6.5 to binary: 6 = 110₂; 0.5 = 0.1₂ → 6.5 = 110.1₂.Step 3: Normalize: 110.1 = 1.101 × 2².Step 4: Exponent = 2 + 127 = 129 = 10000001₂.Step 5: Mantissa = 10100000000000000000000 (take fraction after the implicit 1, pad to 23 bits).Result: 1 10000001 10100000000000000000000.Hex: C0D00000.Verification: (-1)¹ × 1.101₂ × 2^(129-127) = -1 × (1 + 0.5 + 0.125) × 4 = -1 × 1.625 × 4 = -6.5 ✓.Special cases to mention: +0 = 00000000 (all zeros), -0 = 80000000 (sign bit set), +Infinity = 7F800000 (E=255, M=0), NaN = 7FC00000 (E=255, M≠0).

Q5: A program takes 100 seconds to run on a single processor. 70% of the program can be parallelized. Using Amdahl's Law, what is the speedup with 4 processors? How many processors are needed for a 2x speedup?

Part 1: Speedup with 4 processors. P = 0.70, N = 4. S = 1 / [(1 - 0.70) + (0.70 / 4)] = 1 / [0.30 + 0.175] = 1 / 0.475 = 2.105x. New execution time = 100 / 2.105 = 47.5 seconds.Part 2: Processors for 2x speedup. S = 2, P = 0.70. 2 = 1 / [0.30 + 0.70/N] → 0.30 + 0.70/N = 0.50 → 0.70/N = 0.20 → N = 0.70/0.20 = 3.5 → 4 processors(need to round up since you can't have a fractional processor; with N=3: S = 1/[0.30+0.233] = 1.82x < 2x).Maximum theoretical speedup (N→∞) = 1/0.30 = 3.33x. This means no matter how many cores you add, the program will never be more than 3.33x faster — the 30% sequential portion is the hard limit.Interview tip:Always mention Amdahl's Law when discussing multicore scaling and its implication that improving the sequential fraction (reducing 1-P) is often more effective than adding cores.

Top COA interview tips:(1) Always draw the 5-stage pipeline and show forwarding with arrows. (2) For IEEE 754, practice converting at least 3-4 numbers to build speed and confidence. (3) For cache, always calculate bits before mapping addresses — examiners want to see the process. (4) For Amdahl's Law, relate it to real scenarios (e.g., "Why can't my 16-core CPU run this program 16x faster?"). (5) Know the difference between throughput (pipelining, more instructions per second) and latency (time for one instruction).

⏳

Loading cheatsheet...