Why Data Layout Matters More Inside an SGX Enclave: A Rust Perspective

15 min read
Cover Image for Why Data Layout Matters More Inside an SGX Enclave: A Rust Perspective

You've probably come across the term TEE (Trusted Execution Environment). At its core, a TEE lets you run code inside a protected region of memory that even the operating system and hypervisor cannot access. Intel offers SGX and TDX, AMD has SEV — the implementations differ but the principle is the same. To understand why this is remarkable, you need to understand how privilege works on a modern CPU.

Privilege Rings

x86 processors organize software into privilege rings — numbered 0 through 3, where lower numbers mean higher privilege.

Ring 3 is where your normal applications live — browsers, games, your Rust binary. They have the least privilege and can only access their own memory.

Ring 0 is the kernel. It has full access to all of physical memory, all hardware devices, and can read or write any process's memory. When your application makes a syscall, it's asking Ring 0 to do something on its behalf.

Below Ring 0, modern systems add Ring -1 for the hypervisor — the software that manages virtual machines. The hypervisor can inspect and modify everything the kernel sees, and the kernel can inspect everything your application sees. It's a hierarchy of total visibility downward.

This means if your kernel is compromised, every application's memory is exposed. If the hypervisor is compromised — as can happen in a cloud environment where you don't control the host — the attacker can read the kernel and every application running on it. In a traditional architecture, there is no way for user code to hide from the layers below it.

SGX breaks this model.

Intel SGX introduces a protected memory region called the Enclave Page Cache (EPC), carved out of physical RAM. Code and data inside the EPC are encrypted by a hardware component called the Memory Encryption Engine (MEE), using keys that are embedded in the processor during manufacturing and never exposed to software at any privilege level.

When the CPU executes enclave code, it enters enclave mode. In this mode, the privilege hierarchy is effectively inverted for the enclave's memory region — Ring 0 cannot read it, Ring -1 cannot read it, even a physical attacker probing the memory bus sees only ciphertext. The only place the data exists as plaintext is inside the CPU package itself — in the cache and registers.

This is what makes TEEs fundamentally different from every other isolation mechanism. Containers, VMs, process isolation — they all rely on a more privileged layer enforcing boundaries. SGX enforces the boundary in hardware, against all software, regardless of privilege.

But this hardware-enforced isolation comes at a cost. The MEE encryption that protects your data in DRAM doesn't disappear just because your code is running correctly — it's always there, on every memory access that reaches beyond the CPU cache. Understanding where that cost shows up and how it compounds is what the rest of this article will explore.

Understanding Cache

To understand why cache misses are more expensive inside an enclave, you first need to understand how the CPU accesses data.

Your code is compiled down to assembly instructions. When the CPU executes them, it needs both the instructions and the data they operate on. This data lives in a hierarchy of storage — each level faster but smaller than the one below it.

At the top are the CPU registers — they're part of the execution pipeline itself, effectively zero cost. Below them are three levels of cache: L1 is the smallest and fastest at around 4 cycles, L2 is larger at around 12 cycles, and L3 is shared across cores at around 40 cycles. At the bottom is main memory (DRAM), which costs around 200 cycles or more to access.

When the CPU needs a piece of data, it checks each level in order — registers first, then L1, L2, L3, and finally DRAM. If the data is found in L1, you pay 4 cycles. If it's not found until DRAM, you pay 200+. A cache miss is what happens when the data isn't in any cache level and the CPU has to go all the way to main memory.

The performance implication is straightforward: the more your program accesses data that's already in cache, the faster it runs. Writing code with cache-friendly access patterns — sequential memory access, keeping hot data contiguous, avoiding pointer chasing across the heap — is how high-performance systems minimize these misses. This is data-oriented design in a nutshell, and it matters regardless of whether you're running inside a TEE or not.

But inside an enclave, the penalty for a cache miss gets worse.

When the CPU fetches data from the EPC in DRAM, it doesn't just read the bytes — the Memory Encryption Engine must decrypt the data and verify its integrity before it enters the cache. This involves AES decryption operations on every cache line, a Merkle tree traversal to verify the data hasn't been tampered with, and counter updates that cause additional write amplification even on reads. The decrypted plaintext then enters the cache, where it's accessible at normal speed — and still protected from the OS and hypervisor by the enclave mode we discussed earlier.

The critical insight is that cache hits cost the same inside and outside an enclave. The MEE is only involved when data crosses the boundary between the CPU package and DRAM. Here's what that looks like:

AccessNormal executionInside enclave
Register~0 cycles~0 cycles
L1 cache hit~4 cycles~4 cycles
L2 cache hit~12 cycles~12 cycles
L3 cache hit~40 cycles~40 cycles
L3 miss → DRAM~200 cycles~200 cycles + MEE decryption + integrity verification

Every row except the last is identical. The entire performance overhead of running inside an enclave is concentrated in that single row — the L3 cache miss. This means every cache miss you prevent by designing better data layouts saves you not just the normal DRAM latency, but also the MEE overhead on top of it. The same optimization that gives you a 2x improvement outside an enclave might give you a 3-4x improvement inside one, because each avoided miss was costing you more to begin with.

For most applications, occasional cache misses are absorbed across millions of operations and nobody notices. Inside an enclave, they compound. And if your application's data structures are designed in a way that causes frequent cache misses — which, as we'll see in the next section, idiomatic Rust often encourages — the performance impact becomes severe.

Cache Lines

To understand how data layout affects cache performance, we need to understand cache lines — the fundamental unit of data transfer between main memory and the CPU cache.

L1, L2, and L3 caches are all organized into fixed-size blocks called cache lines. On most x86 machines, a cache line is 64 bytes. When the CPU needs a piece of data that isn't already in cache, it doesn't fetch just that byte or that field — it fetches the entire aligned 64-byte block that contains it. If you access a single byte at address 70, the CPU loads bytes 64 through 127 into a cache line. The remaining 63 bytes come along for free.

This is where data layout becomes a performance decision. The CPU doesn't know what data is "related" — it blindly fetches 64-byte blocks based on the addresses your code touches. If the data you need next happens to fall within the same 64-byte block, it's already in cache. If it doesn't, you pay for another cache line fetch. How your data is arranged in memory determines which of these outcomes you get.

When data doesn't fit cleanly within a single cache line, it spans multiple lines. Consider a struct that is 80 bytes. The first 64 bytes occupy one cache line, and the remaining 16 bytes spill into the next. Every time you access this struct, you potentially pull in two cache lines — 128 bytes of cache space for 80 bytes of data.

For a single struct, this is a minor cost. But consider iterating over a thousand of these structs. Each one touches two cache lines. That's two thousand cache line fetches where, with a different layout, one thousand might have sufficed. And inside an enclave, every one of those extra fetches that misses L3 carries the added MEE overhead from Section 2. The waste compounds.

Worse still, you may not need all 80 bytes on every access. If your hot loop only reads a single 8-byte field from each struct, you're pulling in 128 bytes to use 8 — filling the cache with 120 bytes of data you don't need, potentially evicting data you do.

The Cost of Data Layout

Now that we understand cache lines and how the CPU fetches data, let's see what happens in practice. We'll compare two layouts for the same data — a collection of financial transactions — and measure the difference in timing, cache misses, and generated assembly.

The setup

Consider a transaction record with 120 bytes of data:

#[derive(Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum TxStatus {
    Pending = 0,
    Completed = 1,
    Failed = 2,
    Cancelled = 3,
}

pub struct Transaction {
    pub amount: f64,         //  8 bytes  ← hot field
    pub timestamp: u64,      //  8 bytes
    pub account_id: u64,     //  8 bytes
    pub status: TxStatus,    //  1 byte   ← hot field
    pub _pad: [u8; 7],       //  7 bytes
    pub category: u64,       //  8 bytes
    pub metadata: [u8; 80],  // 80 bytes  ← cold, never read in hot path
}

This is the Array of Structs (AoS) layout. A collection is simply Vec<Transaction>. In memory, each 120-byte record sits next to the previous one:

Now the same data in Struct of Arrays (SoA) layout — each field gets its own contiguous array:

pub struct Transactions {
    pub amounts: Vec<f64>,
    pub timestamps: Vec<u64>,
    pub account_ids: Vec<u64>,
    pub statuses: Vec<TxStatus>,
    pub categories: Vec<u64>,
    pub metadata: Vec<[u8; 80]>,
}

We test three common queries. First, summing all completed transaction amounts — the most typical aggregation in any ledger system:

// AoS: loads 120 bytes per record to read 9 bytes (amount + status)
pub fn sum_completed_aos(txns: &[Transaction]) -> f64 {
    let mut sum = 0.0f64;
    for tx in txns {
        if tx.status == TxStatus::Completed {
            sum += tx.amount;
        }
    }
    sum
}

// SoA: only amounts and statuses arrays enter the cache
pub fn sum_completed_soa(txns: &Transactions) -> f64 {
    let mut sum = 0.0f64;
    for i in 0..txns.len() {
        if txns.statuses[i] == TxStatus::Completed {
            sum += txns.amounts[i];
        }
    }
    sum
}

Second, counting transactions in a time range — only needs the timestamp field:

// AoS: loads 120 bytes to read 8 bytes
pub fn count_in_range_aos(txns: &[Transaction], start: u64, end: u64) -> usize {
    let mut count = 0usize;
    for tx in txns {
        if tx.timestamp >= start && tx.timestamp <= end {
            count += 1;
        }
    }
    count
}

// SoA: only timestamps array is loaded — 8 values per cache line
pub fn count_in_range_soa(txns: &Transactions, start: u64, end: u64) -> usize {
    let mut count = 0usize;
    for i in 0..txns.len() {
        if txns.timestamps[i] >= start && txns.timestamps[i] <= end {
            count += 1;
        }
    }
    count
}

Third, total volume — the purest test, touching only a single field:

// AoS: loads 120 bytes to read 8 bytes
pub fn total_volume_aos(txns: &[Transaction]) -> f64 {
    let mut sum = 0.0f64;
    for tx in txns {
        sum += tx.amount;
    }
    sum
}

// SoA: tight loop over contiguous f64s — perfect for the prefetcher
pub fn total_volume_soa(txns: &Transactions) -> f64 {
    let mut sum = 0.0f64;
    for i in 0..txns.len() {
        sum += txns.amounts[i];
    }
    sum
}

Both versions compute the same result. The algorithm is identical. The only difference is where the data lives in memory.

The benchmark results

Tested on AMD x86 with 1 million transactions:

OperationAoSSoASpeedup
sum_completed (amount + status)6.74 ms2.29 ms2.9x
count_in_range (timestamp only)5.89 ms892 µs6.6x
total_volume (amount only)5.13 ms891 µs5.8x

At 1,000 elements, both layouts performed identically — confirming that when data fits in cache, layout is irrelevant. The gap opens once the dataset exceeds cache capacity:

Sizetotal_volume AoStotal_volume SoASpeedup
1K799 ns799 ns1.0x
10K8.46 µs8.20 µs1.0x
100K508 µs82.6 µs6.1x
1M5.13 ms891 µs5.8x

Same algorithm, same Big O complexity — the only variable is whether the data fits in cache.

The cache miss evidence

Timing shows the "what" but not the "why." To prove this is a cache problem, we built standalone binaries that run the hot loops under perf:

// bench_aos.rs
use soa_perf_rs::*;

fn main() {
    let n = 1_000_000;
    let txns = generate_aos(n);

    let mut total = 0.0f64;
    for _ in 0..100 {
        total += sum_completed_aos(&txns);
        total += total_volume_aos(&txns);
    }
    println!("AoS total: {total}");
}
// bench_soa.rs
use soa_perf_rs::*;

fn main() {
    let n = 1_000_000;
    let txns = generate_soa(n);

    let mut total = 0.0f64;
    for _ in 0..100 {
        total += sum_completed_soa(&txns);
        total += total_volume_soa(&txns);
    }
    println!("SoA total: {total}");
}

Running these under perf stat:

perf stat -e cache-misses,cache-references,L1-dcache-load-misses \
    ./target/release/bench_aos

perf stat -e cache-misses,cache-references,L1-dcache-load-misses \
    ./target/release/bench_soa

The results:

MetricAoSSoARatio
Cache misses182,369,2751,527,908119x fewer
Cache references416,787,33064,746,0586.4x fewer
L1 data cache misses245,272,98231,354,7007.8x fewer
Cache miss rate43.76%2.36%

AoS misses 43.76% of all cache references — nearly every other memory access goes all the way to DRAM. SoA misses 2.36% — almost everything is served from cache. 182 million versus 1.5 million cache misses, for the same computation producing the same result.

Looking at the generated assembly for total_volume confirms what's happening at the instruction level.

AoS inner loop:

addsd xmm0, qword ptr [rax + 80]
addsd xmm0, qword ptr [rax + 200]
addsd xmm0, qword ptr [rax + 320]
addsd xmm0, qword ptr [rax + 440]
addsd xmm0, qword ptr [rax + 560]
addsd xmm0, qword ptr [rax + 680]
addsd xmm0, qword ptr [rax + 800]
addsd xmm0, qword ptr [rax + 920]
add rax, 960

SoA inner loop:

addsd xmm0, qword ptr [rcx + 8*rsi]
addsd xmm0, qword ptr [rcx + 8*rsi + 8]
addsd xmm0, qword ptr [rcx + 8*rsi + 16]
addsd xmm0, qword ptr [rcx + 8*rsi + 24]
addsd xmm0, qword ptr [rcx + 8*rsi + 32]
addsd xmm0, qword ptr [rcx + 8*rsi + 40]
addsd xmm0, qword ptr [rcx + 8*rsi + 48]
addsd xmm0, qword ptr [rcx + 8*rsi + 56]
add rsi, 8

The compiler unrolled both loops 8 times — it did its best in both cases. But look at the stride:

In AoS, each load is 120 bytes apart (200 - 80 = 120). Eight unrolled iterations span 960 bytes — 15 cache lines, most of which contain metadata, account_ids, and other fields the loop never reads.

In SoA, each load is 8 bytes apart. Eight iterations span 64 bytes — exactly one cache line, and every byte in it is an amount value that gets used.

Neither version is SIMD-vectorized — both use addsd (scalar double addition) because the running sum creates a data dependency that prevents parallel execution. This means the entire 5.8x speedup is purely from cache behavior. The compiler generated equally good code for both — it simply cannot fix a data layout that wastes cache lines.

What this means inside a TEE

These benchmarks ran on normal hardware. Inside an SGX enclave or TDX confidential VM, every L3 cache miss that reaches DRAM triggers MEE decryption and integrity verification on top of the ~200 cycle DRAM latency. The 182 million cache misses from the AoS layout would each pay that additional penalty. The 1.5 million misses from SoA would too — but there are 119 times fewer of them.

The performance gap that's already 5.8x on normal hardware would widen inside a TEE, because the per-miss cost is higher. The same data layout optimization that saves you milliseconds outside an enclave saves you disproportionately more inside one.

Data layout is not a micro-optimization. It is, for any performance-sensitive application running inside a trusted execution environment, a first-order architectural decision.

Reproduce it yourself

All benchmarks, perf binaries, and assembly inspection instructions are available at github.com/felixfaisal/soa-perf-rs. Clone it, run cargo bench, and see the numbers on your own hardware.

What's next

In a follow-up article, I'll run these same benchmarks on real SGX hardware to measure the actual MEE overhead on cache misses. I'll also compare the performance across the three major frameworks for writing SGX applications — Gramine, Apache Teaclave SGX SDK, and Fortanix EDP — to see whether the framework abstraction layer itself introduces additional cache pressure, or whether the layout penalty is purely hardware-bound as the architecture suggests.