Excellent! Now I have all the pieces. Let me summarize:
Key structures:
tls_context: 328 bytes, tx_conf at offset 20 (bits 0-2), rx_conf at offset 20 (bits 3-5), sk_proto at offset 152, rcu at offset 312
Qdisc: 320 bytes base + privdata, enqueue at offset 0, dequeue at offset 8, ops at offset 24, hash at offset 40 (this is important for the stack pivot!)
bucket_table: ~576 bytes with 64 initial entries → kmalloc-1024
tls_sk_proto_close is the critical path: it reads tx_conf and rx_conf from the controlled context, potentially calls tls_sw_cancel_work_tx if TLS_SW is set, locks the socket, and then either frees the context or calls cleanup depending on the configuration values before invoking the original close handler.
The key insight is that when close(tls2) is called on the freed memory, it dereferences sk_proto at offset 152 to call the original protocol's close function. Since the freed chunk retains its original data (INIT_ON_FREE isn't set), sk_proto still points to valid TCP operations, so the call succeeds rather than crashing. But with CONFIG_INIT_ON_ALLOC_DEFAULT_ON enabled, the freed memory gets zeroed on reallocation, which would corrupt sk_proto and cause the crash. INIT_ON_FREE, the freed chunk keeps its old data. Only the SLUB freelist pointer is written at a random offset (SLAB_FREELIST_RANDOM). So most of the tls_context data, including sk_proto, is still valid. As long as the freelist pointer doesn't overlap with sk_proto (or tx_conf/rx_conf), the function should work.
With SLAB_FREELIST_RANDOM, the freelist pointer is at a random but fixed offset within the slab. For kmalloc-512 objects, the offset could be anywhere in the first 512 bytes. There's a chance it overlaps with sk_proto, but it's random.
The reference exploit handles this by grooming the heap carefully so that the freed tls_context chunk is at a specific position in the slab, closing the first socket, waiting for RCU grace period, then closing the second one. The freed chunk's data remains mostly intact. Since lock_sock only locks the socket itself and not the tls_context, and the function only reads from specific offsets in the freed chunk (tx_conf, rx_conf, sk_proto, sk_write_space), as long as the freelist pointer doesn't corrupt those fields, the exploit succeeds and eventually triggers kfree_rcu on the context.
Now I'm planning the ROP chain and overall exploit strategy. I found a gadget that does push rsi; jmp [rsi+0x41], but I need to verify whether 0x41 falls within a controllable field in the Qdisc structure. Looking at the Qdisc layout, the hash field is at offset 40 (16 bytes), so 0x41 is just past that into the dev_queue field. pivot gadget
- offset 8: dequeue (8 bytes)
- offset 16: flags (4 bytes)
- offset 20: limit (4 bytes)
- offset 24: ops (8 bytes)
- offset 32: stab (8 bytes)
- offset 40: hash (16 bytes) → hlist_node has next and pprev
- offset 56: handle (4 bytes)
- offset 60: parent (4 bytes)
- offset 64: dev_queue (8 bytes)
So [rsi+0x41] = offset 65 falls within dev_queue, which starts at offset 64. Since we're controlling the entire data structure through user_key_payload, the actual field boundaries don't matter—I just need to place the right bytes at the right offsets. At offset 0x41, I'll write the address of my ROP gadget pop rsp; pop r15; ret (0
xffffffff810a4f1e).
When the pivot executes, it pushes rsi (our Qdisc pointer), then jumps to that gadget. The pop rsp sets the stack pointer to rsi, pop r15 discards the pivot gadget address from offset 0, and ret pops the first real ROP gadget from offset 8 to continue the chain.
Now I'm checking the actual struct layout — the user_key_payload has an rcu_head (16 bytes), datalen (2 bytes), 6 bytes of padding, then the data array starts at offset 24. Looking at the key_alloc function, it subtracts the size of the user_key_payload structure from the requested size before allocating memory, so when the reference exploit passes 512+1 bytes, the actual payload allocation ends up being 489 bytes after that subtraction. The exploit author defines their own version of the structure with different field sizes than what pahole shows, which affects how the size calculation works out. I see the exploit passes a 513-byte payload that lands in dyn-kmalloc-1k on kernel 6.1 or kmalloc-1024 on 5.15, and I'm tracing through the user_key_payload structure to understand the memory layout and where the datalen field sits relative to the flexible array.
For the KASLR leak, the attack creates a UAF on the bucket_table by triggering the fqdir overlap, then replaces it with a user key of matching size. When the fqdir gets freed, the user key gets freed too, and the exploit sprays Qdisc objects to reclaim that memory slot—reading the key then exposes the Qdisc's function pointers for the leak.
Now I'm checking the size calculations for the tbf Qdisc objects. The struct size plus the private data would put it in kmalloc-512, but that doesn't overlap with the kmalloc-1024 user key. Looking at the reference exploit on 6.1, it seems the tbf Qdisc is actually allocated in dyn-kmalloc-1k, so I need to figure out what the actual priv_size is for tbf to confirm the total allocation size. Looking at the struct definitions, I can see that psched_ratecfg contains several fields like rate_bytes_ps, mult, shift, overhead, and linklayer that each take up around 20 bytes, and qdisc_watchdog has an hrtimer which is quite substantial on its own with the timerqueue_node, softexpires, function pointer, base pointer, and various state flags.
So the total size of tbf_sched_data comes to roughly 174 bytes when I add up all the components, and combined with the base Qdisc structure of 320 bytes, that's about 494 bytes total—which means it'll allocate from the kmalloc-512 bucket on most kernel versions, though on newer kernels with dynamic kmalloc it might spill into the 1k bucket depending on exact cache boundaries.
The bucket_table needs around 576 bytes and the user key is 513 bytes, both landing in kmalloc-1024, but since the tbf Qdisc itself is in kmalloc-512, I can't directly overlap them the way the reference exploit does. This breaks the KASLR leak strategy because the cache sizes don't align.
I need to either find a different kmalloc-1024 object with kernel pointers I can leak, or adjust the user key size to fit in kmalloc-512 so it overlaps with the tbf Qdisc instead—let me reconsider the approach here.
Actually, pipe_buffer arrays are perfect for this. When you create a pipe and write to it, the pipe_buffer array gets allocated in kmalloc-1024, and each buffer has an ops pointer that references kernel text, which I can leak through the user key.
But wait—if the key gets freed, can I still read it with keyctl_read? The kernel should reject reads on freed keys since it tracks key lifecycle, so I need to think through whether this actually works or if I'm missing something about how the key lookup and validation happens. Hmm, actually the way the exploit works is different. The fqdir teardown calls kfree on the memory that was the user_key_payload. But the kernel's keyring system still has a reference to this key (since it was added to KEY_SPEC_PROCESS_KEYRING). The key's memory was freed by kfree but the keyring still thinks it's valid. So keyctl_read would try to read from the freed (now reused) memory → we get the pipe_buffer data!
The key struct itself persists separately from the payload data it points to. When we call keyctl_read, it dereferences the payload pointer, which now points to freed memory that's been reallocated to something else like a pipe_buffer. So we're reading that new object's data instead. This is the classic use-after-free pattern where the kernel object metadata outlives the data it references.
Now I need to verify the exact freeing mechanism—whether bucket_table_free uses RCU or synchronous freeing, since that affects the timing window for the exploit. which is called during fqdir teardown. The rhashtable_free_and_destroy function iterates through the bucket table, calling the destroy callback on each entry before freeing the table itself. So the user_key_payload gets destroyed first, then the bucket_table is freed with kvfree.
The critical insight is that if we can make the user_key_payload look like a bucket_table structure, we control what gets interpreted as the size field. Since the rcu_head callback_head sits at the start of user_key_payload, those first few bytes determine how many buckets the code thinks it needs to iterate through.
Examining the callback_head structure... out the chunk before allocating the key to control the rcu_head structure. The exploit first uses setxattr to allocate and zero a temporary buffer, then immediately allocates the user_key_payload in that same memory slot—relying on INIT_ON_ALLOC to zero the structure and the previous fill operation to ensure the residual data is clean. This way, the first two qwords of the rcu_head end up as zeros, making the kernel treat bucket_table->size and bucket_table->nest as zero.
Once the user_key_payload is freed, I need to spray kernel objects into that same kmalloc-1024 slot that contain kernel text pointers. The pipe_buffer approach works well for this: I create a pipe, write data to fill pipe_buffers, and each one contains a pointer to anon_pipe_buf_ops in kernel text.
Each pipe_buffer is 40 bytes, and with 16 buffers per pipe that's 640 bytes total, landing in kmalloc-1024. The ops pointer sits at offset 16 in each buffer, so in the sprayed allocation I can predict exactly where these kernel pointers land. When I read back the user_key_payload data starting at offset 24, I'll be able to leak those kernel text addresses.
But wait—to actually read the ops pointer at byte 16, I'd need to access data before the key's data region starts, which keyctl_read won't give me. The rcu_head sits before the data payload, so those bytes are inaccessible through the normal read interface.
Actually, I'm realizing the second pipe_buffer's ops pointer lands at offset 56 in the allocation, which maps to byte 32 in the key data I can read. Similarly, the third pipe_buffer's ops would be at byte 72 of the key data. So by reading the key, I can extract the ops pointers from the subsequent pipe_buffers that overlap with the key's data region.
But there's a critical problem: the user_key_payload's datalen field at offset 16 is only 2 bytes, and when keyctl_read is called, it only returns datalen bytes of data. If the freed memory gets replaced by pipe_buffers, the low 2 bytes of the first pipe_buffer's ops pointer (which sits at offset 16-17) become the new datalen value. Since ops points to kernel text with a high address, those low bytes might be small or zero, meaning keyctl_read could return nothing or very little data—not enough to reach the ops pointers I need at offsets 32 and 72.
However, I realize the datalen doesn't need to be exact; keyctl_read returns min(datalen, requested_size) bytes. So as long as the low 16 bits of the ops pointer are large enough (at least 33 bytes) to reach the second pipe_buffer's ops pointer, I can leak it. The question is whether anon_pipe_buf_ops's address has sufficiently large low bytes to make this work.
Actually, let me step back and reconsider the whole approach. The pipe_buffer method has several issues: the datalen field might be wrong, alignment is tricky, and it's more complex than needed. I should explore simpler alternatives like using simple_xattr objects with known kernel text pointers, or msg_msg structures for the leak, or even skipping the cache transfer entirely since tls_context, fqdir, and Qdisc are all in kmalloc-512.
But looking at the reference exploit more carefully, I see the real problem: on 5.15, the tbf Qdisc ends up in kmalloc-512 instead of kmalloc-1024, which breaks the whole chain. Instead of forcing everything through bucket_table, I should find a different object that's actually in kmalloc-1024 to use for the KASLR leak.
Actually, maybe I'm overcomplicating this. What if I just adjust the user key size to land in kmalloc-512 instead? That would let me overlap it with the tbf Qdisc directly, though then I'd lose the bucket_table overlap. The real constraint is that I need to work within 5.15's actual cache layout rather than trying to force the reference exploit's approach.
Let me step back and think about staying entirely within kmalloc-512 for this version. The double-free from closing the first tls context gives me a starting point there.
So the sequence would be: allocate tls_context in kmalloc-512, both sockets reference it through clone_ulp, close the first socket which triggers the RCU-delayed free, wait for that to complete so the slot goes back on the freelist, then close the second socket which operates on the now-freed memory. At that point the tx_conf and rx_conf fields are zeroed out from the original allocation, and I can work with the sk_proto pointer at offset 152.
The key insight is that after the first free completes, I can spray user_key_payload objects (sized to fit in kmalloc-512) to reclaim that same slot, and then when the second kfree_rcu fires from closing the second socket, it will free the slot again while it's occupied by my spray object—creating the use-after-free condition I need.
But there's a fundamental problem: when I close the second socket, it tries to call tls_sk_proto_close on what's now my user key data, and I'd need to forge a valid kernel pointer at offset 152 to survive that—except I don't have any kernel addresses yet since I haven't bypassed KASLR. The reference exploit handles this differently by closing the first socket instead, which suggests I need to rethink which socket to close and when.
Actually, looking at the next part: the trick is that close(tls1) frees the tls_context via kfree_rcu, but the memory still contains the original tls_context data since INIT_ON_FREE is disabled. So when close(tls2) runs and reads from that freed memory, the sk_proto pointer at offset 152 is still valid—it's the original kernel pointer, not something I need to forge. The spray objects then reclaim the freed slot, and when the second kfree_rcu fires, they get freed too.
But there's a complication: SLAB_FREELIST_HARDENED means the allocator writes an obfuscated freelist pointer at a random offset within the freed object, which corrupts some of the old tls_context data. If that corruption happens to hit offset 152 where sk_proto lives, close(tls2) crashes—though with random offsets the probability is only around 1.6%, so the reference exploit likely just worked through trial and error. There's also CONFIG_INIT_ON_ALLOC_DEFAULT_ON to worry about: if another object gets allocated from the freelist and then freed before step 4, the memory gets zeroed and the old tls_context data disappears entirely.
The real vulnerability window is between when kfree_rcu fires and close(tls2) executes—if anything allocates from kmalloc-512 during those 6 seconds, INIT_ON_ALLOC will zero the corrupted data. The reference exploit mitigates this by saturating kmalloc-512 with cbq_class and pfifo Qdisc objects to prevent other allocations from interfering.
But on 5.15 with INIT_ON_ALLOC enabled, this becomes problematic: if the kernel does grab X and zero it, then close(tls2) sees all fields as zero, including sk_proto, which causes a NULL pointer dereference. I need to find a way to exploit this on 5.15 without relying on the RCU grace period window being clean—maybe by closing tls2 immediately after tls1 without waiting for RCU to fire.
Actually, wait. If I close tls2 right away while the first kfree_rcu is still pending, X is still valid memory and tls_sk_proto_close runs normally. But then close(tls2) also calls kfree_rcu on the same object. The problem is that tls_context only has one rcu_head, so the second kfree_rcu call would overwrite the first one's callback pointer—meaning only the second callback gets queued, not both.
But actually, if both kfree_rcu calls happen before any grace period completes, the rcu_head is already on the callback list from the first call, and trying to add it again could corrupt the list structure itself. That's a crash waiting to happen.
So the reference exploit's approach of waiting for the first kfree_rcu to complete makes sense. The real challenge with INIT_ON_ALLOC is keeping X's data intact between when it's freed and when close(tls2) runs. Any kmalloc-512 allocation during that window would zero out X's memory. The heap grooming strategy is to saturate the allocator so no new allocations can happen in that critical window.
I need to verify the modules are actually available on the system though—let me check if modprobe can load cbq and tbf.
For the KASLR leak on 5.15, the tbf approach won't work since tbf_qdisc_ops is in a loadable module and the tbf Qdisc object lands in a different cache than my UAF'd key. I'm considering using pipe_buffer arrays instead, which are 640 bytes and fit in kmalloc-1024, and they contain function pointers I could leak. Pipes are universally available in Linux, so I can create them with a simple pipe() syscall and then write to trigger allocation of the pipe_buffer array. Looking at the struct definition for pipe_buffer on kernel 5.15, it contains a page pointer, offset, and length field, which should fit nicely into a kmalloc-1024 allocation.
Now I'm examining how the pipe_buffer operations pointer works—each active buffer points to either anon_pipe_buf_ops or page_cache_pipe_buf_ops, both located in kernel text. For the user key payload UAF, I need to account for the fact that a 513-byte user key also lands in kmalloc-1024, with its data starting at offset 24, and keyctl_read exposes that data from byte 0 onward.
Looking at the memory layout, the overlapping pipe_buffer array has its ops pointers at offsets 16, 56, and 96 in the allocation—which means the first one sits right before the user key data begins. The datalen field at offset 16 would overlap with the low 2 bytes of that first ops pointer, so if anon_pipe_buf_ops has a non-zero low 16-bit value, datalen becomes readable. I should verify the actual address of anon_pipe_buf_ops to confirm this will work, but the pipe_buffer approach seems promising even if datalen ends up being zero.
Actually, I'm realizing there's a cleaner path: I could use setxattr to pre-fill the kmalloc-1024 slot with controlled data that sets datalen to a large value, then free it and spray pipe_buffers to reclaim that memory. But that won't help because the pipe_buffer write would just overwrite the datalen field anyway.
Let me step back and try the pipe_buffer spray directly. The datalen field at offset 16 would get the low 2 bytes of the ops pointer from the first pipe_buffer, and since anon_pipe_buf_ops is a kernel text address, those low bits should be non-zero with high probability. I'm also thinking msg_msg structures might work better for both the spray and leak since they have a predictable layout and come in various sizes.
For a message queue approach, I'd allocate msg_msg structures with a 48-byte header plus data payload. Sending a 976-byte message would fit exactly into a kmalloc-1024 slab. The key fields are m_list which points to heap addresses, m_type and m_ts which are user-controlled, and the security field which either points to a SELinux blob or is NULL—none of these give me kernel text pointers directly, so I need to think about which field would actually leak something useful.
Let me pivot to pipe_buffers instead. I'd use the fqdir overlap technique to trigger a use-after-free on the bucket_table, then replace it with a user_key_payload structure. After freeing the key through fqdir teardown, I'd create pipes that write data into the freed slot, causing pipe_buffer arrays to reclaim it. Reading back through the pipe would leak the anon_pipe_buf_ops pointer from the second pipe_buffer's ops field, which sits at offset 56 in the allocation—or offset 32 within the user key's data region starting at offset 24.
The challenge is that the datalen field needs to be at least 40 to read far enough to capture the full 8-byte ops pointer, which means the low 16 bits of anon_pipe_buf_ops would need to be >= 40. But actually, there's a simpler path here—I could use a different object entirely for the KASLR leak instead of relying on pipe_buffers, maybe leveraging the Qdisc ops pointer after the user_key_payload is freed. Let me reconsider the whole exploit strategy for 5.15 from scratch.
What if I skip the bucket_table step entirely and just trigger a UAF directly in kmalloc-512 using the tls_context double-free? The Qdisc base struct is 320 bytes, which lands in kmalloc-512, so if I can overlap a user_key_payload with a Qdisc, the ops pointer at offset 24 would align perfectly with key_data[0..7]. Since built-in qdiscs like pfifo or noqueue have ops pointers in kernel text, this would give me a direct KASLR leak. The issue is that pfifo_fast is larger at around 896 bytes, so I need to verify which Qdisc size actually fits.
Let me check the FIFO qdiscs—pfifo and bfifo both use a minimal fifo_sched_data structure with just a 4-byte limit field, keeping the total Qdisc size at 320 bytes, which fits perfectly in kmalloc-512.
Since the ops pointer at offset 24 in the Qdisc structure points to kernel text addresses like pfifo_qdisc_ops, I can skip the cache transfer entirely and instead double-free the tls_context in kmalloc-512, replace it with a user_key_payload object, then spray pfifo Qdisc objects into that same cache by creating network interfaces with pfifo qdiscs attached—this lets me read the leaked kernel pointer from the user key.
The tricky part is that after closing the second TLS socket, the freed tls_context memory needs to stay valid enough to not crash during cleanup, which works because INIT_ON_FREE is disabled and the original data persists. I also need to time the spray carefully so something reclaims the freed slot before the kfree_rcu callback fires and frees the user key. , close(tls2)'s kfree_rcu happens AFTER tls_sk_proto_close returns. The RCU grace period happens later. So we have time to spray after close(tls2).
Actually, let me re-read the flow:
- close(tls1) → tls_sk_proto_close → kfree_rcu(X, rcu) → RCU callback queued
- Wait for RCU → callback fires → kfree(X) → X freed
- close(tls2) → tls_sk_proto_close on freed X → kfree_rcu(X, rcu) again → RCU callback queued
- Now X is on the freelist (from step 2) AND has a pending kfree_rcu (from step 3)
- Spray user_key_payload → one of them reclaims X
- RCU grace period passes → kfree(X) fires → user key at X is freed
- User key is freed but keyring still references it → UAF on user key
- Spray pfifo Qdisc → reclaims X
- Read user key → get Qdisc data → leak ops pointer → KASLR bypass
This is much simpler than the reference exploit's 3-stage approach! But there's a subtlety when close(tls2) calls tls_sk_proto_close on freed X—it reads X's data (which is the old tls_context data minus freelist pointer corruption) and calls the close function at offset 152 of X. If the freelist pointer corrupts that offset, the call might go to an invalid address.
Also, when we spray user_key_payload, INIT_ON_ALLOC zeros the memory at X before the user key header is written, and our data is copied starting at offset 24. The kernel sets the first 24 bytes (rcu_head + datalen). When the second kfree_rcu fires, X has been reclaimed by a user key, so kfree(X) frees the user key. But the kfree_rcu uses ctx->rcu which is at offset 312 of the original tls_context layout, while X is now a user_key_payload of 512 bytes—so the RCU head location doesn't match up anymore.
Actually, the RCU subsystem doesn't look at the rcu_head when executing the callback; it already saved the callback and object pointer, so it just calls kfree(ptr) where ptr = X regardless of what's there. The real problem is that the kfree_rcu from step 3 registered a callback with a pointer to X, and when we spray and reclaim X as a user key, we overwrite the rcu_head at offset 312 with user key data—but the RCU subsystem's callback list still has that pointer to X queued up.
When the grace period ends and the RCU subsystem tries to process that corrupted callback, it could crash the kernel. To prevent this, I need to either avoid overwriting the rcu_head at offset 312 or find a way to keep it intact while still reclaiming the memory. The reference exploit works by writing to freed memory at offset 312 through kfree_rcu, which still succeeds because the memory is accessible even after being freed. Then it sprays fqdir allocations to reclaim that slot, though there's a size mismatch since fqdir is 320 bytes but the allocation lives in kmalloc-512.
The key insight is that offset 312 falls within the fqdir's actual structure (bytes 0-319), so initializing the fqdir would normally corrupt the RCU callback. The timing must be crucial here—the reference exploit queues the kfree_rcu callback before the fqdir spray happens, so the RCU list gets corrupted in a controlled way.
Then the exploit waits 6 seconds for the RCU grace period to complete and the callback to fire, which attempts to free what's now the fqdir object. The corrupted rcu_head at offset 312 is the vulnerability—it's been overwritten with fqdir data, so when the callback tries to use it, things go wrong in a way that benefits the exploit. When the fqdir was allocated (via kzalloc + INIT_ON_ALLOC), the entire slot was zeroed. This zeroes the rcu_head at offset 312. Then the fqdir initialization writes its own data into the first 320 bytes.
The rcu_head callback list is corrupted because:
- close(tls2) wrote the rcu_head at offset 312 (next pointer + func)
- The RCU system's callback list links through this rcu_head
- fqdir allocation zeros offset 312 (zeroing the rcu_head)
- The RCU system's callback list now has a zeroed next pointer
If the RCU system traverses the list and hits a zeroed rcu_head, it might interpret the null function pointer as a valid callback and crash when trying to invoke it. Let me trace through how kfree_rcu actually works in kernel 5.15 to understand if this mechanism is what's being used here. structures (kvfree_rcu_bulk_data) to batch many kfree_rcu calls without using the per-object rcu_head. Let me check if 5.15 uses this.
In 5.15, there's kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) and also kfree_rcu() which might use the batched version.
Actually, looking at the 5.15 implementation of kfree_rcu:
#define kfree_rcu(ptr, rhf) kvfree_rcu(ptr, rhf)
And kvfree_rcu uses the embedded rcu_head. So it does write into the object.
This is problematic. Let me trace through the reference exploit to see how it handles this. After closing tls2, the other tasks call unshare(CLONE_NEWNET) which allocates an fqdir object in the freed tls_context slot. But the kfree_rcu from close(tls2) already wrote to the rcu_head at offset 312, and then kzalloc zeros the entire 512-byte allocation including that offset.
So the rcu_head is corrupted before the callback fires. When the callback eventually fires, it dereferences a zeroed rcu_head, which means the func pointer is NULL or garbage. This causes a crash or arbitrary code execution depending on what's at that offset after fqdir_init writes its fields.
The key insight is that the timing window between the kfree_rcu and the callback execution is large enough that the fqdir allocation and initialization completes before the RCU callback fires. This corrupts the callback's function pointer.
Let me trace through the kfree_rcu implementation in 5.15 to understand how the rcu_head is actually used...
Examining the macro definition...
Continuing with the kvfree_call_rcu implementation...
The batching mechanism tries to add the pointer to a bulk queue, and if successful, returns early without using the rcu_head at all—instead storing the object pointer in a separate kvfree_rcu_bulk_data array. If batching fails, it falls back to a regular call_rcu. This batching was introduced in kernel 5.12, which means if the exploit targets a vulnerable kernel with this mechanism, overwriting the rcu_head would be safe since it's not actually part of the callback chain when batching succeeds.
Now I'm mapping out the exploit phases for kernel 5.15, starting with namespace setup and memory grooming to position the tls_context allocation in kmalloc-512, then moving into the trigger phase.
For the vulnerability itself, I'll clone the ULP to create two sockets sharing the same tls_context, close the first socket to initiate a kfree_rcu, wait for the grace period to complete, then close the second socket which will attempt another kfree_rcu on already-freed memory. I'll spray user_key_payload objects to reclaim that freed memory and create a use-after-free condition. Then in the KASLR bypass phase, I'll set up dummy network interfaces with qdisc objects in the same kmalloc-512 pool, read through the UAF'd user key to leak the qdisc ops pointer, and calculate the kernel base from there.
For RIP control, I need to free the user key to trigger the UAF on the Qdisc object, then spray new user keys with a controlled ROP chain to overwrite the Qdisc structure. When I send packets that trigger the qdisc->enqueue() function, it'll execute my stack pivot gadget and chain into code execution for privilege escalation.
Wait—I'm realizing there's an issue with the key being freed in phase 2 but still being readable in phase 3. The key subsystem should reject reads on a freed key, but the struct key object itself might still exist in memory even though its payload was freed, which is what creates the UAF window I'm exploiting.
The struct key remains valid and still holds a pointer to the freed user_key_payload memory. When keyctl_read looks up the key by ID, it finds the struct key, dereferences the payload pointer, and reads whatever data is now occupying that freed memory region—which in my spray phase is the Qdisc object I've allocated there.
The key subsystem doesn't realize the payload was freed externally, so it never updates its internal state. The real complication is that when the Qdisc gets allocated over the freed user_key_payload, it overwrites the datalen field at offset 16 with the Qdisc's own flags and limit values, which could affect how much data gets read back. offset 24, so the ops pointer at offset 24 is readable if datalen is at least 16 bytes. Since flags is typically set to values like TCQ_F_ONETXQUEUE (16) for leaf qdisc instances, the low 16 bits would give us datalen = 16, which means we can read exactly up to the ops pointer and extract it. I need to trace through what flags a pfifo qdisc actually gets when created explicitly via netlink, since the initial allocation starts with flags at zero and then various code paths might set them. Let me check the qdisc creation and enqueue logic to see what flags end up being set.
Looking at the fifo initialization, I can see it's determining whether to set the TCQ_F_CAN_BYPASS flag based on the queue limit—for byte FIFOs it checks if the limit is at least the MTU, and for packet FIFOs it just needs a limit of 1 or more. If the bypass condition is met, it sets the flag, otherwise it clears it.
Now I'm realizing the critical vulnerability: the datalen field gets set to the low 16 bits of the flags value, which would be 4 for the bypass case. This means when reading the ops pointer (which occupies 8 bytes), we'd only read the lower 4 bytes, and on little-endian x86_64 that gives us a partial kernel address like 0x81xxxxxx instead of the full pointer.
The key insight is that kernel text addresses always have 0xffffffff as their high 32 bits due to the kernel's virtual address space layout, so even with just the lower 4 bytes, we can reconstruct the complete ops pointer by ORing with 0xffffffff00000000. From there, we can calculate the kernel base by subtracting the known offset of the ops structure.
Now I'm realizing the actual constraint: the u16 datalen field at bytes 16-17 limits us to reading only 4 bytes of data, which happens to be exactly the low 4 bytes of the ops pointer at offset 24 on the Qdisc structure. This gives us just enough to leak the pointer and derive the kernel base address.
The challenge is that KASLR randomizes the kernel base within a range, but all valid kernel addresses have 0xffffffff as their high 32 bits, so I can reconstruct the full address by combining the leaked low 32 bits with the known high bits.
However, this approach only gives me 4 bytes per read, which is tight for leaking a pointer. I'm reconsidering whether I can extract more data—maybe by manipulating the datalen field through a different mechanism, like spraying user keys with a specific pattern that sets a larger datalen value before triggering the leak. So the approach should work—I can get the low 4 bytes of the ops pointer from the keyctl_read, and combined with the known high bits, reconstruct the full kernel address. Let me verify that pfifo_qdisc_ops or bfifo_qdisc_ops are in the symbols and find their addresses, then check what keyctl_read actually returns with datalen=4 to confirm I'm reading the right bytes from the Qdisc structure. Now I'm moving on to Phase 4 and how to control RIP. KASLR bypass:
- Free the UAF'd user key → this frees the Qdisc (UAF on Qdisc!)
- Wait for RCU (since key_free does kfree_rcu on the payload)
- Spray new user keys with ROP chain data → reclaim the Qdisc slot
- The Qdisc's enqueue pointer (offset 0) is now our stack pivot gadget
- Send packets to the interface → triggers enqueue → pivot → ROP → root
So when we revoke the key, the user key type's revoke method calls kfree_rcu on the Qdisc memory itself. After the RCU grace period completes, the Qdisc gets freed, creating the use-after-free. Then we spray new user keys with controlled data to reclaim that freed slot, placing our ROP chain at offset 0 where the enqueue function pointer lives.
The trick is that user key data starts at offset 24 in the allocation, so we can't directly control offset 0. The reference exploit handles this by using setxattr to write the first 24 bytes (including the enqueue pointer), then key_alloc to write the rest starting at offset 24.
But on 5.15 with INIT_ON_ALLOC enabled, the kmalloc zeros the entire buffer before key_alloc writes to it, destroying the setxattr data at offset 0-23. This means the setxattr trick won't work on this kernel configuration.
I need to find a different object type that can control those first 24 bytes. msg_msg might be the answer—it gives me full control starting from offset 48, but I need to check what's at the beginning of the structure to see if I can use it for the stack pivot instead. The INIT_ON_ALLOC feature zeros memory on every allocation through the slab allocator, not just at page allocation time. So the setxattr trick that relies on uninitialized memory won't work here since each kmalloc call explicitly memsets the allocated object to zero.
But wait—the kernel writes the key structure's header (rcu_head and datalen fields) after the zeroing happens, and the user-controlled data gets copied in at offset 24. So the first 24 bytes contain kernel-initialized fields and padding, while the rest comes from our buffer. I need to trace through what user_key_instantiate actually does to understand the exact memory layout. I'm realizing the user_key_payload structure doesn't give me control over offset 0 since the rcu_head gets zeroed by INIT_ON_ALLOC, so I need a different approach for Phase 4. I'm considering using sendmsg with MSG_DONTWAIT to allocate skb data buffers where I can control all 512 bytes from the start, or exploring the big_key type as an alternative primitive. setxattr with userfaultfd to pause the copy operation, leaving the allocation live with user-controlled data at offset 0. But that's getting complicated—let me reconsider. I'm realizing setxattr actually gives me full control over the entire 512-byte chunk: it allocates memory, copies my data into it starting at offset 0, then attempts to set the xattr attribute, which fails because I'm using XATTR_CREATE on an already-existing attribute.
The buffer gets freed after that failure, but the key point is that my controlled data was written into the kmalloc-512 slot. When key_alloc immediately grabs the same slot afterward, INIT_ON_ALLOC zeros the memory before the key structure is written—which blocks my exploit. However, I could use userfaultfd to interrupt the copy_from_user call mid-way: the allocation happens but the copy stalls, then I free the Qdisc in another thread to reclaim that slot, and when I resume the userfaultfd, the copy completes and overwrites the Qdisc with my data.
But that won't work because setxattr allocates its own buffer rather than reusing the Qdisc slot. I need a different angle—maybe leveraging add_key with a key type that doesn't have the header offset problem, or finding a way to write controlled data directly at offset 0 where the Qdisc's function pointers live. The real constraint is that user_key_payload only lets me control data starting at offset 24, but I need to overwrite the enqueue pointer at offset 0.
Actually, I'm realizing I could use a double-free strategy here. After the KASLR bypass, I have the kernel base and a UAF user key overlapping a Qdisc in kmalloc-512. I could free the user key to trigger the Qdisc's kfree_rcu, wait for RCU to complete, then allocate a new user key to reclaim that same slot—this gives me another chance to control the allocation and potentially reach offset 0.
But the INIT_ON_ALLOC problem persists: even if I use setxattr to write controlled data before freeing, the next allocation will zero everything out. Unless I can somehow leave the data in place without triggering another allocation afterward, which seems impossible since the kernel needs to use these slots.
Wait—what if I take a different approach entirely? Instead of trying to overwrite the Qdisc, I could use keyctl_update to modify the user key's payload in place. The key's payload was freed externally through the double-free, but the keyring still holds a reference to the old address. If I can update the key while it points to the Qdisc's memory, I might be able to write controlled data directly into the Qdisc structure without needing another allocation.
Actually, that won't work either—keyctl_update would try to free the old payload (which is now the Qdisc) and allocate a new one, putting me right back in the same situation where I can't control the initial offset.
Let me step back and look at how the reference exploit handles this on 6.1 with the dyn-kmalloc-1k allocator. They're using setxattr to fill a chunk and then allocating a user key in the same slab. The key insight is probably in how they're controlling the layout and what happens when those objects overlap.
On 6.1 without INIT_ON_ALLOC, the setxattr call writes controlled data starting at offset 0, then frees. When key_alloc grabs the same 513-byte slot, the first 24 bytes still contain the old setxattr data, and the user key payload writes its own data starting at offset 24. This gives them full control over the entire allocation. But on 5.15 with INIT_ON_ALLOC enabled, that stale data gets zeroed out, which breaks the exploit.
So I need a different strategy—I need to find a way to write non-zero data at offset 0 of a kmalloc-512 allocation. Either I allocate an object that naturally writes at offset 0 (like a Qdisc with a controlled enqueue function), or I find another heap manipulation technique that works around the zeroing.
The keyctl_update approach is tempting: free the Qdisc, then race to allocate a new key that lands in the same slot, potentially overwriting the enqueue pointer with my ROP chain. But that's circular—I'd need to already control the key data to write the pointer, which is what I'm trying to achieve in the first place.
Let me pivot to the fqdir technique instead. I can create a cross-cache UAF on the bucket_table in kmalloc-1024, then find an object I can fully control from offset 0 to overwrite it. The bucket_table is 576 bytes, and I can create a UAF on a user key in kmalloc-1024 by crafting the right data size. Now I'm looking for what objects in kmalloc-1024 give me full control from the start—sk_buff linear data is promising since I can control the size and send data via sendmsg, but I need to verify the allocation path.
Actually, I'm realizing the standard approaches like pipe_buffer and msg_msg don't give me clean control from offset 0 with INIT_ON_ALLOC enabled, since they have kernel structures at the beginning that I can't overwrite.
Let me pivot entirely—instead of trying to control the Qdisc enqueue function, what if I target modprobe_path? If I can write to it, I can point it to a script that reads the flag, and triggering a module load with an invalid header will execute that script as root. The user key UAF gives me a freed object I can replace and read from, but I need to figure out what else I can do with keyctl_update when the payload has been swapped out.
Actually, there's another angle: exploiting the tls_context UAF more directly. The tls_context contains a sk_proto pointer to a struct proto, and when the socket closes, it calls the close function through that pointer. If I can replace the tls_context with controlled data before closing, I control which function gets called. The challenge is that the tls_context lives in kmalloc-512, which limits what I can use to replace it.
The real blocker is that without knowing kernel addresses, I can't construct a fake struct proto with a meaningful close function pointer. Even partial overwrites won't help since the low bits of kernel addresses aren't fixed enough. Looking at how the reference exploit handles this, it uses a two-stage approach: first leaking a kernel text address to bypass KASLR, then using that to control RIP.
For 5.15 with INIT_ON_ALLOC, the KASLR bypass works by reading a freed user key overlapping with a pfifo Qdisc to leak the ops pointer. But the RIP control stage is where I'm stuck—I can't control offset 0 through the user_key_payload. So I'm exploring a different angle: overwriting modprobe_path instead. Once I have the kernel base from the KASLR bypass, I know where modprobe_path lives and could potentially redirect it to execute arbitrary code.
The challenge is that the user key only gives me a read primitive via keyctl_read, not a write. keyctl_update would free and reallocate, which breaks the overlap I need. So I'm reconsidering how to leverage the overlapping objects—maybe by using keyctl_revoke to mark the key as revoked and then exploring what happens during the cleanup phase.
When the key is revoked, the user key type's revoke method calls kfree_rcu on the payload, which frees the Qdisc memory. After waiting for RCU, that kmalloc-512 slot becomes available again. I can use setxattr to write controlled data at offset 0, placing a stack pivot gadget there. But here's the problem: setxattr allocates, writes the data, then immediately frees it, so the slot goes back on the freelist. Even if the Qdisc was tied to a network interface and I send packets to trigger enqueue(), the kernel shouldn't actually call it on freed memory.
Actually, wait—the Qdisc is freed in terms of memory, but the network interface's dev_queue still holds a pointer to it. The interface doesn't know the Qdisc was freed, so it has a dangling pointer. When I send a packet, the kernel dereferences that pointer and calls qdisc->enqueue(). If I can control what's at that address through setxattr or another allocation, I control the execution. So the refined approach is to free the key, wait for RCU, then use setxattr to place my gadget in the freed slot before sending packets to trigger the dangling pointer dereference.
But there's a problem: setxattr allocates, writes the data, and immediately frees it when the syscall returns. The data won't stay alive long enough. I need a different primitive that keeps the data resident—maybe msgsnd to queue a message that occupies the same kmalloc-512 slot and persists until I dequeue it.
Looking at the msg_msg structure, most fields are kernel-controlled (the list pointers, timestamps, security context), but I can control m_type and the message data itself starting at offset 48. The issue is I still can't touch the first 48 bytes. I'm exploring whether add_key or setsockopt might give me a better primitive where user-controlled data starts at offset 0.
Most setsockopt options seem to have kernel headers too—IP_OPTIONS, SO_ATTACH_FILTER, they all wrap the user data with their own structures. But I'm wondering if there's a netlink-specific option that might work differently.
Actually, I'm pivoting to a completely different strategy using userfaultfd to race a page fault. The approach would be to mmap a region with userfaultfd, then call setxattr with a pointer into that region, so when the kernel tries to copy the data into a kmalloc buffer, the page fault handler can intercept and manipulate what gets copied.
The tricky part is timing: I need to free the Qdisc allocation first so that when the kernel allocates its buffer, it reuses that same slot. Then I can resolve the page fault with controlled data, let setxattr fail and free the buffer, but the network interface still holds a dangling pointer to that address. Before anything else claims that memory, I can send packets to trigger the use-after-free. So kvmalloc with 512 bytes will use kmalloc, which means it grabs that freed Qdisc slot and zeros it via INIT_ON_ALLOC, then copy_from_user writes our controlled data directly into the buffer starting at offset 0 — exactly what we need.
The critical part is the race window: after kvfree returns the buffer to the freelist, the network interface still has a dangling pointer to it. When we send a packet and the kernel invokes qdisc->enqueue(), it dereferences offset 0 and executes our stack pivot gadget. But if anything else allocates from kmalloc-512 in that tiny window, INIT_ON_ALLOC will zero our payload. I can use userfaultfd to stall copy_from_user and get precise control over when the data lands in memory, then immediately trigger the packet send to minimize the race.
Actually, the simpler approach might work: just free the Qdisc slot, call setxattr to allocate and write our controlled data, then immediately send the packet. On a single CPU with no other threads, the interference window is small enough. But there's a wrinkle—SLAB_FREELIST_RANDOM writes the freelist pointer at a random offset within the freed buffer, which could corrupt our stack pivot gadget or the payload data at offset 0x41, breaking the exploit.
For kmalloc-512, the freelist pointer offset is randomly chosen at boot and could land at offset 0 with about 1.6% probability, which would overwrite our stack pivot gadget. Since 0x41 isn't 8-byte aligned, it can't be the freelist offset, so the payload there is safer. The real risk is if offset 0 gets hit.
With a 98.4% success rate, this approach seems viable. But I'm wondering if I can eliminate the freelist pointer collision entirely using userfaultfd to pause the copy_from_user operation mid-way, keeping the allocation live and off the freelist while I work with it.
The strategy would be: set up a user fault handler on a page within the setxattr buffer, trigger the kvmalloc which lands in the Qdisc slot, let copy_from_user proceed until it hits the faulted page and blocks, then from another thread send the packet to invoke enqueue while the buffer still contains my controlled data at offset 0. This bypasses the freelist pointer issue entirely since the buffer never gets freed.
The key constraint is that my gadget needs to fit within the first page of the user_data region—since copy_from_user will fill the buffer with data from the normal page before hitting the fault, I need to ensure the stack pivot gadget and the pop_rsp address land in the portion that gets copied before the page fault occurs. So with the buffer positioned at a page boundary, the copy would succeed for the first 256 bytes (containing my controlled gadget data), then fault when trying to access the uffd page. While the fault blocks, another thread can send a packet that reads from offset 0 and executes my hijacked code. This avoids the race condition entirely, though it adds complexity with userfaultfd. I think I'll start with the simpler setxattr approach since the 98.4% success rate should be acceptable for a CTF, and I can always fall back to this if needed.
Now I need to think through the ROP chain layout. When the Qdisc's enqueue function is called, the registers will be set up with the skb pointer in rdi and the qdisc pointer in rsi, which I can control.
The stack pivot gadget pushes rsi onto the kernel stack and jumps to an address I've placed at offset 0x41 in my controlled data. That address points to a pop-rsp-pop-r15-ret gadget that pivots the stack to my qdisc address and then starts executing my ROP chain from offset 0x08.
So I need to lay out my controlled data with the stack pivot gadget at offset 0x00, the ROP chain starting at offset 0x08, and the intermediate gadget address at offset 0x41. I see the issue now—the ROP chain entries are laid out sequentially starting at offset 0x08, with each gadget address taking 8 bytes, so I need to verify that the overlapping region at 0x41-0x48 doesn't corrupt the chain data that's supposed to be there. I need to figure out which byte value at position 0x40 will work with the address bytes that follow it. The 7th ROP entry's first byte constrains what valid address can be formed when reading from offset 0x41, so I'm working backwards from the required pivot address to determine what that byte needs to be. I'm mapping out the ROP chain structure with 8-byte entries, where each position corresponds to a specific gadget or argument, but the 7th entry at offset 0x40 is heavily constrained since bytes 0x41-0x47 must contain a pop_rsp address, making it impossible to use a standard ROP gadget there.
A better strategy is to make entry 7 a "don't care" value by positioning it as an argument to be loaded into a register rather than executed as code—for instance, if entry 6 is pop rdi; ret, then entry 7 becomes the value popped into rdi, where I can freely choose byte 0x40 while the upper 7 bytes are fixed by the pivot address, giving me a u64 value that's mostly determined but still usable as long as it's a valid argument for whatever rdi is used for next.
The problem is that with byte_0x40 = 0x00, the resulting value 0xffffff810a4f1e00 doesn't look useful as a pointer or operand, so I need to explore different pivot gadgets that avoid this alignment constraint. I'm searching for alternatives like push rsi; jmp [rsi+N] where N is 8-byte aligned, checking offsets like 0x00, 0x08, and so on to find a gadget that doesn't create this awkward value situation. I'm exploring different stack pivot techniques since the displacement-based jumps aren't yielding aligned values. Let me consider using an add rsp gadget instead to shift the stack pointer, then leverage the Qdisc's dequeue pointer as a secondary gadget when the kernel invokes the enqueue function.
Actually, the push rsi; jmp [rsi+0x41] approach with the unaligned offset is still viable—I just need to structure the ROP chain to work within that constraint. Let me map out the memory layout more carefully to see how the gadget addresses and subsequent ROP chain entries would be positioned.
The pivot mechanism works by having the enqueue call execute that gadget, which pushes rsi onto the stack and then jumps to the address stored at offset 0x41 from rsi. That jump lands on a pop_rsp_pop_r15_ret gadget, which pivots the stack pointer back to the beginning of my controlled buffer and then continues executing the ROP chain from there.
Now I'm working through the constraint that the pop_rsp_pop_r15_ret address must occupy bytes 0x41-0x48, which means the 7th and 8th ROP chain entries overlap with this critical gadget address. I need to carefully choose what goes at those positions so that the required bytes align properly while still maintaining a valid ROP chain. 7 bytes of pop_rsp_pop_r15_ret. Let's say I make entry 7 a VALUE rather than a GADGET. For example, if entry 6 is pop rdi; ret, then entry 7 is the value popped into rdi.
pop_rsp_pop_r15_ret = 0xffffffff810a4f1e. Entry 7 as u64 (LE) has bytes 0x41-0x47 fixed to the gadget address, with only byte 0x40 free, giving me a value like 0xffffff810a4f1eXX depending on what I put there. If I use pop rdi; ret as entry 6, rdi gets set to this value, but that won't work for prepare_kernel_cred(0) which needs rdi = 0. However, I could use this approach for other functions like find_task_by_vpid where rdi can be a small number, or repurpose entry 6 entirely.