Kernel Development Learning Pipeline

Home Git Course Dashboard Activity Log Login

Lecture 06: Thursday, February 13 2025

Previous Next

Announcements

Initial submission for the page_walk assignment is due next Tuesday at midnight
Once again we would like to encourage use of camera and microphone when possible
1. Keep in mind that it allows us to guage how you are doing and better tailor the lecture to your needs and reactions

Review

Execution contexts
setjmp/longjmp
Execution contexts
Define kernelspace and userspace
Kernel representation of a process or thread

Lecture overview

What do we want out of a system call?
The five steps of a system call

Slides

System Calls

Notes

Syscalls

Learning Objective: Understand the story of a system call

What is a syscall?
1. A userspace request for kernel action
2. The mechanism to escalate privileges
3. Note: Syscall and system call used interchangably
4. demo: this riscv program will KILL you

.text
.globl _start
_start:
csrr a0, mtvec

Compile in container with:

riscv64-linux-gnu-gcc -pie -ffreestanding -nostdlib kill.s -static -march=rv64imzicsr -mabi=lp64 -shared -o kill

riscv privileged isa releases

1. demo: this arm64 program will KILL you

.text
.globl _start
_start:
msr VBAR_EL1, x0

Overview
1. Execution contexts
2. Define kernelspace and userspace
3. Kernel representation of a process or thread
4. What do we want out of a system call?
5. The five steps of a system call
Background information
1. What is execution context?
  1. An execution context is a CPU register state
  2. First example: {set,long}jmp(3) library functions
    1. setjmp: save current state
    2. longjmp: restore saved register state
    3. demo: setjmp/longjmp example program
  3. Threads in a program
    1. threads share: address space (heap, code, static data)
    2. threads have their own: stack (register ptr) registers, IP
2. Kernelspace and userspace are distinct execution contexts
  1. Primary difference: privilege level
  2. Different types of context difference than thread vs thread
  3. Normal programs run in userspace
  4. Kernelspace is the privileged execution context
    1. Registers, stack, memory are familiar
    2. Key difference: CPU capabilities are unrestricted
  5. Kernelspace can be further subdivided into contexts
    1. Kernel code may be running on behalf of a particular userspace process
    2. Other kernel code runs on its own behalf
    3. This material will be covered when we discuss interupts
3. Define context switch and its relation to syscalls
  1. A change of CPU register state
  2. Some may be familiar with process to process context switches
  3. Each of these can be broken down further
  4. First, the transition from userspace to kernelspace
  5. Save registers and switch stack, raise privileges
  6. Second, the transition from kernelspace to userspace
  7. Registers and switch stack, drop privileges
  8. We will refer to each of these distinct transitions as a "context switch"
4. Important kernel code for syscall discussion
  1. What is the struct task_struct
  2. This is Linux's Process Control Block (PCB)
  3. each pid has unique struct task_struct
  4. demo: see include/linux/sched.h
  5. current macro refers to struct task_struct of process in this context
  6. riscv64: arch/riscv/include/asm/current.h
  7. arm64: arch/arm64/include/asm/current.h
  8. x86: arch/x86/include/asm/current.h
  9. pid in kernel vs userspace
    
    kernelspace name userspace name
    
    pid tid
    
    tgid pid
  10. demo: see get{p,t}id(2) in kernel/sys.c
  11. simplified call stack
    1. task_tgid_vnr() in include/linux/pid.h
    2. ... (namespaces are taken into account): return to this later
    3. ... locking: return to this later
    4. task_pid_nr() { ... tsk->pid ... }
  12. Before Linux 2.6, there were only pids
    1. The clone(2) call could share address space between processes
      1. This allowed thread-like behavior
      2. These processes were too independent
      3. e.g. no shared signals
    2. NPTL implements threads as specified by POSIX
      1. This required both userspace and kernelspace changes
        
        The C library was hardened with locking
        
        The C library introduced the tid concept
        
        The tid subdivides a pid
      2. On the kernel side:
        
        The kernel introduced the tgid concept
        
        The tgid groups kernel pids together
        
        Each pid corresponds to unique struct task_struct
        
        The tgid and pid values are stored here
All useful programs depend on system calls
1. demo: let's trace a program's syscall usage with strace
2. demo : does int main(void) {} use syscalls?
3. demo: a non-trivial syscall-free program: detect prime numbers
4. qemu-system-riscv64 -machine virt -bios none -nographic -no-reboot -net none -kernel arch/riscv/boot/Image -initrd ../rootfs.cpio -append 'panic=-1 -- 4'
5. qemu-system-riscv64 -machine virt -bios none -nographic -no-reboot -net none -kernel arch/riscv/boot/Image -initrd ../rootfs.cpio -append 'panic=-1 -- 3'
What do we want out of syscalls?
1. We need speed
2. We want re-entrancy
  1. Interupt/resume execution of any number of concurrent threads
3. We need security
  1. e.g. we validate ptr args in expected addr range
  2. confused deputy problem [1]
4. Linux provides a stable syscall API
  1. This sets Linux apart from other OSes
How does Linux implement syscalls?
1. A syscall can be broken down into 5 distinct steps
  1. Userspace invocation
  2. Hardware-assisted privilege escalation
  3. Kernel code handler
  4. Hardware-assisted privilege drop
  5. Userspace program continues
The transfer of software or hardware responsibility divides each step
Userspace program preparation and invocation
1. All programs make system calls
  1. excluding trivial example programs
2. e.g. a shell as an abstraction over many syscalls
3. demo: see /proc/PID/syscall, e.g. pid 1
  1. Multi-arch syscall number table
4. when possible prefer library-provided syscall wrapper
  1. glibc provides wrapper functions for many syscalls
    1. Main benefit: speed
    2. We want to minimize the high-overhead syscalls
    3. Checks like input validation avoid syscalls if possible
  2. We use e.g. write(3) instead of write(2)
  3. Common manual number section notation
    1. e.g. write(2)
    2. Number in parenthesis refers to manual page section number
    3. Section 2 has system calls and section 3 has library calls
    4. See man man for more information
  4. demo: ltrace: like strace for library calls
5. Invocation of a syscall is architecture-specific
  1. First, specify the syscall and arguments
  2. Second, give up control to the hardware
  3. On riscv: ecall instruction gives up control
    1. Specify syscall number in a7
    2. Specify arguments 1-6 in a0, a1, a2, a3, a4, a5
    3. Return value will be in a0
  4. On arm64: svc #0 gives up control
    1. Specify syscall number in x8
    2. Specify arguments 1-6 in x0, x1, x2, x3, x4, x5
    3. Return value will land in x0
  5. On x86_64: syscall gives up control
    1. Specify syscall number in rax
    2. args 1-6 in: rdi, rsi, rdx, r10, r8, r9
    3. Note: this differs from normal function calling convention
      1. The syscall instruction clobbers rcx
      2. Use r10 instead of rcx
    4. Return value will land in rax
6. With arguments chosen and syscall selected
  1. Give up control to the hardware
Hardware-assisted privilege escalation
1. This step is done by hardware
2. Rewind to boot:
  1. Linux installs it's syscalls into our CPU
  2. On riscv: see _start:
    1. Set CSR_TVEC to address of handle_exception()
    2. CSR_TVEC: Control and Status Register: Trap Vector Base Address Register
  3. On arm64: see __primary_switched
    1. Set VBAR_EL1 to address of vector table
    2. Vector table defined in entry.S
    3. VBAR_EL1: Vector Base Address Register (Exception Level 1)
  4. On x86_64: see syscall_init()
    1. Set MSR_LSTAR to entry_SYSCALL_64 address
    2. MSR_LSTAR: Model Specific Register: Long System Target Address Register
3. Back to the present syscall
  1. The CPU is preconfigured to correctly transfer control
  2. This makes privilege escalation safe
  3. On riscv: switch to machine mode
  4. On arm64: elevate execution level
  5. On x86_64: change to ring 0
  6. All of these states are stored in a particular register
Kernel code handles request
1. Part architecture-specific and part architecture generic
2. Execution resumes from a hardware specified register rate
  1. Higher on call stack is more generic code
  2. Mostly assembly and C macros
  3. On riscv, start in handle_exception
    1. Assembly jumps into a particular offset in excp_vect_table
    2. Get the syscall number in do_trap_ecall_u()
    3. Index into the sys_call_table array and call the function
    4. sys_call_table populated by an include statement
    5. syscall_table_64.h generated at build time
    6. An included Makefile generates this from a common table
    7. scripts/syscalls.tbl enumerates the syscalls and compatibility information
  4. On arm64, start in VBAR_EL1
    1. Hardware jumps to particular offset in vectors
    2. A function defined by macro in entry.S calls into C code
    3. The first C function is el0t_64_sync_handler()
    4. This takes us to el0_svc_common()
    5. The invoke_syscall() indexes into jump table of handlers
    6. This architecture-generic handler is defined by a SYSCALL_DEFINE* macro
  5. On x86_64: start at entry_SYSCALL_64
    1. Assembly calls into the do_syscall_64() C function
    2. Using a few helper functions, index into jump table of system call handlers
    3. We have another article on x86_64 syscall implementation details
    4. This architecture-generic handler is defined by a SYSCALL_DEFINE* macro
3. Run SYSCALL_DEFINE*() handler
  1. Defined in include/linux/syscalls.h
  2. Resolve to __SYSCALL_DEFINEx(x,...
  3. Five functions generated
  4. See __do_sys##name(...
  5. Note: no SYSCALL_DEFINE7 and above
  6. Indicate error using the errno macros
  7. Return to assembly for exit
  8. On riscv: do_trap_ecall_u() returns to assembly code
    1. After finishing the function found in excp_vect_table, jump to ret_from_exception
    2. Place userspace program counter in mepc: machine exception program counter CSR
    3. mret gives up control to the hardware
    4. sret used in supervisor mode while mret used in machine mode
  9. On arm64: el0t_64_sync() calls ret_to_user()
    1. Which calls kernel_exit 0, defined as a macro
    2. Restore registers, including the stack pointer
    3. eret gives up control to the hardware once again
  10. On x86_64: [entry_SYSCALL_64()] prepares to return
    1. First check whether we can use the faster sysret instruction
    2. sysret is faster than iret
    3. Via either instruction we give up control to the hardware
Hardware-assisted privilege drop
1. Less dangerous operation than escalation
2. Restore old register and stack
3. Drop privileges
4. Set program counter to userspace return address
5. On riscv:
  1. The ecall instruction places the return address in the mepc control and status register (in m-mode)
  2. The mret instuction sets the program counter to the value in mepc
6. On arm64:
  1. The svc #0 instruction saves a return address in hardware
  2. The eret instruction sets the program counter to this value
7. On x86_64:
  1. iret loads the return address form the stack
  2. sysret returns to rcx
8. Software takes control of execution
Userspace program continues
1. Always check for an error
2. The errno in kernel vs userspace context
  1. Kernel functions return -errno
  2. C library wrappers check for error
    1. Store original error in errno
    2. Convert return code to -1
    3. example: musl syscall return
  3. See: man 3 errno
  4. Demo: errno utility from moreutils package
3. The program continues execution
The story of a system call: Summary
1. Linux provides a stable system call API
2. Most programs run in user execution context ("userspace")
3. Linux code runs in several execution contexts (all "kernelspace")
4. Hardware plays two key roles in system calls
  1. First, raising privileges and entering kernel execution context
  2. Second, dropping privileges and entering user execution context
5. Many syscall implementation details are architecture specific
6. The kernel defines the main syscall handler using SYSCALL_DEFINE
  1. These macros are used to define system call implementations
7. The C library defines wrapper functions for many syscalls
  1. These hide architecture-specific details
  2. Provide POSIX-compatible behavior by hiding Linux eccentricities
8. Always check for an error after making a syscall

kernelspace name	userspace name
pid	tid
tgid	pid

msg = (silence)
whoami = None
singularity v0.6-56-g8e52bc8 https://github.com/underground-software/singularity