Understand the story of a system call
A userspace request for kernel action
The mechanism to escalate privileges
The terms "syscall" and "system call" are used interchangably
.text .globl _start _start: csrr a0, mtvec
this arm64 program will KILL you
.text .globl _start _start: msr VBAR_EL1, x0
Execution contexts
Define kernelspace and userspace
Kernel representation of a process or thread
What do we want out of a system call?
The five steps of a system call
An execution context is a CPU register state
The {set,long}jmp(3) library functions
{set,long}jmp(3)
setjmp: save current state
setjmp
longjmp: restore saved register state
longjmp
example use of {set,long}jmp
{set,long}jmp
Threads in a program
Threads share: address space (heap, code, static data)
Threads have their own: stack (register ptr) registers, IP
Kernelspace and userspace are distinct execution contexts
Normal programs run in userspace
Kernelspace is the privileged execution context
Registers, stack, memory are familiar
Key difference: CPU capabilities are unrestricted
Kernelspace execution context can be further subdivided
Kernel code may be running on behalf of a particular userspace process
Other kernel code runs on its own behalf
A capture of the CPU register state
A load of a saved CPU register state
Switching to kernel context is like other context switches
Main difference: privileges escalated
Switching back to userspace is similar but privilegs are dropped
The term "context switch" is sometimes used to refer to task-switching
"A computer program or subroutine is called reentrant if multiple invocations can safely run concurrently on multiple processors" (source)
struct task_struct
This is Linux's Process Control Block (PCB)
pid
A quick look at include/linux/sched.h
include/linux/sched.h
current
Refers to struct task_struct of process in current execution context
A quick look at three files:
riscv64: arch/riscv/include/asm/current.h
arm64: arch/arm64/include/asm/current.h
x86: arch/x86/include/asm/current.h
see get{p,t}id(2) in kernel/sys.c
get{p,t}id(2)
kernel/sys.c
getpid(2)
getpid(2) calls functions ... namespaces are taken into account ... locking is done task_pid_nr() { ... tsk->pid ... }
tgid
Why do we have different names?
Before Linux 2.6, there were only pids
The clone(2) call could share address space between processes
clone(2)
This allowed thread-like behavior
These processes were too independent
NPTL implements threads as specified by POSIX
The C library was hardened for concurrency
The C library introduced the tid concept
tid
The tid subdivides a pid
The kernel introduced the tgid concept
The tgid groups kernel pids together
Each pid corresponds to unique struct task_struct
Can a program do anything useful without making any syscalls?
All useful programs depend on system calls
Let's trace a program's syscall usage with strace
Syscall-free prime-number detector program
speed
security
stability
re-entrancy
confused deputy problem
One example: validate address range of any pointer arguments
Linux provides a stable syscall API
A syscall can be broken down into 5 distinct steps
Userspace invocation
Hardware-assisted privilege escalation
Kernel code handler
Hardware-assisted privilege drop
Userspace program continues
The transfer of software or hardware responsibility divides each step
All programs make system calls
Example: a shell as an abstraction over many syscalls
see /proc/PID/syscall
/proc/PID/syscall
The C library provides wrapper functions for many syscalls
Main benefit: speed
We want to minimize the high-overhead syscalls
Checks like input validation avoid syscalls if possible
Avoid architecture specific details
Example: write(2) vs write(3)
write(2)
write(3)
Number in parenthesis refers to manual page section number
Section 2 has system calls and section 3 has library calls
See man man for more information
man man
ltrace: like strace for library calls
Common accross architectures:
specify the syscall and arguments
give up control to the hardware
Specify syscall number in a7
a7
Specify arguments 1-6 in a0, a1, a2, a3, a4, a5
a0
a1
a2
a3
a4
a5
Return value will be in a0
ecall gives up control to hardware
ecall
Specify syscall number in x8
x8
Specify arguments 1-6 in x0, x1, x2, x3, x4, x5
x0
x1
x2
x3
x4
x5
Return value will land in x0
svc #0 gives up control to hardware
svc #0
Specify syscall number in rax
rax
Args 1-6 in: rdi, rsi, rdx, r10, r8, r9
rdi
rsi
rdx
r10
r8
r9
syscall gives up control to hardware
syscall
Return value will land in rax
Difference from normal function calling convention
The syscall instruction clobbers rcx
rcx
Use r10 instead of rcx
With arguments chosen and syscall selected
This step is handled by hardware
Rewind to boot
See _start:
_start
Set CSR_TVEC to address of handle_exception()
CSR_TVEC
handle_exception()
CSR_TVEC: Control and Status Register: Trap Vector Base Address Register
See __primary_switched()
__primary_switched()
Set VBAR_EL1 to address of vector table
VBAR_EL1
Vector table defined in entry.S
VBAR_EL1: Vector Base Address Register (Exception Level 1)
See syscall_init()
syscall_init()
Set MSR_LSTAR to entry_SYSCALL_64 address
entry_SYSCALL_64
MSR_LSTAR: Model Specific Register: Long System Target Address Register
Back to the present
The CPU is preconfigured to correctly transfer control
This makes privilege escalation safe
On riscv: switch to machine mode
On arm64: elevate execution level
On x86_64: change to ring 0
Both of these are stored in a particular register
Part architecture-specific
Part architecture-generic
Execution resumes from a hardware specified register rate
At bottom, mostly assembly and C macros
Higher on call stack is more generic code
riscv
Start in handle_exception
handle_exception
excp_vect_table
Get the syscall number in do_trap_ecall_u()
do_trap_ecall_u()
sys_call_table
syscall_table_64.h generated at build time
syscall_table_64.h
An included Makefile generates this from a common table
scripts/syscalls.tbl enumerates the syscalls and compatibility information
scripts/syscalls.tbl
arm64
Start in VBAR_EL1
vectors
A function defined by macro in entry.S calls into C code
entry.S
C
Execution reaches el0_svc_common()
el0_svc_common()
The invoke_syscall() indexes into jump table of handlers
invoke_syscall()
This architecture-generic handler is defined by a SYSCALL_DEFINE* macro
SYSCALL_DEFINE*
x86_64
Start atentry_SYSCALL_64
do_syscall_64()
Using a few helper functions, index into jump table of system call handlers
A closer look at the SYSCALL_DEFINE*() handlers
SYSCALL_DEFINE*()
SYSCALL_DEFINE_*
Defined in include/linux/syscalls.h
include/linux/syscalls.h
Resolve to __SYSCALL_DEFINEx(x,...
__SYSCALL_DEFINEx(x,...
Five functions generated
See __do_sys##name(...
__do_sys##name(...
No SYSCALL_DEFINE7 and above
Take a look at the SYSCALL_DEFINE macro definition in include/linux/syscalls.h
SYSCALL_DEFINE
Indicate error using the errno macros
Return to assembly for another context switch
do_trap_ecall_u() returns to assembly code
After finishing the function found in excp_vect_table, jump to ret_from_exception
ret_from_exception
Place userspace program counter in mepc: machine exception program counter CSR
mepc
mret gives up control to the hardware
mret
sret used in supervisor mode while mret used in machine mode
sret
el0t_64_sync() calls ret_to_user()
el0t_64_sync()
ret_to_user()
Which calls kernel_exit 0, defined as a macro
kernel_exit 0
eret gives up control to the hardware once again
eret
entry_SYSCALL_64() prepares to return
entry_SYSCALL_64()
First check whether we can use the faster sysret instruction
sysret
sysret is faster than iret
iret
Via either instruction we give up control to the hardware once again
Less dangerous operation than escalation
Restore old register and stack
Drop privileges
Set program counter to userspace return address
The ecall instruction places the return address in the mepc control and status register (in m-mode)
The mret instuction sets the program counter to the value in mepc
The svc #0 instruction saves a return address in hardware
The eret instruction sets the program counter to this value
iret loads the return address form the stack
sysret returns to rcx
Software takes control of execution
Always check for an error
Kernel functions return -errno
-errno
C library wrappers check for error
Store original error in errno
errno
Convert return code to -1
-1
Example: musl syscall return
The errno utility from moreutils package
moreutils
See man 3 errno
man 3 errno
The system call is complete
Linux provides a stable system call API
Most programs run in user execution context ("userspace")
Kernel code runs in several execution contexts (all "kernelspace")
Hardware plays two key roles in system calls
Raising privileges and entering kernel execution context
Dropping privileges and entering user execution context
Many syscall implementation details are architecture-specific
The kernel defines the main syscall handler using a SYSCALL_DEFINE* macro
The C library defines wrapper functions for many syscalls
These hide architecture-specific details
Provide POSIX-compatible behavior by hiding Linux eccentricities
Always check for an error after making a syscall