Syscalls

Learning Objective

Understand the story of a system call

What is a syscall?

  1. A userspace request for kernel action

  2. The mechanism to escalate privileges

  3. The terms "syscall" and "system call" are used interchangably

demo: this riscv program will KILL you

.text
.globl _start
_start:
csrr a0, mtvec

riscv privileged isa releases

demo

this arm64 program will KILL you

.text
.globl _start
_start:
msr VBAR_EL1, x0

Overview

  1. Execution contexts

  2. Define kernelspace and userspace

  3. Kernel representation of a process or thread

  4. What do we want out of a system call?

  5. The five steps of a system call

Execution context

An execution context is a CPU register state

Execution context example 1

The {set,long}jmp(3) library functions

  1. setjmp: save current state

  2. longjmp: restore saved register state

demo

example use of {set,long}jmp

Execution context example 2

Threads in a program

  1. Threads share: address space (heap, code, static data)

  2. Threads have their own: stack (register ptr) registers, IP

Kernelspace and userspace

Kernelspace and userspace are distinct execution contexts

  • Primary difference: privilege level

Kernelspace and userspace

Normal programs run in userspace

Kernelspace is the privileged execution context

  1. Registers, stack, memory are familiar

  2. Key difference: CPU capabilities are unrestricted

Kernelspace simplification

Kernelspace execution context can be further subdivided

  1. Kernel code may be running on behalf of a particular userspace process

  2. Other kernel code runs on its own behalf

Definition of a context switch

  1. A capture of the CPU register state

  2. A load of a saved CPU register state

Context switching and the kernel

Switching to kernel context is like other context switches

  1. Main difference: privileges escalated

  2. Switching back to userspace is similar but privilegs are dropped

Note on terminology

The term "context switch" is sometimes used to refer to task-switching

  1. Each task-switch involves several context switches

Definition of re-entrancy

"A computer program or subroutine is called reentrant if multiple invocations can safely run concurrently on multiple processors" (source)

  1. Important concept for kernel code

Introducing the struct task_struct

This is Linux's Process Control Block (PCB)

  1. each kernel pid has unique struct task_struct

demo

A quick look at include/linux/sched.h

Introducing the current macro

Refers to struct task_struct of process in current execution context

demo

A quick look at three files:

  1. riscv64: arch/riscv/include/asm/current.h

  2. arm64: arch/arm64/include/asm/current.h

  3. x86: arch/x86/include/asm/current.h

The meaning of pid

kernelspace name userspace name
pid tid
tgid pid

demo

see get{p,t}id(2) in kernel/sys.c

Simplified getpid(2) call stack

getpid(2) calls functions

... namespaces are taken into account

... locking is done

task_pid_nr() { ... tsk->pid ... }

The pid/tgid distinction

Why do we have different names?

History of tgid: the dilema

Before Linux 2.6, there were only pids

  1. The clone(2) call could share address space between processes

  2. This allowed thread-like behavior

  3. These processes were too independent

    1. Example: no shared signals

History of tgid: the proposal

NPTL implements threads as specified by POSIX

  1. This required both userspace and kernelspace changes

History of tgid: userspace changes

  1. The C library was hardened for concurrency

  2. The C library introduced the tid concept

  3. The tid subdivides a pid

History of tgid: kernelspace changes

  1. The kernel introduced the tgid concept

  2. The tgid groups kernel pids together

History of tgid: present day

Each pid corresponds to unique struct task_struct

  1. The tgid and pid values are stored here

Syscalls: are they necessary?

Can a program do anything useful without making any syscalls?

Syscalls: highly necessary

All useful programs depend on system calls

demo

Let's trace a program's syscall usage with strace

demo

Syscall-free prime-number detector program

Desirable properties of syscalls

  1. speed

  2. security

  3. stability

  4. re-entrancy

Security concerns

  1. confused deputy problem

  2. One example: validate address range of any pointer arguments

Stability

Linux provides a stable syscall API

  1. This sets Linux apart from other OSes

Syscall implementation

A syscall can be broken down into 5 distinct steps

  1. Userspace invocation

  2. Hardware-assisted privilege escalation

  3. Kernel code handler

  4. Hardware-assisted privilege drop

  5. Userspace program continues

The transfer of software or hardware responsibility divides each step

Userspace invocation

All programs make system calls

  1. excluding trivial example programs

Example: a shell as an abstraction over many syscalls

demo

see /proc/PID/syscall

  1. Multi-arch syscall number table

Userspace invocation: library wrappers

The C library provides wrapper functions for many syscalls

  1. Main benefit: speed

  2. We want to minimize the high-overhead syscalls

  3. Checks like input validation avoid syscalls if possible

  4. Avoid architecture specific details

Common notation: manual page section numbers

Example: write(2) vs write(3)

  1. Number in parenthesis refers to manual page section number

  2. Section 2 has system calls and section 3 has library calls

See man man for more information

demo

ltrace: like strace for library calls

Userspace invocation: architecture-specific

Common accross architectures:

  1. specify the syscall and arguments

  2. give up control to the hardware

Userspace invocation: riscv

  1. Specify syscall number in a7

  2. Specify arguments 1-6 in a0, a1, a2, a3, a4, a5

  3. Return value will be in a0

  4. ecall gives up control to hardware

Userspace invocation: arm64

  1. Specify syscall number in x8

  2. Specify arguments 1-6 in x0, x1, x2, x3, x4, x5

  3. Return value will land in x0

  4. svc #0 gives up control to hardware

Userspace invocation: x86_64

  1. Specify syscall number in rax

  2. Args 1-6 in: rdi, rsi, rdx, r10, r8, r9

  3. syscall gives up control to hardware

  4. Return value will land in rax

Userspace invocation: x86_64 fine print

Difference from normal function calling convention

  1. The syscall instruction clobbers rcx

  2. Use r10 instead of rcx

Userspace invocation: wrap up

With arguments chosen and syscall selected

  1. Give up control to the hardware

Hardware-assisted privilege escalation

This step is handled by hardware

  1. How does hardware know what to do?

Hardware-assisted privilege escalation

Rewind to boot

  1. Linux installs it's syscalls into our CPU

Hardware-assisted privilege escalation: riscv

See _start:

  1. Set CSR_TVEC to address of handle_exception()

  2. CSR_TVEC: Control and Status Register: Trap Vector Base Address Register

Hardware-assisted privilege escalation: arm64

See __primary_switched()

  1. Set VBAR_EL1 to address of vector table

  2. Vector table defined in entry.S

  3. VBAR_EL1: Vector Base Address Register (Exception Level 1)

Hardware-assisted privilege escalation: x86_64

See syscall_init()

  1. Set MSR_LSTAR to entry_SYSCALL_64 address

  2. MSR_LSTAR: Model Specific Register: Long System Target Address Register

Hardware-assisted privilege escalation: at invocation

Back to the present

  1. The CPU is preconfigured to correctly transfer control

  2. This makes privilege escalation safe

Hardware-assisted privilege escalation

  1. On riscv: switch to machine mode

  2. On arm64: elevate execution level

  3. On x86_64: change to ring 0

  4. Both of these are stored in a particular register

Kernel handles request

  1. Part architecture-specific

  2. Part architecture-generic

Kernel handles request: starting point

Execution resumes from a hardware specified register rate

  1. At bottom, mostly assembly and C macros

  2. Higher on call stack is more generic code

Kernel handles request: riscv

Start in handle_exception

  1. Assembly jumps into a particular offset in excp_vect_table

Kernel handles request: riscv

Get the syscall number in do_trap_ecall_u()

  1. Index into the sys_call_table array and call the function

Kernel handles request: riscv

syscall_table_64.h generated at build time

  1. An included Makefile generates this from a common table

  2. scripts/syscalls.tbl enumerates the syscalls and compatibility information

Kernel handles request: arm64

Start in VBAR_EL1

  1. Hardware jumps to particular offset in vectors

Kernel handles request: arm64 reaches C

A function defined by macro in entry.S calls into C code

  1. The first C function is el0t_64_sync_handler()

Kernel handles request: arm64 goes geenric

Execution reaches el0_svc_common()

  1. The invoke_syscall() indexes into jump table of handlers

  2. This architecture-generic handler is defined by a SYSCALL_DEFINE* macro

Kernel handles request: x86_64 entry

Start atentry_SYSCALL_64

  1. Assembly calls into the do_syscall_64() C function

Kernel handles request: x86_64 goes generic

Using a few helper functions, index into jump table of system call handlers

  1. This architecture-generic handler is defined by a SYSCALL_DEFINE* macro

Further reading on x86_64 syscall implementation details

  1. We have another article about this available

Kernel handles request: architecture-generic

A closer look at the SYSCALL_DEFINE*() handlers

Kernel handles request: SYSCALL_DEFINE_*

Defined in include/linux/syscalls.h

  1. Resolve to __SYSCALL_DEFINEx(x,...

  2. Five functions generated

  3. See __do_sys##name(...

A note on syscall arguments

No SYSCALL_DEFINE7 and above

demo

Take a look at the SYSCALL_DEFINE macro definition in include/linux/syscalls.h

Kernel handles request: return imminent

Indicate error using the errno macros

Return to assembly for another context switch

Kernel handles request: riscv returns

do_trap_ecall_u() returns to assembly code

  1. After finishing the function found in excp_vect_table, jump to ret_from_exception

  2. Place userspace program counter in mepc: machine exception program counter CSR

  3. mret gives up control to the hardware

  4. sret used in supervisor mode while mret used in machine mode

Kernel handles request: arm64 returns

el0t_64_sync() calls ret_to_user()

  1. Which calls kernel_exit 0, defined as a macro

  2. eret gives up control to the hardware once again

Kernel handles request: x86_64 returns

entry_SYSCALL_64() prepares to return

  1. First check whether we can use the faster sysret instruction

  2. sysret is faster than iret

  3. Via either instruction we give up control to the hardware once again

Hardware-assisted privilege drop

Less dangerous operation than escalation

  1. Restore old register and stack

  2. Drop privileges

  3. Set program counter to userspace return address

Hardware-assisted privilege drop: riscv

The ecall instruction places the return address in the mepc control and status register (in m-mode)

The mret instuction sets the program counter to the value in mepc

Hardware-assisted privilege drop: arm64

The svc #0 instruction saves a return address in hardware

The eret instruction sets the program counter to this value

Hardware-assisted privilege drop: x86_64

iret loads the return address form the stack

sysret returns to rcx

Hardware-assisted privilege drop: completed

Software takes control of execution

Userspace program continues

Always check for an error

Userspace program continues: errno

Kernel functions return -errno

C library wrappers check for error

  1. Store original error in errno

  2. Convert return code to -1

  3. Example: musl syscall return

Demo:

The errno utility from moreutils package

errno further reading

See man 3 errno

Userspace program continues

The system call is complete

The story of a system call: A summary

Linux provides a stable system call API

The story of a system call: A summary

  1. Most programs run in user execution context ("userspace")

  2. Kernel code runs in several execution contexts (all "kernelspace")

The story of a system call: A summary

Hardware plays two key roles in system calls

  1. Raising privileges and entering kernel execution context

  2. Dropping privileges and entering user execution context

The story of a system call: A summary

Many syscall implementation details are architecture-specific

The story of a system call: A summary

The kernel defines the main syscall handler using a SYSCALL_DEFINE* macro

  1. These macros are used to define system call implementations

The story of a system call: A summary

The C library defines wrapper functions for many syscalls

  1. These hide architecture-specific details

  2. Provide POSIX-compatible behavior by hiding Linux eccentricities

The story of a system call: A summary

Always check for an error after making a syscall

End