Initial submission for the new_syscall assignment is due today at midnight
Once again we would like to encourage use of camera and microphone when possible
Using GDB with the kernel
Printing to the console
Device Tree
Finish Barebones Kernel
Begin syscalls
Start debugging the panic
Why are we panicing?
Let's figure it out
First: brain dead
Give up and change majors
Second: small brain
Grep for text "no working init found"
grep -rnw -e <pattern>
-r
: recursive search
-n
: display line number in file of results
-w
: only match full words
-i
: be case insensitive
-e
: specify pattern after this argument (optional if pattern is last argument)
Third: Small-medium brain
git grep
Optimized search using git's database
Fourth: medium brain
Look for the function in the source
Fifth: big brain
Use addr2line
on address found in GDB output
addr2line <address> -e vmlinux
Sixth: Galaxy brain
Get good at GDB
Run the kernel and interrupt it
c
...Ctrl+c
Do a backtrace and select a frame
bt
frame <n>
Disassemble the current function or list the source
disas
list
Switch between various Text-user-interface formats
layout asm
layout src
layout next
tui focus next
tui disable
What is try_to_run_init_process("/etc/init")
?
Why does this fail?
Note: kernel is optimized during build so some code is skipped or inlined
Interesting functions to explore
kernel_execve()
do_filep_open()
The essential functionality of the Linux kernel is quite minimal
Cross compiling the kernel can be relatively simple
GDB is a powerful and versatile tool with advanced features
The kernel is just another program that you can debug
The kernel is highly configurable and customizable
Because the kernel is open source, you can read the complete source and documentation
There is no magic in the Linux kernel, just engineering
What is a syscall?
A userspace request for kernel action
The mechanism to escalate privileges
Note: Syscall and system call used interchangably
demo: this riscv program will KILL you
.text
.globl _start
_start:
csrr a0, mtvec
Compile in container with:
riscv64-linux-gnu-gcc -pie -ffreestanding -nostdlib kill.s -static -march=rv64imzicsr -mabi=lp64 -shared -o kill
1. demo: this arm64 program will KILL you
.text
.globl _start
_start:
msr VBAR_EL1, x0
Overview
Execution contexts
Define kernelspace and userspace
Kernel representation of a process or thread
What do we want out of a system call?
The five steps of a system call
Background information
What is execution context?
An execution context is a CPU register state
First example: {set,long}jmp(3)
library functions
setjmp: save current state
longjmp: restore saved register state
demo: setjmp/longjmp example program
Threads in a program
threads share: address space (heap, code, static data)
threads have their own: stack (register ptr) registers, IP
Kernelspace and userspace are distinct execution contexts
Primary difference: privilege level
Different types of context difference than thread vs thread
Normal programs run in userspace
Kernelspace is the privileged execution context
Registers, stack, memory are familiar
Key difference: CPU capabilities are unrestricted
Kernelspace can be further subdivided into contexts
Kernel code may be running on behalf of a particular userspace process
Other kernel code runs on its own behalf
This material will be covered when we discuss interupts
Define context switch and its relation to syscalls
A change of CPU register state
Some may be familiar with process to process context switches
Each of these can be broken down further
First, the transition from userspace to kernelspace
Save registers and switch stack, raise privileges
Second, the transition from kernelspace to userspace
Registers and switch stack, drop privileges
We will refer to each of these distinct transitions as a "context switch"
Important kernel code for syscall discussion
What is the struct task_struct
This is Linux's Process Control Block (PCB)
each pid has unique struct task_struct
demo: see include/linux/sched.h
current macro refers to struct task_struct
of process in this context
riscv64: arch/riscv/include/asm/current.h
pid in kernel vs userspace
kernelspace name | userspace name |
---|---|
pid | tid |
tgid | pid |
demo: see get{p,t}id(2)
in kernel/sys.c
simplified call stack
task_tgid_vnr() in include/linux/pid.h
... (namespaces are taken into account): return to this later
... locking: return to this later
task_pid_nr() { ... tsk->pid ... }
Before Linux 2.6, there were only pid
s
The clone(2)
call could share address space between processes
This allowed thread-like behavior
These processes were too independent
e.g. no shared signals
NPTL implements threads as specified by POSIX
This required both userspace and kernelspace changes
The C library was hardened with locking
The C library introduced the tid
concept
The tid
subdivides a pid
On the kernel side:
The kernel introduced the tgid
concept
The tgid
groups kernel pid
s together
Each pid
corresponds to unique struct task_struct
The tgid
and pid
values are stored here
All useful programs depend on system calls
demo: let's trace a program's syscall usage with strace
demo : does int main(void) {}
use syscalls?
demo: a non-trivial syscall-free program: detect prime numbers
qemu-system-riscv64 -machine virt -bios none -nographic -no-reboot -net none -kernel arch/riscv/boot/Image -initrd ../rootfs.cpio -append 'panic=-1 -- 4'
qemu-system-riscv64 -machine virt -bios none -nographic -no-reboot -net none -kernel arch/riscv/boot/Image -initrd ../rootfs.cpio -append 'panic=-1 -- 3'
What do we want out of syscalls?
We need speed
We want re-entrancy
We need security
e.g. we validate ptr args in expected addr range
confused deputy problem [1]
Linux provides a stable syscall API
How does Linux implement syscalls?
A syscall can be broken down into 5 distinct steps
Userspace invocation
Hardware-assisted privilege escalation
Kernel code handler
Hardware-assisted privilege drop
Userspace program continues
The transfer of software or hardware responsibility divides each step
Userspace program preparation and invocation
All programs make system calls
e.g. a shell as an abstraction over many syscalls
demo: see /proc/PID/syscall, e.g. pid 1
when possible prefer library-provided syscall wrapper
glibc provides wrapper functions for many syscalls
Main benefit: speed
We want to minimize the high-overhead syscalls
Checks like input validation avoid syscalls if possible
We use e.g. write(3) instead of write(2)
Common manual number section notation
e.g. write(2)
Number in parenthesis refers to manual page section number
Section 2 has system calls and section 3 has library calls
See man man
for more information
demo: ltrace: like strace for library calls
Invocation of a syscall is architecture-specific
First, specify the syscall and arguments
Second, give up control to the hardware
On riscv: ecall
instruction gives up control
Specify syscall number in a7
Specify arguments 1-6 in a0
, a1
, a2
, a3
, a4
, a5
Return value will be in a0
On arm64: svc #0
gives up control
Specify syscall number in x8
Specify arguments 1-6 in x0
, x1
, x2
, x3
, x4
, x5
Return value will land in x0
On x86_64: syscall
gives up control
Specify syscall number in rax
args 1-6 in: rdi
, rsi
, rdx
, r10
, r8
, r9
Note: this differs from normal function calling convention
The syscall
instruction clobbers rcx
Use r10
instead of rcx
Return value will land in rax
With arguments chosen and syscall selected
Hardware-assisted privilege escalation
This step is done by hardware
Rewind to boot:
Linux installs it's syscalls into our CPU
On riscv: see _start
:
On arm64: see __primary_switched
Set VBAR_EL1
to address of vector table
Vector table defined in entry.S
VBAR_EL1: Vector Base Address Register (Exception Level 1)
On x86_64: see syscall_init()
Set MSR_LSTAR to entry_SYSCALL_64
address
MSR_LSTAR: Model Specific Register: Long System Target Address Register
Back to the present syscall
The CPU is preconfigured to correctly transfer control
This makes privilege escalation safe
On riscv: switch to machine mode
On arm64: elevate execution level
On x86_64: change to ring 0
All of these states are stored in a particular register
Kernel code handles request
Part architecture-specific and part architecture generic
Execution resumes from a hardware specified register rate
Higher on call stack is more generic code
Mostly assembly and C macros
On riscv, start in handle_exception
Assembly jumps into a particular offset in excp_vect_table
Get the syscall number in do_trap_ecall_u()
Index into the sys_call_table
array and call the function
sys_call_table
populated by an include statement
syscall_table_64.h
generated at build time
An included Makefile generates this from a common table
scripts/syscalls.tbl
enumerates the syscalls and compatibility information
On arm64, start in VBAR_EL1
Hardware jumps to particular offset in vectors
A function defined by macro in entry.S
calls into C
code
The first C function is el0t_64_sync_handler()
This takes us to el0_svc_common()
The invoke_syscall()
indexes into jump table of handlers
This architecture-generic handler is defined by a SYSCALL_DEFINE*
macro
On x86_64: start at entry_SYSCALL_64
Assembly calls into the do_syscall_64()
C function
Using a few helper functions, index into jump table of system call handlers
We have another article on x86_64 syscall implementation details
This architecture-generic handler is defined by a SYSCALL_DEFINE*
macro
Run SYSCALL_DEFINE*()
handler
Defined in include/linux/syscalls.h
Resolve to __SYSCALL_DEFINEx(x,...
Five functions generated
See __do_sys##name(...
Note: no SYSCALL_DEFINE7
and above
Indicate error using the errno macros
Return to assembly for exit
On riscv: do_trap_ecall_u()
returns to assembly code
After finishing the function found in excp_vect_table
, jump to ret_from_exception
Place userspace program counter in mepc
: machine exception program counter CSR
mret
gives up control to the hardware
sret
used in supervisor mode while mret
used in machine mode
On arm64: el0t_64_sync()
calls ret_to_user()
On x86_64: [entry_SYSCALL_64()
] prepares to return
Hardware-assisted privilege drop
Less dangerous operation than escalation
Restore old register and stack
Drop privileges
Set program counter to userspace return address
On riscv:
The ecall
instruction places the return address in the mepc
control and status register (in m-mode)
The mret
instuction sets the program counter to the value in mepc
On arm64:
The svc #0
instruction saves a return address in hardware
The eret
instruction sets the program counter to this value
On x86_64:
Software takes control of execution
Userspace program continues
Always check for an error
The errno
in kernel vs userspace context
Kernel functions return -errno
C library wrappers check for error
Store original error in errno
Convert return code to -1
example: musl syscall return
See: man 3 errno
Demo: errno
utility from moreutils
package
The program continues execution
The story of a system call: Summary
Linux provides a stable system call API
Most programs run in user execution context ("userspace")
Linux code runs in several execution contexts (all "kernelspace")
Hardware plays two key roles in system calls
First, raising privileges and entering kernel execution context
Second, dropping privileges and entering user execution context
Many syscall implementation details are architecture specific
The kernel defines the main syscall handler using SYSCALL_DEFINE
The C library defines wrapper functions for many syscalls
These hide architecture-specific details
Provide POSIX-compatible behavior by hiding Linux eccentricities
Always check for an error after making a syscall
msg = (silence)
whoami = None
singularity v0.6-56-g8e52bc8 https://github.com/underground-software/singularity