VFS: write syscall


Learning objective:

Gain greater depth of understanding file descriptors by comparing read and write


Overview

  1. Userspace and kernel entry points

  2. Contrast with read(2)

  3. A look at security hooks

  4. Superblocks and filesystem snapshotting


In the beginning

SYSCALL_DEFINE3(write,...)

  1. All it does is ksys_write()

  2. Only one other caller in s390 compat code

  3. Originally there were more callers


Where did these callers go?

While file descriptors are preferred as a userspace interface, the kernel is better off working directly with struct files

ksys_write() removed from init/initramfs.c

ksys_write() removed from init/do_mounts_rd.c

  1. Notice that ksys_lseek() is restricted to static linkage

The other kernel interface

kernel_write()

  1. Verify the write operation

  2. Acquire a filesystem resource

  3. Perform the underlying operation

  4. Release the filesystem resource

Almost a simplified vfs_write()


Callable from userspace and the kernel

ksys_write()

  1. Obtain a reference to the file position or bail

  2. Create a local copy of the file position

  3. Perform virtual filesystem (VFS) write

  4. If needed, update the file position

  5. Drop any held references


Spot the difference

ksys_write()

How does the function differ from ksys_read()?


Keeping these slides DRY

  1. DRY: "Don't Repeat Yourself"

  2. See the slides on read

  3. We will skip right to vfs_write()


Right into the meat

vfs_write()

  1. Verify and validate the operation

  2. Acquire filesystem resources

  3. Perform the write operation

  4. Account for the operation

  5. Release filesystem resources


First validation

vfs_write()

  1. Make sure file open for writing (FMODE_WRITE)

  2. Make sure writing makes sense (FMODE_CAN_WRITE)

  3. Make sure buf is a userspace address range


Verifying the target

rw_verify_area()

  1. Disallow count values with top bit set

  2. Sanity check the file position

    1. Signed offsets may wrap or exceed bounds
  3. Verify write access


Security checks

security_file_permission()
  1. Use MAY_WRITE as our mask

  2. Call an arbitrary number of file_permission security hooks


The hook caller

call_int_hook()

  1. __label__ to declare local label.

    1. Why?
  2. RC = LSM_RET_DEFAULT(NAME) initial return code if all hooks return 0

    1. Where is file_permission_default defined?
  3. Call each hook and stop if one fails

  4. Statement expression evaluates to return code


unroll

LSM_LOOP_UNROLL()

  1. Recursively defined macro

  2. #define UNROLL(...

  3. Changed from hlist iteration in Summer 2024 by 417c5643cd67a

  4. Macro counting done for MAX_LSM_COUNT


LSM XARGS

union security_list_options

  1. Define a macro in particular way

  2. Resolve many instances of this macro

    1. Specifically LSM_HOOK(..., file_permission, ...)
  3. Undefine the macro to allow later re-use

    1. This technique is called xmacros

demo

xmacros_example


Example file_permission hooks

selinux_file_permission()

  1. Security Enhanced Linux: Fine-grained mandatory access control (MAC)

  2. Associated with file_permission hook here

  3. Registered with security subsystem by security_add_hooks()

  4. Quick demo: ls -lZ


Example file_permission hooks

apparmor_file_permission()

  1. AppArmor: Per-program security profiles

  2. Associated with file_permission hook here

  3. Registered with security subsystem by security_add_hooks()


More information about LSM

Upstream documentation


Back to the VFS

vfs_write()

One last check:

  1. count >= MAX_RW_COUNT

  2. Ensures maximum value is rounded down to page boundary

  3. Exactly the same as read


Acquire filesystem resources

file_start_write()

  1. Check whether this is a regular file

  2. A regular file is 0 or more bytes on disk

  3. What are some examples of files that are NOT regular

  4. Not regular: character devices, directories, links

  5. S_ISREG()


Acquire filesystem resources

sb_start_write()

  1. Calls __sb_start_write()

  2. Acquire superblock write access

  3. Each filesystem has one superblock

  4. Contains meta-information about filesystem

  5. Only relevant for regular files


Don't freeze me!

SB_FREEZE_WRITE and struct super_block
  1. Freezing enables snapshot fs backups

  2. Select from an array of percpu reader-writer locks

  3. Read is CPU local, write is cross-core


demo

Freezing a filesystem

free.c and make_loop.sh


Back to the VFS

vfs_write()

Now we can actually write!

  1. f_op->write() calls into the filesystem or module

  2. Like read, fallback to f_op->write_iter

  3. We should never hit the -EINVAL case if FMODE_CAN_WRITE is set


Back to the VFS

vfs_write()

When we write some bytes:

  1. Notify of file modification

  2. Account for bytes written by this task


Back to the VFS

vfs_write()

Unconditionally:

  1. Account for write syscall count by this task

  2. Release any filesystem resources acquired earlier

  3. Return bytes written or errno to userspace

This concludes write(2)


Summary

Writing is quite similar to reading, but a bit more complex


Summary

Linux Security Modules (LSM) provides a flexible way to enforce sets of security policies at the kernel level


Summary

Memory footprint minimization in the kernel is critical and this justifies hlist, which saves one pointer in the head instead of two


Summary

Kernel internal use of system call functionality is still evolving


End


msg = (silence)
whoami = None
singularity v0.6-56-g8e52bc8 https://github.com/underground-software/singularity