VFS: write syscall

Learning objective:

Gain greater depth of understanding file descriptors by comparing read and write

Overview

  1. Userspace and kernel entry points

  2. Contrast with read(2)

  3. A look at security hooks

  4. Superblocks and filesystem snapshotting

In the beginning

SYSCALL_DEFINE3(write,...)

  1. All it does is ksys_write()

  2. Only one other caller in s390 compat code

  3. Originally there were more callers

Where did these callers go?

While file descriptors are preferred as a userspace interface, the kernel is better off working directly with struct files

ksys_write() removed from init/initramfs.c

ksys_write() removed from init/do_mounts_rd.c

  1. Notice that ksys_lseek() is restricted to static linkage

The other kernel interface

kernel_write()

  1. Verify the write operation

  2. Acquire a filesystem resource

  3. Perform the underlying operation

  4. Release the filesystem resource

Almost a simplified vfs_write()

Callable from userspace and the kernel

ksys_write()

  1. Obtain a reference to the file position or bail

  2. Create a local copy of the file position

  3. Perform virtual filesystem (VFS) write

  4. If needed, update the file position

  5. Drop any held references

Spot the difference

ksys_write()

How does the function differ from ksys_read()?

  • vfs_write() instead of vfs_read()

  • const char __user * buf instead of char __user * buf

Keeping these slides DRY

  1. DRY: "Don't Repeat Yourself"

  2. See the slides on read

  3. We will skip right to vfs_write()

Right into the meat

vfs_write()

  1. Verify and validate the operation

  2. Acquire filesystem resources

  3. Perform the write operation

  4. Account for the operation

  5. Release filesystem resources

First validation

vfs_write()

  1. Make sure file open for writing (FMODE_WRITE)

  2. Make sure writing makes sense (FMODE_CAN_WRITE)

  3. Make sure buf is a userspace address range

Verifying the target

rw_verify_area()

  1. Disallow count values with top bit set

  2. Sanity check the file position

    1. Signed offsets may wrap or exceed bounds
  3. Verify write access

Security checks

security_file_permission()
  1. Use MAY_WRITE as our mask

  2. Call an arbitrary number of file_permission security hooks

The hook caller

call_int_hook()

  1. __label__ to declare local label.

    1. Why?
  2. RC = LSM_RET_DEFAULT(NAME) initial return code if all hooks return 0

    1. Where is file_permission_default defined?
  3. Call each hook and stop if one fails

  4. Statement expression evaluates to return code

unroll

LSM_LOOP_UNROLL()

  1. Recursively defined macro

  2. #define UNROLL(...

  3. Changed from hlist iteration in Summer 2024 by 417c5643cd67a

  4. Macro counting done for MAX_LSM_COUNT

LSM XARGS

union security_list_options

  1. Define a macro in particular way

  2. Resolve many instances of this macro

    1. Specifically LSM_HOOK(..., file_permission, ...)
  3. Undefine the macro to allow later re-use

    1. This technique is called xmacros

demo

xmacros_example

Example file_permission hooks

selinux_file_permission()

  1. Security Enhanced Linux: Fine-grained mandatory access control (MAC)

  2. Associated with file_permission hook here

  3. Registered with security subsystem by security_add_hooks()

  4. Quick demo: ls -lZ

Example file_permission hooks

apparmor_file_permission()

  1. AppArmor: Per-program security profiles

  2. Associated with file_permission hook here

  3. Registered with security subsystem by security_add_hooks()

More information about LSM

Upstream documentation

Back to the VFS

vfs_write()

One last check:

  1. count >= MAX_RW_COUNT

  2. Ensures maximum value is rounded down to page boundary

  3. Exactly the same as read

Acquire filesystem resources

file_start_write()

  1. Check whether this is a regular file

  2. A regular file is 0 or more bytes on disk

  3. What are some examples of files that are NOT regular

  • Not regular: character devices, directories, links

  • S_ISREG()

Acquire filesystem resources

sb_start_write()

  1. Calls __sb_start_write()

  2. Acquire superblock write access

  3. Each filesystem has one superblock

  4. Contains meta-information about filesystem

  5. Only relevant for regular files

Don't freeze me!

SB_FREEZE_WRITE and struct super_block
  1. Freezing enables snapshot fs backups

  2. Select from an array of percpu reader-writer locks

  3. Read is CPU local, write is cross-core

demo

Freezing a filesystem

free.c and make_loop.sh

Back to the VFS

vfs_write()

Now we can actually write!

  1. f_op->write() calls into the filesystem or module

  2. Like read, fallback to f_op->write_iter

  3. We should never hit the -EINVAL case if FMODE_CAN_WRITE is set

Back to the VFS

vfs_write()

When we write some bytes:

  1. Notify of file modification

  2. Account for bytes written by this task

Back to the VFS

vfs_write()

Unconditionally:

  1. Account for write syscall count by this task

  2. Release any filesystem resources acquired earlier

  3. Return bytes written or errno to userspace

This concludes write(2)

Summary

Writing is quite similar to reading, but a bit more complex

Summary

Linux Security Modules (LSM) provides a flexible way to enforce sets of security policies at the kernel level

Summary

Memory footprint minimization in the kernel is critical and this justifies hlist, which saves one pointer in the head instead of two

Summary

Kernel internal use of system call functionality is still evolving

End