Gain greater depth of understanding file descriptors by comparing read and write
Userspace and kernel entry points
Contrast with read(2)
read(2)
A look at security hooks
Superblocks and filesystem snapshotting
SYSCALL_DEFINE3(write,...)
All it does is ksys_write()
ksys_write()
Only one other caller in s390 compat code
Originally there were more callers
While file descriptors are preferred as a userspace interface, the kernel is better off working directly with struct files
struct file
ksys_write() removed from init/initramfs.c
ksys_write() removed from init/do_mounts_rd.c
ksys_lseek()
kernel_write()
Verify the write operation
Acquire a filesystem resource
Perform the underlying operation
Release the filesystem resource
Almost a simplified vfs_write()
vfs_write()
Obtain a reference to the file position or bail
Create a local copy of the file position
Perform virtual filesystem (VFS) write
If needed, update the file position
Drop any held references
How does the function differ from ksys_read()?
ksys_read()
vfs_write() instead of vfs_read()
vfs_read()
const char __user * buf instead of char __user * buf
const char __user * buf
char __user * buf
DRY: "Don't Repeat Yourself"
See the slides on read
We will skip right to vfs_write()
Verify and validate the operation
Acquire filesystem resources
Perform the write operation
Account for the operation
Release filesystem resources
Make sure file open for writing (FMODE_WRITE)
FMODE_WRITE
Make sure writing makes sense (FMODE_CAN_WRITE)
FMODE_CAN_WRITE
Make sure buf is a userspace address range
buf
rw_verify_area()
Disallow count values with top bit set
Sanity check the file position
Verify write access
security_file_permission()
Use MAY_WRITE as our mask
MAY_WRITE
Call an arbitrary number of file_permission security hooks
file_permission
call_int_hook()
__label__ to declare local label.
__label__
RC = LSM_RET_DEFAULT(NAME) initial return code if all hooks return 0
LSM_RET_DEFAULT(NAME)
file_permission_default
Call each hook and stop if one fails
Statement expression evaluates to return code
LSM_LOOP_UNROLL()
Recursively defined macro
#define UNROLL(...
Changed from hlist iteration in Summer 2024 by 417c5643cd67a
417c5643cd67a
Macro counting done for MAX_LSM_COUNT
MAX_LSM_COUNT
union security_list_options
Define a macro in particular way
Resolve many instances of this macro
LSM_HOOK(..., file_permission, ...)
Undefine the macro to allow later re-use
xmacros_example
selinux_file_permission()
Security Enhanced Linux: Fine-grained mandatory access control (MAC)
Associated with file_permission hook here
Registered with security subsystem by security_add_hooks()
security_add_hooks()
Quick demo: ls -lZ
ls -lZ
apparmor_file_permission()
AppArmor: Per-program security profiles
Upstream documentation
One last check:
count >= MAX_RW_COUNT
MAX_RW_COUNT
Ensures maximum value is rounded down to page boundary
Exactly the same as read
file_start_write()
Check whether this is a regular file
A regular file is 0 or more bytes on disk
What are some examples of files that are NOT regular
Not regular: character devices, directories, links
S_ISREG()
sb_start_write()
Calls __sb_start_write()
__sb_start_write()
Acquire superblock write access
Each filesystem has one superblock
Contains meta-information about filesystem
Only relevant for regular files
SB_FREEZE_WRITE
struct super_block
Freezing enables snapshot fs backups
Select from an array of percpu reader-writer locks
Read is CPU local, write is cross-core
Freezing a filesystem
free.c and make_loop.sh
free.c
make_loop.sh
Now we can actually write!
f_op->write() calls into the filesystem or module
f_op->write()
Like read, fallback to f_op->write_iter
f_op->write_iter
We should never hit the -EINVAL case if FMODE_CAN_WRITE is set
-EINVAL
When we write some bytes:
Notify of file modification
Account for bytes written by this task
Unconditionally:
Account for write syscall count by this task
Release any filesystem resources acquired earlier
Return bytes written or errno to userspace
This concludes write(2)
write(2)
Writing is quite similar to reading, but a bit more complex
Linux Security Modules (LSM) provides a flexible way to enforce sets of security policies at the kernel level
Memory footprint minimization in the kernel is critical and this justifies hlist, which saves one pointer in the head instead of two
hlist
Kernel internal use of system call functionality is still evolving