Gain greater depth of understanding file descriptors by comparing read and write
Userspace and kernel entry points
Contrast with read(2)
A look at security hooks
Superblocks and filesystem snapshotting
SYSCALL_DEFINE3(write,...)
All it does is ksys_write()
Only one other caller in s390 compat code
Originally there were more callers
While file descriptors are preferred as a userspace interface, the kernel is better off working directly with struct file
s
ksys_write()
removed from init/initramfs.c
ksys_write()
removed from init/do_mounts_rd.c
ksys_lseek()
is restricted to static linkagekernel_write()
Verify the write operation
Acquire a filesystem resource
Perform the underlying operation
Release the filesystem resource
Almost a simplified vfs_write()
ksys_write()
Obtain a reference to the file position or bail
Create a local copy of the file position
Perform virtual filesystem (VFS) write
If needed, update the file position
Drop any held references
ksys_write()
How does the function differ from ksys_read()
?
vfs_write()
instead of vfs_read()
const char __user * buf
instead of char __user * buf
DRY: "Don't Repeat Yourself"
See the slides on read
We will skip right to vfs_write()
vfs_write()
Verify and validate the operation
Acquire filesystem resources
Perform the write operation
Account for the operation
Release filesystem resources
vfs_write()
Make sure file open for writing (FMODE_WRITE
)
Make sure writing makes sense (FMODE_CAN_WRITE
)
Make sure buf
is a userspace address range
rw_verify_area()
Disallow count values with top bit set
Sanity check the file position
Verify write access
security_file_permission()
Use MAY_WRITE
as our mask
Call an arbitrary number of file_permission
security hooks
call_int_hook()
__label__
to declare local label.
RC = LSM_RET_DEFAULT(NAME)
initial return code if all hooks return 0
file_permission_default
defined?Call each hook and stop if one fails
Statement expression evaluates to return code
LSM_LOOP_UNROLL()
Recursively defined macro
Changed from hlist iteration in Summer 2024 by 417c5643cd67a
Macro counting done for MAX_LSM_COUNT
union security_list_options
Define a macro in particular way
Resolve many instances of this macro
LSM_HOOK(..., file_permission, ...)
Undefine the macro to allow later re-use
xmacros_example
file_permission
hooksselinux_file_permission()
Security Enhanced Linux: Fine-grained mandatory access control (MAC)
Associated with file_permission
hook here
Registered with security subsystem by security_add_hooks()
Quick demo: ls -lZ
file_permission
hooksapparmor_file_permission()
AppArmor: Per-program security profiles
Associated with file_permission
hook here
Registered with security subsystem by security_add_hooks()
vfs_write()
One last check:
count >= MAX_RW_COUNT
Ensures maximum value is rounded down to page boundary
Exactly the same as read
file_start_write()
Check whether this is a regular file
A regular file is 0 or more bytes on disk
What are some examples of files that are NOT regular
Not regular: character devices, directories, links
sb_start_write()
Calls __sb_start_write()
Acquire superblock write access
Each filesystem has one superblock
Contains meta-information about filesystem
Only relevant for regular files
SB_FREEZE_WRITE
and struct super_block
Freezing enables snapshot fs backups
Select from an array of percpu reader-writer locks
Read is CPU local, write is cross-core
Freezing a filesystem
free.c
and make_loop.sh
vfs_write()
Now we can actually write!
f_op->write()
calls into the filesystem or module
Like read, fallback to f_op->write_iter
We should never hit the -EINVAL
case if FMODE_CAN_WRITE
is set
vfs_write()
When we write some bytes:
Notify of file modification
Account for bytes written by this task
vfs_write()
Unconditionally:
Account for write syscall count by this task
Release any filesystem resources acquired earlier
Return bytes written or errno to userspace
This concludes write(2)
Writing is quite similar to reading, but a bit more complex
Linux Security Modules (LSM) provides a flexible way to enforce sets of security policies at the kernel level
Memory footprint minimization in the kernel is critical and this justifies hlist
, which saves one pointer in the head instead of two
Kernel internal use of system call functionality is still evolving
msg = (silence)
whoami = None
singularity v0.6-56-g8e52bc8 https://github.com/underground-software/singularity