Gain a broad overview of many aspects of the kenrel by understanding what's necessary to close a file descriptor
Peel back the layers of close(2)
close(2)
Removing entries of the FDT
Scheduling work to be done later
Execution context design considerations
Several more concurrency techniques
Execution context sentitive code
Invalidate int fd index in FDT
int fd
Close the struct file * if needed
struct file *
Verify with strace that close(3) indeed calls close(2)
strace
close(3)
SYSCALL_DEFINE1(close)
cannot restart syscall since struct file is gone
struct file
If the file fails to be closed, the data may be hosed
close_fd()
Use int fd arugment to index into FDT
Obtain underlying struct file
What benefit could is there to using spin_lock here?
spin_lock
file_close_fd_locked()
Index into FDT properly
Do bounds checking on input value
Use array_index_nospec() macro for security
Use RCU to safely NULLify FDT entry without locks
NULL
Concurrent readers of the FDT will see a value that makes sense
array_index_mask_nospec()
Create bitmask based on index
All 1s if within bounds, else 0
1
0
Bitwise AND index to zero if out-of-bounds
Speculative indexing into the array always within bounds
__put_unused_fd()
put_unused_fd()
files->file_lock
__clear_open_fd()
Update bitmaps holding open file info
full and low resolution maps used
BITS_PER_LONG-sized ranges checked if all fds in use
Smallest available fd stored for next open(2)
fd
open(2)
This free may require updating smallest fd
Return the struct file associated with the open fd
Return NULL if fd not open
No file? Then -EBADF
-EBADF
Lastly, return whatever filp_close() returns
filp_close()
filp_open()
sanity check reference count
Never should be 0
Use CHECK_DATA_CORRUPTION() macro which may call BUG() on kernels configured to do so
CHECK_DATA_CORRUPTION()
BUG()
ASM_BUG_FLAGS() generates assembly from preprocessor macros
ASM_BUG_FLAGS()
Why use high numbers in assembly labels?
If implemented call the ->flush() file operation
->flush()
Flush performs pre-closure cleanup
Example: writing buffered data to storage medium
Can open(2) a file with O_PATH
O_PATH
Lighweight efernce to filesystem path entry
No I/O
Example usage: permission checks, change of ownership
For files with I/O context
Flush directory notifications using the dnotify system
dnotify
Remove POSIX locks associated with this file
First Linux filesystem event notificaton system
Added in 2001 in Linux 2.4.0
Monitor CRUD chagnes in directory
Notifed via SIGIO usually
SIGIO
Only directory granularity
Signal handling can be tricky
Need open fd
not much info about events
No longer used
Kept for legacy reasons
Replacement: inotify
inotify
fcntl(2)
example posix locks program
Call fput() to finish the job
fput()
No error code from fput()
Return value nonzero only when flush fails
flush
Decrement the file's reference count (file->f_count)
file->f_count
Use atomic_long_dec_and_test()
No other action taken when result is nonzero
If count reaches zero, instigate the real work
Why rush? Schedule a future callback
First method: only for process context
Second method: for any context
in_interrupt()
A depreceated macro
Transitively defined by irq_count()
irq_count()
Bitwise OR three shifted values
NMI, softirq, and hardirq counts
Nonzero when any count is nonzero
preempt_count()
Architecture-specific data source
Value stored in current->thread_info
current->thread_info
Can directly cast current since struct thread_info is first member
current
struct thread_info
READ_ONCE() prevents racy compiler re-ordering
READ_ONCE()
Process context without userspace
Can sleep, be preempted
Can call most kernel funtions
No userspace memory to access
likely()
Generates branch prediction hints
Not on all CPUs
unlikely() does the inverse
unlikely()
Faster true case
Slower false case
Helpful only when very likely true
Otherwise considered harmful
Schedule callback to run on current's behalf
init_task_work() wraps callback struct member assignment
init_task_work()
task_work_add() schedules the work
task_work_add()
____fput()
If this fails, just fallback to the other method
TWA_SIGNAL interrupts target task
TWA_SIGNAL
TWA_SIGNAL_NO_IPI is more chill
TWA_SIGNAL_NO_IPI
TWA_RESUME is the most relaxed
TWA_RESUME
Global delayed work queue
Create a list of files to pass to callback
Run them all in a jiffy (next timer tick)
Use schedule_delayed_work() to access global queue
schedule_delayed_work()
Uses structure defined with DECLARE_DELAYED_WORK()
DECLARE_DELAYED_WORK()
Do work after delay timer ticks pass
delay
Avoid any extra scheduling
Conditionally call schedule_delayed_work()
Only on first list append
Resulting work will empty this queue
llist: lockless linked list implementation optimized for concurrent access
llist
Two possible callers of __fput()
__fput()
____fput() when using task work
delayed_fput when using global delayed work
delayed_fput
Uses the container_of() macro
container_of()
Use struct member offset subtraction
Pass containing struct file to __fput()
delayed_fput()
Detach list of files from caller handle
Use special llist iterator
Pass each file to __fput()
What do we need to do?
Clean up file-associated resoruces
Drop references held by file
Free allocated memory
Is the file really open?
Check FMODE_OPENED flag in file->f_mode,
FMODE_OPENED
file->f_mode
Set by do_dentry_open() in open(2) path
do_dentry_open()
Without this flag, skip to memory freeing
A debugging helper: might_sleep()
might_sleep()
Spead the news of this closure
fsnotify provides fs event info to other kernel systems
fsnotify
e.g. inotify consumes this data
Call eventpoll_release() to clean up all resoruces associated with event polling on this struct file
eventpoll_release()
Safe to release the file's locks: locks_remove_file()
locks_remove_file()
Integrity Management Architecture (IMA)
Prevents tampering with file contens
Allocates resources for each file
Cleanup with ima_file_free()
ima_file_free()
Handle pending asynchronous operations
Only if file has FASYNC flag set
FASYNC
Call fasync() handler defined by underlying file implementation
fasync()
Call any extant release() file operation
release()
Release reference to a character device and file operations
Only if file is backed by one
Reference to any underlying module implements fops reference
Drop reference to pid of file owner
pid
Contained in struct pid in struct fown_struct
struct pid
struct fown_struct
Which is a member of struct file
Use put_file_access() to perform access mode specific tasks to clean up access to the file
put_file_access()
Drop a reference to the dentry for this file with dput()
dentry
dput()
Some file modes will require an unmount at this point
Handled by dissolve_on_fput()
dissolve_on_fput()
May cover later material on namespaces
mntput() frees the struct file's struct vfsmount member
mntput()
struct vfsmount
Finish the job with file_free()
file_free()
Notify Linux Security Modules (LSM) framework users to cleanup with security_file_free() to clean up
security_file_free()
Decrement open file counter
Directly decrement local percpu counter
Global total periodically calculated
Drop refrence to file's struct cred
struct cred
If the file is a backing store for a device or file, drop reference to associated struct path
struct path
Example: a loopback device
Last step before freeing memory
Free the last structure's memory
Backing files free their backing file structure
Otherwise, return the struct file to its kmem_cache()
kmem_cache()
Back to whence it came
We return to userspace, concluding the close(2) implementation
The close(2) systemcall contains plenty of complexity and many layers
Many different types of in-kernel resources may be associated with a file
The kernel employs creative lock avoidant techniques to implement correct concurrency
Correct reference counting is essential
The codepath can invoke several file operations, including release(), flush(), and fasync()
flush()