VFS: read syscall


Learning objective

Gain greater depth of understanding of file descriptors by seeing how read uses them


Overview

  1. read(2) entry

  2. Advanced reference count optimization

  3. Reading through the virtual filesystem


Entry point

SYSCALL_DEFINE3(read, ...)

  1. Just calls ksys_read()

  2. Only one other caller in s390 compat code

  3. Originally there were more callers


Callable from userspace and the kernel

ksys_read()

  1. Obtain a reference to the file position or bail

  2. Create a local copy of the file position

  3. Perform virtual filesystem (vfs) read

  4. If needed, update the file position

  5. Drop any held references


Optimizing the references

fdget_pos()


Optimizing the references

fdget_pos()

What is this struct fd and why might we want something more than just the struct file?


Optimizing the references

fd_file()

  1. Get an unsigned long

  2. Split it into a struct fd

#define fd_file(f) ((struct file *)((f).word & ~(FDPUT_FPUT|FDPUT_POS_UNLOCK)))

Optimizing the references

fdget_pos()

  1. First, do we need the file lock?

  2. Then, do we need the file position lock?


Optimizing the references

fdget()

Get a reference to a file descriptor unless it's opened in path mode


Get what's needed

__fget_light()

  1. If the refcount is 1, we can borrow it

  2. Otherwise, we need our own reference

    1. And we will need to free it later

Get what's needed

__fget_light()

  1. Use atomic_read_acquire() to get the current reference count

  2. Call files_lookup_fd_raw() directly

  3. The unsigned long return value will be cast


Layers surrounding increment

  1. __fget_files()

  2. __fget_files_rcu()


Get what's needed

__fget_light()

In the case we cannot borrow, mark the lower bits of the pointer


Optimizing the references

fdget_pos()

Question: what is fd_file() doing?


Check if we need the fpos lock

file_needs_f_pos_lock()

When do we need the file position lock?


It is standardized

Any regular file or directory has FMODE_POS_ATOMIC set

  1. in do_dentry_open()

  2. POSIX.1-2017 2.9.7

In addition, we check the file_count and for a shared iterator


Optimizing the references

fdget_pos()

To finish up, lock and set another bit if needed


Back where we came from

ksys_read()

First check whether the file is open with fd_empty()

  1. Recent patchset by maintainer

  2. Introduced fd_empty() and fd_file()

  3. Used to check f.file


No position in a stream

file_ppos()

Otherwise, this just gets the address of the file position


The meat of the read

vfs_read()

Overview:

  1. Validate the operation and its inputs

  2. Execute the specific read handler

  3. Notify of completion


The meat of the read

vfs_read()

First three checks

  1. Make sure the file is open for reading

  2. Make sure that the file can be read

  3. Make sure the output buffer is a sane address


Check the area to read from

rw_verify_area()

  1. Sanity check the file position

    1. Signed offsets may wrap or exceed bounds
  2. Verify read access

    1. security_file_permission()

The meat of the read

vfs_read()

Check that count isn't too big

  1. count >= MAX_RW_COUNT

  2. Ensures maximum value is rounded down to page boundary


The meat of the read

vfs_read()

Call the actual read!

  1. Call the read() member of file operations

  2. Otherwise, call read_iter()


The meat of the read

vfs_read()

If we are successful:

  1. Tell fsnotify to let others know of this access

  2. Account for task's bytes written

Unconditionally:

  1. Account for the task's read system call

See struct task_io_accounting


Back where we came from

ksys_read()

Last steps to wrap up

  1. Update the file position if relevant

  2. Drop any references we may have

  3. Return the number of bytes read or an error


Drop any references we may have

fdput_pos()

Question: When does control flow call this function?


Drop any references we may have

fdput_pos()

  1. If we locked the file position: __f_unlock_pos()

  2. If we locked the file: fdput() calls fput()


Summary

read() doesn't need to do as much as open() or write()


Summary

Small optimizations on file descriptor operations add up to significant performance improvements


Summary

Watch out for data storage in unexpected places like the lower bits of a pointer!


End


msg = (silence)
whoami = None
singularity v0.6-56-g8e52bc8 https://github.com/underground-software/singularity