VFS: ioctl and lseek syscalls


Learning objective

Revisit familiar patterns and round out our understanding of file descriptors and the VFS


Overview

  1. ioctl(2)

    1. Background and history

    2. Entry point and codepath

    3. Common IOCTLS to all file descriptors

  2. lseek(2)

    1. History and offset extension

    2. Entry point and codepath


IOCTL

  1. Commonly pronounced "eye-ock-toll"

  2. Abbreviation: Input/Output Control

  3. General purpose interface


Origins

  1. Introduced in Unix version 7

    1. 1960s-70s
  2. Operations beyond read/write

  3. Became standard device-specific communication method

  4. Replaced (now unimplemented stty & gtty


Standardization

  1. Included in POSIX.1-2001

  2. Widely used in Linux and friends

  3. Compare to DeviceIoControl() in Win32


An unusual interface

int ioctl(int fildes, int request, ... /* arg */);

  1. Variable number of arguments!

  2. From current standard

  3. "For non-STREAMS devices, the functions performed by this call are unspecified"

    1. STREAMS is an obsolete character device protocol
  4. "The ioctl() function may be removed in a future version."


In Linux

  1. man 2 ioctl

  2. in glibc

int
__ioctl (int fd, unsigned long int request, ...)
  1. Relies on crazy macros

  2. Notice that args after arg are ignored


Entering the kernel

SYSCALL_DEFINE3(ioctl,...)

  1. unsigned long int from userspace implicitly converted to unsigned int

  2. unsigned long arg can be used to hold pointer

  3. No ksys_ioctl() here!

    1. Used to exist but was removed years ago

Overview

SYSCALL_DEFINE3(ioctl,...)

  1. Validate and take reference to file

  2. Check security modules to validate operation

  3. Perform underlying IOCTL

  4. Release the file reference


fdget() covered elsewhere

See the slides on read

  1. This check make sure fd is valid

Security check

security_file_ioctl()

  1. Similar to file_permisison hook covered in write slides

  2. Checks depend on cmd

  3. Example in selinux

  4. Not present in apparmor


First, the common

do_vfs_ioctl()

  1. Common to any file descriptor

  2. Not specific to any filesystem or device


First, the common

do_vfs_ioctl()

FIOCLEX and FIONCLEX: Set or clear the "close-on-exec" flag

  1. Can also do this with fcntl(2) and open(2) with the O_CLOEXEC flag

  2. Close fd if current succeeds at execve(2)


First, the common

do_vfs_ioctl()

FIONBIO: Uses ioctl_fionbio() to set or clear the nonblocking IO flag

  1. Note single cmd here

First, the common

do_vfs_ioctl()

FIOASYNC: Uses ioctl_fioasync() to enable or disable asynchronous IO notifications

  1. Note -ENOTTY means this IOCTL doesn't apply to this fd

  2. Makes sense: f_op->fasync() must be defined


First, the common

do_vfs_ioctl()

FIOQSIZE: get file's size

  1. Works for directories and links, not just regular files

  2. A directory's size is the sum of all entries


First, the common

do_vfs_ioctl()

FIFREEZE and FITHAW: freeze or thaw a filesystem

  1. Useful for snapshotting and backups

  2. Interaction with write covered write slides

  3. Uses ioctl_fsfreeze() and ioctl_fsthaw()


First, the common

do_vfs_ioctl()

FS_IOC_FIEMAP: Get the physical layout of a file on disk

  1. Useful for optimization and defragmentation

  2. See ioctl_fiemap() for more info


First, the common

do_vfs_ioctl()

FIGETBSZ: get the block size of a filesystem

  1. Check the superblock of this inode

  2. Not always relevant

  3. A simple operation


First, the common

do_vfs_ioctl()

FICLONE, FICLONERANGE, and FIDEDUPERANGE: Copy-on-write file cloning

  1. First can clone a whole file (ioctl_file_clone())

  2. Second can clone part of a file (ioctl_file_clone_range())

  3. Third can de-duplicate data across multiple files (ioctl_file_dedupe_range())


demo

A simple cp implementation in ioctl_copy.c


First, the common

do_vfs_ioctl()

FIONREAD: How many bytes left to read in a file?

  1. This is one place IOCTL may call into a filesystem and/or module

  2. For regular file, this is simple subtraction


First, the common

do_vfs_ioctl()

FS_IOC_GETFLAGS and FS_IOC_SETFLAGS: Set and get file flags

  1. Different than those that can be set with open(2) or fcntl(2)

  2. Many are persistent beyond this fd

  3. E.g. FS_APPEND_FL make a file append-only

  4. E.g. FS_IMMUTABLE_FL make a file immutable

  5. Uses ioctl_getflags and ioctl_setflags


First, the common

do_vfs_ioctl()

FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR: Get and set extended filesystem-level attributes

  1. Multiple uses, including SELinux labels

  2. Stored separately from main file information

  3. Related to, but different than xattrs


Regular files

file_ioctl()

A couple of commands only relevant for regular files, including:

  1. Mapping logical to physical block numbers

  2. Allocate uninitialized space for a file

  3. Deallocate the physical space for a file

  4. Zero out a file range


Next, the specific implementation

vfs_ioctl()

  1. Call f_op->unlocked_ioctl() if it exists

  2. Unlocked == no global kernel lock taken

  3. Global kernel lock long removed, so no other option

This concludes ioctl(2)


LSEEK

  1. Short for "long seek"

  2. Change offset of an open file

  3. Implies an historical non-long seek


History of lseek

  1. In the beginning (~1970), there was seek()

  2. Used signed, 16-bit offset

  3. Very limited!

  4. 2^15 bytes per file


History of lseek

  1. lseek() was introduced to expand computer potential

  2. Now, the offset was a signed 32-bit integer

  3. Files could be an entire 2GB!

  4. POSIX standardized lseek() but not seek()

  5. Therefore, seek() found the dustbin of history


Current standard

"...off_t shall be [a] signed integer [type]" -- POSIX

  1. off_t => __kernel_off_t in <linux/types.h>

  2. __kernel_off_t => __kernel_long_t in <asm-generic/posix_types.h>

  3. Finally: __kernel_long_t => long in the same file, per POSIX


Longer offsets

  1. An loff_t, however, is a long long (64-bit)

  2. On 64-bit systems, the long type is 64-bits

  3. 2^63 = 17,179,869,184/2 gigabytes

  4. This should be enough for all humans


Back to the code

SYSCALL_DEFINE3(lseek,...)

  1. Another ksys_* instance

  2. Used by the 32-bit compatibility entry point too


A familiar pattern

ksys_lseek()

  1. Get a valid reference to the file descriptor or exit

  2. Make sure the whence is within range

    1. This value modifies the file offset
  3. Perform the operation

  4. Check for errors (downcast and upcast)

  5. Release the reference and return


The long becomes longer

vfs_llseek()

  1. Bail if this is a pipe, socket, or FIFO

  2. ESPIPE is a specific error for seeking on a pipe

  3. If all goes well, call into the filesystem or module

  4. f_op->llseek: long long seek (64-bit)

    1. Fallback implementation default_llseek()

This concludes lseek(2)


Summary

Many system calls have a varied and interesting history that explains many of their quirks


Summary

ioctl(2) provides a versatile way to implement all sorts of interfaces to a kernel modules


Summary

Though quite a simple syscall, understanding lseek(2) provides insight into Linux, Unix, and computer history.


Summary

After seeing six syscall implementations, many common patterns should become apparent


Summary

This code is being actively worked on upstream. Contribute!


End


msg = (silence)
whoami = None
singularity v0.6-56-g8e52bc8 https://github.com/underground-software/singularity