Revisit familiar patterns and round out our understanding of file descriptors and the VFS
ioctl(2)
Background and history
Entry point and codepath
Common IOCTLS to all file descriptors
lseek(2)
History and offset extension
Commonly pronounced "eye-ock-toll"
Abbreviation: Input/Output Control
General purpose interface
Introduced in Unix version 7
Operations beyond read/write
Became standard device-specific communication method
Replaced (now unimplemented stty & gtty
Included in POSIX.1-2001
Widely used in Linux and friends
Compare to DeviceIoControl() in Win32
DeviceIoControl()
int ioctl(int fildes, int request, ... /* arg */);
Variable number of arguments!
From current standard
"For non-STREAMS devices, the functions performed by this call are unspecified"
"The ioctl() function may be removed in a future version."
ioctl()
man 2 ioctl
in glibc
int __ioctl (int fd, unsigned long int request, ...)
Relies on crazy macros
Notice that args after arg are ignored
arg
SYSCALL_DEFINE3(ioctl,...)
unsigned long int from userspace implicitly converted to unsigned int
unsigned long int
unsigned int
unsigned long arg can be used to hold pointer
unsigned long arg
No ksys_ioctl() here!
ksys_ioctl()
Validate and take reference to file
Check security modules to validate operation
Perform underlying IOCTL
Release the file reference
fdget()
See the slides on read
fd
security_file_ioctl()
Similar to file_permisison hook covered in write slides
file_permisison
Checks depend on cmd
cmd
Example in selinux
Not present in apparmor
do_vfs_ioctl()
Common to any file descriptor
Not specific to any filesystem or device
FIOCLEX and FIONCLEX: Set or clear the "close-on-exec" flag
FIOCLEX
FIONCLEX
Can also do this with fcntl(2) and open(2) with the O_CLOEXEC flag
fcntl(2)
open(2)
O_CLOEXEC
Close fd if current succeeds at execve(2)
current
execve(2)
FIONBIO: Uses ioctl_fionbio() to set or clear the nonblocking IO flag
FIONBIO
ioctl_fionbio()
FIOASYNC: Uses ioctl_fioasync() to enable or disable asynchronous IO notifications
FIOASYNC
ioctl_fioasync()
Note -ENOTTY means this IOCTL doesn't apply to this fd
-ENOTTY
Makes sense: f_op->fasync() must be defined
f_op->fasync()
FIOQSIZE: get file's size
FIOQSIZE
Works for directories and links, not just regular files
A directory's size is the sum of all entries
FIFREEZE and FITHAW: freeze or thaw a filesystem
FIFREEZE
FITHAW
Useful for snapshotting and backups
Interaction with write covered write slides
Uses ioctl_fsfreeze() and ioctl_fsthaw()
ioctl_fsfreeze()
ioctl_fsthaw()
FS_IOC_FIEMAP: Get the physical layout of a file on disk
FS_IOC_FIEMAP
Useful for optimization and defragmentation
See ioctl_fiemap() for more info
ioctl_fiemap()
FIGETBSZ: get the block size of a filesystem
FIGETBSZ
Check the superblock of this inode
Not always relevant
A simple operation
FICLONE, FICLONERANGE, and FIDEDUPERANGE: Copy-on-write file cloning
FICLONE
FICLONERANGE
FIDEDUPERANGE
First can clone a whole file (ioctl_file_clone())
ioctl_file_clone()
Second can clone part of a file (ioctl_file_clone_range())
ioctl_file_clone_range()
Third can de-duplicate data across multiple files (ioctl_file_dedupe_range())
ioctl_file_dedupe_range()
A simple cp implementation in ioctl_copy.c
cp
ioctl_copy.c
FIONREAD: How many bytes left to read in a file?
FIONREAD
This is one place IOCTL may call into a filesystem and/or module
For regular file, this is simple subtraction
FS_IOC_GETFLAGS and FS_IOC_SETFLAGS: Set and get file flags
FS_IOC_GETFLAGS
FS_IOC_SETFLAGS
Different than those that can be set with open(2) or fcntl(2)
Many are persistent beyond this fd
E.g. FS_APPEND_FL make a file append-only
FS_APPEND_FL
E.g. FS_IMMUTABLE_FL make a file immutable
FS_IMMUTABLE_FL
Uses ioctl_getflags and ioctl_setflags
ioctl_getflags
ioctl_setflags
FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR: Get and set extended filesystem-level attributes
FS_IOC_FSGETXATTR
FS_IOC_FSSETXATTR
Multiple uses, including SELinux labels
Stored separately from main file information
Related to, but different than xattrs
xattrs
file_ioctl()
A couple of commands only relevant for regular files, including:
Mapping logical to physical block numbers
Allocate uninitialized space for a file
Deallocate the physical space for a file
Zero out a file range
vfs_ioctl()
Call f_op->unlocked_ioctl() if it exists
f_op->unlocked_ioctl()
Unlocked == no global kernel lock taken
Global kernel lock long removed, so no other option
This concludes ioctl(2)
Short for "long seek"
Change offset of an open file
Implies an historical non-long seek
In the beginning (~1970), there was seek()
seek()
Used signed, 16-bit offset
Very limited!
2^15 bytes per file
lseek() was introduced to expand computer potential
lseek()
Now, the offset was a signed 32-bit integer
Files could be an entire 2GB!
POSIX standardized lseek() but not seek()
Therefore, seek() found the dustbin of history
"...off_t shall be [a] signed integer [type]" -- POSIX
off_t => __kernel_off_t in <linux/types.h>
off_t
__kernel_off_t
<linux/types.h>
__kernel_off_t => __kernel_long_t in <asm-generic/posix_types.h>
__kernel_long_t
<asm-generic/posix_types.h>
Finally: __kernel_long_t => long in the same file, per POSIX
long
An loff_t, however, is a long long (64-bit)
loff_t
long long
On 64-bit systems, the long type is 64-bits
2^63 = 17,179,869,184/2 gigabytes
This should be enough for all humans
SYSCALL_DEFINE3(lseek,...)
Another ksys_* instance
ksys_*
Used by the 32-bit compatibility entry point too
ksys_lseek()
Get a valid reference to the file descriptor or exit
Make sure the whence is within range
whence
Perform the operation
Check for errors (downcast and upcast)
Release the reference and return
vfs_llseek()
Bail if this is a pipe, socket, or FIFO
ESPIPE is a specific error for seeking on a pipe
ESPIPE
If all goes well, call into the filesystem or module
f_op->llseek: long long seek (64-bit)
f_op->llseek
default_llseek()
This concludes lseek(2)
Many system calls have a varied and interesting history that explains many of their quirks
ioctl(2) provides a versatile way to implement all sorts of interfaces to a kernel modules
Though quite a simple syscall, understanding lseek(2) provides insight into Linux, Unix, and computer history.
After seeing six syscall implementations, many common patterns should become apparent
This code is being actively worked on upstream. Contribute!