Revisit familiar patterns and round out our understanding of file descriptors and the VFS
ioctl(2)
Background and history
Entry point and codepath
Common IOCTLS to all file descriptors
lseek(2)
History and offset extension
Entry point and codepath
Commonly pronounced "eye-ock-toll"
Abbreviation: Input/Output Control
General purpose interface
Introduced in Unix version 7
Operations beyond read/write
Became standard device-specific communication method
Included in POSIX.1-2001
Widely used in Linux and friends
Compare to DeviceIoControl()
in Win32
int ioctl(int fildes, int request, ... /* arg */);
Variable number of arguments!
From current standard
"For non-STREAMS devices, the functions performed by this call are unspecified"
"The ioctl()
function may be removed in a future version."
int
__ioctl (int fd, unsigned long int request, ...)
Relies on crazy macros
Notice that args after arg
are ignored
SYSCALL_DEFINE3(ioctl,...)
unsigned long int
from userspace implicitly converted to unsigned int
unsigned long arg
can be used to hold pointer
No ksys_ioctl()
here!
SYSCALL_DEFINE3(ioctl,...)
Validate and take reference to file
Check security modules to validate operation
Perform underlying IOCTL
Release the file reference
fdget()
covered elsewhereSee the slides on read
fd
is validsecurity_file_ioctl()
Similar to file_permisison
hook covered in write slides
Checks depend on cmd
Example in selinux
Not present in apparmor
do_vfs_ioctl()
Common to any file descriptor
Not specific to any filesystem or device
do_vfs_ioctl()
FIOCLEX
and FIONCLEX
: Set or clear the "close-on-exec" flag
Can also do this with fcntl(2)
and open(2)
with the O_CLOEXEC
flag
Close fd
if current
succeeds at execve(2)
do_vfs_ioctl()
FIONBIO
: Uses ioctl_fionbio()
to set or clear the nonblocking IO flag
cmd
heredo_vfs_ioctl()
FIOASYNC
: Uses ioctl_fioasync()
to enable or disable asynchronous IO notifications
Note -ENOTTY
means this IOCTL doesn't apply to this fd
Makes sense: f_op->fasync()
must be defined
do_vfs_ioctl()
FIOQSIZE
: get file's size
Works for directories and links, not just regular files
A directory's size is the sum of all entries
do_vfs_ioctl()
FIFREEZE
and FITHAW
: freeze or thaw a filesystem
Useful for snapshotting and backups
Interaction with write covered write slides
Uses ioctl_fsfreeze()
and ioctl_fsthaw()
do_vfs_ioctl()
FS_IOC_FIEMAP
: Get the physical layout of a file on disk
Useful for optimization and defragmentation
See ioctl_fiemap()
for more info
do_vfs_ioctl()
FIGETBSZ
: get the block size of a filesystem
Check the superblock of this inode
Not always relevant
A simple operation
do_vfs_ioctl()
FICLONE
, FICLONERANGE
, and FIDEDUPERANGE
: Copy-on-write file cloning
First can clone a whole file (ioctl_file_clone()
)
Second can clone part of a file (ioctl_file_clone_range()
)
Third can de-duplicate data across multiple files (ioctl_file_dedupe_range()
)
A simple cp
implementation in ioctl_copy.c
do_vfs_ioctl()
FIONREAD
: How many bytes left to read in a file?
This is one place IOCTL may call into a filesystem and/or module
For regular file, this is simple subtraction
do_vfs_ioctl()
FS_IOC_GETFLAGS
and FS_IOC_SETFLAGS
: Set and get file flags
Different than those that can be set with open(2)
or fcntl(2)
Many are persistent beyond this fd
E.g. FS_APPEND_FL
make a file append-only
E.g. FS_IMMUTABLE_FL
make a file immutable
Uses ioctl_getflags
and ioctl_setflags
do_vfs_ioctl()
FS_IOC_FSGETXATTR
and FS_IOC_FSSETXATTR
: Get and set extended filesystem-level attributes
Multiple uses, including SELinux labels
Stored separately from main file information
Related to, but different than xattrs
file_ioctl()
A couple of commands only relevant for regular files, including:
Mapping logical to physical block numbers
Allocate uninitialized space for a file
Deallocate the physical space for a file
Zero out a file range
vfs_ioctl()
Call f_op->unlocked_ioctl()
if it exists
Unlocked == no global kernel lock taken
Global kernel lock long removed, so no other option
This concludes ioctl(2)
Short for "long seek"
Change offset of an open file
Implies an historical non-long seek
In the beginning (~1970), there was seek()
Used signed, 16-bit offset
Very limited!
2^15 bytes per file
lseek()
was introduced to expand computer potential
Now, the offset was a signed 32-bit integer
Files could be an entire 2GB!
POSIX standardized lseek()
but not seek()
Therefore, seek()
found the dustbin of history
"...off_t shall be [a] signed integer [type]" -- POSIX
off_t
=> __kernel_off_t
in <linux/types.h>
__kernel_off_t
=> __kernel_long_t
in <asm-generic/posix_types.h>
Finally: __kernel_long_t
=> long
in the same file, per POSIX
An loff_t
, however, is a long long
(64-bit)
On 64-bit systems, the long
type is 64-bits
2^63 = 17,179,869,184/2 gigabytes
This should be enough for all humans
SYSCALL_DEFINE3(lseek,...)
Another ksys_*
instance
Used by the 32-bit compatibility entry point too
ksys_lseek()
Get a valid reference to the file descriptor or exit
Make sure the whence
is within range
Perform the operation
Check for errors (downcast and upcast)
Release the reference and return
vfs_llseek()
Bail if this is a pipe, socket, or FIFO
ESPIPE
is a specific error for seeking on a pipe
If all goes well, call into the filesystem or module
f_op->llseek
: long long seek (64-bit)
default_llseek()
This concludes lseek(2)
Many system calls have a varied and interesting history that explains many of their quirks
ioctl(2)
provides a versatile way to implement all sorts of interfaces to a kernel modules
Though quite a simple syscall, understanding lseek(2)
provides insight into Linux, Unix, and computer history.
After seeing six syscall implementations, many common patterns should become apparent
This code is being actively worked on upstream. Contribute!
msg = (silence)
whoami = None
singularity v0.6-56-g8e52bc8 https://github.com/underground-software/singularity