GNU coreutils – MaiZure's Projects (original) (raw)

October 2018 updated: September 2019

coreutils brought to you by the GNU project

This is a long-term project to decode all of the GNU coreutils in version 8.3.

This resource is for novice programmers exploring the design of command-line utilities. It is best used as an accompaniment providing useful background while reading the source code of the utility you may be interested in. This is not a user guide -- Please see applicable man pages for instructions on using these utilities.

Status: Complete!


The GNU Core Utilities

I'll link the utility pages here at the top. Click the command name for the detailed page decoding that utility. The discussion, source code, and walkthroughs are available on each page. Bolded utilities have been expanded as part of phase 2. Enjoy!

Helpful background for code reading

The GNU coreutils has its foibles. Many of these utilities are approaching 30 years old and include revisions by many people over the years. Here are some things to keep in mind when reading the code:

Basic design

Most CLI utilities look something close to this:

General CLI procedure

The key ideas:

This is the framework I'll use to organize the decoding of each utility. We'll see that each has a unique variant of this idea which range from a few lines to thousands of lines. I'd categorize the variants in three groups: trivial, wrappers, and full utilities

Trivial utilities
Trivial utilities have a unique set up phase which defines a macro in a couple lines. Then it 'includes' the source of another utility in which the macro forces a specific flow control. Examples include: arch, dir, and vdir

Wrapper utilities
Wrappers perform setup and parse command line options which are passed directly as arguments to a syscall. The result of the syscall is the result of the utility. These utilities do little processing on their own. Examples include: link, whoami, hostid, logname, and more

Full utilities
The diagram above shows a design for full utilities. A setup phase, an option/argument parsing phase, and execution. Execution means processing input data and may invoke many syscalls along the way to handle more data until complete. Most utilities fall in to this category.


Digging deeper

Let's go through the most common ideas shared across many of the utilities. Knowing these concepts beforehand should speed up code reading.

Utility Initialization

All utilities have a short initialization procedure near the beginning of main():

initialize_main (&argc, &argv); set_program_name (argv[0]); setlocale (LC_ALL, ""); bindtextdomain (PACKAGE, LOCALEDIR); textdomain (PACKAGE);

atexit (close_stdout);

This preamble solves a few administrative issues; the most important of which are internationalization and assigning the exit action. I'll go through each of these lines below. This lines don't impact the specific action of a utility.

Parsing with Getopt

Ever wonder why command line utilities have had the same look and feel for the past 40 years? You can thank the Getopt toolset. The bare minimum you need to know to follow the coreutils is:

Traversing the file system with fts

Unix-like systems often support the fts library to easily manage walking through the file system. The basic hand-waved details are:

Syscall wrappers, and helpers

coreutils often invokes syscalls through wrappers and helpers beyond those provided by libc. Many are linked through the Gnulib project.

write

libc provides many text writing functions, such as fwrite() for buffered stream access, and the write() syscall wrapper. Coreutils brings in non-standard functions such as full_write(). The full_write() function continuously retries writes unless there is a hard failure. It relies on safe_write() to retry the write() syscall across interrupts. Other write-related helpers are used only in a single utility. Such as iwrite() in dd, cwrite() in split. I'll discuss those within the utilities themselves.

Common functions

All utilities use at least three functions: main(), usage(), and _().

The usage() function displays help for the utility that includes a list of input parameters, their meaning, and appropriate syntax.

The _() function is really a macro defined in system.h that binds simple strings to the Native Language Support capability in GNU gettext.h. If it's a string meant to be shown to the user, it's probably wrapped with this function.

Common code lines

The following code lines occur in most non-trivial utilities:

#include "system.h"
This header defines system-dependent marcos, variables, and useful non-standard functions. It provides 'translations' necessary to allow coreutils to build on as many architectures as possible. Overall, this header is a patchwork of corner cases lacking serious organization -- but it works!. Many C standard and POSIX headers are included within this header, such as: unistd.h, limits.h, ctypes.h, time.h, string.h, errno.h, stdbool.h, stdlib.h, fcntl.h, inttypes.h, and locale.h.

#define PROGRAM_NAME "cat"
Defines the official name for the utility. Used in the 'version' check.

#define AUTHORS proper_name ("Richard M. Stallman")
Defines the authors for the utility. Used in the 'version' check.

emit_try_help ()
Prints help suggestion after failed output. Includes a link to the online documents. This will appear at the beginning of usage()

emit_ancillary_info (PROGRAM_NAME)
Prints common extra help info after the command-specific output. Includes a link to the online documents. This appears close to the end of usage()

exit (status)
Syscall to end execution with the given status. This appears at the end of usage()

initialize_main(&argc, &argv)
Special handler for VMS forcing built-in wildcard expansion. This is defined away for most other operating systems

set_program_name(argv[0]);
Saves the basic program name using the first input argument. Discards the path component of argv[0].

setlocale(LC_ALL, "");
Sets up internationalization options during execution. Provided by libc in <locale.h>

bindtextdomain (PACKAGE, LOCALEDIR);
Sets the directory of intenationalization features using the free software gettext.h

textdomain (PACKAGE);
Sets the text domain to enable i18n.

atexit(close_stdout);
Registers the close_stdout function for call when the program ends. This flushes the buffer steam in addition to closing.

IF_LINT(something);
Suppresses GCC warnings if using a linter by including the code within the parens. Usually this is NOP

C idioms

There are a few idioms buried in the coreutils source that may be unfamiliar to beginners.

!!
The double exclaimation point is exactly what you see, a double unary NOT operation. The purpose is to coerce a value in to a boolean. It's often used to make a flag from a function return value.

do { ... } while (0)
The non-loop often encloses a multi-statement macro to ensure proper tokenization after preprocessor substitution. The core use-case is as a consequent:

if (condition) MACRO; else something else

Note that lack of semi-colon after while -- It's manually added after the macro in the C code.


Utility Maintenance

An active project like coreutils is always evolving. In general, updates proceed across three arcs:

For curious readers, I've included an 'evolution' view within each utility page to visualize utility changes over time.

Contributing

People interested in contributing should read everything on the GNU project page. The contribution guidelines and list of rejected features are especially enlightening. Finally, go through the mailing list archives to get an idea of what contributions are most valuable. A very short list of things to consider before writing any code:

Not sure? Send your concerns to the community on the mailing list


Fun stuff

Veteran developers looking for a reason to peek inside these utilities may want to start their journey here.

Trivia

Shortest utility: false (2 lines - tied with arch, dir, and vdir)
Shortest standalone utility: true (80 lines) -- the first version is almost a minimum C program!
Longest utility: ls (5308 lines)

Interesting implementations

There are a few standalone code snippets within coreutils worth investigating:


FAQ

Nice project! How can I donate to support this effort?
Thanks for the thoughts; unfortunately I'm not configured to receive personal donations. But feel free to share your time or money with the Free Software Foundation -- That's where all the collaborative efforts happen!