How I hacked my own Unix bootup and how easy it is
I’m writing this text because of how surprisingly easy it was in hindsight to do this while increasing my understanding of how a Unix bootup process works. There are so many things one encounters that seem daunting at first but once you did them and look back it was actually quite easy. All too often it happens that concepts that appear frightning and complex at first seem intuitive and trivial 5 days later. I’d like to urge anyone who wants to learn something but thinks it’s too difficult to give it a try anyway, you’ll probably be surprised how easy it is is when you dive into it. Things like writing kernel modules, Linux From Scratch, monads, first-class continuations. They all seem daunting at first but when you dive into it and get your hands dirty it all seems so easy after a while.
The modern Unix bootup process
Note: I am purposefully writing this part from the extent of what I knew before started hacking on my own bootup process.
How does it work:
- First the firmware executes a hardcoded piece of code in the storage devices. We will only concern ourselves with BIOS now. This section is called the MBR. This code is executed by the motherboard itself.
- That code, the MBR contains the code to initialize the bootloader
- The bootloader is a piece of software that will give you a nice menu typically, and then when you make a choice executes a kernel, note that it replaces itself with the kernel, the bootloader will disappear from memory afterwards.
- The kernel is also given a couple of arguments when executed. The most important two of this are a descriptor pointing towards the root filesystem and a filepath on that root filesystem pointing towards the init.
- The kernel then boots itself and once its own booting is complete executes the init, this is the first process the kernel executes. This can in theory be any process but we of course want a process that brings the system online and eventually culminates into a login
- The init process, like the kernel and unlike the bootloader will remain running until the computer shuts down. In theory after this point all the init process needs to do are some periodic cleanup actions
While the init brings the system online it performs things like mounting the appropriate filesystems, checking the drives for integrity and what-not.
Most modern bootup processes involve first mounting the root filesystem as read-only, only after the init is then satisfied with the integrity of the filesystems will it remount it as read-write
The Modern Unix shutdown process
The shutdown process is also typically handled by the init which again is instructed to kill all processes, cleanly unmount filesystems and once that is all done execute the shutdown() or reboot() system call which will instruct the kernel to stop its operations and tell the hardware via apci to shut itself down.
Note that during all of this, the init itself runs as as root, in order to log in as a normal user the init spawns a login mechanicasm which itself runs as root which allows you to log in which then creates a session for your normal user.
Runit
The init process I used for all of this is Runit this is a very simple and flexible framework that allows you to with simple shell scripts scrape your own system together. The only Unix distribution I know that uses it by default is Void Linux.
Runit’s bootup to shutdown sequence is composed of three stages. Simply called 1, 2 and 3. These are executable programs that can be whatever you want though shell scripts are the norm. How Runit works is simple:
- at bootup, it executes
1 - Once
1is done running and exits without failure, it executes2, two must keep running forever and the system crashes if 2 crashes. - When Runit is told to shut down/reboot, it will terminate the
2stage and execute the3stage. Once3has completed it will instruct the kernel to shut down.
Obviously, what you put inside each of these stages is what determins how your system works. Again, these are typically normal shell scripts so 1 must be responsible for initializing the system, checking filesystems, bringing udev online etc.
Implenting our own
Runit stage 1
So, let’s look at what Void Linux did here:
If we look at their implementation of 1, we’ll notice that the core-services directory contains the tasks needed to initialize the system. So let’s have a look in there.
What we see is a bunch of shell scripts which get executed in order that bring the system online. It loads the pseudofilesystem over at /proc, /dev etc, loads the kernel modules, brings udev online.
However we notice one obvious flaw. This is done sequentially, not in parallel with a dependency mechanism. This is just one giant shell script spread over multiple files that brings the system up. That’s not optimal. So here’s my runit/1
#!/bin/bash
# system one time tasks
. /etc/runit/output
. /etc/runit/functions
PATH=/sbin:/usr/sbin:/bin:/usr/bin
announce 'initializing system with OpenRC'
openrc sysinit ||
fatal 'OpenRC failed the sysinit stage'
openrc boot ||
fatal 'OpenRC failed the boot stage'
# OpenRC should have made the system read-write by now
attempt 'creating stopit file'
touch /etc/runit/stopit &&
chmod 0 /etc/runit/stopit
end || fatal 'could not create stopit file'
All the announce and attempt stuff is just output, it can be ignored. The important parts is the openrc sysvinit ; openc boot parts. Yes, that’s OpenRC. You may have heard of it. It’s commonly called an “init system”, but it is not. It’s an RC, a run-command. It must be used together with an actual init system, typically sysvinit but in this case Runit. OpenRC is capable of grouping commands together in hierarchies and types, expressing dependencies between them, executing them, tracking their results and resources with cgroups and doing it in parallel. It can be used to actually launch services, but in this case it is only used to initialize the system and perform tasks similar to the core-services of Void Linux’ implementation of Runit. Except that it does so in parallel.
There are clearly two levels to this process, the sysinit level and the boot level, the former must be completed before the latter starts but inside both the tasks are executed in parallel via a dependency system. Let’s take a look at the sysinit level:
—— — ls -1 /etc/runlevels/sysinit/
devfs
dmesg
kmod-static-nodes
sysfs
tmpfiles.dev
udev
udev-trigger
Those look similar to the Void Linux core services in name and they do similar things. Let’s just open one up and see what’s inside, I cut out the non relevant part as it’s long:
#!/sbin/openrc-run
command_args="--daemon ${udev_opts}"
description="udev manages device permissions and symbolic links in /dev"
extra_started_commands="reload"
description_reload="Reload the udev rules and databases"
depend()
{
need sysfs dev-mount
before checkfs fsck
keyword -lxc -systemd-nspawn -vserver
}
#more stuff here that is cut out
The dependencies are relevant, these scripts are not just executed in lexical order like how void does it. The dependency specification makes it clear that this script needs sysfs and dev-mount and must be executed before checkfs andfsck, but aside from that OpenRC is free to re-arrange the order in which it is executed and will attempt to start as many things as it can in parallel.
The parallel part is relevant because say one service is at a point where it is only limited by disk I/O but does not require a lot of CPU time, you’d want that CPU time to go to something else now rather than just having all the other services wait in line until they are finished.
Runit stage 2
As we remember, stage two has to keep running forever unti a shutdown. So let’s again look at what Void Linux did:
As you can see, there’s some code in there to detect arguments from the kernel command line. Some settings and folders are changed based on that. That’s all not that relevant and can be kept out. The relevant part is this magic part:
exec env - PATH=$PATH \
runsvdir -P /run/runit/runsvdir/current 'log: ...........................................................................................................................................................................................................................................................................................................................................................................................................'
The exec shell builtin runs an exec() system call. This is one of the most powerful parts of Unix that does not exist on Windows. This instructs a process to call another process and replace itself with it. Effectively the new process becomes the old process and inhaerits much of its state such as its process ID. As we recall stage 2 must run forever. So this is what this does. The env - PATH=$PATH runsvdir -P /run/runit/runsvidir/current <log_descriptor> process becomes the runit stage 2 with this.
The relevant part here is runsvdir. This is is basically a process that gets called with a directory which contains descriptions of services that need to keep running forever, it will start those services, provide supervision for them and ensure they will keep running forever and will respawn them if ever should they fail. This is the part that hooks up to the login process because unlike in a lot of traditional systems, the login prompts, called gettys are part of these services in a typical runit setup. It starts the login prompt as a normal service. So this runsvidir process becomes our stage 2 at this point and continues to run forever.
Other services that are supervised by runsvdir are things like your sshd daemon, your wireless connexion via wpa_supplicant, your cron daemon and really whatever you want. I have other esoteric stuff in there too like a small daemon that monitors my network mounts and ensureces that whever a computer comes online on the network that I have a network mount on it gets automatically mounted
So here is my implementation:
#!/bin/bash
. /etc/runit/output
. /etc/runit/functions
export PATH=/command:/usr/local/bin:/usr/local/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11R6/bin
svbase=/etc/runit/runlevels
export SVDIR=$svbase/default
# check for any other runlevel
runlevel=$(get_cmdline softlevel)
if [[ "$runlevel" ]] ; then
runlevel_path=$svbase/$runlevel
if [[ -d "$runlevel_path" ]] ; then
announce "using runlevel: $runlevel"
SVDIR=$runlevel_path
else
warning "runlevel: $runlevel_path does not exist, defaulting to $SVDIR"
fi
fi
attempt "linking $SVDIR to /var/service..."
ln -sf "$SVDIR" /var/service
end $?
SVDIR=/var/service
announce "starting runsvdir with: $SVDIR"
exec runsvdir -P "$SVDIR" &> /dev/null
fatal "exec some-how failed horribly"
This also checks the kernel commandline but in a different way and this also does that whole “runlevel” magic but again in a different way.
The important part is that the directories Void Linux users are completely different from mine. My different levels are inside /etc/runit/runlevels whereas with Void Linux they exist in /etc/runit/runsvdir. Their active runlevel is symlinked from /run/runit/runsvdir/current while mine is from /var/service. The kernel command line spefication is different. Their implementation simply requires that a directory with a matching kernel command line exists so if you have a kernel command line paramalter root=/dev/sda2 to specify your root filesystem and a directory /etc/runit/runsvdir/root=/dev/sda2 it will consider it a runlevel. This approach I feel is too fragile and does not allow for useful error reporting on a typo. So my implementation demands that the kernel commandline param is formed like softlevel=<runlevel> to ensure it goes well and if you make a typo you get a warning that the runlevel doesn’t exist and it defaults to the default one.
This heighlights a lot of the strengths of the simplicify of Runit. As I said before, it just executes the three stages to boot and shutdown your system. You can put into those scripts whatever you want. The support for parsing the kernel command line to the runlevels you want is not hardcoded, and if you don’t like how they did it, you can change it. In this sense Runit is quite a flexible framework. It’s not really an init, it provides a set of tools to make one however you please.
Runit stage 3
So of course, we want to shut down cleanly as well. As said, to do this, Runit stops stage 2 and starts stage 3.
Let’s again look how Void Linux does this. Quite unremarkable again, it saves the random seed, it stops all processes, it stops udev and it remounts the file-systems as read-only before it exits and finally it performs the sync call, what sync does is that it writes any last pending changes to the filesystems if the file systems are mounted in an asynchroneous way (almost all modern filesystems are). No, this is not done in the wrong order. A filesystem that is mounted in read-only cannot gain new pending writes, but old pending writes can still be synced to it. A synchroneous filesystem immediately writes whenever you execute an operation, an asynchroneous filesystem may keep these in a buffer to improve performance. For some things such as shutting down the computer or removing a removable drive however we have to first sync it of course before we can do that. For most things, the kernel will be smart enough to figure it out for us.
Of course, this is all again done sequentially and not parallel, so again, OpenRC to the rescue in my setup:
#!/bin/bash
. /etc/runit/output
. /etc/runit/functions
exec >/dev/console 2>&1
export PATH=/sbin:/usr/sbin:/bin:/usr/bin
attempt "Waiting for services to stop"
sv force-stop /var/service/* &&
sv exit /var/service/*
end
attempt "unlinking /var/service"
unlink /var/service
end
attempt "obtaining all users in users group"
IFS=, users=( $(awk -F: '/^users/ {print $4;}' /etc/group ) )
end || fatal 'could not obtain users in user group'
announce "sending a TERM signal to all processes by users in users group"
for user in "${users[@]}"; do
pkill -u "$user"
done
patience=5
attempting "waiting a maximum of $patience seconds for processes to die"
now=$(date +%s)
timeout=$((now+$patience))
while true ; do
remaining_pids=()
for user in "${users[@]}" ; do
remaining_pids+=( $(ps -u "$user" -o pid=) )
done
remaining_count=${#remaining_pids[@]}
if [[ $remaining_count = 0 || $(date +%s) -gt timeout ]]; then
break
fi
done
if [[ $remaining_count -gt 0 ]]; then
end 1 || warning "$remaining_count processes remaining"
announce "sending KILL signal to remaining processes"
kill -s 9 ${remaining_pids[@]}
else
end 0
fi
if [[ -x /etc/runit/reboot ]] ; then
announce "rebooting"
announce "telling OpenRC to reboot"
openrc reboot
else
announce "shutting down"
announce "telling OpenRC to shut down"
openrc shutdown
fi
The first thing it does that is relevant is that it tells runsvidir to stop all the services in /var/service, which is basically a symlink to a directly with a list of currently up services.
Since this contains the login gettys any children they spawn are killed automatically too. So if you issue a shutdown command from your graphical session you will see it boom away before your eyes as your X session will be terminated because its parent getty was terminated before it.
Finally, there’s a rather complex piece of code, this code is actually in a separate file but I placed it here for illustration. This piece of code sends a TERM signal first to any process owned by any user that is in the users group. Then wait for a maximum of 5 seconds for them to gracefully die on themselves, and if there are still processes left after that point it just brutalizes them with the KILL signal.
I find this implementation to be superior over the standard sysvinit way of doing this which always waits a full second before sending the KILL signal to any remaining processes. This implementation is faster if they actually do gracefully exit before that time, but is more generous and gives them more time as well if they need more of it.
After that, the OpenRC magic starts again. Depending on whether the system was configured for a reboot or shutdown we tell OpenRC to finalize itself again. This involves OpenRC doing the inverse of all the things it did during system intialization, it will unmount all filesystems, stop udev, unmount cgroups, unmount pseudofilesystems, remount the root file system read-only and finally after that exit, after which point the entire stage3 will exit and runit will tell the kernel to shut the system down.
And yes, doing this properly required me to implement a lot of OpenRC scripts that initialize the system and shut it down cleanly. A lot was already written by the Gentoo team but a lot of that I had to implement myself to make sure it all went well. It was a lot of work and time consuming yes, but it wasn’t that difficult at all.
So is it really all that easy?
Yeah, I thought there was more to it as well. But really, you can easily build a system initialization with a couple of shell scripts. You just need to check the filesystems and bring a bunch of stuff up and do the reverse when it shuts down really. You can again look in Void Linux’ core services to look just how easy and understandable it can be. My implementation using OpenRC is already more complex than that. You can pretty much make your entire init a shell script if you want. All it needs to do is run forever and perform some routine cleanup of zombie processes and that is it really.