Both pieces of this assignment ask you to print a list of files. The word
“file” is often used to refer specifically to a regular file, which is the
“normal” kind of file that holds data and shows up in ls
with a type of -
.
However, the strict POSIX definition of “file” encompasses not only regular
files but also directories, symlinks, devices, and every other thing that can
go inside a directory. For this assignment, we are referring to the POSIX
definition whenever we say “file.” Your implementations of whats-new.sh
and
myls.c
should consider directories, symlinks, and all other types of file
when producing their output.
whats-new.sh
: finding newly-added files on the homework serverThe server that Tufts uses to host home directories has a special feature to
help prevent data loss: in every directory, the server adds a hidden, read-only
directory named .snapshot/
. Inside .snapshot/
is a set of directories with
names like daily.2021-09-22_0010
. Each of these directories holds a copy of
the contents of the original directory from a certain point in time. Take a
moment to SSH to the server and see for yourself!
The first piece of each snapshot’s name—daily
, in this example—represents
the frequency at which those snapshots are taken. The second
part—2021-09-22_0010
—is the date and time at which it was taken.
Snapshots that happen more frequently are also deleted more aggressively,
meaning that the granularity of snapshots goes down the further you go back.
Note that, to prevent overwhelming tools that recursively traverse a directory
tree, the .snapshot/
directory only shows up in listings (ls -a
) in your
top-level home directory. However, it is also present in every subdirectory if
you specifically ask for it:
$ ls -a Documents/ # No .snapshot/ listed!
. .. do-not-read top-secret-file
$ ls Documents/.snapshot/
daily.2021-09-19_0010 every_four_hours.2021-09-23_1205
daily.2021-09-20_0010 every_four_hours.2021-09-23_1605
daily.2021-09-21_0010 every_four_hours.2021-09-23_2005
daily.2021-09-22_0010 every_four_hours.2021-09-24_0005
daily.2021-09-23_0010 weekly.2021-09-05_0015
daily.2021-09-24_0010 weekly.2021-09-12_0015
every_four_hours.2021-09-23_0405 weekly.2021-09-19_0015
every_four_hours.2021-09-23_0805
$
This is not a standard feature of Linux or POSIX—it’s specific to the network
servers that Tufts uses. Nevertheless, .snapshot/
can be very useful when you
accidentally remove or overwrite an important file and want to get the old
version back. (In the next module, we’ll learn about version control systems,
which are like .snapshot/
but supercharged.) In this part of the assignment,
you’ll write a shell script that automates the process of finding what files
have been added to a directory since its most recent snapshot.
Your script, named whats-new.sh
, should take as its only argument a path to a
directory to find new files in. If the argument is missing, or does not point
to a directory that has been snapshotted, or the script is provided too many
arguments, your script should print an error message and exit with a nonzero
exit status. (The exit
command immediately exits a shell script and takes the
exit status as an argument.)
If the argument is valid, your script should compare the contents of the
directory with the contents of the most recent snapshot. We define the most
recent snapshot as the one with the highest timestamp in its name, regardless
of whether it’s daily
, weekly
, or something else. We define the contents
of a directory as all the non-hidden files (see note above) that are directly
contained within it.
In other words, you do not need to show new hidden files (ones that start with
.
), nor do you need to recurse into subdirectories. You may do either of
these if you so choose, however. (If you do decide to handle hidden files, try
to not list .snapshot/
itself as a new file!)
Your output should be a list of files that are present in the directory but not
in the snapshot, one per line. In other words, print out the files that have
been added since the snapshot was taken. These should be valid paths relative
to the given directory (so either file1
or ./file1
is acceptable). You may
find the comm
utility useful in generating this list; see man comm
.
Please include a #!
line at the beginning of your script that indicates it
should be executed with either /bin/sh
or /bin/bash
. (/bin/sh
points to
different shells on different systems but is always guaranteed to be a
POSIX-compliant shell; /bin/bash
is always Bash specifically.) And use the
error handling practices we encouraged in lecture.
Your program may write data to files as part of generating its result. If it
does, please use the mktemp
utility to create them. mktemp
generates a new
file with a random name inside /tmp
, a system directory designed to hold
temporary files that don’t need to stick around. We encourage you to remove any
temporary files you use once your script finishes, but you will not lose points
if you don’t.
We discussed in lecture how system calls (also known as “syscalls”) are the
primary way for userspace programs to communicate with the Linux kernel. Now
it’s time to get your hands dirty and make a syscall of your own. You’re going
to write a stripped-down version of ls
that prints the contents of a
directory.
This piece of the assignment involves writing C code, but we’re confident that
you’ll be able to do it with what you know from CS 15 and the
our-friendly-cat
implementation we showed in class. C is very similar to the
pieces of C++ you already know, and you can take a look at this old CS 40
lab to learn some of the
notable differences (plus details on argc
and argv
).
You’ll be calling some system functions from C, including one called fputs()
to print to stdout as well as a number of syscall shim functions (described
below). These functions are part of the C standard library (a.k.a. libc),
which is a set of function implementations that are available to any C program
running on a POSIX system. Nearly all C library functions have their own man
pages; don’t be afraid to use them!
Your program, myls.c
, should take as its only argument a path to a directory
and print the name of each file (see note above) inside that directory, one per
line. You may or may not include files starting with .
at your discretion. If
your program is run without a directory name or the given name is not a
directory, your program should print an error and return a nonzero exit code.
Internally, GNU’s implementation of ls
calls the readdir()
function from
libc. readdir()
behaves as a transparent wrapper around the readdir
syscall1. Read the man pages for opendir
(man 3p
opendir
2), readdir
(man 3p readdir
), and closedir
(man 3p
closedir
) to get an idea for how you might write your program. Although these
three functions are syscall shims, you don’t have to implement your entire
program directly using syscalls. Specifically, you’ll probably want to use
fputs
(man 3 fputs
) to print the filenames.
These functions can return a whole host of different errors in different conditions. If any of them returns a value that signals an error (check the man page for each one), you should print a helpful error message and exit immediately with a nonzero exit code.3
You can start with the following skeleton, or you can write your own:
#include <stdio.h>
int main(int argc, char **argv) {
if (argc != 2) {
fprintf(stderr, "Expected one argument but got %d instead.\n", argc-1);
return 1;
}
// your code here
}
As an extension, you may allow the user to omit the argument and have your program list the contents of the current working directory. This is not required for full marks.
Please submit your two files, whats-new.sh
and myls.c
, with provide
comp50isdt cli-constructive whats-new.sh myls.c
. You must be logged into the
homework server to use Provide.
At this point, you are done with the assignment. You need not read anything past this point if you don’t want to. However, if you’re looking for a challenge, or if you want to learn some tricks involving syscalls and shared libraries in Linux, feel free to take a stab at the following. Expect to do a lot of Googling here, as we have not taken care to define every term we use.
LD_PRELOAD
to intercept opendir
opendir()
and readdir()
are not themselves syscalls; instead, they’re
functions in the C standard library that invoke syscalls. When you run myls, it
loads the implementations of those functions from a shared object containing
the C standard library. (This shared object is generally located at
/usr/lib/libc.so.6
on GNU/Linux systems.)
In this exercise, you’ll use a fun feature of the GNU dynamic loader (whose job
it is to load shared objects at runtime) to replace the system’s
implementation of opendir()
with one you write yourself. To do this, you’ll
set an environment variable named LD_PRELOAD
to the path of a shared object
you write yourself in C.
When LD_PRELOAD
is set, the loader looks in that library first for any
library function your program uses. So, by setting it to a library you write
containing a function named opendir
, you can have your own code run when myls
(or any other program!) calls opendir()
.
Here’s how to compile a shared library that contains functions from a source
file named shim.c
and run myls with that library preloaded:
$ clang -shared -fpic shim.c -o shim.so
$ LD_PRELOAD="$(pwd)/shim.so" ./myls
<...>
$
Please note that the $(pwd)/
is essential, as LD_PRELOAD
does not allow
relative paths.
Try creating a shim with an opendir
implementation that first prints to the
terminal its name
argument and then calls the real libc version of opendir
,
returning its result so that programs like myls still get what they asked for.
Calling the real version of opendir()
is a bit tricky, since the name
opendir()
now refers to your shim. You can get around this using a function
called dlsym
(see man dlsym
).
When you run myls with this shim, you should see the path of the directory
you’re listing printed out before the rest of its output. This is your shim’s
code running when myls initially calls opendir()
!
Spoiler alert: David Knifehands did not kill Alexander Henshawe. But you’ve just heard that he’s been framed by the real murderer, and the authorities are after him! You have to save Dave by hiding the evidence before he’s caught by Mike Bauer & co, our eagle-eyed sysadmins. Mike will go spelunking through the directory tree (using system ls, not myls), looking for clues like a murder weapon.
For inexplicable reasons, Mike will load your shim when he runs ls
. In order
to interfere with his investigation, you must use your new LD_PRELOAD
skills
for good. This time, instead of just logging a function call, you’ll alter the
results!
Change your shim to intercept readdir()
instead of opendir()
, then see if
you can write an implementation of readdir()
that skips returning any
directory entries that contain the word “trapdoor”. That is, if ls (or myls)
would normally produce4:
carpet
trapdoor
another-trapdoor-wow
table
Instead, it should now produce:
carpet
table
Good luck!
What happens when you run find
, tree
, grep -r
, or some other command that
does directory traversals? Does your shim work? How do you know?
For example, try LD_PRELOAD="$(pwd)/shim.so" tree
/comp/50ISDT/murder-mystery/
.
In reality, some implementations use a different syscall called
getdents
under the hood, just like open()
uses openat
. This is an
implementation detail that shouldn’t concern you. ↩
Manual pages are divided into sections. Since some commands like
chmod
are both shell utilities and syscalls, referencing man pages by
name alone can be ambiguous. According to man man
, section 3 of the man
pages are for library calls (C functions). There is also a section for
POSIX specifications, section 3p. ↩
We won’t require you to use it, but you may find the perror
function (man 3 perror
) helpful. ↩
This example has a slightly different directory listing than the actual directories from the murder mystery, but it is only meant to be illustrative. ↩