Prerequisites
Topics
Extras
Shell Extras
Pipes and Filters
Now that we know a few basic commands, we can finally look at the shell’s most powerful feature: the ease with which it
lets us combine existing programs in new ways. We’ll start with the intro-to-yens/
directory that contains a text file
numbers.txt
that contains 1000 lines of randomly generated numbers from 1 to 10.
Let’s go into that directory with cd
and run an example command wc numbers.txt
:
$ cd ~/Desktop/intro-to-yens/
$ wc numbers.txt
1000 1000 2086 numbers.txt
wc
is the ‘word count’ command: it counts the number of lines, words, and characters in files (from left to right,
in that order).
If we run wc -l
instead of just wc
, the output shows only the number of lines per file:
$ wc -l numbers.txt
1000 numbers.txt
Capturing output from commands
Which of these files contains the fewest lines? It’s an easy question to answer when there are only a few files, but what if there were 1000? Our first step toward a solution is to run the command:
$ wc -l *.txt > lengths.txt
The greater than symbol, >
, tells the shell to redirect the command’s output to a file instead of printing it to the screen.
(This is why there is no screen output: everything that wc
would have printed has gone into the file lengths.txt
instead.)
The shell will create the file if it doesn’t exist. If the file exists, it will be silently overwritten, which may lead
to data loss and thus requires some caution. ls
confirms that the file exists:
$ ls
We can now send the content of lengths.txt
to the screen using cat lengths.txt
. The cat
command gets its name from
‘concatenate’ i.e. join together, and it prints the contents of files one after another. There’s only one file in this case,
so cat
just shows us what it contains:
$ cat lengths.txt
1000 numbers.txt
3000 numbers2.txt
500 numbers3.txt
4500 total
Filtering output
Next we’ll use the sort
command to sort the contents of the lengths.txt
file. Note that
sort -n
option specifies a numerical rather than an alphanumerical sort. sort
does not change the file; instead,
it sends the sorted result to the screen:
$ sort -n lengths.txt
500 numbers3.txt
1000 numbers.txt
3000 numbers2.txt
4500 total
We can put the sorted list of lines in another temporary file called sorted-lengths.txt
by putting > sorted-lengths.txt
after the command, just as we used > lengths.txt
to put the output of wc
into lengths.txt
. Once we’ve done that,
we can run another command called head
to get the first few lines in sorted-lengths.txt
:
$ sort -n lengths.txt > sorted-lengths.txt
$ head -n 1 sorted-lengths.txt
500 numbers3.txt
Using -n 1
with head
tells it that we only want the first line of the file; -n 20
would get the first 20, and so on.
Since sorted-lengths.txt
contains the lengths of our files ordered from least to greatest, the output of head -n 1
must be the file with the fewest lines.
The head
command prints lines from the start of a file. tail
is similar, but prints lines from the end of a file instead.
Note, it’s a very bad idea to try redirecting the output of a command that operates on a file to the same file. Doing something like this may give you incorrect results and/or delete the contents of the file.
We have seen the use of >
, but there is a similar operator >>
which works slightly differently.
We’ll learn about the differences between these two operators by printing some strings. We can use the echo
command
to print strings
$ echo hello
Run each command twice to see the difference between the two operators:
$ echo hello > testfile01.txt
$ echo hello >> testfile02.txt
Examine the output files. The >>
operator appends to the file if it already exists instead of overwriting the file as >
does.
Passing output to another command
In our example of finding the file with the fewest lines, we are using two intermediate files to store output.
This is a confusing way to work because even once you understand what wc
, sort
, and head
do, those intermediate
files make it hard to follow what’s going on. We can make it easier to understand by running sort
and head
together:
$ sort -n lengths.txt | head -n 1
500 numbers3.txt
The vertical bar, |
, between the two commands is called a pipe. It tells the shell that we want to use the output
of the command on the left as the input to the command on the right. This has removed the need for the sorted-lengths.txt
file.
A filter is a program like wc
or sort
that transforms a stream of input into a stream of output.
Almost all of the standard Unix tools can work this way: unless told to do otherwise, they read from standard input,
do something with what they’ve read, and write to standard output.
Combining multiple commands
Nothing prevents us from chaining pipes consecutively. We can for example send the output of wc
directly to sort
,
and then the resulting output to head
. This removes the need for any intermediate files.
$ rm lengths.txt sorted-lengths.txt
We’ll start by using a pipe to send the output of wc
to sort
:
$ wc -l *.txt | sort -n
500 numbers3.txt
1000 numbers.txt
3000 numbers2.txt
4500 total
We can then send that output through another pipe, to head
, so that the full pipeline becomes:
$ wc -l *.txt | sort -n | head -n 1
500 numbers3.txt
Summary
wc
counts lines, words, and characters in its inputs.cat
displays the contents of its inputs.sort
sorts its inputs.head
displays the first 10 lines of its input.tail
displays the last 10 lines of its input.command > [file]
redirects a command’s output to a file (overwriting any existing content).command >> [file]
appends a command’s output to a file.[first] | [second]
is a pipeline: the output of the first command is used as the input to the second.- The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).
Loops
Loops are a programming construct which allow us to repeat a command or set of commands for each item in a list. As such they are key to productivity improvements through automation. Similar to wildcards and tab completion, using loops also reduces the amount of typing required (and hence reduces the number of typing mistakes).
Suppose we would like to print out the number on the tenth line for each text file in intro-to-yens
directory. For each file,
we would need to execute the command head
and pipe this to tail -n 1
. We’ll use a loop to solve this problem,
but first let’s look at the general form of a loop:
for thing in list_of_things
do
operation_using $thing # Indentation within the loop is not required, but aids legibility
done
and we can apply this to our example like this:
$ for filename in numbers.txt numbers2.txt numbers3.txt
> do
> head $filename | tail -n 1
> done
2
2
10
The shell prompt changes from $
to >
and back again as we were typing in our loop. The second prompt, >
,
is different to remind us that we haven’t finished typing a complete command yet. A semicolon, ;
, can be used to
separate two commands written on a single line.
When the shell sees the keyword for
, it knows to repeat a command (or group of commands) once for each item in a list.
Each time the loop runs (called an iteration), an item in the list is assigned in sequence to the variable, and the
commands inside the loop are executed, before moving on to the next item in the list. Inside the loop, we call for the
variable’s value by putting $
in front of it. The $
tells the shell interpreter to treat the variable as a variable
name and substitute its value in its place, rather than treat it as text or an external command.
When using variables it is also possible to put the names into curly braces to clearly delimit the variable name:
$filename
is equivalent to ${filename}
. You may find this notation in other people’s programs.
Here’s a slightly more complicated loop:
$ for filename in *.txt
> do
> echo $filename
> head $filename | tail -n 1
> done
numbers.txt
2
numbers2.txt
2
numbers3.txt
10
The shell starts by expanding *.txt
to create the list of files it will process. The loop body then executes two
commands for each of those files. The first command, echo
, prints its command-line arguments (the name of the file)
to standard output. Then, the head
and tail
combination selects tenth line from each file.
Another way to repeat previous work is to use the history
command to get a list of the last few hundred commands
that have been executed, and then to use !123
(where ‘123’
is replaced by the command number) to repeat one of
those commands.
Summary
- A
for
loop repeats commands once for every thing in a list. - Every
for
loop needs a variable to refer to the thing it is currently operating on. - Use
$name
to expand a variable (i.e., get its value).${name}
can also be used. - Use
history
to display recent commands, and![number]
to repeat a command by number.
Finding Things
In the same way that many of us now use ‘Google’ as a verb meaning ‘to find’, Unix programmers often use the word grep
.
grep
is a contraction of ‘global/regular expression/print’, a common sequence of operations in early Unix text editors.
It is also the name of a very useful command-line program.
grep
finds and prints lines in files that match a pattern. For our example, we will use a file
swiss-parallel-bootstrap.R
containing bootstrapping code in intro-to-yens
directory.
$ grep ncore swiss-parallel-bootstrap.R
ncore <- detectCores()
print(paste('running on', ncore, 'cores'))
# register parallel backend to limit threads to the value specified in ncore variable
registerDoParallel(ncore)
Here, ncore
is the pattern we’re searching for. The grep
command searches through the file, looking for matches to
the pattern specified. To use it type grep
, then the pattern we’re searching for and finally the name of the file (or files)
we’re searching in.
The output is the four lines in the file that contain the letters ncore
.
By default, grep
searches for a pattern in a case-sensitive way.
A useful option is -n
, which numbers the lines that match:
$ grep -n ncore swiss-parallel-bootstrap.R
9:ncore <- detectCores()
11:print(paste('running on', ncore, 'cores'))
13:# register parallel backend to limit threads to the value specified in ncore variable
14:registerDoParallel(ncore)
Here, we can see that lines 9, 11, 13 and 14 contain the letters ncore
.
If we use the -r
(recursive) option, grep
can search for a pattern recursively through a set of files in subdirectories.
While grep
finds lines in files, the find
command finds files themselves.
Let’s run find .
. Remember to run this command from the intro-to-yens
directory.
$ find .
As always, the .
on its own means the current working directory, which is where we want our search to start.
find
’s output is the names of every file and directory under the current working directory.
Now let’s try matching by name:
$ find . -name "*.txt"
Put *.txt
in quotes to prevent the shell from expanding the *
wildcard. This command gives us a list of all text files
in or below the current directory.
How can we combine that with wc -l
to count the lines in all those files?
The simplest way is to put the find
command inside $()
:
$ wc -l $(find . -name "*.txt")
When the shell executes this command, the first thing it does is run whatever is inside the $()
.
It then replaces the $()
expression with that command’s output.
Summary
find
finds files with specific properties that match patterns.grep
selects lines in files that match patterns.$([command])
inserts a command’s output in place.
Shell Scripts
We are finally ready to see what makes the shell such a powerful programming environment. We are going to take the commands we repeat frequently and save them in files so that we can re-run all those operations again later by typing a single command. For historical reasons, a bunch of commands saved in a file is usually called a shell script.
We can put the for
loop we wrote earlier in a shell script called print_tenth_line.sh
.
The shell script should contain:
for filename in *txt
do
echo $filename
head $filename | tail -n 1
done
Once we have saved the file, we can ask the shell to execute the commands it contains. Our shell is called bash
,
so we run the following command:
$ bash print_tenth_line.sh
numbers.txt
2
numbers2.txt
2
numbers3.txt
10
Sure enough, our script’s output is exactly what we would get if we ran that loop directly.
Connect with us