Piping hot commands
Pipes let programs work together by connecting the output from one to be the input for another. The term "output" has a precise meaning here: it is what the program writes to the standard output, via C program statements such as printf or the equivalent, and normally it appears on the terminal screen. And "input" is the standard input, usually coming from the keyboard. Pipes are built using a vertical bar ("|") as the pipe symbol.
Say you help your eccentric Aunt Hortense manage her private book collection. You have a file named books containing a list of her holdings, one per line, in the format "author:title", something like this:
$ cat books Carroll, Lewis:Through the Looking-Glass Shakespeare, William:Hamlet Bartlett, John:Familiar Quotations Mill, John Stuart:On Nature London, Jack:John Barleycorn Bunyan, John:Pilgrim's Progress, The Defoe, Daniel:Robinson Crusoe Mill, John Stuart:System of Logic, A Milton, John:Paradise Lost Johnson, Samuel:Lives of the Poets Shakespeare, William:Julius Caesar Mill, John Stuart:On Liberty Bunyan, John:Saved by Grace
This is somewhat untidy, as they are in no particular order. But we can use the
sort command to straighten that out:
$ sort books Bartlett, John:Familiar Quotations Bunyan, John:Pilgrim's Progress, The Bunyan, John:Saved by Grace Carroll, Lewis:Through the Looking-Glass Defoe, Daniel:Robinson Crusoe Johnson, Samuel:Lives of the Poets London, Jack:John Barleycorn Mill, John Stuart:On Liberty Mill, John Stuart:On Nature Mill, John Stuart:System of Logic, A Milton, John:Paradise Lost Shakespeare, William:Hamlet Shakespeare, William:Julius Caesar
Ah, now you have a list nicely sorted by author. How about getting a list just of authors, without titles? You can do that with the
$ cut -d: -f1 books Carroll, Lewis Shakespeare, William Bartlett, John Mill, John Stuart London, Jack Bunyan, John Defoe, Daniel Mill, John Stuart Milton, John Johnson, Samuel Shakespeare, William Mill, John Stuart Bunyan, John
A little explanation here. The
-d option chose a colon as the delimiter (separator). This tells
cut to break up each line wherever a delimiter appears, and each separate part of the line is called a field. In our format, the author's name appears as the first field, so we have put a 1 with the
-f option to tell
cut that we want to see just that field.
But you'll notice the list is unsorted again. Pipelines to the rescue!
$ sort books | cut -d: -f1 Bartlett, John Bunyan, John Bunyan, John Carroll, Lewis Defoe, Daniel Johnson, Samuel London, Jack Mill, John Stuart Mill, John Stuart Mill, John Stuart Milton, John Shakespeare, William Shakespeare, William
Voila! You've taken the alphabetized list, which is the output of the
sort command, and fed it as input to the
cut command. Don't give the
cut command a filename to use, because you want it to operate on the text that's piped out of the
Pipes are just that simple--text flows down the pipe from one command to the next.
How about if you wanted a sorted list of titles instead? Since the title is the second field, let's try using
-f2 with the
cut command instead of
$ sort books | cut -d: -f2 Familiar Quotations Pilgrim's Progress, The Saved by Grace Through the Looking-Glass Robinson Crusoe Lives of the Poets John Barleycorn On Liberty On Nature System of Logic, A Paradise Lost Hamlet Julius Caesar
Oops. What happened? When looking at a pipeline, you need to go left-to-right. In this case, we sorted the file first before extracting the titles. So it dutifully sorted the lines starting with the author at the beginning of each line. To get the titles in the proper order, you need to do the sort after extracting them:
$ cut -d: -f2 books | sort Familiar Quotations Hamlet John Barleycorn Julius Caesar Lives of the Poets On Liberty On Nature Paradise Lost Pilgrim's Progress, The Robinson Crusoe Saved by Grace System of Logic, A Through the Looking-Glass
Much better. Now this is all very nice, but you may be thinking you could have done these things with a spreadsheet. For simpler tasks, this is probably true. But suppose that Aunt Hortense is in the habit of asking odd questions about her collection. For example, she wants to know how many books she has from each author named John. A spreadsheet or other graphical program may have difficulty handling a request that wasn't anticipated by the program's authors. But the shell offers us many small, simple commands that can be combined in unforeseen ways to accomplish a complex task.
To find a particular string in a line of text, use the
grep command. Now remember that when you combine commands, they need to go in the proper order. You can't run
grep against the file first, because it will match the title "John Barleycorn" in addition to authors named John. So add it to the end of the pipeline:
$ cut -d: -f1 books | sort | grep "John" Bartlett, John Bunyan, John Bunyan, John Johnson, Samuel Mill, John Stuart Mill, John Stuart Mill, John Stuart Milton, John
This gets us close, but you don't want to get "Samuel Johnson" on the list and make Aunt Hortense angry. Often when working with
grep you will need to refine the matching text to get exactly what you need.
grep happens to offer a
-w option that will let it match "John" only when "John" is a complete word, not when it's part of "Johnson". But we'll solve this particular dilemma by adding a comma and space on the front of the string to match, so it will match only when John is a first name:
$ cut -d: -f1 books | sort | grep ", John" Bartlett, John Bunyan, John Bunyan, John Mill, John Stuart Mill, John Stuart Mill, John Stuart Milton, John
Ah, that's better. Now you just need to total up the number of books for each author. A little command called
uniq will work nicely. It removes duplicate lines (duplicates must be on consecutive lines, so be sure your text is sorted first), and when used with the
-c option also provides a count:
$ cut -d: -f1 books | sort | grep ", John" | uniq -c 1 Bartlett, John 2 Bunyan, John 3 Mill, John Stuart 1 Milton, JohnAnd there you are! A nicely sorted list of Johns and the number of books from each. For our example set this is a simple job, one you could even do with pencil and paper. But this very same pipeline can be used to process far more data--it won't blink even if Aunt Hortense has hundreds of thousands of books stored in the barn.
System administrators often use pipelines like these to deal with log files generated by web and mail servers. Such files can grow to tens or hundreds of megabytes in size, and a command pipeline can be a quick way to generate summary statistics without trying to read through the entire log.
A nice thing about building pipelines is that you can do it one command at a time, seeing exactly what effect each one has on the output. This can help you discover when you might need to tweak options or rearrange the order of commands. For instance, to put the authors in ranking order, you can just add a
sort -nr to the previous pipeline:
$ cut -d: -f1 books | sort | grep ", John" | uniq -c | sort -nr 3 Mill, John Stuart 2 Bunyan, John 1 Milton, John 1 Bartlett, John