Skip to content

Small tools

In this part we will be talking about bunch of very simple tool. Differently than sed or awk, all of them are much easier to learn and use. This is why we collected them in one chapter. We will discuss the most frequently used options -- for more details please refer to man pages.

In this part we cover the following topics


cut

The cut command in UNIX is a command line utility for cutting sections from each line of input and writing the result to standard output. It can be used to cut parts of a line by byte position (-b), character (-c) and field delimiter (-f and -d to specify delimiter different that default tab character). A range must be provided in each case which consists of one of N, N-M, N- (N to the end of the line), or -M (beginning of the line to M), where N and M are counted from 1 (there is no zeroth value).

Below a list of all usable options is given (except help and version which are skipped, as being present in most UNIX commands)

  • -b, --bytes=RANGE Select only the bytes from each line as specified in RANGE. RANGE specifies a byte, a set of bytes, or a range of bytes as it was described above.
  • -c, --characters=RANGE Select only the characters from each line as specified in RANGE.
  • -d, --delimiter=DELIM use character DELIM instead of a tab for the field delimiter.
  • -f, --fields=RANGE Select only the fields from each line as specified in RANGE. Also print any line that contains no delimiter character, unless the -s option is specified.
  • --complement complement the set of selected bytes, characters or fields.
  • -s, --only-delimited do not print lines not containing delimiters.
  • --output-delimiter=STRING use STRING as the output delimiter string. The default is to use the input delimiter.


cut -- usage examples

  • To cut by byte position
  • To cut by character
    Where input stream is character based -c can be a better option than selecting by bytes with -b as often characters are more than one byte. In the following example Polish letter Ą -- Latin Capital Letter a with Ogonek -- has unicode U+0104 whis is coded in two bytes (c4 and 84) with UTF8.

    By using the -c option the character can be correctly selected along with any other characters that are of interest.

    This option seems to work incorectly on Linux

    --complement does not work on MacOS, but should work on Linux
  • To cut based on a delimiter (to cut by field)

    --output-delimiter does not work on MacOS, but should work on Linux


grep

The name grep means general regular expression parser, but it would be easier for us to think about grep command as a search command for Unix systems. It’s used to search for text strings or, more generally, regular expressions within one or more files or input stream.

grep is a simple tool but despite this has a lot of options. Printing all of them here is useless as our goal is not to copy man pages. I think it’s easiest to learn how to use the grep command by showing examples, so this is what I'm going to do as next.


grep -- usage examples

For all of the examples, we’ll be using the following test file named data03.txt.

  • Search for a string in one or more files
  • Case-insensitive (with -i option) search for a string
  • Search for a string matched a regular expression
  • Reverse the meaning with -v option
  • Search for multiple patterns (mind egrep usage in this case)
  • Show matching line numbers
  • Display matching filenames
  • Lines before and after grep match
  • Highlighting the search using --color option

    FoO should be somehow higlighted. On my terminal it's red.

  • Counting the lines when words match

That was a short example of the grep typical usage. More options we can find in documentation.



head is a program on Unix systems used to display the beginning of a text file or a stream of data (by default it prints the first 10 lines). The general command syntax is typical and there are just a few options.

  • -c [-]K print the first K bytes of each file; with the leading -, print all but the last K bytes of each file.
  • -n [-]K print the first K lines instead of the first 10; with the leading -, print all but the last K lines of each file.
  • -q never print headers giving file names.
  • -v always print headers giving file names.

K may have a multiplier suffix

  • b 512,
  • kB 1000, K 1024,
  • MB 1000*1000, M 1024*1024,
  • GB 1000*1000*1000, G 1024*1024*1024,
  • and so on for T, P, E, Z, Y.

A complement command for head is a tail command.


head -- usage examples

-n option with negative values does not work in MacOS

but works in Linux

-v option does not work in MacOS

but works in Linux


join

join command combines two files based on the matching content lines found in each file. Using join command is quite straight forward but it can save lots of time and effort. To join two files using the join command files must have identical join fields. The default join field is the first field delimited by blanks (space or tab). Join expects that files will be sorted on the join fields before joining.

Most frequently used options includes

  • -1 FIELD Join on this FIELD of file 1.
  • -2 FIELD Join on this FIELD of file 2.
  • -t CHAR Use CHAR as input and output field separator.
  • -o FORMAT Use FORMAT while constructing output line.
  • -j FIELD Equivalent to -1 FIELD -2 FIELD.
  • -i Ignore differences in case when comparing fields.
  • -a FILENUM Also, print unpairable lines from file FILENUM, where FILENUM is 1 or 2, corresponding to FILE1 or FILE2.


join -- usage examples

  • Basic usage of join command is usage without any options. All what is required is to specify 2 files as an arguments. Having two files data06_A.txt and data06_B.txt with a following content

    the result is as below
  • Choosing field
    When the first default join field is not longer matching, we can modify default behavior and join both files based on another fields. For files data06_AA.txt and data06_BB.txt with a following content
  • Overriding default join format

    On Linux the following version (without multiple -o) should work
  • Dealing with non-pairable lines


nl

In theory, nl numbers the lines in a file. In practise it does much more.


nl -- usage examples


paste

The paste command merges the corresponding lines of multiple files side-by-side.


paste -- usage examples

  • To display the contents of data06_AAA.txt and data06_BBB.txt, side-by-side, with the corresponding lines of each file separated by a tab we can use paste command in the following way
  • With -d we can change line delimiter
  • With -s option paste command paste one file at a time instead of in parallel. It means, that we merge the files in sequentially manner. It reads all the lines from a single file and merges all these lines into a single line with each line separated by tab. And these single lines are separated by newline.

    On MacOS result is odd

    while on Linux seems to be correct

    -s option is much more clear for one column files


sort

sort command rearrange the lines in a text file so that they are sorted, numerically and alphabetically.

By default, the rules for sorting are

  • Lines starting with a number will appear before lines starting with a letter.
  • Lines starting with a letter that appears earlier in the alphabet will appear before lines starting with a letter that appears later in the alphabet.
  • Lines starting with a lowercase letter will appear before lines starting with the same letter in uppercase.

sort has many options -- please refer to man pages to get know all of them. Below only some most common examples are given.


sort -- usage examples

Consider the following data07.txt file

  • To sort the lines in this file alphabetically, use the following command

    We can use -o option to save sorting result in a file
  • To sort the lines in reverse order
  • Checking for sorted order
  • Sorting based on selected fields of data
    Normally, sort decides how to sort lines based on the entire line: it compares every character from the first character in a line, to the last one. Even leading whitespaces matters

    To ignores leading blanks, use the -b option

    If we want sort to compare a limited subset of every line data, we can specify which fields to compare using the -k option (fields are defined as anything separated by whitespace unless we specify other character with -t option).

    Have in mind that -k 3 means rather sort starting with column 3 than sort based (only) on column 3. If -k 3 is used, the sort key would begin at column 3 and extend to the end of the line, spanning all the fields in between. If we want to sort based only on column 3 we shoud specify starting field as well as ending field

    We can do even more, and specify a start and end position by character in every field

    We may also sort the contents of a file based upon more than one column

    Because the following seems to not sort based on the first character form field 4

    we can try

    Next
  • To sort the contents numerically
  • Remove duplicates with -u option
  • Sort using human readable numbers
  • Merge already sorted files


split

split command is used to split a file into the pieces. Whenever it is used a large file is divided into a set of smaller files with default size equal to 1000 lines, its default name prefix x and names as aa, ab, ac, etc. (so the full file names would be xaa, xab, xac, etc.).

Typically split accepts the following options

  • -a use suffixes of length N (default 2)
  • -b put SIZE bytes per output file
  • -C put at most SIZE bytes of lines per output file
  • -d use numeric suffixes instead of alphabetic
  • -l put NUMBER lines per output file
  • -x use hex suffixes instead of alphabetic
  • -n generate CHUNKS output files

SIZE may be (or may be an integer optionally followed by) one of following: KB=1000 bytes, K=1024 bytes, MB= 1000*1000 bytes, M=1024*1024, and so on for G, T, P, E, Z, Y.

CHUNKS may be

  • N split into N files based on size of input
  • K/N output Kth of N to stdout
  • l/N split into N files without splitting lines/records
  • l/K/N output Kth of N to stdout without splitting lines/records
  • r/N like l but use round robin distribution
  • r/K/N likewise but only output Kth of N to stdout

On MacOS another option is also available

  • -p The file is split whenever an input line matches PATTERN, which is interpreted as an extended regular expression. The matching line will be the first line of the next output file.


split -- usage examples

  • Create dummy files
    • Two files with random human readable bytes
    • File with a line foo bar repeated 256 times

    The wc command used above displays the number of lines, words, and bytes contained in input file.

    Very nice information about generating dummy files can be found in How To Quickly Generate A Large File On The Command Line (With Linux) and How To Create Files Of A Certain Size In Linux.

  • Split file into pieces with customize line numbers

  • Split file into pieces with customize byte numbers

  • Create files with numeric suffix instead of alphabetic
    Unfortunately this option doesn't work on MacOS; should work on Linux

  • Create files with customized prefix

  • Divide file into chunks
    Unfortunately this option doesn't work on MacOS; should work on Linux

  • Create files of customize suffix length


tail

The tail command is a command-line utility for printing the last part of files. By default tail returns the last ten lines of each file that it is given. Compared to head, tail has a little bit more options and one very useful feature which allows it to be used in real time file changes monitoring.

General syntax is as as follow

  • -c [+|-]K Output the last K bytes. Numbers having a leading plus + sign are relative to the beginning of the input. Numbers having a leading minus - sign or no explicit sign are relative to the end of the input.
  • -n [+|-]KOutput the last K lines, instead of the default last 10. A leading plus + or - sign may be used in the meaning described in -c.
  • -f or --follow[={name|descriptor}] Output appended data as the file grows. This option will cause tail will loop forever, checking for new data at the end of the file(s). When new data appears, it will be printed. If we follow more than one file, a header will be printed to indicate which file's data is being printed. If the file shrinks instead of grows, tail will let us know with a message. If we specify name, the file with that name is followed, regardless of its file descriptor. If we specify descriptor, the same file is followed, even if it is renamed. This is the default behavior.
    -f, --follow, and --follow=descriptor are equivalent.
  • --retry Keep trying to open a file even when it is or becomes inaccessible; useful when following by name, i.e., with --follow=name.
  • -F Same as --follow=name --retry.
  • -q Never output headers giving file names.
  • -v, Always output headers giving file names.

Again, as for , K may have a multiplier suffix

  • b 512,
  • kB 1000, K 1024,
  • MB 1000*1000, M 1024*1024,
  • GB 1000*1000*1000, G 1024*1024*1024,
  • and so on for T, P, E, Z, Y.

A complement command for tail is a head command.


tail -- usage examples

The same but with option -F instead of -f


tr

The tr command is used to translate specified characters into other characters. Moreover it can be also used to deleting specified characters, or squeezing repeated characters.

In contrast to many command line programs, tr does not accept file names as arguments (i.e., input data). Instead, it only accepts inputs from standard input or from the output of other programs via redirection; it write to standard output.

The general syntax of tr is

particularlt on MacOS we have

The first, designated set1, lists the characters in the text to be replaced or removed. The second, set2, lists the characters that are to be substituted for the characters listed in the first argument. If both the set1 and set2 are specified and -d option is not specified, then command will replace each characters in set1 with each character in same position in set2. Input characters in the string set1 are mapped to corresponding characters in the string set1 so it is resonable that both set1 and set2 should have equal length. If this is not the case, no error is generated, but two rules are applied to make them equal

    If length of set2 is less then the length of set1 then set2 is extended to the length of set1 by repeating its last character as many times as necessary.
  • If length of set2 exceed the length of set1, excess characters in set2 are ignored.

Being more precisely, both sets can be specified not only by characters but by

  • Enumeration of characters like in (see example below)
  • Using charater ranges like in (see example below)
  • Using POSIX character classes. Each consists of a word (or abbreviation) surrounded by colons and then enclosed in a set of square brackets. So the sequence [:class:] represents all characters belonging to the defined character class, and class names are
    • alnum alphanumeric characters,
    • alpha alphabetic characters,
    • cntrl control (non-printing) characters,
    • digit numeric characters,
    • graph graphic characters,
    • lower lower-case alphabetic characters,
    • print printable characters,
    • punct punctuation characters,
    • space whitespace characters,
    • upper upper-case characters,
    • xdigit hexadecimal characters 0-9 A-F.

    They can be used like in (see example below)

    Classes can be combined to form a more complex set, for example '[:lower:][:digit:]' (see example below)

We can also mix all of the above methods (see example below).

Typically tr accepts three options

  • -c Converts the set to the complement of the listed characters, i.e., operations apply to characters not in the given set.
  • -d Delete characters in the first set from the output.
  • -s Squeeze multiple occurrences of the characters listed in the last operand (either set1 or set2) in the input into a single instance of the character. This occurs after all deletion and translation is completed.

On MacOS another two options (-C, -u) are available (however -c option has different meaning; -C on MacOS = -c on Linux)

  • -C Complement the set of characters in set1.
  • -c Same as -C but complement the set of values in string1.
  • -u Guarantee that any output is unbuffered.



tr -- usage examples

We will use the following test file data12.txt

  • Replaced : with a -

    Alternatively we can use pipe
  • Replace using enumeration of characters (replace more than one character)
  • Replace using charater ranges
  • Delete specified characters
  • Squeeze repetition of characters
  • Complement the sets
  • Using POSIX character classes and mixed set specification
  • Difference between -c and -C (who can explain this????)