In this part we will be talking about bunch of very simple tool. Differently than sed
or awk
, all of them are much easier to learn and use. This is why we collected them in one chapter. We will discuss the most frequently used options -- for more details please refer to man pages.
In this part we cover the following topics
cut
dd ???
grep
head
join
less ???
nl ???
od ???
paste
seq ???
sort
split
tail
tee ???
tr
uniq ???
wc ???
cut
The
cut
command in UNIX is a command line utility for cutting sections from each line of input and writing the result to standard output. It can be used to cut parts of a line by byte position (-b
), character (-c
) and field delimiter (-f
and -d
to specify delimiter different that default tab
character). A range must be provided in each case which consists of one of N
, N-M
, N-
(N
to the end of the line), or -M
(beginning of the line to M
), where N
and M
are counted from 1 (there is no zeroth value).
Below a list of all usable options is given (except help
and version
which are skipped, as being present in most UNIX commands)
-b
,--bytes=RANGE
Select only the bytes from each line as specified inRANGE
.RANGE
specifies a byte, a set of bytes, or a range of bytes as it was described above.-c
,--characters=RANGE
Select only the characters from each line as specified inRANGE
.-d
,--delimiter=DELIM
use character DELIM instead of atab
for the field delimiter.-f
,--fields=RANGE
Select only the fields from each line as specified inRANGE
. Also print any line that contains no delimiter character, unless the-s
option is specified.--complement
complement the set of selected bytes, characters or fields.-s
,--only-delimited
do not print lines not containing delimiters.--output-delimiter=STRING
useSTRING
as the output delimiter string. The default is to use the input delimiter.
cut
-- usage examples- To cut by byte position
- To cut by character
Where input stream is character based-c
can be a better option than selecting by bytes with-b
as often characters are more than one byte. In the following example Polish letter Ą -- Latin Capital Letter a with Ogonek -- has unicode U+0104 whis is coded in two bytes (c4 and 84) with UTF8.
By using the-c
option the character can be correctly selected along with any other characters that are of interest.
This option seems to work incorectly on Linux
--complement
does not work on MacOS, but should work on Linux
- To cut based on a delimiter (to cut by field)
--output-delimiter
does not work on MacOS, but should work on Linux
grep
The name
grep
means general regular expression parser, but it would be easier for us to think about grep
command as a search command for Unix systems. It’s used to search for text strings or, more generally, regular expressions within one or more files or input stream.
grep
is a simple tool but despite this has a lot of options. Printing all of them here is useless as our goal is not to copy man pages. I think it’s easiest to learn how to use the grep
command by showing examples, so this is what I'm going to do as next.
grep
-- usage examplesFor all of the examples, we’ll be using the following test file named
data03.txt
.
- Search for a string in one or more files
- Case-insensitive (with
-i
option) search for a string
- Search for a string matched a regular expression
- Reverse the meaning with
-v
option
- Search for multiple patterns (mind
egrep
usage in this case)
- Show matching line numbers
- Display matching filenames
- Lines before and after grep match
- Highlighting the search using
--color
option
FoO
should be somehow higlighted. On my terminal it's red. - Counting the lines when words match
That was a short example of the grep
typical usage. More options we can find in documentation.
head
head
is a program on Unix systems used to display the beginning of a text file or a stream of data (by default it prints the first 10 lines). The general command syntax is typical and there are just a few options.
-c [-]K
print the firstK
bytes of each file; with the leading-
, print all but the lastK
bytes of each file.-n [-]K
print the firstK
lines instead of the first 10; with the leading-
, print all but the lastK
lines of each file.-q
never print headers giving file names.-v
always print headers giving file names.
K
may have a multiplier suffix
- b 512,
- kB 1000, K 1024,
- MB 1000*1000, M 1024*1024,
- GB 1000*1000*1000, G 1024*1024*1024,
- and so on for T, P, E, Z, Y.
A complement command for head
is a tail
command.
head
-- usage examples-n
option with negative values does not work in MacOS
but works in Linux
-v
option does not work in MacOS
but works in Linux
join
join
command combines two files based on the matching content lines found in each file. Using join command is quite straight forward but it can save lots of time and effort. To join two files using the join
command files must have identical join fields. The default join field is the first field delimited by blanks (space or tab). Join expects that files will be sorted on the join fields before joining.
Most frequently used options includes
-1 FIELD
Join on thisFIELD
of file 1.-2 FIELD
Join on thisFIELD
of file 2.-t CHAR
UseCHAR
as input and output field separator.-o FORMAT
UseFORMAT
while constructing output line.-j FIELD
Equivalent to-1 FIELD -2 FIELD
.-i
Ignore differences in case when comparing fields.-a FILENUM
Also, print unpairable lines from fileFILENUM
, whereFILENUM
is1
or2
, corresponding toFILE1
orFILE2
.
join
-- usage examples- Basic usage of
join
command is usage without any options. All what is required is to specify 2 files as an arguments. Having two filesdata06_A.txt
anddata06_B.txt
with a following content
the result is as below
- Choosing field
When the first default join field is not longer matching, we can modify default behavior and join both files based on another fields. For filesdata06_AA.txt
anddata06_BB.txt
with a following content
- Overriding default join format
On Linux the following version (without multiple-o
) should work
- Dealing with non-pairable lines
nl
In theory,
nl
numbers the lines in a file. In practise it does much more.
nl
-- usage examples
paste
The
paste
command merges the corresponding lines of multiple files side-by-side.
paste
-- usage examples- To display the contents of
data06_AAA.txt
anddata06_BBB.txt
, side-by-side, with the corresponding lines of each file separated by a tab we can usepaste
command in the following way
- With
-d
we can change line delimiter
- With
-s
optionpaste
command paste one file at a time instead of in parallel. It means, that we merge the files in sequentially manner. It reads all the lines from a single file and merges all these lines into a single line with each line separated by tab. And these single lines are separated by newline.On MacOS result is odd
while on Linux seems to be correct
-s
option is much more clear for one column files
sort
sort
command rearrange the lines in a text file so that they are sorted, numerically and alphabetically.
By default, the rules for sorting are
- Lines starting with a number will appear before lines starting with a letter.
- Lines starting with a letter that appears earlier in the alphabet will appear before lines starting with a letter that appears later in the alphabet.
- Lines starting with a lowercase letter will appear before lines starting with the same letter in uppercase.
sort
has many options -- please refer to man pages to get know all of them. Below only some most common examples are given.
sort
-- usage examplesConsider the following
data07.txt
file
- To sort the lines in this file alphabetically, use the following command
We can use-o
option to save sorting result in a file
- To sort the lines in reverse order
- Checking for sorted order
- Sorting based on selected fields of data
Normally, sort decides how to sort lines based on the entire line: it compares every character from the first character in a line, to the last one. Even leadingwhitespace
s matters
To ignores leading blanks, use the-b
option
If we wantsort
to compare a limited subset of every line data, we can specify which fields to compare using the-k
option (fields are defined as anything separated bywhitespace
unless we specify other character with-t
option).
Have in mind that-k 3
means rather sort starting with column 3 than sort based (only) on column 3. If-k 3
is used, the sort key would begin at column 3 and extend to the end of the line, spanning all the fields in between. If we want to sort based only on column 3 we shoud specify starting field as well as ending field
We can do even more, and specify a start and end position by character in every field
We may also sort the contents of a file based upon more than one column
Because the following seems to not sort based on the first character form field 4
we can try
Next
- To sort the contents numerically
- Remove duplicates with
-u
option
- Sort using human readable numbers
- Merge already sorted files
split
split
command is used to split a file into the pieces. Whenever it is used a large file is divided into a set of smaller files with default size equal to 1000 lines, its default name prefix x
and names as aa
, ab
, ac
, etc. (so the full file names would be xaa
, xab
, xac
, etc.).
Typically split
accepts the following options
-a
use suffixes of lengthN
(default 2)-b
putSIZE
bytes per output file-C
put at mostSIZE
bytes of lines per output file-d
use numeric suffixes instead of alphabetic-l
putNUMBER
lines per output file-x
use hex suffixes instead of alphabetic-n
generateCHUNKS
output files
SIZE
may be (or may be an integer optionally followed by) one of following: KB
=1000 bytes, K
=1024 bytes, MB
= 1000*1000 bytes, M
=1024*1024, and so on for G
, T
, P
, E
, Z
, Y
.
CHUNKS
may be
N
split intoN
files based on size of inputK/N
outputK
th ofN
to stdoutl/N
split intoN
files without splitting lines/recordsl/K/N
outputK
th ofN
to stdout without splitting lines/recordsr/N
likel
but use round robin distributionr/K/N
likewise but only outputK
th ofN
to stdout
On MacOS another option is also available
-p
The file is split whenever an input line matchesPATTERN
, which is interpreted as an extended regular expression. The matching line will be the first line of the next output file.
split
-- usage examples- Create dummy files
- Two files with random human readable bytes
- File with a line
foo bar
repeated 256 times
The
wc
command used above displays the number of lines, words, and bytes contained in input file.Very nice information about generating dummy files can be found in How To Quickly Generate A Large File On The Command Line (With Linux) and How To Create Files Of A Certain Size In Linux.
- Two files with random human readable bytes
- Split file into pieces with customize line numbers
- Split file into pieces with customize byte numbers
- Create files with numeric suffix instead of alphabetic
Unfortunately this option doesn't work on MacOS; should work on Linux - Create files with customized prefix
- Divide file into chunks
Unfortunately this option doesn't work on MacOS; should work on Linux - Create files of customize suffix length
tail
The
tail
command is a command-line utility for printing the last part of files. By default tail
returns the last ten lines of each file that it is given. Compared to head
, tail
has a little bit more options and one very useful feature which allows it to be used in real time file changes monitoring.
General syntax is as as follow
-c [+|-]K
Output the lastK
bytes. Numbers having a leading plus+
sign are relative to the beginning of the input. Numbers having a leading minus-
sign or no explicit sign are relative to the end of the input.-n [+|-]K
Output the lastK
lines, instead of the default last 10. A leading plus+
or-
sign may be used in the meaning described in-c
.-f
or--follow[={name|descriptor}]
Output appended data as the file grows. This option will cause tail will loop forever, checking for new data at the end of the file(s). When new data appears, it will be printed. If we follow more than one file, a header will be printed to indicate which file's data is being printed. If the file shrinks instead of grows, tail will let us know with a message. If we specifyname
, the file with that name is followed, regardless of its file descriptor. If we specifydescriptor
, the same file is followed, even if it is renamed. This is the default behavior.
-f
,--follow
, and--follow=descriptor
are equivalent.--retry
Keep trying to open a file even when it is or becomes inaccessible; useful when following by name, i.e., with--follow=name
.-F
Same as--follow=name --retry
.-q
Never output headers giving file names.-v,
Always output headers giving file names.
Again, as for , K
may have a multiplier suffix
- b 512,
- kB 1000, K 1024,
- MB 1000*1000, M 1024*1024,
- GB 1000*1000*1000, G 1024*1024*1024,
- and so on for T, P, E, Z, Y.
A complement command for tail
is a head
command.
tail
-- usage examples
|
|
The same but with option -F
instead of -f
|
|
tr
The
tr
command is used to translate specified characters into other characters. Moreover it can be also used to deleting specified characters, or squeezing repeated characters.
In contrast to many command line programs, tr
does not accept file names as arguments (i.e., input data). Instead, it only accepts inputs from standard input or from the output of other programs via redirection; it write to standard output.
The general syntax of tr
is
particularlt on MacOS we have
The first, designated set1
, lists the characters in the text to be replaced or removed. The second, set2
, lists the characters that are to be substituted for the characters listed in the first argument. If both the set1
and set2
are specified and -d
option is not specified, then command will replace each characters in set1
with each character in same position in set2
. Input characters in the string set1
are mapped to corresponding characters in the string set1
so it is resonable that both set1
and set2
should have equal length. If this is not the case, no error is generated, but two rules are applied to make them equal
- If length of
- If length of
set2
exceed the length ofset1
, excess characters inset2
are ignored.
set2
is less then the length of set1
then set2
is extended to the length of set1
by repeating its last character as many times as necessary.
Being more precisely, both sets can be specified not only by characters but by
- Enumeration of characters like in (see example below)
- Using charater ranges like in (see example below)
- Using POSIX character classes. Each consists of a word (or abbreviation) surrounded by colons and then enclosed in a set of square brackets. So the sequence
[:class:]
represents all characters belonging to the defined characterclass
, andclass
names arealnum
alphanumeric characters,alpha
alphabetic characters,cntrl
control (non-printing) characters,digit
numeric characters,graph
graphic characters,lower
lower-case alphabetic characters,print
printable characters,punct
punctuation characters,space
whitespace characters,upper
upper-case characters,xdigit
hexadecimal characters 0-9 A-F.
They can be used like in (see example below)
Classes can be combined to form a more complex set, for example
'[:lower:][:digit:]'
(see example below)
We can also mix all of the above methods (see example below).
Typically tr
accepts three options
-c
Converts the set to the complement of the listed characters, i.e., operations apply to characters not in the given set.-d
Delete characters in the first set from the output.-s
Squeeze multiple occurrences of the characters listed in the last operand (eitherset1
orset2
) in the input into a single instance of the character. This occurs after all deletion and translation is completed.
On MacOS another two options (-C
, -u
) are available (however -c
option has different meaning; -C
on MacOS = -c
on Linux)
-C
Complement the set of characters inset1
.-c
Same as-C
but complement the set of values in string1.-u
Guarantee that any output is unbuffered.
tr
-- usage examplesWe will use the following test file
data12.txt
- Replaced
:
with a-
Alternatively we can use pipe
- Replace using enumeration of characters (replace more than one character)
- Replace using charater ranges
- Delete specified characters
- Squeeze repetition of characters
- Complement the sets
- Using POSIX character classes and mixed set specification
- Difference between
-c
and-C
(who can explain this????)