In this part we cover the following topics
awk
awk
is not an abbreviation for awkward as many people things after first few minutes they spend with this tool. In fact, it is an elegant and simple language. The work awk is derived from the initials of the language's three developers: A. Aho, B. W. Kernighan and P. Weinberger.
awk
is an excellent filter and report writer, and in many cases it is easier to use awk
than "conventional" programming languages like C or Python. We can wonder what differs awk
from sed
?
sed
is a stream editor. It works with streams of characters on a per-line basis. It uses pattern matching and address matching to take an actions. It has a primitive programming language that includes goto-style loops and simple conditionals. There are essentially only two "variables": pattern space and hold space. Mathematical operations are almost not possible while string functions are not possible at all. sed
can be used when there are patterns in the text. For example, we could replace all the negative numbers in some text that are in the form "minus-sign followed by a sequence of digits" (e.g., "-123.45") with their absolute values numbers (e.g., "123.45").
awk
is oriented toward delimited fields on a per-line basis. There is complete support for variables and single-dimension associative arrays. There are some mathematical operations as well as some very basic string functions. It has C-style printf
, allows to define user functions and has programming constructs including conditions (if/else
) and loops (for
, while
and do/while
). It also uses pattern matching to take an actions. awk
can be used when the "text" has like rows and columns structure. For example, we could sum all negative values from second column.
AWK follows a simple Read, Execute, and Repeat workflow given below
- Execute commands from BEGIN block.
- Read a line from input stream.
- Execute commands on a previously read line.
- If it's not end of file go to step 2.
- Execute commands from END block.
Looking at this workflow it should be clear why awk
is a perfect tool to generate simple formatted reports.
With awk
scripting language, we can
- use variables;
- use string and most arithmetic operators we know from C language;
- use control flow and loops.
Being more precisely, in awk
we can use a lot of elements well known from classic C language
- printf function to pretty print with
- escape sequences,
- format specifiers,
- minimum field width specifiers,
- left justification,
- field precision value specifiers
.
We can also send output to a named file instead a standard output, with the following format
1printf([FORMAT], [ARGUMENTS]) > [OUTPUT_FILE] - Flow Control with
next
andexit
We can exit from anawk
script using theexit
command. The second command, thenext
command, will also change the flow of the program. It causes the current processing of the pattern space to stop. The program reads in the next line, and starts executing the commands again with the new line. - Numerical functions
cos
,exp
,int
,log
,sin
,sqrt
(for some versions there are alsoatan
,rand
,srand
) - String functions
12345index(string, search)length(string)split(string,array,separator)substr(string,position)substr(string,position,max)
if
for control flowswhile
andfor
for loops
Below a list and syntax of all awk
commands is given
1 2 3 4 5 6 7 8 9 10 11 12 |
if ( conditional ) statement [ else statement ] while ( conditional ) statement for ( expression ; conditional ; expression ) statement for ( variable in array ) statement break continue { [ statement ] ...} variable=expression print [ expression-list ] [ > expression ] printf format [ , expression-list ] [ > expression ] next exit |
The awk
command is used like this
1 |
awk options program file |
awk
refers to the rows and columns as records and fields. Note, that awk
gives us an access to the first 99 of fields in a single line
In
awk
there are two kinds of variables
- User defined A user defined variable is one we create.
- Positional A positional variable is not a special variable, but a function triggered by the dollar sign
$
. Therefore
User defined variables can be defined before script execution and used through the execution of the script
1 |
MBAPF:textdataprocessing fulmanp$ awk -v foo=3 'BEGIN {print foo}' |
In this example
-v
option assigning a value to a variable. It allows assignment before the program execution.foo=3
is a definition onfoo
variable.BEGIN {print foo}
is a BEGIN block with only one command:print
intended to print value offoo
variable.
They can be also defined inside one of "regular" blocks
1 2 |
MBAPF:textdataprocessing fulmanp$ awk 'BEGIN {foo=7; print foo}' 7 |
Positional variables allows us to get an access to the specified fields from currently processed line. The variable $0
refers to the entire line that awk
reads in. Having a data file as given below
1 2 3 4 5 6 7 8 9 |
MBAPF:textdataprocessing fulmanp$ echo '11 12 13 > 21 22 23 > 31 32 33 > 41 42 43' > data01.txt MBAPF:textdataprocessing fulmanp$ cat data01.txt 11 12 13 21 22 23 31 32 33 41 42 43 |
we can write
1 2 3 4 5 |
MBAPF:textdataprocessing fulmanp$ awk '{print $0}' data01.txt 11 12 13 21 22 23 31 32 33 41 42 43 |
to print all of them.
Variables of the form $[POSITIVE NATURAL]
addresses POSITIVE NATURAL
field from our data (remember about a limit of 99 fields in a single line)
1 2 3 4 5 |
MBAPF:textdataprocessing fulmanp$ awk '{print $1,$3}' data01.txt 11 13 21 23 31 33 41 43 |
Notice that last two examples can be also completed with commands
1 2 3 4 5 6 7 8 9 10 |
MBAPF:textdataprocessing fulmanp$ awk '{print}' data01.txt 11 12 13 21 22 23 31 32 33 41 42 43 MBAPF:textdataprocessing fulmanp$ awk '{$2=""; print}' data01.txt 11 13 21 23 31 33 41 43 |
Notice that in the second case results are similar, but not identical. The number of spaces between the values vary. There are two reasons for this. The actual number of fields does not change. Setting a positional variable to an empty string does not delete the variable. It's still there, but the contents has been deleted. The other reason is the way AWK outputs the entire line. The first example outputs three fields, while the second outputs two. In-between each field there is a space. In result we have two spaces.
An useful variable related to fields is the NF
(number of fields)
1 2 3 4 5 6 7 8 9 10 |
MBAPF:textdataprocessing fulmanp$ awk '{print $NF}' data01.txt 13 23 33 43 MBAPF:textdataprocessing fulmanp$ awk '{print $(NF-1)}' data01.txt 12 22 32 42 |
With NF
it is easy to print last element from every line even if the number of fields differs from line to line
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
MBAPF:textdataprocessing fulmanp$ echo '11 > 21 22 > 31 32 33 > 42 42 43 44' > data02.txt MBAPF:textdataprocessing fulmanp$ cat data02.txt 11 21 22 31 32 33 42 42 43 44 MBAPF:textdataprocessing fulmanp$ awk '{print $NF}' data02.txt 11 22 33 44 |
Another "counter", the NR
(the number of records), tells us the number of records, or the line number. With this we can work with certain lines
1 2 3 |
MBAPF:textdataprocessing fulmanp$ awk '{if(NR>2){print $0}}' data01.txt 31 32 33 41 42 43 |
With
awk
we can process many different text files and, what is quite obvious, not all of them use whitespace as a field separator. We can easily change the field separator character to be any other character using the -F
command line option.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
MBAPF:textdataprocessing fulmanp$ echo '11 12,13,14 15 > 21 22,23,24 25 > 31 32,33,34 35' > data03.txt MBAPF:textdataprocessing fulmanp$ cat data03.txt 11 12,13,14 15 21 22,23,24 25 31 32,33,34 35 MBAPF:textdataprocessing fulmanp$ awk '{print $3}' data03.txt 15 25 35 MBAPF:textdataprocessing fulmanp$ awk -F, '{print $3}' data03.txt 14 15 24 25 34 35 |
However there is a way to do this without the command line option. Instead the variable FS
can be set.
1 2 3 4 |
MBAPF:textdataprocessing fulmanp$ awk 'BEGIN{FS=","} {print $3}' data03.txt 14 15 24 25 34 35 |
Notice that if FS
is not defined in BEGIN
block, the result is different
1 2 3 4 |
MBAPF:textdataprocessing fulmanp$ awk '{FS=","; print $3}' data03.txt 15 24 25 34 35 |
Explanation for this is quite clear if we only realize how awk
works. It process file line by line and first it reads the whole line, prepares for processing and after that processes it (executes all commands). If we change the field separator before we read the line, the change affects what we read. If we change it after we read the line, it will not redefine the variables.
Consider the following examples
1 2 3 4 5 6 7 8 9 10 |
MBAPF:textdataprocessing fulmanp$ awk '{print $2 $3}' data01.txt 1213 2223 3233 4243 MBAPF:textdataprocessing fulmanp$ awk '{print $2, $3}' data01.txt 12 13 22 23 32 33 42 43 |
In the first case, the two positional parameters are concatenated together and output without a space. In the second case, two fields are printed, and the output field separator is placed between them. By default this separator is a whitespace, but we can change this by modifying the variable OFS
.
1 2 3 4 5 |
MBAPF:textdataprocessing fulmanp$ awk '{OFS=":"; print $2, $3}' data01.txt 12:13 22:23 32:33 42:43 |
awk
reads one line (called record in awk
) at a time, and breaks up the line into fields. We can change awk
's definition of a line setting the RS
variable. If we set it to an empty string, then awk
will read the entire file into memory.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
MBAPF:textdataprocessing fulmanp$ awk 'BEGIN{RS=" "} {print ">"$0"<"}' data01.txt >11< >12< >13 21< >22< >23 31< >32< >33 41< >42< >43 < |
The default output record separator is a newline. This can be set to be any sequence of characters with
ORS
variable.
1 2 3 |
MBAPF:textdataprocessing fulmanp$ awk 'BEGIN{RS=" "} {print $0}' data01.txt | awk 'BEGIN{ORS="::"} {print $0}' > res.txt MBAPF:textdataprocessing fulmanp$ cat res.txt 11::12::13::21::22::23::31::32::33::41::42::43:::: |
Notice that while ORS
can be a sequence of characters like ::
in the example above, RS
can take only one character
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
MBAPF:textdataprocessing fulmanp$ awk 'BEGIN{RS="2"} {print ">"$0"<"}' res.txt >11::1< >::13::< >1::< >< >::< >3::31::3< >::33::41::4< >::43::::< MBAPF:textdataprocessing fulmanp$ awk 'BEGIN{RS="22"} {print ">"$0"<"}' res.txt >11::1< >::13::< >1::< >< >::< >3::31::3< >::33::41::4< >::43::::< |
In
awk
we can use one dimensional associative arrays. Associativity is a good news, because allows to reduce coding time and makes difficult problems much more simpler. Let's write a simple program which counts the number of words occurrences in a file. First we create a file
1 2 3 4 5 6 7 8 9 10 11 12 13 |
MBAPF:textdataprocessing fulmanp$ echo '11 > 22 > 22 > 33 > 33 > 33' > data04.txt MBAPF:textdataprocessing fulmanp$ cat data04.txt 11 22 22 33 33 33 |
then we can count words with the following script saved under script01.awk
name
1 2 3 4 5 6 7 8 |
{ username[$1]++; } END { for (i in username) { print i":"username[i]; } } |
Up to now we have get awk
program directly. Fortunately awk
provides the ability to read from file with parameter -f
what we will use at the moment.
1 2 3 4 5 |
MBAPF:textdataprocessing fulmanp$ awk -f script01.awk data04.txt 22:2 11:1 33:3 MBAPF:textdataprocessing fulmanp$ man sort |
Using pipes, sort
and top
commands we can select this way two top most frequently words
1 2 3 |
MBAPF:textdataprocessing fulmanp$ awk -f script01.awk data04.txt | sort -r | head -n 2 33:3 22:2 |
Imagine now that we have a file profits.txt
with our profits from the programs we sell in iTunes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
January app01 10 February app02 2 February app03 15 March app02 3 April app01 7 May app03 12 May app03 5 May app02 8 June app01 10 July app01 20 August app02 5 August app03 15 September app03 4 October app02 15 November app01 8 December app02 17 December app02 3 December app03 5 December app01 9 |
and we want to calculate the most profitable month, the most profitable application and the most profitable application in every month.
- The most profitable month
12345678{data[$1] += $3;}END {for (i in data) {print i, data[i];}}
12345678910111213MBAPF:textdataprocessing fulmanp$ awk -f profits_m.awk profits.txt | sort -nrk 2,2December 34May 25July 20August 20February 17October 15June 10January 10November 8April 7September 4March 3 - The most profitable application
12345678{data[$2] += $3;}END {for (i in data) {print i, data[i];}}
1234MBAPF:textdataprocessing fulmanp$ awk -f profits_a.awk profits.txt | sort -nrk 2,2app01 64app03 56app02 53 - The most profitable application every month
To simplify our solution, two scripts will be used. Firstprofits_ma_1.awk
12345678{data[$1":"$2] += $3;}END {for (i in data) {print i, data[i];}}
and a secondprofits_ma_2.awk
12345678910111213141516171819202122{if($2 > data[substr($1,0,3)]) {data[substr($1,0,3)]=$2; name[substr($1,0,3)]=$1}}END {map["Jan"] = 1map["Feb"] = 2map["Mar"] = 3map["Apr"] = 4map["May"] = 5map["Jun"] = 6map["Jul"] = 7map["Aug"] = 8map["Sep"] = 9map["Oct"] = 10map["Nov"] = 11map["Dec"] = 12for (i in data) {print map[i], i, substr(name[i],(index(name[i],":"))+1), data[i];}}
We can call them as it is shown bellow
12345678910111213MBAPF:textdataprocessing fulmanp$ awk -f profits_ma_1.awk profits.txt | awk -f profits_ma_2.awk | sort -nk 1,11 Jan app01 102 Feb app03 153 Mar app02 34 Apr app01 75 May app03 176 Jun app01 107 Jul app01 208 Aug app03 159 Sep app03 410 Oct app02 1511 Nov app01 812 Dec app02 20
So far we have only used two patterns: the special words
BEGIN
and END
even without calling them a pattern. Patterns seems to be something indispensable in text processing tool as we saw it in sed
part. When we realize how they are used, we can conclude that patterns are used like conditions in the environment where conditional statement doesn't exist. But we have conditions in awk
, so we don't need patterns because we can duplicate them using an if
statement.
A pattern (or condition) is simply an abbreviated test. If the condition is true, the action is performed. All relational tests can be used as a pattern.
1 2 3 4 |
MBAPF:textdataprocessing fulmanp$ awk '{if(NR<=3){print}}' profits.txt January app01 10 February app02 2 February app03 15 |
If we prefer, we can change the if
statement to a condition what results in shortening the code
1 2 3 |
MBAPF:textdataprocessing fulmanp$ awk 'NR<3 {print}' profits.txt January app01 10 February app02 2 |
Besides conditional tests, you can also use a regular expressions. Printing all lines that contain the sequence # comment
from a file pattern.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# comment 1 line 1 # comment 1 line 2 begin block 1 line 1 block 1 line 2 block 1 line 3 end # comment 2 line 1 # comment 2 line 2 begin block 2 line 1 block 2 line 2 block 2 line 3 end |
is possible with the following command
1 2 3 4 5 |
MBAPF:textdataprocessing fulmanp$ awk '{if($0 ~ /# comment/) {print}}' pattern.txt # comment 1 line 1 # comment 1 line 2 # comment 2 line 1 # comment 2 line 2 |
or more briefly
1 2 3 4 5 |
MBAPF:textdataprocessing fulmanp$ awk '$0 ~ /# comment/ {print}' pattern.txt # comment 1 line 1 # comment 1 line 2 # comment 2 line 1 # comment 2 line 2 |
Saying the truth, this type of test is so common, the awk
allows a third, shorter format
1 2 3 4 5 |
MBAPF:textdataprocessing fulmanp$ awk '/# comment/ {print}' pattern.txt # comment 1 line 1 # comment 1 line 2 # comment 2 line 1 # comment 2 line 2 |
Tests can be combined with the and &&
, or ||
and not !
operators. Parenthesis can also be added if we want to change operators order or to make complex statement more clear.
A very useful variant of pattern is called the comma separated pattern and takes the form
1 |
/[TRIGGER_ACTION_PATTERN]/,/[STOP_ACTION_PATTERN]/ [ACTION] |
This form defines, in one line, the condition to turn the action on, and the condition to turn the action off. That is, when a line containing TRIGGER_ACTION_PATTERN
is seen, the ACTION
is performed. Every line afterwards is also processed by ACTION
, until a line containing STOP_ACTION_PATTERN
is seen. This one is also processed as the last processed line.
The following prints all lines between line containing begin
and another line containing end
.
1 2 3 4 5 6 7 8 9 10 11 |
MBAPF:textdataprocessing fulmanp$ awk '/begin/,/end/ {print}' pattern.txt begin block 1 line 1 block 1 line 2 block 1 line 3 end begin block 2 line 1 block 2 line 2 block 2 line 3 end |
The following prints all lines between 4 and 6 (inclusively):
1 2 3 4 |
MBAPF:textdataprocessing fulmanp$ awk '(NR==4),(NR==6) {print}' pattern.txt block 1 line 1 block 1 line 2 block 1 line 3 |
Note that we can have several patterns in a script and each one is independent of the others.
1 2 3 4 5 6 7 8 9 10 11 |
MBAPF:textdataprocessing fulmanp$ awk '/block/ {print} > /block 1/ {print}' pattern.txt block 1 line 1 block 1 line 1 block 1 line 2 block 1 line 2 block 1 line 3 block 1 line 3 block 2 line 1 block 2 line 2 block 2 line 3 |
In
awk
we define functions according to the following general format
1 2 3 |
function [NAME]( {ARGUMENT1], ..., [ARGUMENTN]) { [BODY] } |
Consider the following addLeadingDots
saved as function.awk
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
function addLeadingDots (line, lineLength) { currentLineLength = length(line) diff = lineLength - currentLineLength str = "" if (diff > 0){ for(i=0;i<diff;i++) { str = str"." } } return str line } { print addLeadingDots($0, 11) } |
On executing this code, we get the following result
1 2 3 4 5 6 7 8 9 10 |
MBAPF:textdataprocessing fulmanp$ cat data02.txt 11 21 22 31 32 33 42 42 43 44 MBAPF:textdataprocessing fulmanp$ awk -f function.awk data02.txt .........11 ......21 22 ...31 32 33 42 42 43 44 |