Skip to content

awk

In this part we cover the following topics


awk

awk is not an abbreviation for awkward as many people things after first few minutes they spend with this tool. In fact, it is an elegant and simple language. The work awk is derived from the initials of the language's three developers: A. Aho, B. W. Kernighan and P. Weinberger.

awk is an excellent filter and report writer, and in many cases it is easier to use awk than "conventional" programming languages like C or Python. We can wonder what differs awk from sed?

sed is a stream editor. It works with streams of characters on a per-line basis. It uses pattern matching and address matching to take an actions. It has a primitive programming language that includes goto-style loops and simple conditionals. There are essentially only two "variables": pattern space and hold space. Mathematical operations are almost not possible while string functions are not possible at all. sed can be used when there are patterns in the text. For example, we could replace all the negative numbers in some text that are in the form "minus-sign followed by a sequence of digits" (e.g., "-123.45") with their absolute values numbers (e.g., "123.45").

awk is oriented toward delimited fields on a per-line basis. There is complete support for variables and single-dimension associative arrays. There are some mathematical operations as well as some very basic string functions. It has C-style printf, allows to define user functions and has programming constructs including conditions (if/else) and loops (for, while and do/while). It also uses pattern matching to take an actions. awk can be used when the "text" has like rows and columns structure. For example, we could sum all negative values from second column.

AWK follows a simple Read, Execute, and Repeat workflow given below

  1. Execute commands from BEGIN block.
  2. Read a line from input stream.
  3. Execute commands on a previously read line.
  4. If it's not end of file go to step 2.
  5. Execute commands from END block.

Looking at this workflow it should be clear why awk is a perfect tool to generate simple formatted reports.

With awk scripting language, we can

  • use variables;
  • use string and most arithmetic operators we know from C language;
  • use control flow and loops.

Being more precisely, in awk we can use a lot of elements well known from classic C language

  • printf function to pretty print with
    • escape sequences,
    • format specifiers,
    • minimum field width specifiers,
    • left justification,
    • field precision value specifiers
    • .

    We can also send output to a named file instead a standard output, with the following format

  • Flow Control with next and exit
    We can exit from an awk script using the exit command. The second command, the next command, will also change the flow of the program. It causes the current processing of the pattern space to stop. The program reads in the next line, and starts executing the commands again with the new line.
  • Numerical functions cos, exp, int, log, sin, sqrt (for some versions there are also atan, rand, srand)
  • String functions

  • if for control flows
  • while and for for loops

Below a list and syntax of all awk commands is given

The awk command is used like this

awk refers to the rows and columns as records and fields. Note, that awk gives us an access to the first 99 of fields in a single line


Variables

In awk there are two kinds of variables

  • User defined A user defined variable is one we create.
  • Positional A positional variable is not a special variable, but a function triggered by the dollar sign $. Therefore

User defined variables can be defined before script execution and used through the execution of the script

In this example

  • -v option assigning a value to a variable. It allows assignment before the program execution.
  • foo=3 is a definition on foo variable.
  • BEGIN {print foo} is a BEGIN block with only one command: print intended to print value of foo variable.

They can be also defined inside one of "regular" blocks

Positional variables allows us to get an access to the specified fields from currently processed line. The variable $0 refers to the entire line that awk reads in. Having a data file as given below

we can write

to print all of them.

Variables of the form $[POSITIVE NATURAL] addresses POSITIVE NATURAL field from our data (remember about a limit of 99 fields in a single line)

Notice that last two examples can be also completed with commands

Notice that in the second case results are similar, but not identical. The number of spaces between the values vary. There are two reasons for this. The actual number of fields does not change. Setting a positional variable to an empty string does not delete the variable. It's still there, but the contents has been deleted. The other reason is the way AWK outputs the entire line. The first example outputs three fields, while the second outputs two. In-between each field there is a space. In result we have two spaces.

An useful variable related to fields is the NF (number of fields)

With NF it is easy to print last element from every line even if the number of fields differs from line to line

Another "counter", the NR (the number of records), tells us the number of records, or the line number. With this we can work with certain lines


Separators


The input field separator variable

With awk we can process many different text files and, what is quite obvious, not all of them use whitespace as a field separator. We can easily change the field separator character to be any other character using the -F command line option.

However there is a way to do this without the command line option. Instead the variable FS can be set.

Notice that if FS is not defined in BEGIN block, the result is different

Explanation for this is quite clear if we only realize how awk works. It process file line by line and first it reads the whole line, prepares for processing and after that processes it (executes all commands). If we change the field separator before we read the line, the change affects what we read. If we change it after we read the line, it will not redefine the variables.


The output field separator variable

Consider the following examples

In the first case, the two positional parameters are concatenated together and output without a space. In the second case, two fields are printed, and the output field separator is placed between them. By default this separator is a whitespace, but we can change this by modifying the variable OFS.


The record separator variable

awk reads one line (called record in awk) at a time, and breaks up the line into fields. We can change awk's definition of a line setting the RS variable. If we set it to an empty string, then awk will read the entire file into memory.


The output record separator variable

The default output record separator is a newline. This can be set to be any sequence of characters with ORS variable.

Notice that while ORS can be a sequence of characters like :: in the example above, RS can take only one character


Arrays

In awk we can use one dimensional associative arrays. Associativity is a good news, because allows to reduce coding time and makes difficult problems much more simpler. Let's write a simple program which counts the number of words occurrences in a file. First we create a file

then we can count words with the following script saved under script01.awk name

Up to now we have get awk program directly. Fortunately awk provides the ability to read from file with parameter -f what we will use at the moment.

Using pipes, sort and top commands we can select this way two top most frequently words

Imagine now that we have a file profits.txt with our profits from the programs we sell in iTunes

and we want to calculate the most profitable month, the most profitable application and the most profitable application in every month.

  • The most profitable month

  • The most profitable application

  • The most profitable application every month
    To simplify our solution, two scripts will be used. First profits_ma_1.awk

    and a second profits_ma_2.awk

    We can call them as it is shown bellow


Patterns

So far we have only used two patterns: the special words BEGIN and END even without calling them a pattern. Patterns seems to be something indispensable in text processing tool as we saw it in sed part. When we realize how they are used, we can conclude that patterns are used like conditions in the environment where conditional statement doesn't exist. But we have conditions in awk, so we don't need patterns because we can duplicate them using an if statement.

A pattern (or condition) is simply an abbreviated test. If the condition is true, the action is performed. All relational tests can be used as a pattern.

If we prefer, we can change the if statement to a condition what results in shortening the code

Besides conditional tests, you can also use a regular expressions. Printing all lines that contain the sequence # commentfrom a file pattern.txt

is possible with the following command

or more briefly

Saying the truth, this type of test is so common, the awk allows a third, shorter format

Tests can be combined with the and &&, or || and not ! operators. Parenthesis can also be added if we want to change operators order or to make complex statement more clear.

A very useful variant of pattern is called the comma separated pattern and takes the form

This form defines, in one line, the condition to turn the action on, and the condition to turn the action off. That is, when a line containing TRIGGER_ACTION_PATTERN is seen, the ACTION is performed. Every line afterwards is also processed by ACTION, until a line containing STOP_ACTION_PATTERN is seen. This one is also processed as the last processed line.

The following prints all lines between line containing begin and another line containing end.

The following prints all lines between 4 and 6 (inclusively):

Note that we can have several patterns in a script and each one is independent of the others.


Functions

In awk we define functions according to the following general format

Consider the following addLeadingDots saved as function.awk

On executing this code, we get the following result


Various examples