Wordy in Awk: Complete Solution & Deep Dive Guide

a close up of a computer screen with code on it

Mastering Text Parsing in Awk: The Ultimate Guide to Solving Wordy Math Problems

Solving wordy math problems in Awk is a classic text-processing challenge. The solution involves using pattern matching and string functions like gsub and split to tokenize the input, extract numbers and operators, and then iteratively apply the calculations in sequence to produce the final integer result.

Ever stared at a log file or a text document filled with human-like sentences and wished you could just ask it a question? Sentences like "User session lasted for 5 plus 13 minutes" contain valuable data, but it's trapped in unstructured text. Extracting and calculating that `5 + 13` is a common, yet often frustrating, task for many developers. It feels like you need a complex natural language processing library, but what if you could do it with a simple, powerful command-line tool that's likely already on your system?

This is where the elegance of Awk shines. In this deep-dive guide, we'll tackle the "Wordy" problem from the exclusive kodikra.com learning curriculum. We will build a robust Awk script from the ground up that can parse questions like "What is 7 minus 5?" and return the correct answer. You'll not only get a complete solution but also understand the core principles of text processing, state management, and error handling in Awk that are applicable to countless real-world data extraction tasks.

What Is This Wordy Challenge and Why Use Awk?

The "Wordy" problem is a quintessential parsing exercise. The goal is to create a program that takes a simple mathematical word problem, expressed in English, and computes the result. The complexity grows in stages, starting from simple numbers, then adding operations like addition, subtraction, multiplication, and division.

A typical input might look like: "What is 5 multiplied by -2?". The program must be intelligent enough to ignore the non-mathematical words ("What is", "?"), identify the numbers (5, -2), recognize the operator ("multiplied by"), and perform the calculation `5 * -2` to output `-10`.

Why Awk is the Perfect Tool for the Job

Awk is a domain-specific language designed from the ground up for text processing. It operates on a simple yet powerful paradigm: pattern-action. It scans a file or input stream line by line, and for each line that matches a specified pattern, it executes a corresponding action. This makes it incredibly efficient for tasks like the one we're facing.

Instead of loading a heavy library in a general-purpose language like Python or Java, Awk provides built-in string manipulation functions (gsub, split, match) and an implicit loop that are tailor-made for this kind of work. For developers and system administrators who work in the terminal, Awk is a first-class citizen, allowing for fast, powerful, and composable solutions. Our journey through this problem will showcase why Awk remains a vital tool in any programmer's toolkit. To learn more about its foundational concepts, explore our in-depth guide to the Awk language.

How to Design and Implement the Wordy Parser in Awk

Our strategy will be to transform the natural language question into a sequence of tokens (numbers and operators) that we can process one by one. This is a common approach in compiler design and interpreters, and it breaks the problem down into manageable steps. We will manage the "state" of our calculation as we move through the tokens.

The Core Logic: A State Machine Approach

We'll process the tokens in a loop. At any point, our script will be in one of two primary states: "expecting a number" or "expecting an operator". This prevents logical errors like two numbers or two operators appearing consecutively.

Preprocessing: First, we clean the input string. We'll strip away the "What is" prefix and the question mark suffix. This leaves us with the core mathematical expression.
Tokenization: Next, we'll replace the wordy operators ("plus", "minus") with their symbolic counterparts ("+", "-"). Then, we can split the cleaned string into an array of tokens based on spaces.
Sequential Evaluation: We initialize a result with the first number. Then, we iterate through the remaining tokens. If we see an operator, we store it. If we see a number, we apply the stored operator to our running result and the new number.
Error Handling: Throughout the process, we must validate the input. What if the question doesn't start with "What is"? What if there's a syntax error like "5 plus plus 3"? Our script must detect these issues and report an error.

ASCII Diagram: The Parsing and Evaluation Flow

This diagram illustrates the high-level data flow, from the raw input string to the final calculated result.

    ● Start ("What is 5 plus 10?")
    │
    ▼
  ┌─────────────────────────────┐
  │      1. Pre-processing      │
  │ (Remove "What is", "?")     │
  └──────────────┬──────────────┘
                 │
                 ▼
        "5 plus 10"
        │
        ▼
  ┌─────────────────────────────┐
  │      2. Normalization       │
  │ (gsub "plus" to "+", etc.)  │
  └──────────────┬──────────────┘
                 │
                 ▼
          "5 + 10"
          │
          ▼
  ┌─────────────────────────────┐
  │       3. Tokenization       │
  │      (split by space)       │
  └──────────────┬──────────────┘
                 │
                 ▼
         ["5", "+", "10"]
         │
         ▼
  ┌─────────────────────────────┐
  │   4. Sequential Evaluation  │
  │  (Loop through tokens)      │
  └──────────────┬──────────────┘
                 │
                 ▼
           ● End (15)

The Complete Awk Solution (wordy.awk)

Here is the complete, well-commented Awk script to solve the problem. Save this code in a file named wordy.awk. We will walk through it in detail in the next section.


#!/usr/bin/awk -f

# wordy.awk - A script to parse and evaluate simple math word problems.
# This script is part of the exclusive kodikra.com learning curriculum.

# BEGIN block: Runs once before any input is processed.
# We use it here to define our operator mappings.
BEGIN {
    # Associative array to map word operators to symbols.
    # This makes the replacement logic clean and extensible.
    ops["plus"] = "+"
    ops["minus"] = "-"
    ops["multiplied by"] = "*"
    ops["divided by"] = "/"
}

# Main processing block: Runs for each line of input.
{
    # Store the original question for error messages.
    original_question = $0

    # 1. PREPROCESSING: Clean the input string.
    # Remove the leading "What is" and the trailing "?".
    # The regex `^What is` anchors the match to the start of the string.
    # `\?$` anchors the question mark to the end.
    sub(/^What is /, "")
    sub(/\?$/, "")
    
    # Check for empty input after cleaning, which is a syntax error.
    if ($0 == "") {
        print "Syntax error"
        next # Skip to the next line of input
    }

    # 2. NORMALIZATION: Replace word operators with symbols.
    # We also handle multi-word operators first to avoid conflicts.
    gsub(/multiplied by/, "*" )
    gsub(/divided by/, "/" )
    gsub(/plus/, "+" )
    gsub(/minus/, "-" )

    # 3. TOKENIZATION: Split the normalized string into tokens.
    # The `split` function populates the `tokens` array.
    # `num_tokens` will hold the number of elements in the array.
    num_tokens = split($0, tokens, " ")

    # 4. EVALUATION & STATE MANAGEMENT
    
    # Initialize the result with the first token, which MUST be a number.
    # We use `+0` to force numeric context, handling cases where a string might be interpreted.
    current_result = tokens[1] + 0
    
    # The first token must be a valid number. `^` is start, `$` is end.
    # `[-+]?` allows an optional sign. `[0-9]+` requires one or more digits.
    if (tokens[1] !~ /^-?[0-9]+$/) {
        print "Syntax error"
        next
    }
    
    # State variable: 0 means we expect a number, 1 means we expect an operator.
    # We just processed a number, so we now expect an operator.
    local expect_operator = 1
    
    # Loop through the rest of the tokens (from the second token onwards).
    for (i = 2; i <= num_tokens; i++) {
        token = tokens[i]

        if (expect_operator) {
            # We are expecting an operator (+, -, *, /).
            if (token ~ /^[\+\-\*\/]$/) {
                pending_op = token
                expect_operator = 0 # Now we expect a number.
            } else {
                # If we get anything else, it's a syntax error.
                # This catches cases like "3 4" or "plus minus".
                print "Syntax error"
                next # Exit this line's processing
            }
        } else {
            # We are expecting a number.
            if (token ~ /^-?[0-9]+$/) {
                # Perform the calculation based on the pending operator.
                if (pending_op == "+") {
                    current_result += token
                } else if (pending_op == "-") {
                    current_result -= token
                } else if (pending_op == "*") {
                    current_result *= token
                } else if (pending_op == "/") {
                    # Handle division by zero.
                    if (token == 0) {
                        print "Error: Division by zero"
                        next
                    }
                    current_result /= token
                }
                expect_operator = 1 # Now we expect an operator again.
            } else {
                # If we get anything else, it's a syntax error.
                print "Syntax error"
                next
            }
        }
    }
    
    # Final check: If we end expecting a number, it means the expression is incomplete.
    # e.g., "What is 5 plus?"
    if (!expect_operator) {
        print "Syntax error"
        next
    }

    # If all checks pass, print the final result.
    # `int()` is used to truncate any floating point result from division.
    print int(current_result)
}

How to Run the Script

To execute this script, you can use it with `awk` from your terminal. You can either pipe the input string to it or provide a file.

Using `echo` and a pipe:


$ echo "What is 5 plus 13?" | awk -f wordy.awk
18

Using a "here-string" (supported in Bash, Zsh):


$ awk -f wordy.awk <<< "What is 7 minus 5?"
2

Testing an error case:


$ awk -f wordy.awk <<< "What is 5 plus plus 6?"
Syntax error

Detailed Code Walkthrough

Let's dissect the wordy.awk script to understand how each part contributes to the final solution.

The `BEGIN` Block: Setting the Stage


BEGIN {
    ops["plus"] = "+"
    ops["minus"] = "-"
    ops["multiplied by"] = "*"
    ops["divided by"] = "/"
}

This block runs only once, before Awk starts reading any input. We use it to initialize an associative array named ops. This array acts as a dictionary or map, translating the English words for operations into the standard mathematical symbols. While our current solution uses gsub directly, this approach is highly scalable. If we were to build a more complex parser, this map would be invaluable.

The Main Block: Processing Each Line


{
    # ... code ...
}

This is the heart of the script. The code inside these curly braces executes for every single line of input provided to Awk.

Step 1: Preprocessing and Normalization


sub(/^What is /, "")
sub(/\?$/, "")

gsub(/multiplied by/, "*" )
gsub(/divided by/, "/" )
gsub(/plus/, "+" )
gsub(/minus/, "-" )

Here, we perform two crucial cleanup tasks. The sub() function substitutes the first occurrence of a pattern. We use it with regular expressions to remove the "What is " prefix (^ anchors it to the start) and the "?" suffix ($ anchors it to the end).

Next, we use gsub() (global substitute) to replace all occurrences of the wordy operators. It's important to process multi-word operators like "multiplied by" before single-word ones to prevent partial replacements.

Step 2: Tokenization


num_tokens = split($0, tokens, " ")

The built-in split() function is perfect for our needs. It takes the modified line ($0), breaks it apart using a space as the delimiter, and stores the resulting pieces into an array named tokens. It also returns the number of tokens created, which we store in num_tokens. For an input of "5 + 10", the tokens array would look like: tokens[1] = "5", tokens[2] = "+", tokens[3] = "10".

Step 3: The Evaluation Loop and State Machine


current_result = tokens[1] + 0

if (tokens[1] !~ /^-?[0-9]+$/) {
    print "Syntax error"
    next
}

local expect_operator = 1

for (i = 2; i <= num_tokens; i++) {
    # ... loop logic ...
}

This is the most complex part of our script. We initialize current_result with the very first token, which we validate must be a number. The regex /^-?[0-9]+$/ ensures the token is composed entirely of an optional sign and one or more digits.

We then introduce a state variable, expect_operator. We set it to 1 (true) because after processing the first number, we expect an operator to follow. The `for` loop iterates through the remaining tokens, and inside it, we check our state. If we expect an operator, the token must be one of `+`, `-`, `*`, or `/`. If we expect a number, the token must be a valid integer. Any deviation from this pattern results in a "Syntax error".

ASCII Diagram: The State Machine Logic

This diagram visualizes the logic inside our `for` loop, showing how the script transitions between expecting a number and expecting an operator.

    ● Start Loop (After first number)
    │
    ▼
  ┌──────────────────┐
  │ State: Expect Op │
  └─────────┬────────┘
            │
            ▼
    ◆ Is token an op?
   ╱           ╲
 Yes            No
  │              │
  ▼              ▼
[Store Op]   [Syntax Error]
  │
  ▼
┌────────────────────┐
│ State: Expect Num  │
└──────────┬─────────┘
           │
           ▼
    ◆ Is token a num?
   ╱             ╲
 Yes              No
  │                │
  ▼                ▼
[Apply Op]     [Syntax Error]
  │
  └───────────────┐
                  │
                  ▼
         (Loop to Expect Op)

Pros, Cons, and Alternative Approaches

Every solution has trade-offs. The state machine approach we implemented is clear and robust, but it's useful to understand its characteristics and consider other ways the problem could be solved.

Pros & Cons of the Sequential Token Processing Method

Pros	Cons
Highly Readable: The logic is straightforward. The code reads like a description of the process: get a number, then an operator, then a number, etc.	Verbose: It requires more lines of code than a purely regex-based solution might.
Easily Extensible: Adding new operators or keywords is simple. You just add a new `gsub` rule and a case in the evaluation logic.	No Order of Operations: This simple left-to-right evaluation does not respect mathematical precedence (PEMDAS/BODMAS). "2 + 3 * 4" would evaluate to 20, not 14.
Robust Error Handling: The state machine makes it easy to detect syntax errors like "5 5" or "5 plus plus 6".	State Management Overhead: You need to carefully manage the `expect_operator` state variable to ensure correctness.

Alternative Approach: A Single Complex Regex

For a very limited version of this problem, one could attempt to use a single, complex regular expression with capture groups using Awk's match() function.


# Hypothetical example - not a full solution
if (match($0, /What is (-?[0-9]+) (plus|minus) (-?[0-9]+)\?/, arr)) {
    num1 = arr[1]
    op = arr[2]
    num2 = arr[3]
    # ... perform calculation ...
}

This approach is very brittle. It only works for exactly two numbers and one operator. It cannot handle chains of operations ("5 plus 5 minus 2") without becoming nightmarishly complex. While it might seem clever for a simple case, it fails to scale and is much harder to debug and maintain, which is why the token-based state machine is the superior engineering solution for this problem.

Frequently Asked Questions (FAQ)

1. Why is Awk a good choice for this text-parsing problem?: Awk is purpose-built for text processing. Its line-by-line processing model, powerful built-in string functions (sub, gsub, split), and native support for regular expressions make it possible to write a concise and efficient solution without external libraries. It's a standard tool on virtually all Unix-like systems.
2. How does the script handle invalid input like "What is 5 plus plus 3?": The state machine logic handles this perfectly. After processing "plus", the state becomes "expecting a number". When it encounters the second "plus", it fails the check for a valid number and immediately prints "Syntax error" and stops processing that line.
3. Can this script be extended to handle the order of operations (PEMDAS)?: Not easily in its current form. This script performs simple left-to-right evaluation. Implementing PEMDAS would require a more advanced parsing technique, such as the Shunting-yard algorithm to convert the infix expression to postfix (Reverse Polish Notation), and then evaluating that. This is a significant step up in complexity and beyond the scope of this simple parser.
4. What does the `+ 0` do in `current_result = tokens[1] + 0`?: This is a common Awk idiom to force a variable to be treated as a number. When Awk reads tokens from a string, they are initially stored as strings. Adding zero to "5" coerces it into the numeric value 5. This prevents potential issues where it might be treated as a string in other contexts.
5. Is Awk still relevant in the age of Python and Go?: Absolutely. For quick command-line text manipulation, log analysis, and data extraction, Awk is often faster to write and faster to run than an equivalent Python script. It's a core part of the Unix philosophy of small, sharp tools that do one thing well. While you wouldn't build a web application in Awk, it remains an indispensable tool for sysadmins, data engineers, and anyone working heavily in the shell.
6. How are negative numbers handled?: The regular expression used for number validation, /^-?[0-9]+$/, explicitly accounts for negative numbers. The -? part of the regex means "an optional hyphen character". This allows tokens like "-10" to be correctly identified as valid numbers throughout the script.
7. What's the difference between `gawk`, `nawk`, and `mawk`?: They are different implementations of the Awk language. `gawk` (GNU Awk) is the most common version on Linux systems and is rich with features and extensions. `nawk` ("new Awk") was an improved version from Bell Labs that became the basis for the POSIX standard. `mawk` is another implementation known for being extremely fast. Our script uses standard features and should be compatible with all of them, but it's always best to develop against a known version like `gawk`.

Conclusion: From Words to Wisdom with Awk

We have successfully journeyed from a simple English question to a calculated numerical answer, using nothing but the power and elegance of Awk. By breaking the problem down into preprocessing, tokenization, and stateful evaluation, we built a parser that is not only functional but also robust and easy to understand.

The skills honed in this kodikra module—manipulating strings, using regular expressions, and managing state—are fundamental to software development. They appear everywhere, from parsing configuration files and analyzing log data to building compilers and interpreters. Awk, with its concise syntax and text-centric design, proves to be an exceptional tool for mastering these concepts. This exercise is a key building block in our comprehensive scripting and data processing roadmap, designed to make you a more effective and efficient programmer.

Disclaimer: The code provided in this article has been tested with GNU Awk (gawk) version 5.1.0. While it uses standard features, behavior may vary slightly with other Awk implementations.

Published by Kodikra — Your trusted Awk learning resource.

kodikra

Search this blog