Rectangles in Awk: Complete Solution & Deep Dive Guide

Tabs labeled

The Complete Guide to Counting ASCII Rectangles with Awk

This guide provides a comprehensive solution for counting rectangles within an ASCII diagram using Awk. We explore a robust algorithm that identifies top-left corners and verifies corresponding edges, leveraging Awk's powerful text-processing capabilities to solve this complex pattern recognition challenge from the exclusive kodikra.com curriculum.


The Challenge: Seeing Shapes in a Sea of Text

Imagine staring at a log file, a piece of legacy documentation, or a command-line tool's output formatted with ASCII art. At first glance, it's just a jumble of characters: plus signs, minus signs, and pipes. But hidden within this text are structures, patterns, and shapes. Your task is to teach a program to see these shapes, specifically, to count every single rectangle present in the diagram.

This is a classic problem in computational thinking. It tests not just your coding ability but your capacity to break down a visual problem into a concrete, step-by-step algorithm. Many developers might immediately reach for a general-purpose language like Python or JavaScript, but what if there's a tool, born from the Unix philosophy, that is perfectly suited for this kind of line-by-line, character-by-character grid analysis?

In this deep-dive tutorial, we will unlock the surprising power of Awk to solve this challenge elegantly and efficiently. You will learn how to transform a grid of characters into a data structure, how to systematically search for patterns, and how to write a script that is both concise and powerful. By the end, you'll not only have a solution but a new appreciation for one of command-line's most versatile tools.


What is the ASCII Rectangle Counting Problem?

The core task is to analyze a multi-line string of text representing an ASCII diagram and return the total number of rectangles it contains. The diagram is constructed using three primary characters: + for corners, - for horizontal edges, and | for vertical edges.

Consider the following input diagram:


+--+
|  |
+--+

This simple diagram contains exactly one rectangle. However, the diagrams can be much more complex, with overlapping and nested shapes:


+--+--+
|  |  |
+--+--+
|  |  |
+--+--+

The above diagram contains not just the two small squares, but also the larger rectangle that encompasses them, for a total of three rectangles. The challenge from the kodikra learning path requires an algorithm that is robust enough to find all valid rectangles, regardless of their size or position.

A rectangle is defined by four + corners that are properly connected by - and | characters. For example, a single ++ on one line connected to another ++ on a line below it also forms a valid, albeit very thin, rectangle.


Why Use Awk for This Grid-Based Problem?

While modern languages like Python or Go could certainly solve this, using Awk offers a unique blend of conciseness and power rooted in the Unix philosophy of "do one thing and do it well." Awk's primary purpose is advanced text processing, making it a natural fit for this problem.

  • Implicit Looping: Awk automatically reads input line by line, eliminating the need for boilerplate file-handling code. The main action block {...} executes for each line, simplifying the data ingestion process.
  • Field and Character Manipulation: Awk excels at splitting lines into fields or, in our case, individual characters. The split() function is a built-in powerhouse for deconstructing strings into arrays.
  • Associative Arrays: Awk's associative arrays are incredibly flexible. While it doesn't have native multi-dimensional arrays, we can easily simulate a 2D grid using composite keys (e.g., grid[row, col]), which is perfect for storing our ASCII diagram in memory.
  • END Block for Post-Processing: The special END block executes only after all input lines have been processed. This is the ideal place to run our main counting algorithm, as we are guaranteed to have the entire grid loaded into our associative array.

Using Awk connects us to a rich history of command-line data manipulation. It forces a different way of thinking that is less about complex objects and classes and more about efficient data streams and transformations. Mastering it provides a powerful tool for system administrators, data scientists, and anyone who works extensively in a terminal environment.


How the Rectangle Counting Algorithm Works

Our strategy will be a systematic, brute-force approach. While this term can sometimes sound inefficient, it is often the clearest and most reliable way to ensure every possibility is checked. The core idea is to iterate through every possible pair of corners and then verify if they form a valid rectangle.

The logic can be broken down into three main phases: Data Ingestion, Pattern Detection, and Validation.

Phase 1: Ingesting and Storing the Grid

Before we can count rectangles, we need to read the ASCII diagram from the input and store it in a structure we can easily access. An associative array is perfect for this. The Awk script will read the input line by line. For each line, it will:

  1. Use the built-in variable NR (Number of Record) as the current row index.
  2. Use the split($0, chars, "") function to break the entire line ($0) into an array of individual characters.
  3. Loop through this character array and store each character in our main associative array, grid, using a key like grid[NR, col_index].

This entire process happens in the main action block of our Awk script. Once the last line is read, the END block will trigger, and our grid array will contain the complete diagram.

Phase 2: The Four-Corner Detection Loop

This is the heart of the algorithm. Inside the END block, we will implement a set of four nested loops to find all possible pairs of top-left and bottom-right corners.


// Pseudocode for the detection loop
for r1 from 1 to num_rows
  for c1 from 1 to num_cols
    // Found a potential top-left corner
    if grid[r1, c1] is '+'
      // Now search for a potential bottom-right corner
      for r2 from r1 + 1 to num_rows
        for c2 from c1 + 1 to num_cols
          // If we find another '+', we might have a rectangle
          if grid[r2, c2] is '+'
            // Proceed to Phase 3: Validation
          end if
        end for
      end for
    end if
  end for
end for

This nested loop structure ensures that we check every character as a potential top-left corner (r1, c1) and then pair it with every possible bottom-right corner (r2, c2) that is located below and to the right of it.

Phase 3: Edge and Corner Validation

Just finding two diagonal + characters isn't enough. A valid rectangle requires all four corners to be + and the connecting lines to be valid. Once our detection loop finds a potential top-left (r1, c1) and bottom-right (r2, c2) corner, we must perform a rigorous validation check:

  1. Check the Other Two Corners: First, a quick and crucial check. Does a + exist at the top-right (r1, c2) and bottom-left (r2, c1) positions? If not, we can immediately discard this candidate and continue our search.
  2. Validate the Top Edge: Iterate from column c1 + 1 to c2 - 1 along row r1. Every character in this path must be either a - (a line segment) or a + (part of another intersecting rectangle).
  3. Validate the Bottom Edge: Similarly, iterate from column c1 + 1 to c2 - 1 along row r2. All characters must be - or +.
  4. Validate the Left Edge: Iterate from row r1 + 1 to r2 - 1 along column c1. Every character must be a | or a +.
  5. Validate the Right Edge: Finally, iterate from row r1 + 1 to r2 - 1 along column c2. All characters must be | or +.

If, and only if, all five of these conditions are met, we can confidently say we've found a valid rectangle. We then increment our global rectangle_count and continue the search.

This detailed validation prevents us from counting incomplete or broken shapes.

    ● Start Search at (r1, c1)
    │
    ▼
  ┌───────────────────────┐
  │ Is grid[r1, c1] == '+'? │
  └──────────┬────────────┘
             │ Yes
             ▼
    ● Search for (r2, c2) where r2 > r1, c2 > c1
    │
    ▼
  ┌───────────────────────┐
  │ Is grid[r2, c2] == '+'? │
  └──────────┬────────────┘
             │ Yes
             ▼
    ◆ Check Other Corners
     (r1, c2) & (r2, c1) must be '+'
   ╱                       ╲
  Yes                       No
  │                          │
  ▼                          ▼
┌───────────┐             [Discard]
│ Validate Edges │
└─────┬─────┘
      │
      ▼
 ◆ All Edges Valid?
   ╱           ╲
  Yes           No
  │              │
  ▼              ▼
[Increment Count] [Continue Search]
  │
  └───────────┐
              ▼
             ● End of Loop

Where to Implement the Logic: The Awk Script Walkthrough

Now, let's translate our algorithm into a working Awk script. The script is neatly divided into the main action block for data loading and the END block for processing.

The Complete Awk Solution

Here is the full, commented source code for solving this kodikra module challenge. Save this code in a file named count_rectangles.awk.


# count_rectangles.awk
# A solution from the kodikra.com exclusive curriculum.

# Main action block: Executed for each line of input.
# Its sole purpose is to read the ASCII grid into an associative array.
{
    # NR is a built-in Awk variable for the current line number (1-indexed).
    # We use it as our row coordinate.
    
    # The split() function is key here. It divides the entire line ($0)
    # into the 'line_chars' array. The empty string "" as a delimiter
    # tells split() to separate by each character.
    split($0, line_chars, "")
    
    # We capture the number of columns. The problem assumes all lines
    # have equal length, so we can set this on each line read.
    num_cols = length($0)
    
    # This loop populates our simulated 2D grid.
    # The key 'NR, i' combines the row and column to uniquely identify a cell.
    for (i = 1; i <= num_cols; i++) {
        grid[NR, i] = line_chars[i]
    }
}

# END block: This block executes only once, after all lines are read.
# The main counting logic is placed here, ensuring the full 'grid' is available.
END {
    num_rows = NR
    rectangle_count = 0

    # Loop 1 & 2: Iterate through every possible top-left corner (r1, c1).
    for (r1 = 1; r1 <= num_rows; r1++) {
        for (c1 = 1; c1 <= num_cols; c1++) {
            
            # Optimization: If the starting point isn't a '+', skip it.
            if (grid[r1, c1] != "+") {
                continue
            }

            # Loop 3 & 4: Iterate through every possible bottom-right corner (r2, c2)
            # that is located down and to the right of (r1, c1).
            for (r2 = r1 + 1; r2 <= num_rows; r2++) {
                for (c2 = c1 + 1; c2 <= num_cols; c2++) {
                    
                    # Phase 3 Validation starts here.
                    # First, a quick check for the other two corners. This is a massive
                    # optimization, as it filters out most candidates immediately.
                    if (grid[r1, c2] == "+" && grid[r2, c1] == "+" && grid[r2, c2] == "+") {
                        
                        is_valid_rectangle = 1 # Flag: Assume it's valid.

                        # Check top edge (row r1, between c1 and c2)
                        for (i = c1 + 1; i < c2; i++) {
                            if (grid[r1, i] != "-" && grid[r1, i] != "+") {
                                is_valid_rectangle = 0; break
                            }
                        }
                        if (!is_valid_rectangle) continue

                        # Check bottom edge (row r2, between c1 and c2)
                        for (i = c1 + 1; i < c2; i++) {
                            if (grid[r2, i] != "-" && grid[r2, i] != "+") {
                                is_valid_rectangle = 0; break
                            }
                        }
                        if (!is_valid_rectangle) continue

                        # Check left edge (col c1, between r1 and r2)
                        for (j = r1 + 1; j < r2; j++) {
                            if (grid[j, c1] != "|" && grid[j, c1] != "+") {
                                is_valid_rectangle = 0; break
                            }
                        }
                        if (!is_valid_rectangle) continue

                        # Check right edge (col c2, between r1 and r2)
                        for (j = r1 + 1; j < r2; j++) {
                            if (grid[j, c2] != "|" && grid[j, c2] != "+") {
                                is_valid_rectangle = 0; break
                            }
                        }
                        
                        # If the flag is still 1, all checks passed.
                        if (is_valid_rectangle) {
                            rectangle_count++
                        }
                    }
                }
            }
        }
    }
    
    # Finally, print the total count to standard output.
    print rectangle_count
}

How to Run the Script

To execute this solution, you first need to save your ASCII diagram into a text file, for example, input.txt. Then, run the Awk script from your terminal, telling it to use your script file (-f) on your input file.


$ awk -f count_rectangles.awk input.txt

The script will process the file and print a single number: the total count of rectangles found.

The data flow is simple and powerful, embodying the Unix stream-based processing model.

  ┌───────────┐
  │ input.txt │
  └─────┬─────┘
        │
        ▼
  ┌──────────────┐
  │   awk -f     │
  │ count_rects  │
  └─────┬────────┘
        │
        ├─ 1. Main Block: Reads each line...
        │     └─ Stores characters in `grid[row,col]`
        │
        ├─ 2. All lines read.
        │
        ▼
  ┌──────────────┐
  │   END Block  │
  │  (Processing)│
  └─────┬────────┘
        │
        ├─ 1. Iterates through `grid` with 4 nested loops.
        │     └─ Finds corner pairs.
        │
        ├─ 2. Validates edges for each pair.
        │     └─ Increments `rectangle_count`.
        │
        ▼
  ┌──────────────┐
  │ print count  │
  └─────┬─────┘
        │
        ▼
  ● Final Output (stdout)

Alternative Approaches and Considerations

While our brute-force algorithm is correct and clear, it's worth considering its performance and other potential strategies. This demonstrates a deeper understanding of algorithmic trade-offs.

Time Complexity

The provided solution has a time complexity of roughly O(R² * C²), where R is the number of rows and C is the number of columns. This is because we have four nested loops, two iterating through rows and two through columns, that dominate the runtime. For the typical size of ASCII diagrams found in this type of problem, this is perfectly acceptable. However, for extremely large grids, it could become slow.

Alternative Algorithm Idea

A more optimized approach could involve a different strategy, such as counting top-left and bottom-left corner pairs. 1. Iterate through all columns `c1` and `c2` where `c2 > c1`. 2. For this pair of columns, scan down the rows. 3. Whenever you find a row `r` where `grid[r, c1]` is `+` and `grid[r, c2]` is `+` and the line between them is valid (all `-` or `+`), you have found a valid horizontal segment. 4. You can then count how many previous valid horizontal segments you've found for this same column pair. Each previous segment can form a rectangle with the current one.

This approach can reduce the complexity but often requires more complex state management (e.g., storing counts of horizontal segments). For clarity and robustness, the four-corner brute-force method is an excellent and highly defensible starting point.

Pros and Cons of the Awk Approach

To provide a balanced view, let's compare our chosen tool with a more general-purpose language like Python.

Aspect Awk Solution Python Solution
Conciseness Extremely concise. No boilerplate for file I/O. The entire logic is self-contained and focused. More verbose. Requires explicit file opening, reading lines into a list, and closing the file.
Readability Can be cryptic for developers unfamiliar with Awk's syntax (e.g., $0, NR, associative arrays). Generally considered more readable for a wider audience due to its explicit, English-like syntax.
Performance Very fast for its domain. Awk is written in C and highly optimized for text processing. Performance is excellent, but the interpreter has more overhead. A C-extension like NumPy would be faster.
Ecosystem & Portability Available by default on nearly every Linux, macOS, and Unix-like system. Highly portable for shell scripting. Requires a Python interpreter to be installed. Excellent library ecosystem for more complex tasks.

Frequently Asked Questions (FAQ)

1. Why not just use regular expressions to find rectangles?
Regular expressions are powerful for finding patterns within a single line of text. However, they are not well-suited for matching two-dimensional patterns that span multiple lines. A rectangle is a 2D structure, requiring coordinated checks across different rows, which is beyond the scope of standard regex.

2. How does Awk handle 2D arrays?
Awk does not have native 2D arrays. Instead, it uses a powerful feature called associative arrays, where keys can be any string. We simulate a 2D array by creating a unique string key for each cell, typically by concatenating the row and column indices with a special separator, like grid[row, col]. Awk's internal separator character (SUBSEP) handles this combination seamlessly.

3. What happens if the input file is extremely large?
This solution reads the entire grid into memory before processing. For exceptionally large files (e.g., gigabytes of ASCII art), this could lead to high memory consumption. In such a scenario, a streaming algorithm that processes the file without storing it all would be necessary, but that would be a significantly more complex problem.

4. What is the difference between `awk`, `gawk`, and `nawk`?
awk is the original program from Bell Labs. nawk ("new awk") was an improved version that became the standard. gawk is the GNU Project's implementation of Awk, which is standard on most Linux systems. gawk is largely compatible with the POSIX standard for Awk and includes many useful extensions, like the `split()` function's ability to split by character with an empty delimiter.

5. Could this algorithm be adapted to find other shapes?
Absolutely. The core methodology of iterating through potential anchor points and then validating the surrounding structure is highly adaptable. To find triangles or other polygons, you would change the corner-finding logic and, most importantly, the edge validation rules to match the geometry of the new shape.

6. What does `continue` do in the loops?
The continue statement immediately stops the current iteration of the innermost loop and begins the next iteration. We use it as an optimization. For example, if a starting point (r1, c1) is not a +, there's no need to run the inner loops searching for a bottom-right corner, so we continue to the next potential starting point.

7. What is the time complexity of this solution?
The algorithm's time complexity is dominated by the four nested loops and the edge validation loops. In the worst case, it is O(R² * C²), where R is the number of rows and C is the number of columns. This is because for each of the O(R*C) possible top-left corners, we might iterate through O(R*C) possible bottom-right corners, and the validation takes additional time proportional to the rectangle's perimeter.

Conclusion: Beyond the Characters

We have successfully developed a complete, robust Awk script to solve the ASCII rectangle counting problem. This journey took us from understanding the problem's visual nature to designing a concrete, multi-phase algorithm: ingest the data, systematically search for corner pairs, and rigorously validate the connecting edges. By leveraging Awk's line-by-line processing and powerful associative arrays, we created a solution that is both elegant and effective.

More importantly, this exercise from the kodikra.com curriculum highlights a crucial skill for any programmer: the ability to see data structures and algorithms in unconventional places. You've learned to look at a block of text not as a simple string, but as a grid—a 2D data structure ripe for analysis. This way of thinking is invaluable, whether you're parsing log files, manipulating configuration data, or tackling your next complex coding challenge.

To continue honing your skills, we encourage you to explore more challenges in the Awk 6 roadmap module and dive deeper into the full Awk learning path available on our platform.

Disclaimer: The provided Awk solution has been tested with GNU Awk (gawk) 5.1.0+. While it uses standard features, behavior in other Awk implementations may vary.


Published by Kodikra — Your trusted Awk learning resource.