Search
⌘K
Common Patterns

String Matching & Parsing

Pattern matching, parsing, and text processing — from regex-style matching to building simple parsers and interpreters.


String matching and parsing problems show up constantly in AI coding interviews, and for good reason. They're the kind of problems that naturally decompose into multiple components: a tokenizer here, a parser there, maybe a state machine to tie it all together. That decomposition is exactly what interviewers want to see you manage. Can you break a messy text-processing task into clean pieces? Can you tell the AI what to build for each piece and then wire them together correctly?
Think log parsers that extract structured data from messy output. Template engines that expand variables inside strings. Config file interpreters. Regex-like matchers. Simple expression evaluators. These aren't contrived puzzles — they're things real engineers build all the time, which makes them perfect for an interview format that's trying to simulate real work.
Here's the thing that trips people up: nobody expects you to walk in knowing the KMP algorithm or Aho-Corasick by heart. Interviewers care about whether you can recognize what kind of string problem you're facing, pick a reasonable approach, and implement it correctly with AI assistance. The algorithmic knowledge matters less than the engineering judgment.

Pattern matching fundamentals

The most basic string matching question is: does this pattern appear in this text, and where? Start with the brute force approach because it's often good enough, and knowing when "good enough" is actually good enough is itself an important skill.
Brute force string search slides a window across the text, checking at each position whether the pattern matches. It's O(nm) in the worst case, where n is the text length and m is the pattern length. But here's the practical reality: for most inputs you'll see in an interview, the average case is much better because mismatches happen early and you skip ahead quickly.
Text:ABCABDABCABEFPattern:ABECurrent window (position 3) — mismatch at index 2ABEMatch found at position 9
When is brute force not enough? When you're searching for many patterns simultaneously, or when the text is enormous and the pattern has repetitive structure that causes worst-case behavior. In practice during an AI coding interview, you'll rarely need to implement KMP or Rabin-Karp from scratch. But you should know they exist and when to reach for them — the interviewer might ask you to describe how you'd optimize if the naive approach became a bottleneck.
def brute_force_search(text: str, pattern: str) -> list[int]:
    matches = []
    for i in range(len(text) - len(pattern) + 1):
        if text[i:i + len(pattern)] == pattern:
            matches.append(i)
    return matches
Short, readable, correct. For a single pattern against a reasonably sized text, this is the right answer. Don't overcomplicate it.

Trie-based multi-pattern matching

When the problem involves matching against many patterns — autocomplete, dictionary lookup, filtering log lines against a set of known error signatures — a trie is the right data structure. Instead of searching for each pattern individually (which multiplies your cost by the number of patterns), you build a trie once and walk the text against it.
atpnoiptp*atpnoiptpappanttopti
The trie above holds the words "app", "ant", "top", and "ti". Double-bordered nodes mark the end of a complete word. To check if a string matches, you walk character by character from the root — if you reach a terminal node, you've found a match. If you fall off the trie at any point, the string isn't in your set.
class TrieNode:
    def __init__(self):
        self.children: dict[str, "TrieNode"] = {}
        self.is_end = False

class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word: str) -> None:
        node = self.root
        for ch in word:
            if ch not in node.children:
                node.children[ch] = TrieNode()
            node = node.children[ch]
        node.is_end = True

    def search(self, word: str) -> bool:
        node = self.root
        for ch in word:
            if ch not in node.children:
                return False
            node = node.children[ch]
        return node.is_end

    def starts_with(self, prefix: str) -> bool:
        node = self.root
        for ch in prefix:
            if ch not in node.children:
                return False
            node = node.children[ch]
        return True
In an interview, the trie itself is rarely the whole problem. It's a building block. You might use it inside a log parser to match known patterns, or as the dictionary backing an autocomplete system. The interviewer wants to see that you know when a trie is the right choice and that you can integrate it into a larger system.
Tries are a great candidate for AI generation. The structure is well-known and the AI will produce a correct implementation almost every time. Ask the AI to build the trie, then spend your time on the interesting part: how you're using it in the larger problem.

Recursive descent parsing

This is the pattern that shows up most often in AI coding interviews when the problem involves parsing. Expression evaluators, config file readers, template engines, mini-language interpreters — they all follow the same basic shape. You define a grammar, then write functions that consume tokens according to that grammar, calling each other recursively.
The idea is straightforward. You have a string of tokens. Each grammar rule becomes a function. The function for "expression" calls the function for "term", which calls the function for "factor", and so on. Each function consumes the tokens it's responsible for and returns the result.
Here's a parser for simple arithmetic expressions that handles +, -, *, /, and parentheses with correct operator precedence:
class Parser:
    def __init__(self, text: str):
        self.text = text
        self.pos = 0

    def parse(self) -> float:
        result = self.expression()
        if self.pos != len(self.text):
            raise ValueError(f"Unexpected character at position {self.pos}")
        return result

    def expression(self) -> float:
        result = self.term()
        while self.pos < len(self.text) and self.text[self.pos] in "+-":
            op = self.text[self.pos]
            self.pos += 1
            right = self.term()
            if op == "+":
                result += right
            else:
                result -= right
        return result

    def term(self) -> float:
        result = self.factor()
        while self.pos < len(self.text) and self.text[self.pos] in "*/":
            op = self.text[self.pos]
            self.pos += 1
            right = self.factor()
            if op == "*":
                result *= right
            else:
                if right == 0:
                    raise ValueError("Division by zero")
                result /= right
        return result

    def factor(self) -> float:
        if self.pos < len(self.text) and self.text[self.pos] == "(":
            self.pos += 1
            result = self.expression()
            if self.pos >= len(self.text) or self.text[self.pos] != ")":
                raise ValueError("Missing closing parenthesis")
            self.pos += 1
            return result

        start = self.pos
        if self.pos < len(self.text) and self.text[self.pos] in "+-":
            self.pos += 1
        while self.pos < len(self.text) and (self.text[self.pos].isdigit() or self.text[self.pos] == "."):
            self.pos += 1
        if self.pos == start:
            raise ValueError(f"Expected number at position {self.pos}")
        return float(self.text[start:self.pos])
A few things to notice. The grammar's precedence is encoded in the call structure: expression handles + and -, term handles * and /, and factor handles numbers and parenthesized sub-expressions. This naturally gives multiplication higher precedence than addition without needing any explicit priority table.
The error handling is basic but present. In an interview, you don't need industrial-strength error messages, but showing that you think about malformed input — missing parentheses, unexpected characters, division by zero — is a strong signal.
Interviewers don't expect you to have parser theory memorized. What they want to see is that you can take a description of a grammar or format and systematically turn it into code. If you've never written a recursive descent parser before, the pattern is simple enough to pick up in the interview itself — and that's fine. Showing that you can assimilate a new technique quickly is exactly what they're evaluating.

Tokenization as a separate step

For more complex parsers, you'll want to separate tokenization from parsing. The tokenizer (or lexer) converts raw text into a stream of typed tokens, and the parser operates on tokens instead of characters. This separation makes both pieces simpler and easier to test independently.
from enum import Enum
from dataclasses import dataclass

class TokenType(Enum):
    NUMBER = "NUMBER"
    PLUS = "PLUS"
    MINUS = "MINUS"
    STAR = "STAR"
    SLASH = "SLASH"
    LPAREN = "LPAREN"
    RPAREN = "RPAREN"
    STRING = "STRING"
    IDENT = "IDENT"
    EOF = "EOF"

@dataclass
class Token:
    type: TokenType
    value: str

def tokenize(text: str) -> list[Token]:
    tokens = []
    i = 0
    while i < len(text):
        if text[i].isspace():
            i += 1
        elif text[i].isdigit() or (text[i] == "." and i + 1 < len(text) and text[i + 1].isdigit()):
            start = i
            while i < len(text) and (text[i].isdigit() or text[i] == "."):
                i += 1
            tokens.append(Token(TokenType.NUMBER, text[start:i]))
        elif text[i].isalpha() or text[i] == "_":
            start = i
            while i < len(text) and (text[i].isalnum() or text[i] == "_"):
                i += 1
            tokens.append(Token(TokenType.IDENT, text[start:i]))
        elif text[i] == "+":
            tokens.append(Token(TokenType.PLUS, "+"))
            i += 1
        elif text[i] == "-":
            tokens.append(Token(TokenType.MINUS, "-"))
            i += 1
        elif text[i] == "*":
            tokens.append(Token(TokenType.STAR, "*"))
            i += 1
        elif text[i] == "/":
            tokens.append(Token(TokenType.SLASH, "/"))
            i += 1
        elif text[i] == "(":
            tokens.append(Token(TokenType.LPAREN, "("))
            i += 1
        elif text[i] == ")":
            tokens.append(Token(TokenType.RPAREN, ")"))
            i += 1
        else:
            raise ValueError(f"Unexpected character: {text[i]}")
    tokens.append(Token(TokenType.EOF, ""))
    return tokens
This is the kind of code that AI generates really well. The tokenizer is mechanical — you're just classifying characters into categories. Let the AI produce it, verify it handles your edge cases (whitespace, multi-character numbers, string literals if needed), and move on to the parser where your judgment matters more.

Finite state machines

When a problem says "validate this format" or "match strings that follow this pattern," think state machines. A finite state machine (FSM) is just a set of states, a set of transitions between them based on input characters, and one or more accepting states. If you end in an accepting state after processing all input, the string matches.
startS0[0-9]S1[0-9].S2[0-9]S3[0-9]Validates decimal numbers like "3.14" and "42.0"accepting
The FSM above validates decimal numbers. S0 is the start state, S3 (double-bordered) is the accepting state. The machine reads digits, then a dot, then more digits. If it ends in S3, the input is a valid decimal.
In code, state machines are usually just an integer (or enum) tracking the current state and a transition table:
from enum import Enum

class State(Enum):
    START = 0
    INTEGER = 1
    DOT = 2
    DECIMAL = 3
    INVALID = 4

def is_valid_decimal(s: str) -> bool:
    state = State.START
    for ch in s:
        if state == State.START:
            if ch.isdigit():
                state = State.INTEGER
            else:
                state = State.INVALID
        elif state == State.INTEGER:
            if ch.isdigit():
                state = State.INTEGER
            elif ch == ".":
                state = State.DOT
            else:
                state = State.INVALID
        elif state == State.DOT:
            if ch.isdigit():
                state = State.DECIMAL
            else:
                state = State.INVALID
        elif state == State.DECIMAL:
            if ch.isdigit():
                state = State.DECIMAL
            else:
                state = State.INVALID
        if state == State.INVALID:
            return False
    return state == State.DECIMAL
State machines show up in interview problems more often than you'd expect. Validating IP addresses, parsing CSV with quoted fields, matching simplified glob patterns — these all map cleanly to FSMs. The pattern is recognizable: if the problem describes a format with distinct modes or phases (reading the integer part vs. the decimal part, inside a quoted string vs. outside), a state machine is probably the right model.
When you spot a state machine problem, sketch the states and transitions on paper (or describe them to the interviewer) before writing code. Getting the state diagram right is 90% of the work. The code is just a mechanical translation of that diagram, and the AI can handle the translation if you describe the states clearly.

Regular expressions

Regex is a polarizing topic in interviews. Some people love it, some people avoid it entirely. In AI coding interviews, regex occupies a useful middle ground: it's great for quick extraction and validation of well-defined patterns, but it's not a substitute for a proper parser when the grammar gets complex.
Good uses of regex in an interview:
  • Extracting timestamps, IPs, or error codes from log lines
  • Validating formats like email addresses or version strings (approximately — regex can't fully validate emails but can handle the common cases)
  • Splitting structured text into fields
  • Quick find-and-replace in template expansion
Bad uses of regex in an interview:
  • Parsing nested structures (regex fundamentally can't handle balanced parentheses or nested tags)
  • Anything where you need to maintain state across matches
  • Complex grammars where a recursive descent parser would be clearer
import re

def parse_log_line(line: str) -> dict[str, str] | None:
    pattern = r"(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(\w+)\] (.+)"
    match = re.match(pattern, line)
    if not match:
        return None
    return {
        "timestamp": match.group(1),
        "level": match.group(2),
        "message": match.group(3),
    }
In an AI coding interview, regex is useful for the mechanical parts of a larger problem. If you're building a log analyzer and the first step is extracting structured fields from each line, regex handles that quickly. The AI is also very good at writing regex patterns — often better than most humans. Describe what you want to match in plain English, and the AI will usually produce a working pattern on the first try. Just make sure you test it against edge cases.
Be careful with AI-generated regex. The AI loves to produce clever, dense patterns that are hard to read and harder to debug. If you can't explain what a regex does at a glance, ask the AI to simplify it or break it into named groups. In an interview, maintainability matters — a slightly longer but readable pattern beats a cryptic one-liner.

Combining the pieces

Real interview problems rarely use just one of these techniques. A config file parser might use regex to tokenize lines, a trie for keyword lookup, and recursive descent for evaluating expressions in values. A log analysis tool might use a state machine to handle multi-line log entries, regex to extract fields, and tries to classify error types.
The skill the interviewer is evaluating isn't whether you know each technique in isolation. It's whether you can look at a problem, identify which parts call for which approach, and wire them together into a working system. That decomposition — "I'll use regex for tokenization, a recursive descent parser for the expression grammar, and a simple hash map for variable lookup" — is exactly the kind of plan that impresses interviewers.
Here's a small example that combines tokenization and recursive descent to evaluate expressions with variables:
def evaluate(expression: str, variables: dict[str, float]) -> float:
    tokens = tokenize(expression)
    pos = [0]

    def peek() -> Token:
        return tokens[pos[0]]

    def advance() -> Token:
        token = tokens[pos[0]]
        pos[0] += 1
        return token

    def parse_expression() -> float:
        result = parse_term()
        while peek().type in (TokenType.PLUS, TokenType.MINUS):
            op = advance()
            right = parse_term()
            if op.type == TokenType.PLUS:
                result += right
            else:
                result -= right
        return result

    def parse_term() -> float:
        result = parse_factor()
        while peek().type in (TokenType.STAR, TokenType.SLASH):
            op = advance()
            right = parse_factor()
            if op.type == TokenType.STAR:
                result *= right
            else:
                result /= right
        return result

    def parse_factor() -> float:
        token = peek()
        if token.type == TokenType.NUMBER:
            advance()
            return float(token.value)
        if token.type == TokenType.IDENT:
            advance()
            if token.value not in variables:
                raise ValueError(f"Undefined variable: {token.value}")
            return variables[token.value]
        if token.type == TokenType.LPAREN:
            advance()
            result = parse_expression()
            if peek().type != TokenType.RPAREN:
                raise ValueError("Expected closing parenthesis")
            advance()
            return result
        raise ValueError(f"Unexpected token: {token.value}")

    result = parse_expression()
    if peek().type != TokenType.EOF:
        raise ValueError("Unexpected tokens after expression")
    return result
This reuses the tokenizer from earlier and adds variable resolution. Notice how each layer of the system is independently testable: you can test the tokenizer alone, test the parser with hardcoded tokens, and test the whole thing end-to-end. That kind of modularity is exactly what interviewers want to see.

What interviewers expect

String matching and parsing problems are evaluated differently than pure algorithm problems. The interviewer isn't timing how fast you can implement a trie. They're watching how you decompose a text-processing problem into components and whether you can reason about correctness.
Break the problem into layers. Almost every parsing problem has a natural decomposition: raw text to tokens, tokens to structure, structure to result. Identify those layers before you start coding. Tell the interviewer what they are. This plan is more important than the code itself.
Verify AI-generated parser code carefully. Parsers have specific failure modes that AI tends to get wrong:
  • Off-by-one errors in position tracking (consuming one character too many or too few)
  • Missing error handling for unexpected input or premature end of input
  • Incorrect operator precedence in expression parsers
  • Failure to handle edge cases like empty strings, leading/trailing whitespace, or nested structures at maximum depth
Test incrementally. Don't write the whole parser and then test it. Build the tokenizer, test it. Build one grammar rule, test it. Add the next rule, test it. Each step should produce working code before you move on. This is both good engineering and good interviewing — it gives the interviewer confidence that you know what you're building at each stage.
Know when you're overengineering. If the problem is "parse lines of the form key=value", you don't need a recursive descent parser. A simple split("=", 1) is the right answer. Reaching for heavy machinery when a simple approach works is a negative signal. Match the complexity of your solution to the complexity of the problem.
Parsing problems are one of the best places to let AI handle boilerplate while you focus on the interesting parts. The tokenizer, the basic trie structure, the scaffolding of a recursive descent parser — these are all well-known patterns the AI will produce correctly. Your value-add is in defining the grammar, handling edge cases, and connecting the components. Spend your brainpower there.
One last thing worth emphasizing: interviewers don't expect you to have built parsers before. Most candidates haven't. What they expect is that you can look at a grammar description, understand the structure, and systematically translate it into code — with AI help. The willingness to work through an unfamiliar pattern methodically, rather than panicking or guessing, is one of the strongest signals a candidate can give.

Purchase Premium to Keep Reading

Unlock this article and so much more with Hello Interview Premium

Schedule a mock interview

Meet with a FAANG senior+ engineer or manager and learn exactly what it takes to get the job.

Schedule a Mock Interview