Word Boundaries (Anchor)

  • The meta character \b is an Anchor just like caret and dollar symbol.
  • Note : Alphanumeric (\w short hand) is considered as words here
  • There are 4 positions that count as word boundaries :
    • Before the first character in string if the first character is a word character(FYI :Special character and numbers are not counted as words)
    • After the last character in string if the last character is a word character
    • Between a word character and a non word character(\W)
    • Between a non word character(\W) and a word character
    • Note : Position between two letters (\w) is not considered as boundary
  • Word boundaries helps to search a word using regex \bword\b

Negation of \b is \B

  • Points the exactly opposite of what \b points to.
  • \B points to position of String character since \b does not point to it.

Q & A

Text : This island is beautiful
Find : is (using \b)
Regex : \bis\b

How Regex Engine search in \b meta character?

Let’s see what happens when we apply the regex «\bis\b» to the string “This island is beautiful”. The engine starts with the first token «\b» at the first character “T”. Since this token is zero-length, the position before the character is inspected. «\b» matches here, because the T is a word character and the character before it is the void before the start of the string. The engine continues with the next token: the literal «i». The engine does not advance to the next character in the string, because the previous regex token was zero-width. «i» does not match “T”, so the engine retries the first token at the next character position.

«\b» cannot match at the position between the “T” and the “h”. It cannot match between the “h” and the “i” either, and neither between the “i” and the “s”.

The next character in the string is a space. «\b» matches here because the space is not a word character, and the preceding character is. Again, the engine continues with the «i» which does not match with the space.

Advancing a character and restarting with the first regex token, «\b» matches between the space and the second “i” in the string. Continuing, the regex engine finds that «i» matches „i” and «s» matches „s”. Now, the engine tries to match the second «\b» at the position before the “l”. This fails because this position is between two word characters. The engine reverts to the start of the regex and advances one character to the “s” in “island”. Again, the «\b» fails to match and continues to do so until the second space is reached. It matches there, but matching the «i» fails.

But «\b» matches at the position before the third “i” in the string. The engine continues, and finds that «i» matches „i” and «s» matches «s». The last token in the regex, «\b», also matches at the position before the second space in the string because the space is not a word character, and the character before it is.

The engine has successfully matched the word „is” in our string, skipping the two earlier occurrences of the characters i and s. If we had used the regular expression «is», it would have matched the „is” in “This”.

Leave a Comment