Use Round brackets for Grouping

  • By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together.
  • This allows you to apply a regex operator, e.g. a repetition operator, to the entire group.
  • Note that only round brackets can be used for grouping. Square brackets define a character class, and curly braces are used by a special repetition operator.
  • Besides grouping part of a regular expression together, round brackets also create a “backreference”.
  • A backreference stores the part of the string matched by the part of the regular expression inside the parentheses.
  • That is, unless you use non-capturing parentheses. Remembering part of the regex match in a backreference, slows down the regex engine because it has more work to do.
  • If you do not use the backreference, you can speed things up by using non-capturing parentheses, at the expense of making your regular expression slightly harder to read.
  • The regex «Set(Value)?» matches „Set” or „SetValue”. In the first case, the first backreference will be empty, because it did not match anything. In the second case, the first backreference will contain „Value”.
  • If you do not use the backreference, you can optimize this regular expression into «Set(?:Value)?».
  • The question mark and the colon after the opening round bracket are the special syntax that you can use to tell the regex engine that this pair of brackets should not create a backreference.

Use Back reference in regex

  • To back reference is to use the search result of the previous group again in the same search.
  • We define group by using round brackets
    Example : (\d)(\w) where (\d) is the first group and (\w) is the second group
  • We can reference to these groups by using \1 and \2 where \1 is the digit and \2 is the word respectively.
  • Let us see an example to clarify it more

Q & A

Find a digits in the search text that may have been repeated 4 or more than 4 times
Text :
123 13222234
3534534 214333332432
Search Result expected : 2222 and 33333
==================================================================
Regex : (\d)\1{3,}

Optional Items ? in Regex

The question mark makes the preceding token in the regular expression optional. E.g.: «colou?r» matches both „colour” and „color”.

You can make several tokens optional by grouping them together using round brackets, and placing the question mark after the closing bracket. E.g.: «Nov(ember)?» will match „Nov” and „November”.

Important Regex Concept: Greediness

The question mark gives the regex engine two choices: try to match the part the question mark applies to, or do not try to match it. The engine will always try to match that part. Only if this causes the entire regular expression to fail, will the engine try ignoring the part the question mark applies to.

The effect is that if you apply the regex «Feb 23(rd)?» to the string “Today is Feb 23rd, 2003”, the match will always be „Feb 23rd” and not „Feb 23”. You can make the question mark lazy (i.e. turn off the greediness) by putting a second question mark after the first.

Pipe Symbol in regex | (Alteration)

  • You can use alteration or pipe symbol to find out pattern out of several regular expression.
  • Pipe symbol works just like OR operator in java language.
  • If you want to search bikes or cars using same regex then you can separate bikes and cars token with a pipe symbol and search both of them together.
  • Text : Tonight Bikes and Cars are going to have a race on highway
    Find : Text “Bikes” and “Cars” in the same regex.
    ==================================
    Regex : Bikes|Cars
  • Now suppose you want to add additional regex on the search token like \b then you need to group the search using round brackets to do it
    Example : \b(Bikes|Cars)\b
  • If we had omitted the round brackets, the regex engine would have searched for a word boundary followed by Bikes, or, Cars followed by a word boundary.

Word Boundaries (Anchor)

  • The meta character \b is an Anchor just like caret and dollar symbol.
  • Note : Alphanumeric (\w short hand) is considered as words here
  • There are 4 positions that count as word boundaries :
    • Before the first character in string if the first character is a word character(FYI :Special character and numbers are not counted as words)
    • After the last character in string if the last character is a word character
    • Between a word character and a non word character(\W)
    • Between a non word character(\W) and a word character
    • Note : Position between two letters (\w) is not considered as boundary
  • Word boundaries helps to search a word using regex \bword\b

Negation of \b is \B

  • Points the exactly opposite of what \b points to.
  • \B points to position of String character since \b does not point to it.

Q & A

Text : This island is beautiful
Find : is (using \b)
==============================================================
Regex : \bis\b

How Regex Engine search in \b meta character?

Let’s see what happens when we apply the regex «\bis\b» to the string “This island is beautiful”. The engine starts with the first token «\b» at the first character “T”. Since this token is zero-length, the position before the character is inspected. «\b» matches here, because the T is a word character and the character before it is the void before the start of the string. The engine continues with the next token: the literal «i». The engine does not advance to the next character in the string, because the previous regex token was zero-width. «i» does not match “T”, so the engine retries the first token at the next character position.

«\b» cannot match at the position between the “T” and the “h”. It cannot match between the “h” and the “i” either, and neither between the “i” and the “s”.

The next character in the string is a space. «\b» matches here because the space is not a word character, and the preceding character is. Again, the engine continues with the «i» which does not match with the space.

Advancing a character and restarting with the first regex token, «\b» matches between the space and the second “i” in the string. Continuing, the regex engine finds that «i» matches „i” and «s» matches „s”. Now, the engine tries to match the second «\b» at the position before the “l”. This fails because this position is between two word characters. The engine reverts to the start of the regex and advances one character to the “s” in “island”. Again, the «\b» fails to match and continues to do so until the second space is reached. It matches there, but matching the «i» fails.

But «\b» matches at the position before the third “i” in the string. The engine continues, and finds that «i» matches „i” and «s» matches «s». The last token in the regex, «\b», also matches at the position before the second space in the string because the space is not a word character, and the character before it is.

The engine has successfully matched the word „is” in our string, skipping the two earlier occurrences of the characters i and s. If we had used the regular expression «is», it would have matched the „is” in “This”.

Start of String and End of String Anchors ^ and $

  • Unlike character set, anchors in Regex are used to match positions before, after and in between
  • They are used to “Anchor” the regex match at a certain position.
  • The caret “^” matches the position before first character in a single line
    Example : In text “abc” regex “^a” matches the letter “a”
  • The dollar “$” matches the position after last character in a single line
    Example : In text “xyz” regex “z$” matches the letter “z”
  • Note : Anchors matches/searches line by line and not word by word hence they are great to validate single word input from users in applications like email id or number only input.
  • Searching caret “^” and dollar “$” in a multi-line text (Note : CR LF is \n in windows)
Since ^ is a position, notepad++ is pointing to it
Since $ is a position, notepad++ is pointing at it
  • Example 2:
    Input : Text should contain number inputs only “746746746”
    Regex : ^\d+$
  • Example 3:
    Input : Find the starting and ending spaces in paragraph
    Regex : ^\s+|\s+$

Permanent Start and End of String

  • As discussed above “^” and “$” works line by line and with multiple lines together, suppose in a scenario you want to find what is the start of a String in multiple lines in a file the you can go for “\A” and “\Z” instead.
  • \A check the first position of the first line in multiple lines.(Example Below)
  • \Z checks of the last position of the last line in multiple lines.(Example Below)
    Note : If you want to match the line break “\n” position as well then you can use “\z” which will return you the position after \n

Dot meta-character in regex

  • The Dot meta-character matches any single character, without caring what that character is
  • Note : Dot does not matches the new line character.
  • Dot is not a meta-character in character set [ ]
  • Dot is a very powerful regex metacharacter and it allows you to be lazy i.e you put a dot and everything matches in regex
  • It is adviced to use Dot meta character sparingly since it matches everything it causes regex to match character that we don’t want to match

Example : If you want to match date format 03/11/2035 you can write a regex like this “\d\d.\d\d.\d\d\d\d” but also matches 03-11-2015 and 03.11.2015
We can improve our regex search for using character class like this “\d\d[/]\d\d[/]\d\d\d\d”

In the above example we don’t have any control of what all character . can match if the above date format input hence it is very dangerous to use dot meta character in regex.

Use Character Sets Instead of the Dot

  • Dot meta character is used most of the time when they know there can be a special character inside the search so you can either blacklist or white-list the charters using character set instead of using dot meta character
  • In the above example of date we can use negated character set to compare date

Q & A

Input : Hello world this is a date 03/11/2035 which has to be found using Regex
But it should not match 03\11\2035 or 03 11 2035 or 03-11-2035
Search : 03/11/2035
================================================================
Regex Using Character set : \d\d[/]\d\d[/]\d\d\d\d
Regex Using Negated Character set : \d\d[^a-zA-Z0-9\-. ]\d\d[^a-zA-Z0-9\-. ]\d\d\d\d

Difference between Encoding and Character set.

For better understand of encoding and character set we will use the UTF-8 encoding and Character set unicode in the below explanation and examples.

So as we all know data is stored on an hard disk in binary format i.e 1 and 0’s. So each system has to read these 1 and 0’s and decode it understand what the data is, so if the data is written on hardisk using UTF-8 encoding then it has to be decoded by UTF-8 decoder to understand the text, any other decoder will convert the text into gibberish. Now the decoded data has to be mapped with a character set which will make the data more readable by humans.

Example : Suppose 1 2 3 4 has to be written on an harddisk then it will be encoded into something like 00000001 00000010 00000011 00000100 using UTF-8 encoder.

Example : Suppose “1101000 1100101 1101100 1101100 1101111” is the data that is stored in the hard disk, now UTF-8 decoding algorithm has to decode it and after decoding it will look something like “104 101 108 108 111”. Now when these numbers are viewed via a unicode character set it will be translated into “hello”

Literal Character Search in Regex

So suppose you have a word to search in a paragraph this is how regex based engine will search it.

Text : Hello World you are learning regex and trying to search a Word in this line.
Regex Search : Word

So the in text ” Hello World you are learning regex” you are searching for word “World” using regex, regex engine will first match for the letter “W” with the letter “H”, since this is not a match it will try matching “W” with the next letter that is “e”, this still not a match then it will proceed with next letter “l” and so on util it reaches the letter “W” at position 7(including space), since it is a match regex engine will now search letter “o” with “letter right next the letter “W” which is “o” it is still a match then the next letter “r” and it is still a match but the next regex word is “d” but the letter right next to “r” in the text is “l” which is not match. so regex engine will now again compare the letter “l” with the first word of the regex that is “W” it is still not a match hence it will proceed with the letter “d” of the text to compare with “W” and search moves on util it it meets letter “W” again.

Note : Regex engines are case sensitive by default

Directory Structure of Linux file system

  • bin – Essential command binaries are stored here
  • sbin – Essential system binaries
  • etc – Contains configuration files
  • home – Any users home directories
  • lib – contains common libraries
  • mnt – Temporary file system file pendrives and cd drives comes under this directory
  • proc – Contains virtual file system and stores kernal info
  • tmp – temporary files are stored in this directory and files in this dir get deleted after restart
  • usr – Contains user program and user data
  • var – (variable)files and folders where system writes during and operation example system logs