Use Extended Regular Expressions with grep command

The grep command accepts a text and a location as arguments. It finds the specified text string in the provided location. The specified text string is called a pattern or expression. It assigns special meanings to many symbols to allow users to customize the expression. This tutorial explains these special-meaning symbols and how they work.

Regular and Extended regular expressions

Grep assigns special meanings to a few characters. These characters are known as metacharacters. Initially, grep assigned the following characters as metacharacters. These characters are also known as default metacharacters.

 ^ $ . [ ] *

Later, grep added the following characters to this list.

( ) { } ? + |

Use the -E switch with the grep command to use the special meanings of lately added characters. Without the -E option, grep treats them as the literal. A character is a literal when the command uses its original meaning. Instead of the original meaning, if the command uses the special meaning of a character, it is a metacharacter.

For example, if you use the + sign without the -E option, grep searches for the + (plus) sign. Here, the plus is a literal. However, if you use it with the -E option, the grep uses its special meaning. In special meaning, it appends the previous search. Here, the plus is a metacharacter.

Let us take another example.

The original implementation uses the pipe sign (|) as a regular character, while the new implementation defines it as a metacharacter. As a metacharacter, it allows you to search for multiple words.

Let us take an example. Suppose you want to search for two user accounts: sanjay and rick in the local database file. The /etc/passwd file saves local user accounts. If you use the following command for this, it will not work.

#grep "sanjay|rick" /etc/passwd

Without the -E option, grep will treat it as a single word. It will search for the string sanjay|rick instead of searching for two separate words: sanjay and rick. Use the -E option to use the pipe sign for the special meaning. The following command searches for two words: sanjay and rick separately.

#grep -E "sanjay|rick" /etc/passwd

using extended regular expression

The -E option instructs grep to use the special meaning of the pipe. In special meaning, the pipe works as a text separator. It instructs grep to search both words separately. You can use a pipe sign multiple times in a pattern to search multiple words simultaneously. For example, the following command searches for words abc, fgh, xyz, mno and jkl.

#grep –E "abc|fgh|xyz|mno|jkl" [source file]

The grep command regex example

This section presents a small project-based example of regexes. This project extracts all links from an HTML source file. You can use the source code of any webpage to create an HTML source file for practice.

Start a web browser and open the webpage from which you want to extract links. Press CTRL + U to display the source code of the web page.

access a web page

Press CTRL + C to copy all codes.

copy code

Open a terminal and create a new text file. Right-click inside the edit mode and select the paste option.

paste the code

Save the file.

save the file

The following command extracts all links from the file.

#grep –Eoi '<a[^>]+>.*</a>' html_file

anchor text

The following outline explains the above command.

Options

-E  This option instructs the grep command that the search pattern contains lately added meta characters.

o  By default, grep prints the entire line that contains the search pattern. This option forces it to print only the matching words.

i  This option instructs it to ignore the case while matching the pattern.

Special meaning metacharacters

  • <a  Starting point of the anchor tag.
  • [^ >]  Match everything except the > symbol.
  • +  Match preceding one or more times.
  • >  Ending point of the anchor tag.

The above pattern searches for the text that starts with <a and picks everything that comes after it until it finds a > sign. The > sign ends the anchor tag. A + sign instructs it to repeat the previous search in the entire file. The previous search finds everything that starts with the <a and ends with the a>.

  • A dot (.) sign represents a single character. A star (*) represents all characters. This pattern searches for all characters between the starting and closing anchor tags.
  • </a>:- This is the closing point of the anchor tag.

Collectively, the above pattern searches a text string that starts with the <a and has some texts and ends with the > and again has some texts and ends with the </a>.

Displaying only anchor tags

If you need only anchor tags, you can use the following command to exclude the expression that includes the linked text.

#grep –Eoi '<a[^>]+>' html_file

only anchor tag

Extracting all links and saving them in a text file

Combine the following three commands to extract all links or URLs from an HTML file.

#grep –Eoi '<a[^>]+>' html_file
#grep -Eo 'href= "[^"]+"'
#grep –Eo 'https://[^"]+' > link-only

The following syntax combines the above commands.

#grep –Eoi '<a[^>]+>' html_file | grep -Eo 'href="[^"]+"' | grep –Eo 'https://[^"]+'

only urls

To save all links in a text file, redirect the final output to the file.

#grep –Eoi '<a[^>]+>' html_file | grep -Eo 'href="[^"]+"' | grep –Eo 'https://[^"]+' > link-only

saving links to a file

  • The first command receives its input from the file named html_file. The second command receives its input from the first command. The third command receives its input from the second command.
  • The first command extracts all anchor attributes from the file and sends output to the second command.
  • The second command extracts all href tags from the output of the first command and sends the result to the third command.
  • The third command extracts all links from the second command's output and saves it to the text file.

This tutorial is part of the tutorial "The grep command in Linux: - usage, options, and syntax explained through examples.". Other parts of this tutorial are as follows:

Chapter 1  grep options, regex, parameters and regular expressions
Chapter 2  Grep Command in Linux Explained with Practical Examples
Chapter 3  Use Extended Regular Expressions with the grep command
Chapter 4  grep regex Practical Examples of Regular Expressions

Conclusion

The grep is a powerful tool for extracting specific data from a text source. This tutorial explained this process through an example that extracts all links from an HTML source file.

ComputerNetworkingNotes Linux Tutorials Use Extended Regular Expressions with grep command

We do not accept any kind of Guest Post. Except Guest post submission, for any other query (such as adverting opportunity, product advertisement, feedback, suggestion, error reporting and technical issue) or simply just say to hello mail us ComputerNetworkingNotes@gmail.com