/
Write
Start a team blog
NEW ✨
Start a team blog, invite your team, and start publishing.

Post hidden from Hashnode

Posts can be hidden from Hashnode network for various reasons. Contact the moderators for more details.

Walking in the park with Regular Expression

Walking in the park with Regular Expression

Vicente G. Reyes's photo
Vicente G. Reyes
·Sep 9, 2021·

4 min read

I. What is RegEx?

II. Common uses of RegExes

III. Benefits of using RegExes

III. Examples of Metacharacters of Regular Expressions

What is RegEx or a Regular Expression?

RegEx or A Regular Expression is a pattern to match given texts. RegEx allows us to define abstract strings or structured texts, to check and to see if they match the other strings or not. RegExes can be unscrupulously complicated and may be eminently intricate and hard to follow but are very useful because they allow us to accomplish a lot through pattern matching.

I'll try to show examples to understand how powerful RexExes are using the re module of python

Common uses of RegEx

  • Validating data such as a phone number that only has numbers, brackets and dashes. An email address that looks for any combination of A - Z in both upper and lower cases which allows a few other special characters such as: ! # $ % & ' * + - / = ? ^ _ { |`

    • reference https://www.abstractapi.com/tools/email-regex-guide i. Email validation example in Python:

      import re
      
      regex = '^[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$'
      
      def check(email):  
        if(re.search(regex,email)):  
            print("Valid Email")  
        else:  
            print("Invalid Email")
      

      ii. Email validation in Javascript:

      function ValidateEmail(inputText)
      {
        var mailformat = /^[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/
        if(inputText.value.match(mailformat))
        {
            alert("This is not a valid email address");
            return false;
            }
      }
      

iii. Email validation in Go

package main
import (
    "fmt"
    "regexp"
)
var emailRegex = regexp.MustCompile("^[a-zA-Z0-9.!#$%&'*+\\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$")

func main() {
    e := "test@testing.com"
    if isEmailValid(e) {
        fmt.Println(e + " is a valid email")
    }
    if !isEmailValid("just text") {
        fmt.Println("not a valid email")
    }
}
func isEmailValid(e string) bool {
    if len(e) < 3 && len(e) > 254 {
        return false
    }
    return emailRegex.MatchString(e)
}
  • Replacement: To find and replace a word or set of words with other characters.

    1. Transforming a title to be a url slug

       >>> content = "This is a title."
       >>> slug = content.replace(' ', '-').lower()
       >>> slug
       'this-is-a-title.'
      
      >>> content = "This is a title."
      >>> slug = content.toLocaleLowerCase().replace(' ', '-')
      'this-is-a-title'
      
      package main
      
      import (
       "fmt"
       "strings"
      )
      
      func main() {
        fmt.Println(
          // "This is a title"
           strings.ToLower("This is a title."),
          strings.Replace("this is a title", " ", "-", 4),
       )
      }
      this is a title. this-is-a-title
      
  • Scrapping: Finding occurrences of text in a website, pdf, excel/CSV and the likes.

  • String parsing to retrieve data from structured strings such as URLs or logs.

Benefits of RegExes

“Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." – Jamie Zawinski Excerpt From: Jaime Buelta. “Python Automation Cookbook.” Apple Books.

Keep it simple, stupid.

Overcomplicating your withered brain will just waste your time when trying to use RegExes. Regular Expressions are best used when they are kept very simple. If another tool exists that can finish the job in five minutes, use that. Unless you're not aware of how bad you are at estimating time, then do as you wish.

Examples of RegExes: Metacharacters

What is a RegEx metacharacter?

  • RegEx metacharacters lets you change how you match data. When providing a regex without metacharacters, you simply match the exact substring.

  • "." A period matches any character other than a newline. Example:

# match any substring that ends in ill
>>> words = pd.Series(['abaa', 'cabb', 'Abaa', 'sabb', 'dcbb'])
>>> words.str.contains(".abb")
  1. "[ ]": Square brackets are like gates that specify a set of characters to match.
>>> words.str.contains("[Aa]abb")
  1. "^": The caret symbol searches for matches at the beginning of a string outside of square brackets
>>> sentence = pd.Series(["Where did he go", "He went to the shop", "He is good"])
>>> sentence.str.contains("^(He|he)")
  1. "( )": Parenthesis are used for grouping and enforcing proper order of operations just like in math and logical expressions.

  2. "*": Asterisks matches 0 or more copies of the preceding character

  3. "?": A question mark matches 0 or 1 copy of the preceding character.
  4. "+": A plus sign matches 1 or more copies of the preceding character.
  5. "{ }": Curly braces matches a preceding character of a specified number of repetitions
    1. "{a}": The preceding element is matches a times
    2. "{a, }": The preceding element is matched a times of more
    3. "{m, n}": The preceding element is matched between m and n times.

Special characters that allow us to specify certain common character types

  • [a-z]: Match any lowercase letter
  • [A-Z]: Match any uppercase letter
  • [0-9]: Match any digit
  • [a-zA-Z0-9]: Match any letter or digit

    Adding "^" inside square brackets matches any characters not in the set

    • [^a-z]: Match any character that is not in lowercase
    • [^A-Z]: Match any character that is not in uppercase
    • [^0-9]: Match any character that is not a digit
    • [^a-zA-Z0-9]: Match any character that is not a letter or a digit

      Shorthand regex for common sequences:

    • \d: Match any digit

    • \D: Match any non-digit
    • \w: Match a word character
    • \W: Match a non-word character
    • \s: Match whitespace (spaces, tabs, newlines and so on.)
    • \S: Match non-whitespace

      Reference: Hands-on Explanatory Data Analysis with Python - Packt Publishing