Post hidden from Hashnode
Posts can be hidden from Hashnode network for various reasons. Contact the moderators for more details.
Walking in the park with Regular Expression
I. What is RegEx?
II. Common uses of RegExes
III. Benefits of using RegExes
III. Examples of Metacharacters of Regular Expressions
What is RegEx or a Regular Expression?
RegEx or A Regular Expression is a pattern to match given texts. RegEx allows us to define abstract strings or structured texts, to check and to see if they match the other strings or not. RegExes can be unscrupulously complicated and may be eminently intricate and hard to follow but are very useful because they allow us to accomplish a lot through pattern matching.
I'll try to show examples to understand how powerful RexExes are using the re
module of python
Common uses of RegEx
Validating data such as a phone number that only has numbers, brackets and dashes. An email address that looks for any combination of A - Z in both upper and lower cases which allows a few other special characters such as:
! # $ % & ' * + - / = ? ^ _
{ |`reference https://www.abstractapi.com/tools/email-regex-guide i. Email validation example in Python:
import re regex = '^[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$' def check(email): if(re.search(regex,email)): print("Valid Email") else: print("Invalid Email")
ii. Email validation in Javascript:
function ValidateEmail(inputText) { var mailformat = /^[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/ if(inputText.value.match(mailformat)) { alert("This is not a valid email address"); return false; } }
iii. Email validation in Go
package main
import (
"fmt"
"regexp"
)
var emailRegex = regexp.MustCompile("^[a-zA-Z0-9.!#$%&'*+\\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$")
func main() {
e := "test@testing.com"
if isEmailValid(e) {
fmt.Println(e + " is a valid email")
}
if !isEmailValid("just text") {
fmt.Println("not a valid email")
}
}
func isEmailValid(e string) bool {
if len(e) < 3 && len(e) > 254 {
return false
}
return emailRegex.MatchString(e)
}
Replacement: To find and replace a word or set of words with other characters.
Transforming a title to be a url slug
>>> content = "This is a title." >>> slug = content.replace(' ', '-').lower() >>> slug 'this-is-a-title.'
>>> content = "This is a title." >>> slug = content.toLocaleLowerCase().replace(' ', '-') 'this-is-a-title'
package main import ( "fmt" "strings" ) func main() { fmt.Println( // "This is a title" strings.ToLower("This is a title."), strings.Replace("this is a title", " ", "-", 4), ) } this is a title. this-is-a-title
Scrapping: Finding occurrences of text in a website, pdf, excel/CSV and the likes.
- String parsing to retrieve data from structured strings such as URLs or logs.
Benefits of RegExes
“Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." – Jamie Zawinski Excerpt From: Jaime Buelta. “Python Automation Cookbook.” Apple Books.
Keep it simple, stupid.
Overcomplicating your withered brain will just waste your time when trying to use RegExes. Regular Expressions are best used when they are kept very simple. If another tool exists that can finish the job in five minutes, use that. Unless you're not aware of how bad you are at estimating time, then do as you wish.
Examples of RegExes: Metacharacters
What is a RegEx metacharacter?
RegEx metacharacters lets you change how you match data. When providing a regex without metacharacters, you simply match the exact substring.
"." A period matches any character other than a newline. Example:
# match any substring that ends in ill
>>> words = pd.Series(['abaa', 'cabb', 'Abaa', 'sabb', 'dcbb'])
>>> words.str.contains(".abb")
- "[ ]": Square brackets are like gates that specify a set of characters to match.
>>> words.str.contains("[Aa]abb")
- "^": The caret symbol searches for matches at the beginning of a string outside of square brackets
>>> sentence = pd.Series(["Where did he go", "He went to the shop", "He is good"])
>>> sentence.str.contains("^(He|he)")
"( )": Parenthesis are used for grouping and enforcing proper order of operations just like in math and logical expressions.
"*": Asterisks matches 0 or more copies of the preceding character
- "?": A question mark matches 0 or 1 copy of the preceding character.
- "+": A plus sign matches 1 or more copies of the preceding character.
- "{ }": Curly braces matches a preceding character of a specified number of repetitions
- "{a}": The preceding element is matches
a
times - "{a, }": The preceding element is matched
a
times of more - "{m, n}": The preceding element is matched between m and n times.
- "{a}": The preceding element is matches
Special characters that allow us to specify certain common character types
- [a-z]: Match any lowercase letter
- [A-Z]: Match any uppercase letter
- [0-9]: Match any digit
[a-zA-Z0-9]: Match any letter or digit
Adding "^" inside square brackets matches any characters not in the set
- [^a-z]: Match any character that is not in lowercase
- [^A-Z]: Match any character that is not in uppercase
- [^0-9]: Match any character that is not a digit
[^a-zA-Z0-9]: Match any character that is not a letter or a digit
Shorthand regex for common sequences:
\d: Match any digit
- \D: Match any non-digit
- \w: Match a word character
- \W: Match a non-word character
- \s: Match whitespace (spaces, tabs, newlines and so on.)
\S: Match non-whitespace
Reference: Hands-on Explanatory Data Analysis with Python - Packt Publishing