Alternative for confusing RegEx

RegEx (aka Regular Expression) is a tool for people to match a certain string in a huge text. This is extremely useful for people doing Text Analytics or Natural Language Processing.

However, it is extremely confusing and hard to use as it has so many irregular meta-characters. Luckily, another package was created to help with this problem. It is called VerbalExpression

RegEx

Before we talk about VerbalExpression, let's review about the basic Regular Expression first

[az] - match either a or z
[a-z] - match any characters from a to z
[A-Z] - match any characters from A to Z
[a-zA-Z] - match any characters from a to z or A to Z
[1-4] - match any numbers from 1 to 4
[^a-z] - match any characters from a to z

^ - Matches the start of the line
$ - Matches the end of the line

* - Zero or more
+ - One or more
? - Zero or one
{n} - Exactly 'n' number
{n,} - Matches 'n' or more occurrences
{n,m} - Between 'n' and 'm'

As we can see, there are so many things for us to remember to use Regular Expression. For me, most of the time, I need to Google or to use StackOverflow to check for any Regex questions.

Luckily, we can use VerbalExpressions package to help us with this process.

VerbalExpressions

import re

from verbalexpressions import VerEx

Let's start with the range in RegEx. We can easily replace this with .range()

verbal_expression.range("az")            --> [az]
verbal_expression.range("a","z")         --> [a-z]
verbal_expression.range("A","Z")         --> [A-Z]
verbal_expression.range("a","z","A","Z") --> [a-zA-Z]

string = "a b c A B D 1 2 3 ab"

verbal_expression = VerEx()
verbal_expression.range("a","c")
print(verbal_expression.source())
re.findall(verbal_expression.source(),string)

([a-c])
['a', 'b', 'c', 'a', 'b']

verbal_expression = VerEx()
verbal_expression.range("A","C")
print(verbal_expression.source())
re.findall(verbal_expression.source(),string)

([A-C])
['A', 'B']

verbal_expression = VerEx()
verbal_expression.range("a","c","A","C",1,2)
print(verbal_expression.source())
re.findall(verbal_expression.source(),string)

([a-cA-C1-2])
['a', 'b', 'c', 'A', 'B', '1', '2', 'a', 'b']

We can also define matches start of or end of the line with VerbalExpression

verbal_expression.start_of_line() --> ^
verbal_expression.end_of_line()   --> $

In addition, we also can state which words that we want to have, which words we may want to have. We can also exclude the words that we don't want.

verbal_expression.find()          --> find a string
verbal_expression.maybe()         --> maybe find a string
verbal_expression.anything_but()  --> exclude strings we do not want

Let's try to look for a correct url link.

  + http://google.com
  + http://www.google.com
  + https://google.com
  + https://www.google.com
  + https://www.google.com/doodles/

verbal_expression = VerEx()
tester = (verbal_expression.
            start_of_line().
            find('http').
            maybe('s').
            find('://').
            maybe('www.').
            anything_but(' ').
            end_of_line()
)

# The RegEx syntax if we do it by ourself
print(tester.source())

^(http)(s)?(://)(www\.)?([^\ ]*)$

# Create an example URL
test_url = ["http://www.google.com", "https://www.google.com", 
            "http://google.com", "https://google.com", 
            "http://www.google.com/doodles/", "http://www.google.com /doodles/"]

# Test if the URL is valid
for text in test_url:
    if tester.match(text):
        print (f"{text} is a Valid URL")
    else:
        print (f"{text} is not a Valid URL")

http://www.google.com is a Valid URL
https://www.google.com is a Valid URL
http://google.com is a Valid URL
https://google.com is a Valid URL
http://www.google.com/doodles/ is a Valid URL
http://www.google.com /doodles/ is not a Valid URL

To be honest, VerbalExpression is not completely well-built on Python. There are still some components missing. Remember that we have things like this in RegEx ?

* - Zero or more
+ - One or more
? - Zero or one
{n} - Exactly 'n' number
{n,} - Matches 'n' or more occurrences
{n,m} - Between 'n' and 'm'

For now VerbalExpression does not have any function to support this syntax. Hence, in order for us to use this, we need to manually add in.

verbal_expression.add({0,})  --> * - Zero or more
verbal_expression.add({1,})  --> + - One or more
verbal_expression.add({0,1}) --> ? - Zero or one
verbal_expression.add({n})   --> {n} - Exactly 'n' number
verbal_expression.add({n,})  --> {n,} - Matches 'n' or more occurrences
verbal_expression.add({n,m}) --> {n,m} - Between 'n' and 'm'

Let's find words consist of more than 2 "very" words

verbal_expression = VerEx()
tester = (verbal_expression.
            find("very").
            add("{2,}")
)

print(tester.source())

(very){2,}

test_very = ["","very","veryvery","veryveryvery"]

# Test if the word is correct
for text in test_very:
    if tester.match(text):
        print (f"{text} is a correct word")
    else:
        print (f"{text} is not a correct word")

 is not a correct word
very is not a correct word
veryvery is a correct word
veryveryvery is a correct word