Regular expressions (regex) are a powerful tool for manipulating and analyzing text. In Python, we use the re
module to work with regex.
import re # No need to pip install its in the standard library
Patterns
Basic Characters
a
: Exact match.re.search('a', 'apple')
.
: Matches any character (except newline).re.search('.', 'apple')
\d
: Matches any digit (0-9).re.search('\d', 'apple2')
\D
: Matches any non-digit.re.search('\D', '1234a')
\s
: Matches any whitespace.re.search('\s', 'apple pie')
\S
: Matches any non-whitespace.re.search('\S', ' apple')
\w
: Matches any alphanumeric character and underscore (a-z, A-Z, 0-9, _).re.search('\w', '@apple!')
\W
: Matches any non-alphanumeric character.re.search('\W', 'apple@')
Special Characters
\t
: Tab.re.search('\t', 'apple\t')
\n
: Newline.re.search('\n', 'apple\npie')
\r
: Carriage Return.re.search('\r', 'apple\r\n')
\\
: Backslash.re.search('\\\\', 'apple\\')
Quantifiers
*
: Zero or more of the previous item.re.search('a*', 'aaapple')
+
: One or more of the previous item.re.search('a+', 'aaapple')
?
: Zero or one of the previous item.re.search('a?', 'aaapple')
{n}
: Exactly n of the previous item.re.search('a{2}', 'aaapple')
{n,}
: n or more of the previous item.re.search('a{2,}', 'aaapple')
{,m}
: Up to m of the previous item.re.search('a{,2}', 'aaapple')
{n,m}
: Between n and m of the previous item.re.search('a{2,3}', 'aaaapple')
Groups and Ranges
[abc]
: Matches any of the enclosed characters.re.search('[abc]', 'apple')
[^abc]
: Matches any character not enclosed.re.search('[^abc]', 'apple')
(abc)
: Defines a group.re.search('(abc)', 'abcapple')
(a|b)
: Matches either a or b.re.search('(a|p)', 'apple')
Anchors
^abc
: Matches pattern abc at the start of a string.re.search('^abc', 'abcapple')
abc$
: Matches pattern abc at the end of a string.re.search('abc$', 'appleabc')
\babc
: Word boundary (matches abc at the start of a word).re.search('\\babc', 'abc apple')
abc\b
: Word boundary (matches abc at the end of a word).re.search('abc\\b', 'appleabc pie')
Flags
re.I
orre.IGNORECASE
: Makes matching case insensitive.re.search('a', 'APPLE', re.I)
re.M
orre.MULTILINE
: Makes^
and$
match start and end of each line.re.search('^a', 'apple\nbanana', re.M)
re.S
orre.DOTALL
: Makes.
match any character, including newlines.re.search('a.p', 'a\np', re.S)
re.X
orre.VERBOSE
: Allows multiline regular expressions and ignores whitespace and comments in the pattern.re.search("""a # this is a comment\nb""", 'ab', re.X)
Back References
Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string.
\1
: Matches the contents of group 1.re.search('(a)b\\1', 'aba')
\2
: Matches the contents of group 2.re.search('(a)(b)\\2', 'abb')
Lookahead and Lookbehind
Lookahead and lookbehind assertions determine the success or failure of a regex match in Python based on what is just behind (to the left) or ahead (to the right) of the current string position.
a(?=b)
: Positive lookahead: Matches 'a' only if 'a' is followed by 'b'.re.search('a(?=b)', 'ab')
a(?!b)
: Negative lookahead: Matches 'a' only if 'a' is not followed by 'b'.re.search('a(?!b)', 'ac')
(?<=b)a
: Positive lookbehind: Matches 'a' only if 'a' is preceded by 'b'.re.search('(?<=b)a', 'ba')
(?<!b)a
: Negative lookbehind: Matches 'a' only if 'a' is not preceded by 'b'.re.search('(?<!b)a', 'ca')
Python’s re
module
Python’s re
module provides several functions to work with regex. Here are the most used beyondre.search()
:
re.match()
This function checks for a match only at the beginning of the string.
print(re.match('abc', 'abcdef')) # <re.Match object; span=(0, 3), match='abc'>
print(re.match('abc', 'abcdefabc')) # <re.Match object; span=(0, 3), match='abc'>
print(re.match('abc', 'abcdefabc').group()) # abc
re.findall()
This function returns all non-overlapping matches of pattern in string, as a list of strings.
print(re.findall('abc', 'abcdefabc')) # ['abc', 'abc']
re.sub()
This function replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless max provided.
print(re.sub('abc', '123', 'abcdefabc')) # 123def123
re.split()
This function splits the source string by the occurrences of the pattern.
print(re.split('\d+', 'apple123banana45cherry6')) # ['apple', 'banana', 'cherry', '']
Popular Examples
import re
# Find all substrings that match a pattern
text = "Hello, my name is John Doe. I live in New York."
matches = re.findall(r'\b\w{4}\b', text)
# matches: ['Hello', 'name', 'John', 'live', 'York']
# This code finds all 4-letter words in the text.
# Split a string by multiple delimiters
text = "apple;banana-orange:peach"
result = re.split(r'[;:-]', text)
# result: ['apple', 'banana', 'orange', 'peach']
# This code splits the text by either a semicolon, a dash, or a colon.
# Replace substrings that match a pattern
text = "I have 3 cats, 4 dogs, and 5 fishes."
result = re.sub(r'\d', 'many', text)
# result: 'I have many cats, many dogs, and many fishes.'
# This code replaces all digits in the text with the word 'many'.
# Check if a string starts with a pattern
text = "Hello, world!"
result = bool(re.match(r'^Hello', text))
# result: True
# This code checks if the text starts with 'Hello'.
# Extract email addresses from a string
text = "Contact us at info@example.com or support@example.net."
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
# emails: ['info@example.com', 'support@example.net']
# This code extracts all email addresses from the text.
# Find all dates in YYYY-MM-DD format
text = "I was born on 2000-01-01. I graduated on 2020-05-15."
dates = re.findall(r'\b\d{4}-\d{2}-\d{2}\b', text)
# dates: ['2000-01-01', '2020-05-15']
# This code extracts all dates in YYYY-MM-DD format from the text.
# Capture groups in a match
text = "The event will be held on 2023-07-10 at 18:00."
match = re.search(r'(\d{4}-\d{2}-\d{2}) at (\d{2}:\d{2})', text)
date, time = match.groups()
# date: '2023-07-10', time: '18:00'
# This code extracts the date and time from the text.
# Match a pattern multiple times
text = "I love apples, apples are my favorite fruit."
matches = re.findall(r'(apples)', text)
# matches: ['apples', 'apples']
# This code finds all occurrences of 'apples' in the text.
# Match a pattern and replace it with a function's result
def replace_with_length(match):
return str(len(match.group()))
text = "I have a cat, a dog, and a horse."
result = re.sub(r'\ba \w+?\b', replace_with_length, text)
# result: 'I have 1 cat, 1 dog, and 1 horse.'
# This code replaces all 'a [word]' with the length of '[word]'.
# Match nested brackets correctly
text = "foo(bar(baz))blim"
matches = re.findall(r'\(([^()]*)\)', text)
# matches: ['baz']
# This code finds all text within the innermost brackets.
# Find duplicate words
text = "This is is a test test sentence."
dupes = re.findall(r'\b(\w+)\s+\1\b', text)
# dupes: ['is', 'test']
# This code finds all duplicate words in the text.
# Match a pattern except in specific contexts
text = "100 dollars, but not 100 cents"
matches = re.findall(r'100(?!\s+cents)', text)
# matches: ['100']
# This code finds '100' except when it is followed by ' cents'.
# Match balanced parentheses
text = "((()))()()(((())))"
matches = re.findall(r'\(([^()]|(?R))*\)', text)
# matches: ['((()))', '()', '(((())))']
# This code matches balanced parentheses in the text.
# Validate a password with certain rules
password = "StrongPass1!"
is_valid = bool(re.match(r'^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%^&*]).{8,}$', password))
# is_valid: True
# This code validates the password, which must contain at least one digit, one lowercase letter, one uppercase letter, one special character, and be at least 8 characters long.
# Extract the domain name from a URL
url = "https://www.example.com/path?query#fragment"
domain = re.search(r'https?://([A-Za-z_0-9.-]+).*', url).group(1)
# domain: 'www.example.com'
# This code extracts the domain name from a URL.
# Match a Unicode character
text = "Résumé"
matches = re.findall(r'\w+', text)
# matches: ['Résumé']
# This code finds all words in the text, even if they contain Unicode characters.
# Match repeating words
text = "This is a a test."
repeated_words = re.findall(r'\b(\w+)\s+\1\b', text)
# repeated_words: ['a']
# This code finds all words that are immediately repeated.
# Match words that are palindromes
text = "A man, a plan, a canal, Panama"
palindromes = [word for word in re.findall(r'\b\w+\b', text) if word == word[::-1]]
# palindromes: ['A', 'man', 'a', 'a', 'Panama']
# This code finds all palindromes in the text.
# Match words containing 'q' not followed by 'u'
text = "Iraq is a country in the Middle East."
q_not_u_words = re.findall(r'\b\w*q[^u]\w*\b', text)
# q_not_u_words: ['Iraq']
# This code finds all words in the text that contain 'q' not followed by 'u'.
# Extract all words within double quotes
text = 'He said, "Hello, world!"'
quoted = re.findall(r'"(.*?)"', text)
# quoted: ['Hello, world!']
# This code extracts all words within double quotes from the text.
https://medium.com/@theom/the-ultimate-python-regex-cheat-sheet-f202e99ac21d
No hay comentarios:
Publicar un comentario