Regular expressions (regex) are a powerful tool for manipulating and analyzing text. In Python, we use the re module to work with regex.

import re # No need to pip install its in the standard library

Patterns

Basic Characters

a: Exact match. re.search('a', 'apple')
.: Matches any character (except newline). re.search('.', 'apple')
\d: Matches any digit (0-9). re.search('\d', 'apple2')
\D: Matches any non-digit. re.search('\D', '1234a')
\s: Matches any whitespace. re.search('\s', 'apple pie')
\S: Matches any non-whitespace. re.search('\S', ' apple')
\w: Matches any alphanumeric character and underscore (a-z, A-Z, 0-9, _). re.search('\w', '@apple!')
\W: Matches any non-alphanumeric character. re.search('\W', 'apple@')

Special Characters

\t: Tab. re.search('\t', 'apple\t')
\n: Newline. re.search('\n', 'apple\npie')
\r: Carriage Return. re.search('\r', 'apple\r\n')
\\: Backslash. re.search('\\\\', 'apple\\')

Quantifiers

*: Zero or more of the previous item. re.search('a*', 'aaapple')
+: One or more of the previous item. re.search('a+', 'aaapple')
?: Zero or one of the previous item. re.search('a?', 'aaapple')
{n}: Exactly n of the previous item. re.search('a{2}', 'aaapple')
{n,}: n or more of the previous item. re.search('a{2,}', 'aaapple')
{,m}: Up to m of the previous item. re.search('a{,2}', 'aaapple')
{n,m}: Between n and m of the previous item. re.search('a{2,3}', 'aaaapple')

Groups and Ranges

[abc]: Matches any of the enclosed characters. re.search('[abc]', 'apple')
[^abc]: Matches any character not enclosed. re.search('[^abc]', 'apple')
(abc): Defines a group. re.search('(abc)', 'abcapple')
(a|b): Matches either a or b. re.search('(a|p)', 'apple')

Anchors

^abc: Matches pattern abc at the start of a string. re.search('^abc', 'abcapple')
abc$: Matches pattern abc at the end of a string. re.search('abc$', 'appleabc')
\babc: Word boundary (matches abc at the start of a word). re.search('\\babc', 'abc apple')
abc\b: Word boundary (matches abc at the end of a word). re.search('abc\\b', 'appleabc pie')

Flags

re.I or re.IGNORECASE: Makes matching case insensitive. re.search('a', 'APPLE', re.I)
re.M or re.MULTILINE: Makes ^ and $ match start and end of each line. re.search('^a', 'apple\nbanana', re.M)
re.S or re.DOTALL: Makes . match any character, including newlines. re.search('a.p', 'a\np', re.S)
re.X or re.VERBOSE: Allows multiline regular expressions and ignores whitespace and comments in the pattern. re.search("""a # this is a comment\nb""", 'ab', re.X)

Back References

Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string.

\1: Matches the contents of group 1. re.search('(a)b\\1', 'aba')
\2: Matches the contents of group 2. re.search('(a)(b)\\2', 'abb')

Lookahead and Lookbehind

Lookahead and lookbehind assertions determine the success or failure of a regex match in Python based on what is just behind (to the left) or ahead (to the right) of the current string position.

a(?=b): Positive lookahead: Matches 'a' only if 'a' is followed by 'b'. re.search('a(?=b)', 'ab')
a(?!b): Negative lookahead: Matches 'a' only if 'a' is not followed by 'b'. re.search('a(?!b)', 'ac')
(?<=b)a: Positive lookbehind: Matches 'a' only if 'a' is preceded by 'b'. re.search('(?<=b)a', 'ba')
(?<!b)a: Negative lookbehind: Matches 'a' only if 'a' is not preceded by 'b'. re.search('(?<!b)a', 'ca')

Python’s `re` module

Python’s re module provides several functions to work with regex. Here are the most used beyondre.search():

`re.match()`

This function checks for a match only at the beginning of the string.

print(re.match('abc', 'abcdef'))  # <re.Match object; span=(0, 3), match='abc'>
print(re.match('abc', 'abcdefabc'))  # <re.Match object; span=(0, 3), match='abc'>
print(re.match('abc', 'abcdefabc').group())  # abc

`re.findall()`

This function returns all non-overlapping matches of pattern in string, as a list of strings.

print(re.findall('abc', 'abcdefabc'))  # ['abc', 'abc']

`re.sub()`

This function replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless max provided.

print(re.sub('abc', '123', 'abcdefabc'))  # 123def123

`re.split()`

This function splits the source string by the occurrences of the pattern.

print(re.split('\d+', 'apple123banana45cherry6'))  # ['apple', 'banana', 'cherry', '']

Popular Examples

import re

# Find all substrings that match a pattern
text = "Hello, my name is John Doe. I live in New York."
matches = re.findall(r'\b\w{4}\b', text)  
# matches: ['Hello', 'name', 'John', 'live', 'York']
# This code finds all 4-letter words in the text.

# Split a string by multiple delimiters
text = "apple;banana-orange:peach"
result = re.split(r'[;:-]', text)  
# result: ['apple', 'banana', 'orange', 'peach']
# This code splits the text by either a semicolon, a dash, or a colon.

# Replace substrings that match a pattern
text = "I have 3 cats, 4 dogs, and 5 fishes."
result = re.sub(r'\d', 'many', text)  
# result: 'I have many cats, many dogs, and many fishes.'
# This code replaces all digits in the text with the word 'many'.

# Check if a string starts with a pattern
text = "Hello, world!"
result = bool(re.match(r'^Hello', text))  
# result: True
# This code checks if the text starts with 'Hello'.

# Extract email addresses from a string
text = "Contact us at info@example.com or support@example.net."
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)  
# emails: ['info@example.com', 'support@example.net']
# This code extracts all email addresses from the text.

# Find all dates in YYYY-MM-DD format
text = "I was born on 2000-01-01. I graduated on 2020-05-15."
dates = re.findall(r'\b\d{4}-\d{2}-\d{2}\b', text)  
# dates: ['2000-01-01', '2020-05-15']
# This code extracts all dates in YYYY-MM-DD format from the text.

# Capture groups in a match
text = "The event will be held on 2023-07-10 at 18:00."
match = re.search(r'(\d{4}-\d{2}-\d{2}) at (\d{2}:\d{2})', text)
date, time = match.groups()
# date: '2023-07-10', time: '18:00'
# This code extracts the date and time from the text.

# Match a pattern multiple times
text = "I love apples, apples are my favorite fruit."
matches = re.findall(r'(apples)', text)  
# matches: ['apples', 'apples']
# This code finds all occurrences of 'apples' in the text.

# Match a pattern and replace it with a function's result
def replace_with_length(match):
    return str(len(match.group()))

text = "I have a cat, a dog, and a horse."
result = re.sub(r'\ba \w+?\b', replace_with_length, text)  
# result: 'I have 1 cat, 1 dog, and 1 horse.'
# This code replaces all 'a [word]' with the length of '[word]'.

# Match nested brackets correctly
text = "foo(bar(baz))blim"
matches = re.findall(r'\(([^()]*)\)', text)  
# matches: ['baz']
# This code finds all text within the innermost brackets.

# Find duplicate words
text = "This is is a test test sentence."
dupes = re.findall(r'\b(\w+)\s+\1\b', text)  
# dupes: ['is', 'test']
# This code finds all duplicate words in the text.

# Match a pattern except in specific contexts
text = "100 dollars, but not 100 cents"
matches = re.findall(r'100(?!\s+cents)', text)  
# matches: ['100']
# This code finds '100' except when it is followed by ' cents'.

# Match balanced parentheses
text = "((()))()()(((())))"
matches = re.findall(r'\(([^()]|(?R))*\)', text)  
# matches: ['((()))', '()', '(((())))']
# This code matches balanced parentheses in the text.

# Validate a password with certain rules
password = "StrongPass1!"
is_valid = bool(re.match(r'^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%^&*]).{8,}$', password))  
# is_valid: True
# This code validates the password, which must contain at least one digit, one lowercase letter, one uppercase letter, one special character, and be at least 8 characters long.

# Extract the domain name from a URL
url = "https://www.example.com/path?query#fragment"
domain = re.search(r'https?://([A-Za-z_0-9.-]+).*', url).group(1)  
# domain: 'www.example.com'
# This code extracts the domain name from a URL.

# Match a Unicode character
text = "Résumé"
matches = re.findall(r'\w+', text)  
# matches: ['Résumé']
# This code finds all words in the text, even if they contain Unicode characters.

# Match repeating words
text = "This is a a test."
repeated_words = re.findall(r'\b(\w+)\s+\1\b', text)  
# repeated_words: ['a']
# This code finds all words that are immediately repeated.

# Match words that are palindromes
text = "A man, a plan, a canal, Panama"
palindromes = [word for word in re.findall(r'\b\w+\b', text) if word == word[::-1]]
# palindromes: ['A', 'man', 'a', 'a', 'Panama']
# This code finds all palindromes in the text.

# Match words containing 'q' not followed by 'u'
text = "Iraq is a country in the Middle East."
q_not_u_words = re.findall(r'\b\w*q[^u]\w*\b', text)  
# q_not_u_words: ['Iraq']
# This code finds all words in the text that contain 'q' not followed by 'u'.

# Extract all words within double quotes
text = 'He said, "Hello, world!"'
quoted = re.findall(r'"(.*?)"', text)  
# quoted: ['Hello, world!']

# This code extracts all words within double quotes from the text.

https://medium.com/@theom/the-ultimate-python-regex-cheat-sheet-f202e99ac21d

Learning Python

jueves, 20 de junio de 2024

Python Regular expresions full guide

Patterns

Basic Characters

Special Characters

Quantifiers

Groups and Ranges

Anchors

Flags

Back References

Lookahead and Lookbehind

Python’s `re` module

`re.match()`

`re.findall()`

`re.sub()`

`re.split()`

Popular Examples

No hay comentarios:

Publicar un comentario

jueves, 20 de junio de 2024

Python Regular expresions full guide

Patterns

Basic Characters

Special Characters

Quantifiers

Groups and Ranges

Anchors

Flags

Back References

Lookahead and Lookbehind

Python’s re module

re.match()

re.findall()

re.sub()

re.split()

Popular Examples

No hay comentarios:

Publicar un comentario

Python’s `re` module

`re.match()`

`re.findall()`

`re.sub()`

`re.split()`