jueves, 20 de junio de 2024

Python Regular expresions full guide

Regular expressions (regex) are a powerful tool for manipulating and analyzing text. In Python, we use the re module to work with regex.

import re # No need to pip install its in the standard library

Patterns

Basic Characters

  • a: Exact match. re.search('a', 'apple')
  • .: Matches any character (except newline). re.search('.', 'apple')
  • \d: Matches any digit (0-9). re.search('\d', 'apple2')
  • \D: Matches any non-digit. re.search('\D', '1234a')
  • \s: Matches any whitespace. re.search('\s', 'apple pie')
  • \S: Matches any non-whitespace. re.search('\S', ' apple')
  • \w: Matches any alphanumeric character and underscore (a-z, A-Z, 0-9, _). re.search('\w', '@apple!')
  • \W: Matches any non-alphanumeric character. re.search('\W', 'apple@')

Special Characters

  • \t: Tab. re.search('\t', 'apple\t')
  • \n: Newline. re.search('\n', 'apple\npie')
  • \r: Carriage Return. re.search('\r', 'apple\r\n')
  • \\: Backslash. re.search('\\\\', 'apple\\')

Quantifiers

  • *: Zero or more of the previous item. re.search('a*', 'aaapple')
  • +: One or more of the previous item. re.search('a+', 'aaapple')
  • ?: Zero or one of the previous item. re.search('a?', 'aaapple')
  • {n}: Exactly n of the previous item. re.search('a{2}', 'aaapple')
  • {n,}: n or more of the previous item. re.search('a{2,}', 'aaapple')
  • {,m}: Up to m of the previous item. re.search('a{,2}', 'aaapple')
  • {n,m}: Between n and m of the previous item. re.search('a{2,3}', 'aaaapple')

Groups and Ranges

  • [abc]: Matches any of the enclosed characters. re.search('[abc]', 'apple')
  • [^abc]: Matches any character not enclosed. re.search('[^abc]', 'apple')
  • (abc): Defines a group. re.search('(abc)', 'abcapple')
  • (a|b): Matches either a or b. re.search('(a|p)', 'apple')

Anchors

  • ^abc: Matches pattern abc at the start of a string. re.search('^abc', 'abcapple')
  • abc$: Matches pattern abc at the end of a string. re.search('abc$', 'appleabc')
  • \babc: Word boundary (matches abc at the start of a word). re.search('\\babc', 'abc apple')
  • abc\b: Word boundary (matches abc at the end of a word). re.search('abc\\b', 'appleabc pie')

Flags

  • re.I or re.IGNORECASE: Makes matching case insensitive. re.search('a', 'APPLE', re.I)
  • re.M or re.MULTILINE: Makes ^ and $ match start and end of each line. re.search('^a', 'apple\nbanana', re.M)
  • re.S or re.DOTALL: Makes . match any character, including newlines. re.search('a.p', 'a\np', re.S)
  • re.X or re.VERBOSE: Allows multiline regular expressions and ignores whitespace and comments in the pattern. re.search("""a # this is a comment\nb""", 'ab', re.X)

Back References

Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string.

  • \1: Matches the contents of group 1. re.search('(a)b\\1', 'aba')
  • \2: Matches the contents of group 2. re.search('(a)(b)\\2', 'abb')

Lookahead and Lookbehind

Lookahead and lookbehind assertions determine the success or failure of a regex match in Python based on what is just behind (to the left) or ahead (to the right) of the current string position.

  • a(?=b): Positive lookahead: Matches 'a' only if 'a' is followed by 'b'. re.search('a(?=b)', 'ab')
  • a(?!b): Negative lookahead: Matches 'a' only if 'a' is not followed by 'b'. re.search('a(?!b)', 'ac')
  • (?<=b)a: Positive lookbehind: Matches 'a' only if 'a' is preceded by 'b'. re.search('(?<=b)a', 'ba')
  • (?<!b)a: Negative lookbehind: Matches 'a' only if 'a' is not preceded by 'b'. re.search('(?<!b)a', 'ca')

Python’s re module

Python’s re module provides several functions to work with regex. Here are the most used beyondre.search():

re.match()

This function checks for a match only at the beginning of the string.

print(re.match('abc', 'abcdef'))  # <re.Match object; span=(0, 3), match='abc'>
print(re.match('abc', 'abcdefabc')) # <re.Match object; span=(0, 3), match='abc'>
print(re.match('abc', 'abcdefabc').group()) # abc

re.findall()

This function returns all non-overlapping matches of pattern in string, as a list of strings.

print(re.findall('abc', 'abcdefabc'))  # ['abc', 'abc']

re.sub()

This function replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless max provided.

print(re.sub('abc', '123', 'abcdefabc'))  # 123def123

re.split()

This function splits the source string by the occurrences of the pattern.

print(re.split('\d+', 'apple123banana45cherry6'))  # ['apple', 'banana', 'cherry', '']

Popular Examples

import re

# Find all substrings that match a pattern
text = "Hello, my name is John Doe. I live in New York."
matches = re.findall(r'\b\w{4}\b', text)
# matches: ['Hello', 'name', 'John', 'live', 'York']
# This code finds all 4-letter words in the text.

# Split a string by multiple delimiters
text = "apple;banana-orange:peach"
result = re.split(r'[;:-]', text)
# result: ['apple', 'banana', 'orange', 'peach']
# This code splits the text by either a semicolon, a dash, or a colon.

# Replace substrings that match a pattern
text = "I have 3 cats, 4 dogs, and 5 fishes."
result = re.sub(r'\d', 'many', text)
# result: 'I have many cats, many dogs, and many fishes.'
# This code replaces all digits in the text with the word 'many'.

# Check if a string starts with a pattern
text = "Hello, world!"
result = bool(re.match(r'^Hello', text))
# result: True
# This code checks if the text starts with 'Hello'.

# Extract email addresses from a string
text = "Contact us at info@example.com or support@example.net."
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
# emails: ['info@example.com', 'support@example.net']
# This code extracts all email addresses from the text.

# Find all dates in YYYY-MM-DD format
text = "I was born on 2000-01-01. I graduated on 2020-05-15."
dates = re.findall(r'\b\d{4}-\d{2}-\d{2}\b', text)
# dates: ['2000-01-01', '2020-05-15']
# This code extracts all dates in YYYY-MM-DD format from the text.

# Capture groups in a match
text = "The event will be held on 2023-07-10 at 18:00."
match = re.search(r'(\d{4}-\d{2}-\d{2}) at (\d{2}:\d{2})', text)
date, time = match.groups()
# date: '2023-07-10', time: '18:00'
# This code extracts the date and time from the text.

# Match a pattern multiple times
text = "I love apples, apples are my favorite fruit."
matches = re.findall(r'(apples)', text)
# matches: ['apples', 'apples']
# This code finds all occurrences of 'apples' in the text.

# Match a pattern and replace it with a function's result
def replace_with_length(match):
return str(len(match.group()))

text = "I have a cat, a dog, and a horse."
result = re.sub(r'\ba \w+?\b', replace_with_length, text)
# result: 'I have 1 cat, 1 dog, and 1 horse.'
# This code replaces all 'a [word]' with the length of '[word]'.

# Match nested brackets correctly
text = "foo(bar(baz))blim"
matches = re.findall(r'\(([^()]*)\)', text)
# matches: ['baz']
# This code finds all text within the innermost brackets.

# Find duplicate words
text = "This is is a test test sentence."
dupes = re.findall(r'\b(\w+)\s+\1\b', text)
# dupes: ['is', 'test']
# This code finds all duplicate words in the text.

# Match a pattern except in specific contexts
text = "100 dollars, but not 100 cents"
matches = re.findall(r'100(?!\s+cents)', text)
# matches: ['100']
# This code finds '100' except when it is followed by ' cents'.

# Match balanced parentheses
text = "((()))()()(((())))"
matches = re.findall(r'\(([^()]|(?R))*\)', text)
# matches: ['((()))', '()', '(((())))']
# This code matches balanced parentheses in the text.

# Validate a password with certain rules
password = "StrongPass1!"
is_valid = bool(re.match(r'^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%^&*]).{8,}$', password))
# is_valid: True
# This code validates the password, which must contain at least one digit, one lowercase letter, one uppercase letter, one special character, and be at least 8 characters long.

# Extract the domain name from a URL
url = "https://www.example.com/path?query#fragment"
domain = re.search(r'https?://([A-Za-z_0-9.-]+).*', url).group(1)
# domain: 'www.example.com'
# This code extracts the domain name from a URL.

# Match a Unicode character
text = "Résumé"
matches = re.findall(r'\w+', text)
# matches: ['Résumé']
# This code finds all words in the text, even if they contain Unicode characters.

# Match repeating words
text = "This is a a test."
repeated_words = re.findall(r'\b(\w+)\s+\1\b', text)
# repeated_words: ['a']
# This code finds all words that are immediately repeated.

# Match words that are palindromes
text = "A man, a plan, a canal, Panama"
palindromes = [word for word in re.findall(r'\b\w+\b', text) if word == word[::-1]]
# palindromes: ['A', 'man', 'a', 'a', 'Panama']
# This code finds all palindromes in the text.

# Match words containing 'q' not followed by 'u'
text = "Iraq is a country in the Middle East."
q_not_u_words = re.findall(r'\b\w*q[^u]\w*\b', text)
# q_not_u_words: ['Iraq']
# This code finds all words in the text that contain 'q' not followed by 'u'.

# Extract all words within double quotes
text = 'He said, "Hello, world!"'
quoted = re.findall(r'"(.*?)"', text)
# quoted: ['Hello, world!']

# This code extracts all words within double quotes from the text. 

https://medium.com/@theom/the-ultimate-python-regex-cheat-sheet-f202e99ac21d

No hay comentarios:

Publicar un comentario