viernes, 21 de junio de 2024

Reg cookbook

Here's how you can correctly match the first word in a string:

  1. import re
  2.  
  3. text = "subject, adjust, jump, university, major"
  4.  
  5. # Match the first word in the string
  6. match = re.match(r"^\w+", text, flags=re.IGNORECASE)
  7.  
  8. if match:
  9. print(match.group())
  10. else:
  11. print("No match found")



To match the last word 

in a string, you can use regular expressions combined with string manipulation techniques. The approach varies slightly depending on whether you use re.findall, re.search, or another method.

Using re.findall

To find the last word using re.findall, you would typically capture all words and then select the last one:

  1. import re
  2.  
  3. text = "subject, adjust, jump, university, major"
  4.  
  5. # Find all words in the string
  6. words = re.findall(r'\w+', text, flags=re.IGNORECASE)
  7.  
  8. # Get the last word
  9. last_word = words[-1] if words else None
  10.  
  11. print(last_word)
  12.  

 Match Words Starting with "gob"


  1. import re
  2.  
  3. text = "goblin, goblet, gobsmacked, gobble, dog, gob"
  4.  
  5. # Find words starting with 'gob'
  6. matches = re.findall(r'\bgob\w*', text, flags=re.IGNORECASE)
  7.  
  8. print(matches)

Explanation

  • \bgob\w*:
    • \b asserts a word boundary before "gob".
    • gob is the specific prefix we are looking for.
    • \w* matches zero or more word characters following "gob".


3. Match Words Ending with "te"

  1. import re
  2.  
  3. text = "complete, update, bite, great, late, state"
  4.  
  5. # Find words ending with 'te'
  6. matches = re.findall(r'\b\w*te\b', text.lower())
  7.  
  8. print(matches)
  9.  

    Explanation

    • \b\w*te\b:
      • \b asserts a word boundary.
      • \w* matches zero or more word characters preceding "te".
      • te is the suffix we are looking for.
      • \b asserts a word boundary to ensure "te" is at the end of the word.



The matches list will contain all words from the text that have the substring "uj" in them. For the provided text, the output will be:

  1. import re
  2.  
  3. text = "subject, adjust, jump, university, major" # Find words containing 'uj' (case-insensitive)
  4.  
  5. matches = re.findall(r'\b\w*uj\w*\b', text, flags=re.IGNORECASE)
  6.  
  7. print(matches)
  8.  
Explanation
  • r'\b\w*uj\w*\b':

    • \b is a word boundary anchor, which ensures that the match occurs at the beginning or end of a word. It's useful if you want to match whole words but is optional if you're just looking for substrings within words.
    • \w* matches any number of word characters (letters, digits, and underscores) before and after the substring uj.
    • uj is the substring you're looking to match within the words.
  • Flags: re.IGNORECASE makes the search case-insensitive.



Match emails
matches = re.findall(r'\b[\w.-]+@[a-zA-Z-]+\.[a-zA-Z.]{2,6}\b', text)

return  string after @
import re 
 text = "Email me at john.doe@example.com or jane_smith123@test.co.uk"
matches = re.findall(r'@(\w+)', text) print(matches)

return before

matches = re.findall(r'(\w+)@', text) 





////////////// todo lo que empiece con a  y seguido de uno o mas caracteres y termine  en r
import re
pattern = r"^a.+r$"   
text1 = "ar"
text2 = "abr"

print(re.findall(pattern, text1))  # No Match
print(re.findall(pattern, text2))  # Match

#^a   // todo lo que empiece con a

#.+   Uno o mas caracteres si quito el signo de mas  solo  podria tener un solo caracter para machar

#r$  // todo lo que termine con r

#final machea todo lo que empiece con a seguido de uno mas caractereres y termine en r   ejemplo machea  abr pero no machea ar


Search  XXX-XXX-XXX phone format

import re
text="ambiorix rodriguez 809-714-2819 809-560-8344 829-561-3454 edad 42"
match=re.findall(r'\d{3}-+\d{3}-\d{4}',text)
print(match)

['809-714-3489', '809-560-8344', '829-561-3454']


jueves, 20 de junio de 2024

Python Regular expresions full guide

Regular expressions (regex) are a powerful tool for manipulating and analyzing text. In Python, we use the re module to work with regex.

import re # No need to pip install its in the standard library

Patterns

Basic Characters

  • a: Exact match. re.search('a', 'apple')
  • .: Matches any character (except newline). re.search('.', 'apple')
  • \d: Matches any digit (0-9). re.search('\d', 'apple2')
  • \D: Matches any non-digit. re.search('\D', '1234a')
  • \s: Matches any whitespace. re.search('\s', 'apple pie')
  • \S: Matches any non-whitespace. re.search('\S', ' apple')
  • \w: Matches any alphanumeric character and underscore (a-z, A-Z, 0-9, _). re.search('\w', '@apple!')
  • \W: Matches any non-alphanumeric character. re.search('\W', 'apple@')

Special Characters

  • \t: Tab. re.search('\t', 'apple\t')
  • \n: Newline. re.search('\n', 'apple\npie')
  • \r: Carriage Return. re.search('\r', 'apple\r\n')
  • \\: Backslash. re.search('\\\\', 'apple\\')

Quantifiers

  • *: Zero or more of the previous item. re.search('a*', 'aaapple')
  • +: One or more of the previous item. re.search('a+', 'aaapple')
  • ?: Zero or one of the previous item. re.search('a?', 'aaapple')
  • {n}: Exactly n of the previous item. re.search('a{2}', 'aaapple')
  • {n,}: n or more of the previous item. re.search('a{2,}', 'aaapple')
  • {,m}: Up to m of the previous item. re.search('a{,2}', 'aaapple')
  • {n,m}: Between n and m of the previous item. re.search('a{2,3}', 'aaaapple')

Groups and Ranges

  • [abc]: Matches any of the enclosed characters. re.search('[abc]', 'apple')
  • [^abc]: Matches any character not enclosed. re.search('[^abc]', 'apple')
  • (abc): Defines a group. re.search('(abc)', 'abcapple')
  • (a|b): Matches either a or b. re.search('(a|p)', 'apple')

Anchors

  • ^abc: Matches pattern abc at the start of a string. re.search('^abc', 'abcapple')
  • abc$: Matches pattern abc at the end of a string. re.search('abc$', 'appleabc')
  • \babc: Word boundary (matches abc at the start of a word). re.search('\\babc', 'abc apple')
  • abc\b: Word boundary (matches abc at the end of a word). re.search('abc\\b', 'appleabc pie')

Flags

  • re.I or re.IGNORECASE: Makes matching case insensitive. re.search('a', 'APPLE', re.I)
  • re.M or re.MULTILINE: Makes ^ and $ match start and end of each line. re.search('^a', 'apple\nbanana', re.M)
  • re.S or re.DOTALL: Makes . match any character, including newlines. re.search('a.p', 'a\np', re.S)
  • re.X or re.VERBOSE: Allows multiline regular expressions and ignores whitespace and comments in the pattern. re.search("""a # this is a comment\nb""", 'ab', re.X)

Back References

Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string.

  • \1: Matches the contents of group 1. re.search('(a)b\\1', 'aba')
  • \2: Matches the contents of group 2. re.search('(a)(b)\\2', 'abb')

Lookahead and Lookbehind

Lookahead and lookbehind assertions determine the success or failure of a regex match in Python based on what is just behind (to the left) or ahead (to the right) of the current string position.

  • a(?=b): Positive lookahead: Matches 'a' only if 'a' is followed by 'b'. re.search('a(?=b)', 'ab')
  • a(?!b): Negative lookahead: Matches 'a' only if 'a' is not followed by 'b'. re.search('a(?!b)', 'ac')
  • (?<=b)a: Positive lookbehind: Matches 'a' only if 'a' is preceded by 'b'. re.search('(?<=b)a', 'ba')
  • (?<!b)a: Negative lookbehind: Matches 'a' only if 'a' is not preceded by 'b'. re.search('(?<!b)a', 'ca')

Python’s re module

Python’s re module provides several functions to work with regex. Here are the most used beyondre.search():

re.match()

This function checks for a match only at the beginning of the string.

print(re.match('abc', 'abcdef'))  # <re.Match object; span=(0, 3), match='abc'>
print(re.match('abc', 'abcdefabc')) # <re.Match object; span=(0, 3), match='abc'>
print(re.match('abc', 'abcdefabc').group()) # abc

re.findall()

This function returns all non-overlapping matches of pattern in string, as a list of strings.

print(re.findall('abc', 'abcdefabc'))  # ['abc', 'abc']

re.sub()

This function replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless max provided.

print(re.sub('abc', '123', 'abcdefabc'))  # 123def123

re.split()

This function splits the source string by the occurrences of the pattern.

print(re.split('\d+', 'apple123banana45cherry6'))  # ['apple', 'banana', 'cherry', '']

Popular Examples

import re

# Find all substrings that match a pattern
text = "Hello, my name is John Doe. I live in New York."
matches = re.findall(r'\b\w{4}\b', text)
# matches: ['Hello', 'name', 'John', 'live', 'York']
# This code finds all 4-letter words in the text.

# Split a string by multiple delimiters
text = "apple;banana-orange:peach"
result = re.split(r'[;:-]', text)
# result: ['apple', 'banana', 'orange', 'peach']
# This code splits the text by either a semicolon, a dash, or a colon.

# Replace substrings that match a pattern
text = "I have 3 cats, 4 dogs, and 5 fishes."
result = re.sub(r'\d', 'many', text)
# result: 'I have many cats, many dogs, and many fishes.'
# This code replaces all digits in the text with the word 'many'.

# Check if a string starts with a pattern
text = "Hello, world!"
result = bool(re.match(r'^Hello', text))
# result: True
# This code checks if the text starts with 'Hello'.

# Extract email addresses from a string
text = "Contact us at info@example.com or support@example.net."
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
# emails: ['info@example.com', 'support@example.net']
# This code extracts all email addresses from the text.

# Find all dates in YYYY-MM-DD format
text = "I was born on 2000-01-01. I graduated on 2020-05-15."
dates = re.findall(r'\b\d{4}-\d{2}-\d{2}\b', text)
# dates: ['2000-01-01', '2020-05-15']
# This code extracts all dates in YYYY-MM-DD format from the text.

# Capture groups in a match
text = "The event will be held on 2023-07-10 at 18:00."
match = re.search(r'(\d{4}-\d{2}-\d{2}) at (\d{2}:\d{2})', text)
date, time = match.groups()
# date: '2023-07-10', time: '18:00'
# This code extracts the date and time from the text.

# Match a pattern multiple times
text = "I love apples, apples are my favorite fruit."
matches = re.findall(r'(apples)', text)
# matches: ['apples', 'apples']
# This code finds all occurrences of 'apples' in the text.

# Match a pattern and replace it with a function's result
def replace_with_length(match):
return str(len(match.group()))

text = "I have a cat, a dog, and a horse."
result = re.sub(r'\ba \w+?\b', replace_with_length, text)
# result: 'I have 1 cat, 1 dog, and 1 horse.'
# This code replaces all 'a [word]' with the length of '[word]'.

# Match nested brackets correctly
text = "foo(bar(baz))blim"
matches = re.findall(r'\(([^()]*)\)', text)
# matches: ['baz']
# This code finds all text within the innermost brackets.

# Find duplicate words
text = "This is is a test test sentence."
dupes = re.findall(r'\b(\w+)\s+\1\b', text)
# dupes: ['is', 'test']
# This code finds all duplicate words in the text.

# Match a pattern except in specific contexts
text = "100 dollars, but not 100 cents"
matches = re.findall(r'100(?!\s+cents)', text)
# matches: ['100']
# This code finds '100' except when it is followed by ' cents'.

# Match balanced parentheses
text = "((()))()()(((())))"
matches = re.findall(r'\(([^()]|(?R))*\)', text)
# matches: ['((()))', '()', '(((())))']
# This code matches balanced parentheses in the text.

# Validate a password with certain rules
password = "StrongPass1!"
is_valid = bool(re.match(r'^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%^&*]).{8,}$', password))
# is_valid: True
# This code validates the password, which must contain at least one digit, one lowercase letter, one uppercase letter, one special character, and be at least 8 characters long.

# Extract the domain name from a URL
url = "https://www.example.com/path?query#fragment"
domain = re.search(r'https?://([A-Za-z_0-9.-]+).*', url).group(1)
# domain: 'www.example.com'
# This code extracts the domain name from a URL.

# Match a Unicode character
text = "Résumé"
matches = re.findall(r'\w+', text)
# matches: ['Résumé']
# This code finds all words in the text, even if they contain Unicode characters.

# Match repeating words
text = "This is a a test."
repeated_words = re.findall(r'\b(\w+)\s+\1\b', text)
# repeated_words: ['a']
# This code finds all words that are immediately repeated.

# Match words that are palindromes
text = "A man, a plan, a canal, Panama"
palindromes = [word for word in re.findall(r'\b\w+\b', text) if word == word[::-1]]
# palindromes: ['A', 'man', 'a', 'a', 'Panama']
# This code finds all palindromes in the text.

# Match words containing 'q' not followed by 'u'
text = "Iraq is a country in the Middle East."
q_not_u_words = re.findall(r'\b\w*q[^u]\w*\b', text)
# q_not_u_words: ['Iraq']
# This code finds all words in the text that contain 'q' not followed by 'u'.

# Extract all words within double quotes
text = 'He said, "Hello, world!"'
quoted = re.findall(r'"(.*?)"', text)
# quoted: ['Hello, world!']

# This code extracts all words within double quotes from the text. 

https://medium.com/@theom/the-ultimate-python-regex-cheat-sheet-f202e99ac21d