MCS 260 Fall 2020
Emily Dumas
.
— matches any character except newline\s
— matches any whitespace character\d
— matches a decimal digit+
— previous item must repeat 1 or more times*
— previous item must repeat 0 or more times?
— previous item must repeat 0 or 1 times{n}
— previous item must appear n times(...)
— treat part of a pattern as a unit and capture its match into a group[...]
— match any one of a set of charactersA|B
— match either pattern A
or pattern B
.^
— match the beginning of the string.$
— match the end of the string or the end of the line.re.search(pattern,string)
— does string
contain a match to the pattern
? Return a match object or None
.re.finditer(pattern,string)
— Return an iterable containing all non-overlapping matches as match objects.re.findall(pattern,string)
— return a list of all non-overlapping matches as strings.Find all of the phone numbers in a string that are written in the format 319-555-1012
, and split each one into area code (e.g. 319
), exchange (e.g. 555
), and line number (e.g. 1012
).
Give a list of characters and to match any one of them.
[abc]
matches any of the characters a,b,c
.
[^abc]
matches any character except a,b,c
.
[A-Za-z]
matches any alphabet letter.
[0-9a-fA-F]
matches any hex digit.
A|B
matches either pattern A
or pattern B
.
Use this inside parentheses to limit how much of the pattern is considered to be part of A
or B
, e.g.
[Hh](ello|i),? my name is (.*).
Let's make a program to find function definitions in a Python source file and print the function names.
What is the size of a file if we open and write one of these words to it?
U+1F60A
)Note: The last item in the list above has an emoji which doesn't render correctly in the PDF slides.
As the OS sees it, a file is a sequence of bytes. To write text, we need to decide how to represent code points as bytes.
A scheme to do this is an encoding. Encodings can also specify which code points are allowed.
The default encoding in Python is usually UTF-8, though officially this is platform-dependent.
In UTF-8, the first 128 code points are stored as a single byte. Others become two, three, or four bytes.
Opening a file with "b"
in its mode string will make it a binary file. E.g. "rb"
reads a binary file, "wb"
writes to one.
Reading from a binary file gives a bytes
object, a sequence of ints in the range 0 to 255.
We can decode bytes into a string with the method .decode()
, and can encode a string as bytes with .encode()
. Each takes optional encoding parameter.
print
are lacking parentheses. Otherwise, the code should work.re
module is good as a reference, but may not be ideal to learn from.