Before the lab
Background
Discussion
- Time
- 5 minutes
- Any questions about A5?
Regular Expressions
- Time
- 15 minutes
- Activity
- Demo / Group discussion
To get familiar with regular expressions, we follow the street address Case Study.
Try the following evaluations in a python REPL.
>>> '100 NORTH MAIN ROAD'.replace('ROAD', 'RD.')
>>> s = '100 NORTH BROAD ROAD'
>>> s.replace('ROAD', 'RD.')
# oops
>>> s[:-4] + s[-4:].replace('ROAD', 'RD.')
# ugh, that code
>>> import re
>>> re.sub('ROAD$', 'RD.', s)
# what dark magic is this?
Regular expressions are a domain specific language that allow us to
specify complicated string operations. In practice, the simple $
we
used above is not enough.
>>> s = '100 BROAD'
>>> re.sub('ROAD$', 'RD.', s)
# New regex feature \b.
>>> re.sub('\\bROAD$', 'RD.', s)
# Raw strings reduce \ overload
>>> re.sub(r'\bROAD$', 'RD.', s)
# Our new regex is too "narrow"
>>> s = '100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD$', 'RD.', s)
>>> re.sub(r'\bROAD\b', 'RD.', s)
In the next part we will need to use a few fancier features.
import re
rex=re.compile(r'([^0-9]+)')
for match in rex.findall('113abba999bjorn78910101benny888331dancing34234queen'):
print(match)
Stripping Quotes
- Time
- 25 minutes
- Activity
- Individual
parse_csv.py
:
def split_csv(string):
return [ row.split(",") for row in string.splitlines() ]
from parse_csv import split_csv
test_string_1 = """OPEID,INSTNM,TUITIONFEE_OUT
02503400,Amridge University,6900
00100700,Central Alabama Community College,7770
01218200,Chattahoochee Valley Community College,7830
00101500,Enterprise State Community College,7770
00106000,James H Faulkner State Community College,7770
00101700,Gadsden State Community College,5976
00101800,George C Wallace State Community College-Dothan,7710
"""
table1 = [['OPEID', 'INSTNM', 'TUITIONFEE_OUT'],
['02503400', 'Amridge University', '6900'],
['00100700', 'Central Alabama Community College', '7770'],
['01218200', 'Chattahoochee Valley Community College', '7830'],
['00101500', 'Enterprise State Community College', '7770'],
['00106000', 'James H Faulkner State Community College', '7770'],
['00101700', 'Gadsden State Community College', '5976'],
['00101800', 'George C Wallace State Community College-Dothan', '7710']]
def test_split_1():
assert split_csv(test_string_1) == table1
In general entries of CSV files can have quotes, but these are not
consider part of the content. In particular a correct version of
split_csv
should pass the following test.
test_string_2 = '''OPEID,INSTNM,TUITIONFEE_OUT
02503400,"Amridge University",6900
00100700,"Central Alabama Community College",7770
01218200,"Chattahoochee Valley Community College",7830
00101500,"Enterprise State Community College",7770
00106000,"James H Faulkner State Community College",7770
00101700,"Gadsden State Community College",5976
00101800,"George C Wallace State Community College-Dothan",7710
'''
def test_split_2():
assert split_csv(test_string_2) == table1
Ours doesn't yet, so let's try to fix that using regular expressions
- Fill in the regex in
strip_quotes
so that it passes the following test
def test_strip_quotes():
assert strip_quotes('"hello"') == 'hello'
assert strip_quotes('hello') == 'hello'
- Here is a skeleton for
strip_quotes
:
def strip_quotes(string):
strip_regex = re.compile( )
search = strip_regex.search(string)
if search:
return search.group(1)
else:
return None
- You'll want to refer to regular expression features
- The use of the groups method means your regex solution should
have exactly one set of
(…)
with a regex matching the non-quoted part. - You can say something is optional by using
…?
, any number of repetitions with…*
- A character not in a given set can be matched with
[^…]
- once you have a working
strip_quotes
, use it inparse_csv
in order to make the test above pass.
Handling quoted commas
- Time
- 30 minutes
- Activity
- Individual
It turns out one of the main reasons for supporting quotes is to handle quoted commas.
The function split_row_3
is intended to split rows with exactly 3 columns.
def test_split_row_3():
assert split_row_3('00101800,"George C Wallace State Community College, Dothan",7710') == \
['00101800', 'George C Wallace State Community College, Dothan', '7710']
Read the discussion on verbose regular expressions
Complete the definition of
split_row_3
. You'll want to figure out a regular expression that matches either a quoted or an unquoted column, and then repeat that 3 times. "Or" in regular expressions is implemented with|
You will want to use
[^…]
once for each case; in one case for excluding"
and in the other for excluding,
.
def split_row_3(string):
split_regex=re.compile(
r'''^ # start
(" "| ) # column
,
(" "| ) # column
,
(" "| ) # column
$''', re.VERBOSE)
search = split_regex.search(string)
if search:
return [ strip_quotes(col) for col in search.groups() ]
else:
return None
- Use your
split_row_3
function insplit_csv
to pass the following test
test_string_3 = '''OPEID,INSTNM,TUITIONFEE_OUT
02503400,"Amridge University",6900
00100700,"Central Alabama Community College",7770
01218200,"Chattahoochee Valley Community College",7830
00101500,"Enterprise State Community College",7770
00106000,"James H Faulkner State Community College",7770
00101700,"Gadsden State Community College",5976
00101800,"George C Wallace State Community College, Dothan",7710
'''
table2 = [['OPEID', 'INSTNM', 'TUITIONFEE_OUT'],
['02503400', 'Amridge University', '6900'],
['00100700', 'Central Alabama Community College', '7770'],
['01218200', 'Chattahoochee Valley Community College', '7830'],
['00101500', 'Enterprise State Community College', '7770'],
['00106000', 'James H Faulkner State Community College', '7770'],
['00101700', 'Gadsden State Community College', '5976'],
['00101800', 'George C Wallace State Community College, Dothan', '7710']]
def test_split_3():
'''Check handling of quoted commas'''
assert split_csv(test_string_3) == table2
Parsing more columns
- Time
- 20 minutes
- Activity
- Individual
Use your column matching regex, along with the findall
method to
match any number of columns. Call your new function split_row
.
def test_split_row():
assert split_row('00101800,"George C Wallace State Community College, Dothan",7710,",,,"') == \
['00101800', 'George C Wallace State Community College, Dothan', '7710',',,,']
Use your new function in place of split_row_3
so that the following test (and all previous tests) pass
test_string_4=\
'''OPEID,INSTNM,PCIP52,TUITIONFEE_OUT
00103800,Snead State Community College,0.0811,7830
00573400,H Councill Trenholm State Community College,0.0338,7524
00573300,"Bevill, State, Community College",0.0451,7800
00884300,Alaska Bible College,0,9300
00107100,Arizona Western College,0.0425,9530
00107200,"Cochise County Community College, District",0.0169,6000
'''
table3=[
['OPEID', 'INSTNM', 'PCIP52', 'TUITIONFEE_OUT'],
['00103800', 'Snead State Community College', '0.0811', '7830'],
['00573400', 'H Councill Trenholm State Community College', '0.0338', '7524'],
['00573300', 'Bevill, State, Community College', '0.0451', '7800'],
['00884300', 'Alaska Bible College', '0', '9300'],
['00107100', 'Arizona Western College', '0.0425', '9530'],
['00107200', 'Cochise County Community College, District', '0.0169', '6000']]
def test_split_4():
assert split_csv(test_string_4) == table3