UNB/ CS/ David Bremner/ teaching/ cs2613/ labs/ Lab 16

Before the lab

Background


Discussion

Time
5 minutes

Regular Expressions

Time
15 minutes
Activity
Demo / Group discussion

To get familiar with regular expressions, we follow the street address Case Study.

Try the following evaluations in a python REPL.

>>> '100 NORTH MAIN ROAD'.replace('ROAD', 'RD.')
>>> s = '100 NORTH BROAD ROAD'
>>> s.replace('ROAD', 'RD.')
# oops
>>> s[:-4] + s[-4:].replace('ROAD', 'RD.')
# ugh, that code
>>> import re
>>> re.sub('ROAD$', 'RD.', s)
# what dark magic is this?

Regular expressions are a domain specific language that allow us to specify complicated string operations. In practice, the simple $ we used above is not enough.

>>> s = '100 BROAD'
>>> re.sub('ROAD$', 'RD.', s)
# New regex feature \b.
>>> re.sub('\\bROAD$', 'RD.', s)
# Raw strings reduce \ overload
>>> re.sub(r'\bROAD$', 'RD.', s)
# Our new regex is too "narrow"
>>> s = '100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD$', 'RD.', s)
>>> re.sub(r'\bROAD\b', 'RD.', s)

In the next part we will need to use a few fancier features.

import re
rex=re.compile(r'([^0-9]+)')
for match in rex.findall('113abba999bjorn78910101benny888331dancing34234queen'):
    print(match)

Stripping Quotes

Time
25 minutes
Activity
Individual

parse_csv.py:

def split_csv(string):
    return [ row.split(",") for row in string.splitlines() ]
from parse_csv import split_csv

test_string_1 = """OPEID,INSTNM,TUITIONFEE_OUT
02503400,Amridge University,6900
00100700,Central Alabama Community College,7770
01218200,Chattahoochee Valley Community College,7830
00101500,Enterprise State Community College,7770
00106000,James H Faulkner State Community College,7770
00101700,Gadsden State Community College,5976
00101800,George C Wallace State Community College-Dothan,7710
"""

table1 = [['OPEID', 'INSTNM', 'TUITIONFEE_OUT'],
          ['02503400', 'Amridge University', '6900'],
          ['00100700', 'Central Alabama Community College', '7770'],
          ['01218200', 'Chattahoochee Valley Community College', '7830'],
          ['00101500', 'Enterprise State Community College', '7770'],
          ['00106000', 'James H Faulkner State Community College', '7770'],
          ['00101700', 'Gadsden State Community College', '5976'],
          ['00101800', 'George C Wallace State Community College-Dothan', '7710']]

def test_split_1():
    assert split_csv(test_string_1) == table1

In general entries of CSV files can have quotes, but these are not consider part of the content. In particular a correct version of split_csv should pass the following test.

test_string_2 = '''OPEID,INSTNM,TUITIONFEE_OUT
02503400,"Amridge University",6900
00100700,"Central Alabama Community College",7770
01218200,"Chattahoochee Valley Community College",7830
00101500,"Enterprise State Community College",7770
00106000,"James H Faulkner State Community College",7770
00101700,"Gadsden State Community College",5976
00101800,"George C Wallace State Community College-Dothan",7710
'''

def test_split_2():
    assert  split_csv(test_string_2) == table1

Ours doesn't yet, so let's try to fix that using regular expressions

    def test_strip_quotes():
        assert strip_quotes('"hello"') == 'hello'
        assert strip_quotes('hello') == 'hello'
    def strip_quotes(string):
        strip_regex = re.compile(               )
        search = strip_regex.search(string)
        if search:
            return search.group(1)
        else:
            return None

Handling quoted commas

Time
30 minutes
Activity
Individual

It turns out one of the main reasons for supporting quotes is to handle quoted commas. The function split_row_3 is intended to split rows with exactly 3 columns.

def test_split_row_3():
    assert split_row_3('00101800,"George C Wallace State Community College, Dothan",7710') == \
                ['00101800', 'George C Wallace State Community College, Dothan', '7710']
def split_row_3(string):
    split_regex=re.compile(
        r'''^   # start
        ("     "|     )     # column
        ,
        ("     "|     )     # column
        ,
        ("     "|     )     # column
        $''', re.VERBOSE)
    search = split_regex.search(string)
    if search:
        return [ strip_quotes(col) for col in search.groups() ]
    else:
        return None
test_string_3 = '''OPEID,INSTNM,TUITIONFEE_OUT
02503400,"Amridge University",6900
00100700,"Central Alabama Community College",7770
01218200,"Chattahoochee Valley Community College",7830
00101500,"Enterprise State Community College",7770
00106000,"James H Faulkner State Community College",7770
00101700,"Gadsden State Community College",5976
00101800,"George C Wallace State Community College, Dothan",7710
'''

table2 = [['OPEID', 'INSTNM', 'TUITIONFEE_OUT'],
          ['02503400', 'Amridge University', '6900'],
          ['00100700', 'Central Alabama Community College', '7770'],
          ['01218200', 'Chattahoochee Valley Community College', '7830'],
          ['00101500', 'Enterprise State Community College', '7770'],
          ['00106000', 'James H Faulkner State Community College', '7770'],
          ['00101700', 'Gadsden State Community College', '5976'],
          ['00101800', 'George C Wallace State Community College, Dothan', '7710']]

def test_split_3():
    '''Check handling of quoted commas'''
    assert  split_csv(test_string_3) == table2

Parsing more columns

Time
20 minutes
Activity
Individual

Use your column matching regex, along with the findall method to match any number of columns. Call your new function split_row.

def test_split_row():
    assert split_row('00101800,"George C Wallace State Community College, Dothan",7710,",,,"') == \
                ['00101800', 'George C Wallace State Community College, Dothan', '7710',',,,']

Use your new function in place of split_row_3 so that the following test (and all previous tests) pass

test_string_4=\
'''OPEID,INSTNM,PCIP52,TUITIONFEE_OUT
00103800,Snead State Community College,0.0811,7830
00573400,H Councill Trenholm State Community College,0.0338,7524
00573300,"Bevill, State, Community College",0.0451,7800
00884300,Alaska Bible College,0,9300
00107100,Arizona Western College,0.0425,9530
00107200,"Cochise County Community College, District",0.0169,6000
'''

table3=[
    ['OPEID', 'INSTNM', 'PCIP52', 'TUITIONFEE_OUT'],
    ['00103800', 'Snead State Community College', '0.0811', '7830'],
    ['00573400', 'H Councill Trenholm State Community College', '0.0338', '7524'],
    ['00573300', 'Bevill, State, Community College', '0.0451', '7800'],
    ['00884300', 'Alaska Bible College', '0', '9300'],
    ['00107100', 'Arizona Western College', '0.0425', '9530'],
    ['00107200', 'Cochise County Community College, District', '0.0169', '6000']]

def test_split_4():
    assert split_csv(test_string_4) == table3