UNB/ CS/ David Bremner/ teaching/ cs2613/ labs/ CS2613 Lab 18

Background

Splitting strings

Time
20 minutes
Activity
Small Groups

Use the split method, the splitlines method, and a list comprehension to impliment the function split_csv (spoiler: this a bad CSV parser, missing many special cases).

from parse_csv import split_csv

test_string_1 = """OPEID,INSTNM,TUITIONFEE_OUT
02503400,Amridge University,6900
00100700,Central Alabama Community College,7770
01218200,Chattahoochee Valley Community College,7830
00101500,Enterprise State Community College,7770
00106000,James H Faulkner State Community College,7770
00101700,Gadsden State Community College,5976
00101800,George C Wallace State Community College-Dothan,7710
"""

table1 = [['OPEID', 'INSTNM', 'TUITIONFEE_OUT'],
          ['02503400', 'Amridge University', '6900'],
          ['00100700', 'Central Alabama Community College', '7770'],
          ['01218200', 'Chattahoochee Valley Community College', '7830'],
          ['00101500', 'Enterprise State Community College', '7770'],
          ['00106000', 'James H Faulkner State Community College', '7770'],
          ['00101700', 'Gadsden State Community College', '5976'],
          ['00101800', 'George C Wallace State Community College-Dothan', '7710']]

def test_split_1():
    assert split_csv(test_string_1) == table1

Stripping Quotes

Time
30 minutes
Activity
Small Groups

In general entries of CSV files can have quotes, but these are not consider part of the content. In particular a correct version of split_csv should pass the following test.

test_string_2 = '''OPEID,INSTNM,TUITIONFEE_OUT
02503400,"Amridge University",6900
00100700,"Central Alabama Community College",7770
01218200,"Chattahoochee Valley Community College",7830
00101500,"Enterprise State Community College",7770
00106000,"James H Faulkner State Community College",7770
00101700,"Gadsden State Community College",5976
00101800,"George C Wallace State Community College-Dothan",7710
'''

def test_split_2():
    assert  split_csv(test_string_2) == table1

Ours doesn't yet, so let's try to fix that using regular expressions

    def test_strip_quotes():
        assert strip_quotes('"hello"') == 'hello'
        assert strip_quotes('hello') == 'hello'
    def strip_quotes(string):
        strip_regex = re.compile(               )
        search = strip_regex.search(string)
        if search:

        else:
            return None

Handling quoted commas

Time
30 minutes
Activity
Small Groups

It turns out one of the main reasons for supporting quotes is to handle quoted commas. The function split_row_3 is intended to split rows with exactly 3 columns.

def test_split_row_3():
    assert split_row_3('00101800,"George C Wallace State Community College, Dothan",7710') == \
                ['00101800', 'George C Wallace State Community College, Dothan', '7710']
    def split_row_3(string):
        split_regex=re.compile(
            r'''^   # start






            $''', re.VERBOSE)
        search = split_regex.search(string)
        if search:
            return [ strip_quotes(col) for col in search.groups() ]
        else:
            return None
    test_string_3 = '''OPEID,INSTNM,TUITIONFEE_OUT
    02503400,"Amridge University",6900
    00100700,"Central Alabama Community College",7770
    01218200,"Chattahoochee Valley Community College",7830
    00101500,"Enterprise State Community College",7770
    00106000,"James H Faulkner State Community College",7770
    00101700,"Gadsden State Community College",5976
    00101800,"George C Wallace State Community College, Dothan",7710
    '''

    table2 = [['OPEID', 'INSTNM', 'TUITIONFEE_OUT'],
              ['02503400', 'Amridge University', '6900'],
              ['00100700', 'Central Alabama Community College', '7770'],
              ['01218200', 'Chattahoochee Valley Community College', '7830'],
              ['00101500', 'Enterprise State Community College', '7770'],
              ['00106000', 'James H Faulkner State Community College', '7770'],
              ['00101700', 'Gadsden State Community College', '5976'],
              ['00101800', 'George C Wallace State Community College, Dothan', '7710']]

    def test_split_3():
        '''Check handling of quoted commas'''
        assert  split_csv(test_string_3) == table2

Parsing more columns

Time
20 minutes
Activity
Small Groups

Use your column matching regex, along with the findall method to match any number of columns. Call your new function split_row.

def test_split_row():
    assert split_row('00101800,"George C Wallace State Community College, Dothan",7710,",,,"') == \
                ['00101800', 'George C Wallace State Community College, Dothan', '7710',',,,']

Use your new function in place of split_row_3 so that the following test (and all previous tests) pass

test_string_4=\
'''OPEID,INSTNM,PCIP52,TUITIONFEE_OUT
00103800,Snead State Community College,0.0811,7830
00573400,H Councill Trenholm State Community College,0.0338,7524
00573300,"Bevill, State, Community College",0.0451,7800
00884300,Alaska Bible College,0,9300
00107100,Arizona Western College,0.0425,9530
00107200,"Cochise County Community College, District",0.0169,6000
'''

table3=[
    ['OPEID', 'INSTNM', 'PCIP52', 'TUITIONFEE_OUT'],
    ['00103800', 'Snead State Community College', '0.0811', '7830'],
    ['00573400', 'H Councill Trenholm State Community College', '0.0338', '7524'],
    ['00573300', 'Bevill, State, Community College', '0.0451', '7800'],
    ['00884300', 'Alaska Bible College', '0', '9300'],
    ['00107100', 'Arizona Western College', '0.0425', '9530'],
    ['00107200', 'Cochise County Community College, District', '0.0169', '6000']]

def test_split_4():
    assert split_csv(test_string_4) == table3