A document from MCS 275 Spring 2022, instructor Emily Dumas. You can also get the notebook file.

MCS 275 Spring 2022 Worksheet 9 Solutions

  • Course instructor: Emily Dumas
  • Solutions prepared by: Jennifer Vaccaro, Johnny Joyce

Topics

The main topics of this worksheet are:

  • Tree traversals
  • Set and defaultdict
  • CSV and JSON

Resources

These things might be helpful while working on the problems. Remember that for worksheets, we don't strictly limit what resources you can consult, so these are only suggestions.

The most useful files from the sample code repository are:

1. Accidentally quadratic

Here is a function that takes two strings and returns the set of characters that appear in both strings.

In [118]:
def common_chars(s1,s2):
    """Return a set of all characters that are present in both
    strings `s1` and `s2`."""
    common = set()
    for c1 in s1:
        if c1 in s2:
            common.add(c1)
    return common

It works. Here's a simple example:

In [49]:
common_chars("mathematics","computer science")
Out[49]:
{'c', 'e', 'i', 'm', 's', 't'}

However, this function is actually needlessly slow. Here's an example that generates two strings that each have 50,000 characters, runs common_chars on them, and prints the total time.

In [50]:
import random
import time

s1 = ''.join([ random.choice(["edfghijklmnopqrstuvwxyzzzzzzzzzzzzzzzz"]) for _ in range(50000) ])
s2 = ''.join([ random.choice(["abcedfghijklmnopqrstuvw"]) for _ in range(50000) ]) + 'z'

t_start = time.time()
both = common_chars(s1,s2)
t_end = time.time()

print("Common characters:")
print(both)
print("\nRunning time: {:.2f} seconds".format(t_end-t_start))
Common characters:
{'w', 'n', 't', 'f', 'r', 'm', 'z', 'q', 'p', 'e', 'i', 'u', 'l', 'o', 'k', 'd', 'g', 'h', 'v', 's', 'j'}

Running time: 15.94 seconds

If you try this yourself, you might get a slightly different time, but it will probably take more than 10 seconds.

First, what is going on here? It should be possible to compare millions of characters for equality per second, and there are only 100,000 characters you need to look at, right?

Second, can you fix it? (It is possible to make this function shorter, clearer, and so that it takes less than 0.05 seconds.)

Solution

The problem with this function is that it has an implicit nested for loop that performs len(s1)*len(s2) equality checks. The expression

c1 in s2

is equivalent to the return value of this function:

def c1_in_s2():
    for c2 in s2:
        if c1==c2:
            return True
    return False

In the worst case, this function performs len(s2) checks, and it runs for each character of s1. If s1 and s2 are each of length 50,000, then this becomes 2,500,000,000 equality checks.

However, most strings don't have that many distinct characters, so it would be faster to:

  • Find all of the distinct characters in s1 (and make a set out of them)
  • Find all of the distinct characters in s2 (and make a set out of them)
  • Check which characters lie in both of these sets

The time it takes to do this would be roughly proportional to len(s1) + len(s2) + n1*n2 where n1 is the number of distinct characters in s1 and n2 is the number of distinct characters in s2. In most cases n1 and n2 will be bounded by a fixed constant (like 26, if the strings only contain lower case letters of the alphabet), so the main contribution to the running time is proportional to the lengths of s1 and s2 individually, rather than their product.

Here is an alternative common_chars function that uses this strategy:

In [1]:
def common_chars(s1,s2):
    """Return a set of all characters that are present in both
    strings `s1` and `s2`."""
    # By first turning s1 and s2 into sets, we have fewer characters to compare.
    # Then we can return the intersection
    return set(s1) & set(s2)

# Another syntax option would be 'return set(s1).intersection(s2)'

Here is a timing study, showing it handles strings of length 50,000 with ease:

In [2]:
import random
import time

s1 = ''.join([ random.choice(["edfghijklmnopqrstuvwxyzzzzzzzzzzzzzzzz"]) for _ in range(50000) ])
s2 = ''.join([ random.choice(["abcedfghijklmnopqrstuvw"]) for _ in range(50000) ]) + 'z'

t_start = time.time()
both = common_chars(s1,s2)
t_end = time.time()

print("Common characters:")
print(both)
print("\nRunning time: {:.2f} seconds".format(t_end-t_start))
Common characters:
{'z', 'k', 'e', 'r', 'n', 'd', 'l', 'q', 's', 'f', 't', 'p', 'o', 'g', 'j', 'm', 'i', 'h', 'u', 'w', 'v'}

Running time: 0.02 seconds

2. Save and load tree

Recall that we have two classes for building trees:

  • Node, for generic binary trees
  • BST, a subclass for binary search trees

Consider the following binary tree, made of Node objects:

          [True] 
          /    \
   ["Alice"]  [False]
    /     \
 [5.8]   [5.5]

If we wanted to save the tree to a file, JSON would be a natural choice as it allows nesting of data structures. The tree above might be saved as:

{
    "class": "Node",
    "tree": {
        "key": true,
        "left": {
            "key": "Alice",
            "left": {
                "key": 5.8,
                "left": null,
                "right": null
            },
            "right": {
                "key": 5.5,
                "left": null,
                "right": null
            }
        },
        "right": {
            "key": false,
            "left": null,
            "right": null
        }
    }
}

Here, the top-level JSON object has just two keys: "class" indicates what kind of tree it is (BST or Node), and "tree" maps to a hierarchy of objects that represent the nodes of the tree.

The same general approach could be applied to binary search trees, too. This BST:

          [6] 

    [5]         [14]

 [4]         [10]  [16]

Could be saved as:

{
    "class": "BST",
    "tree": {
        "key": 6,
        "left": {
            "key": 5,
            "left": {
                "key": 4,
                "left": null,
                "right": null
            },
            "right": null
        },
        "right": {
            "key": 14,
            "left": {
                "key": 10,
                "left": null,
                "right": null
            },
            "right": {
                "key": 16,
                "left": null,
                "right": null
            }
        }
    }
}

Add a method save(fp) to class Node in trees.py that will save a tree as a JSON object, writing it to an open file object fp.

Then, add a function to the module trees.py called load(fp) that takes an open file object, reads a JSON object from it, and then builds and returns the corresponding tree. The return type should be either Node or BST depending on what is found in the JSON file.

Suggestions / hints:

  • At first, ignore the BST case and build a version that works for Node objects
  • A key step is converting a tree to a collection of nested dictionaries. I suggest making a separate method of Node called as_dict_tree() that handles this naturally recursive operation. Then you can handle the top-level object creation (with "class" and "tree" keys) in the method save itself.
  • A dictionary like {"Node": Node, "BST": BST} can be used to map names of classes to actual classes. That allows you to build a Node or BST object depending on a string, e.g.
In [ ]:
node_classes = {"Node": Node, "BST": BST}

k = "Node"
A = node_classes[k](key=5)
# Now A is a Node object

k = "BST"
B = node_classes[k](key=11)
# Now B is a BST object

When you're done, the following code should build a BST, save it to a file, and load it back again.

In [ ]:
from trees import load, BST
from treevis import treeprint

T = BST()
T.insert(8)
T.insert(12)
T.insert(2)
T.insert(3)

with open("tree.json","w",encoding="UTF-8") as fp:
    T.save(fp)
    
with open("tree.json","r",encoding="UTF-8") as fp:
    W = load(fp)
    
print("Tree that was saved:")
treevis.treeprint(T)

print("Tree that was loaded:")
treevis.treeprint(W)

Solution

Inside class Node: (note that this implicitly defines the function for class BST too)

In [ ]:
def as_dict_tree(self):
        '''Returns representation of tree as a dictionary object so it can be saved'''
        
        # Initialize left and right keys as None so they can be overwritten if needed
        D = {"key": self.key, "left": None, "right": None}

        if self.left != None:
            D["left"] = self.left.as_dict_tree()

        if self.right != None:
            D["right"] = self.right.as_dict_tree()

        return D

Also inside class Node (be sure to import json):

In [ ]:
def save(self, fp):
        '''Saves dictionary representation of tree as JSON file as `fp`.'''
        tree = self.as_dict_tree()
        json.dump({"class": self.__class__.__name__, "tree": tree}, file)

In another file, define the load function (along with a recursive helper function called dict_tree_to_nodes to turn dictionaries into Nodes or BSTs)

In [ ]:
def dict_tree_to_nodes(D, node_type):
    '''Converts a dict object into a Node or BST.
    Argument `node_type' can be either `Node` or `BST` 
    (acting as a reference to the class itself, not a string).'''

    if D["left"] != None: # Recursive call on left side if needed
        left_node = dict_tree_to_nodes(D["left"], node_type)
    else:
        left_node = None
    
    if D["right"] != None: # Recursive call on right side if needed
        right_node = dict_tree_to_nodes(D["right"], node_type)
    else:
        right_node = None

    # Call the `node_type` class to initialize a node with left and right children
    return node_type(D["key"], left = left_node, right = right_node)


def load(fp):
    '''Loads a tree from a JSON file `fp`'''
    T = json.load(fp)
    if T["class"] == "Node":
        node_type = trees.Node
    else:
        node_type = trees.BST
    return dict_tree_to_nodes(T["tree"], node_type)

Running the example test code, we see that the tree is preserved after saving and loading (this block of code may need some small modifications depending on how you organized your .py files)

In [5]:
import trees
import treevis
import json

T = trees.BST(8)
T.insert(12)
T.insert(2)
T.insert(3)

with open("tree.json","w",encoding="UTF-8") as fp:
    T.save(fp)
    
with open("tree.json","r",encoding="UTF-8") as fp:
    W = load(fp)
    
print("Tree that was saved:")
treevis.treeprint(T)

print("Tree that was loaded:")
treevis.treeprint(W)
Tree that was saved:
      [8]

  [2]     [12]

 .  [3]  .   .

Tree that was loaded:
      [8]

  [2]     [12]

 .  [3]  .   .

3. CSV merge

Write a program to merge any number of CSV files, so that each row in any of the input files becomes a row in the output file. All of the input CSV files will have header rows. If the CSV files have the same columns, this is of course easy. But you should also handle the general case, where some columns may exist in multiple files, and others may be unique to a single file. The output file should contain one column for each distinct column name that appears in any of the input files.

Arrange it so your program csvmerge.py accepts all the input filenames as command line arguments. The last command line argument is the name of the output file that should be created.

For example, you might use a command like

python3 csvmerge.py a.csv b.csv c.csv out.csv

with a.csv containing:

name,age,favorite
Melissa,52,vanilla
Jonah,24,strawberry

and b.csv containing:

name,major
Josefina,falconry
David,phrenology

and c.csv containing:

age,major
5,bubbles
11,chess

In which case the program should create out.csv containing:

name,age,favorite,major
Melissa,52,vanilla,
Jonah,24,strawberry,
Josefina,,,falconry
David,,,phrenology
,5,,bubbles
,11,,chess

Solution

In [ ]:
import csv
import sys

# This code keeps me from accidentally emptying out this text file, 
# needs at least 2 command line args
if len(sys.argv) < 3:
    raise Exception("Usage: {} INPUTFILES OUTPUTFILE".format(sys.argv[0]))

# Iterate through the input files, adding the column names to a set.
# At this point we don't look at any of the input data---only the header row.
columns = set()
for csv_fn in sys.argv[1:-1]:
    with open(csv_fn,newline="") as csv_file:
        reader = csv.DictReader(csv_file)
        for col in reader.fieldnames:
            columns.add(col)
            
# Note: because we use a set to store the column names, the columns may appear in a different 
# order in the output file than they do in any input file. If order were important, we'd need
# to use a data structure other than a set to store the column names.

# Now, open the output file, and write the data from each input file into the output file.
# Use the combined input fieldnames. This is the second time each input file is opened.
out_fn = sys.argv[-1]
with open(out_fn,"w") as out_file:
    writer = csv.DictWriter(out_file, fieldnames=list(columns))  # fieldnames must be a list
    writer.writeheader()
    # Iterate through our input files again, and write each row of them to the output file.
    for csv_fn in sys.argv[1:-1]:
        with open(csv_fn,newline="") as csv_file:
            reader = csv.DictReader(csv_file)
            for row in reader:
                writer.writerow(row)  # row may not have all the keys that `writer` expects
                                      # but that's ok; DictWriter allows missing keys and
                                      # fills in an empty output field for them.

Extra for self study: Two line tree

This is an elaboration of a comment from lecture. It doesn't ask you to do much except experiment with an interesting construction.

The following one-line function lets you create hierarchies of dictionaries in Python using defaultdict:

In [124]:
# Adapted from https://gist.github.com/hrldcpr/2012250 by Harold Cooper
from collections import defaultdict

def autohierarchy():
    return defaultdict(autohierarchy)

Here's an example of how you can use it. Basically, you just attempt to access keys to create them! You can even access a key of a nested dictionary, and all the surrounding dictionaries will be created.

In [120]:
tasks = autohierarchy()
print("Adding stuff")
tasks["home"]["maintenance"]["demagnetize rain gutters"]
tasks["home"]["maintenance"]["replace missing front door"]
tasks["home"]["cleaning"]["load dishwasher"]
tasks["home"]["cleaning"]["run dishwasher"]
tasks["home"]["cleaning"]["empty dishwasher"]
tasks["school"]["mcs 275"]["project 3"]
print("Ok, added the stuff")
Adding stuff
Ok, added the stuff

Now if you print one of these objects, it looks ugly. But if you convert it to JSON, it looks nice.

In [121]:
import json

print("Ugly, but has all the stuff we intended:")
print(tasks)
print()
print("Nicer:")
print(json.dumps(tasks, indent=4))
Ugly, but has all the stuff we intended:
defaultdict(<function autohierarchy at 0x7f7edc493e50>, {'home': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {'maintenance': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {'demagnetize rain gutters': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {}), 'replace missing front door': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {})}), 'cleaning': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {'load dishwasher': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {}), 'run dishwasher': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {}), 'empty dishwasher': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {})})}), 'school': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {'mcs 275': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {'project 3': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {})})})})

Nicer:
{
    "home": {
        "maintenance": {
            "demagnetize rain gutters": {},
            "replace missing front door": {}
        },
        "cleaning": {
            "load dishwasher": {},
            "run dishwasher": {},
            "empty dishwasher": {}
        }
    },
    "school": {
        "mcs 275": {
            "project 3": {}
        }
    }
}

Can you figure out how and why this works?