A document from MCS 275 Spring 2021, instructor David Dumas. You can also get the notebook file.

Worksheet 9

MCS 275 Spring 2021 - David Dumas

Topics

The main topics of this worksheet are:

  • Tree traversals
  • Set and defaultdict
  • CSV and JSON

The main references for these topics are:

The most useful files from the sample code repository are:

Instructions

  • Problem 1 is handled differently than the others:
    • Tuesday discussion students: Problem 1 will be presented as an example at the start of discussion
    • Thursday discussion students: Please complete Problem 1 before discussion and bring your solution
  • For the other problems:
    • Work on these problems in discussion.

1. BST copier

Write a function bst_copy(T) that takes a BST object T (with integer keys) and returns a copy of it, created from scratch by instantiating a new BST object and inserting the keys from T.

In [58]:
import treeutil
import treevis

T = treeutil.random_bst(6)
T2 = bst_copy(T)
print("The original:")
treevis.treeprint(T)
print("The copy:")
treevis.treeprint(T2)
print("Are they the same object? (Expect False.)")
print(T==T2)  # Since we didn't define an __eq__ method, this will return False
              # unless T and T2 are different names for a single object
The original:
          <6>          

    <5>         <14>   

 <4>         <10>  <16>

The copy:
          <6>          

    <5>         <14>   

 <4>         <10>  <16>

Are they the same object? (Expect False.)
False

2. Accidentally quadratic

Here is a function that takes two strings and returns the set of characters that appear in both strings.

In [118]:
def common_chars(s1,s2):
    """Return a set of all characters that are present in both
    strings `s1` and `s2`."""
    common = set()
    for c1 in s1:
        if c1 in s2:
            common.add(c1)
    return common

It works. Here's a simple example:

In [49]:
common_chars("mathematics","computer science")
Out[49]:
{'c', 'e', 'i', 'm', 's', 't'}

However, this function is actually quite slow. Here's an example that generates two strings that each have 50,000 characters, runs common_chars on them, and prints the total time.

In [50]:
import random
import time

s1 = ''.join([ random.choice(["edfghijklmnopqrstuvwxyzzzzzzzzzzzzzzzz"]) for _ in range(50000) ])
s2 = ''.join([ random.choice(["abcedfghijklmnopqrstuvw"]) for _ in range(50000) ]) + 'z'

t_start = time.time()
both = common_chars(s1,s2)
t_end = time.time()

print("Common characters:")
print(both)
print("\nRunning time: {:.2f} seconds".format(t_end-t_start))
Common characters:
{'w', 'n', 't', 'f', 'r', 'm', 'z', 'q', 'p', 'e', 'i', 'u', 'l', 'o', 'k', 'd', 'g', 'h', 'v', 's', 'j'}

Running time: 15.94 seconds

If you try this yourself, you might get a slightly different time, but it will probably take more than 10 seconds.

First, what is going on here? It should be possible to compare millions of characters for equality per second, and there are only 100,000 characters you need to look at, right?

Second, can you fix it? (It is possible to make this function shorter, clearer, and so that it takes less than 0.05 seconds.)

3. Nobel prize winners data

Download the JSON file of data about Nobel laureates from

http://api.nobelprize.org/v1/laureate.json

Write a Python program that reads this file and uses defaultdict to make and print a histogram of the institutional affiliations that appear in each prize category (e.g. in chemistry, how many nobel laureates have their institutional affiliation as "University of California"? in peace, what institutional affiliation is listed for the largest number of laureates?)

Poking around in the JSON data to figure out where the relevant information is stored will be the first step. I suggest loading it into an object in the REPL and then exploring the list of keys, taking the value associated to one of those keys and listing its keys, etc..

4. CSV merge

Write a program to merge any number of CSV files, so that each row in any of the input files becomes a row in the output file. All of the input CSV files will have header rows. If the CSV files have the same columns, this is of course easy. But you should also handle the general case, where some columns may exist in multiple files, and others may be unique to a single file. The output file should contain one column for each distinct column name that appears in any of the input files.

Arrange it so your program csvmerge.py accepts all the input filenames as command line arguments. The last command line argument is the name of the output file that should be created.

For example, you might use a command like

python3 csvmerge.py a.csv b.csv c.csv out.csv

with a.csv containing:

name,age,favorite
Melissa,52,vanilla
Jonah,24,strawberry

and b.csv containing:

name,major
Josefina,falconry
David,phrenology

and c.csv containing:

age,major
5,bubbles
11,chess

In which case the program should create out.csv containing:

name,age,favorite,major
Melissa,52,vanilla,
Jonah,24,strawberry,
Josefina,,,falconry
David,,,phrenology
,5,,bubbles
,11,,chess

5. USPS facilities data conversion

Download and unzip this CSV file listing facilities (mostly post offices) owned by the USPS in the 50 states:

Take a look at the file to get a sense of how it is structured. (You may need to look back at it later to determine which column headings are relevant to the tasks below.)

Now, write a program that uses the csv module to read this file and process it into a hierarchy of summary data that is written to a JSON file (using the json module). The output should have the following hierarchy:

  • At the top level, there is one key for each 2-letter state abbreviation (e.g. "IL")
    • The value associated to a state abbreviation is an object whose keys are the names of postal districts that have facilities in that state (e.g. "Central Illinois")
      • The value associated to a postal district is an object whose keys and values are as follows:
        • key "total facilities" with value the number of facilities in that state and postal district
        • key "post offices" with value the number of facilities in that state and postal district that have facility type (column FDB Facility Type (All)) equal to Post Office.

For example, the output file might begin

{ "IL": { "Central Illinois": { "total facilities": 165, "post offices": 144 }, ...

6. Kafkaesque and Tolstoyesque

What words appear in both War and Peace by Leo Tolstoy and The Metamorphosis by Franz Kafka?

Answer this for the english translations of these novels available as UTF-8 plain text from Project Gutenberg:

(You should look at the files in a browser to see the format; there is a header and a footer, with the actual text in between.)

Counting words is tricky, and there are a lot of edge cases, so let's impose these rules:

  • A word means a maximal block of consecutive characters that are alphabetic
    • A character c is considered alphabetic if c.isalpha() returns True. The method isalpha of class str is documented here
  • Words are considered the same if they differ only in capitalization
  • Chapter headings and roman numerals count as words (to save you the trouble of devising code to recognize and exclude these)

These files contain accented characters and various exotic punctuation (e.g. proper start and end quotation marks, not just the symmetric ones on your keyboard). As a result, to find words and ignore punctuation it will be easiest to use a positive test (.isalpha decides what is a letter) instead of a negative one (e.g. a hard-coded list of punctuation characters to convert to spaces before splitting).

Having finished the last exercise of the worksheet, you can now wake from these troubled dreams to find yourself transformed into a more skilled programmer.

Extra for self study: One line tree

This is an elaboration of a comment from lecture. It doesn't ask you to do much except experiment with an interesting construction. I learned of this from Harold Cooper.

The following one-line function lets you create hierarchies of dictionaries in Python using defaultdict:

In [124]:
# Adapted from https://gist.github.com/hrldcpr/2012250 by Harold Cooper
from collections import defaultdict

def autohierarchy():
    return defaultdict(autohierarchy)

Here's an example of how you can use it. Basically, you just attempt to access keys to create them! You can even access a key of a nested dictionary, and all the surrounding dictionaries will be created.

In [120]:
tasks = autohierarchy()
print("Adding stuff")
tasks["home"]["maintenance"]["demagnetize rain gutters"]
tasks["home"]["maintenance"]["replace missing front door"]
tasks["home"]["cleaning"]["load dishwasher"]
tasks["home"]["cleaning"]["run dishwasher"]
tasks["home"]["cleaning"]["empty dishwasher"]
tasks["school"]["mcs 275"]["project 3"]
print("Ok, added the stuff")
Adding stuff
Ok, added the stuff

Now if you print one of these objects, it looks ugly. But if you convert it to JSON, it looks nice.

In [121]:
import json

print("Ugly, but has all the stuff we intended:")
print(tasks)
print()
print("Nicer:")
print(json.dumps(tasks, indent=4))
Ugly, but has all the stuff we intended:
defaultdict(<function autohierarchy at 0x7f7edc493e50>, {'home': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {'maintenance': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {'demagnetize rain gutters': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {}), 'replace missing front door': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {})}), 'cleaning': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {'load dishwasher': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {}), 'run dishwasher': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {}), 'empty dishwasher': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {})})}), 'school': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {'mcs 275': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {'project 3': defaultdict(<function autohierarchy at 0x7f7edc493e50>, {})})})})

Nicer:
{
    "home": {
        "maintenance": {
            "demagnetize rain gutters": {},
            "replace missing front door": {}
        },
        "cleaning": {
            "load dishwasher": {},
            "run dishwasher": {},
            "empty dishwasher": {}
        }
    },
    "school": {
        "mcs 275": {
            "project 3": {}
        }
    }
}

Can you figure out how and why this works?

How might you write a function to convert one of these autohierarchy objects to a collection of nested dicts so that they print nicely without using the json module?

Revision history

  • 2021-03-09 - Initial publication