The main topics of this worksheet are:
The main references for these topics are:
The most useful files from the sample code repository are:
Write a function bst_copy(T)
that takes a BST
object T
(with integer keys) and returns a copy of it, created from scratch by instantiating a new BST
object and inserting the keys from T
.
import treeutil
import treevis
T = treeutil.random_bst(6)
T2 = bst_copy(T)
print("The original:")
treevis.treeprint(T)
print("The copy:")
treevis.treeprint(T2)
print("Are they the same object? (Expect False.)")
print(T==T2) # Since we didn't define an __eq__ method, this will return False
# unless T and T2 are different names for a single object
Here is a function that takes two strings and returns the set of characters that appear in both strings.
def common_chars(s1,s2):
"""Return a set of all characters that are present in both
strings `s1` and `s2`."""
common = set()
for c1 in s1:
if c1 in s2:
common.add(c1)
return common
It works. Here's a simple example:
common_chars("mathematics","computer science")
However, this function is actually quite slow. Here's an example that generates two strings that each have 50,000
characters, runs common_chars
on them, and prints the total time.
import random
import time
s1 = ''.join([ random.choice(["edfghijklmnopqrstuvwxyzzzzzzzzzzzzzzzz"]) for _ in range(50000) ])
s2 = ''.join([ random.choice(["abcedfghijklmnopqrstuvw"]) for _ in range(50000) ]) + 'z'
t_start = time.time()
both = common_chars(s1,s2)
t_end = time.time()
print("Common characters:")
print(both)
print("\nRunning time: {:.2f} seconds".format(t_end-t_start))
If you try this yourself, you might get a slightly different time, but it will probably take more than 10 seconds.
First, what is going on here? It should be possible to compare millions of characters for equality per second, and there are only 100,000 characters you need to look at, right?
Second, can you fix it? (It is possible to make this function shorter, clearer, and so that it takes less than 0.05 seconds.)
Download the JSON file of data about Nobel laureates from
http://api.nobelprize.org/v1/laureate.json
Write a Python program that reads this file and uses defaultdict
to make and print a histogram of the institutional affiliations that appear in each prize category (e.g. in chemistry, how many nobel laureates have their institutional affiliation as "University of California"? in peace, what institutional affiliation is listed for the largest number of laureates?)
Poking around in the JSON data to figure out where the relevant information is stored will be the first step. I suggest loading it into an object in the REPL and then exploring the list of keys, taking the value associated to one of those keys and listing its keys, etc..
Write a program to merge any number of CSV files, so that each row in any of the input files becomes a row in the output file. All of the input CSV files will have header rows. If the CSV files have the same columns, this is of course easy. But you should also handle the general case, where some columns may exist in multiple files, and others may be unique to a single file. The output file should contain one column for each distinct column name that appears in any of the input files.
Arrange it so your program csvmerge.py
accepts all the input filenames as command line arguments. The last command line argument is the name of the output file that should be created.
For example, you might use a command like
python3 csvmerge.py a.csv b.csv c.csv out.csv
with a.csv
containing:
name,age,favorite
Melissa,52,vanilla
Jonah,24,strawberry
and b.csv
containing:
name,major
Josefina,falconry
David,phrenology
and c.csv
containing:
age,major
5,bubbles
11,chess
In which case the program should create out.csv
containing:
name,age,favorite,major
Melissa,52,vanilla,
Jonah,24,strawberry,
Josefina,,,falconry
David,,,phrenology
,5,,bubbles
,11,,chess
Download and unzip this CSV file listing facilities (mostly post offices) owned by the USPS in the 50 states:
Take a look at the file to get a sense of how it is structured. (You may need to look back at it later to determine which column headings are relevant to the tasks below.)
Now, write a program that uses the csv
module to read this file and process it into a hierarchy of summary data that is written to a JSON file (using the json
module). The output should have the following hierarchy:
FDB Facility Type (All)
) equal to Post Office
.For example, the output file might begin
{ "IL": { "Central Illinois": { "total facilities": 165, "post offices": 144 }, ...
What words appear in both War and Peace by Leo Tolstoy and The Metamorphosis by Franz Kafka?
Answer this for the english translations of these novels available as UTF-8 plain text from Project Gutenberg:
(You should look at the files in a browser to see the format; there is a header and a footer, with the actual text in between.)
Counting words is tricky, and there are a lot of edge cases, so let's impose these rules:
c
is considered alphabetic if c.isalpha()
returns True
. The method isalpha
of class str
is documented hereThese files contain accented characters and various exotic punctuation (e.g. proper start and end quotation marks, not just the symmetric ones on your keyboard). As a result, to find words and ignore punctuation it will be easiest to use a positive test (.isalpha
decides what is a letter) instead of a negative one (e.g. a hard-coded list of punctuation characters to convert to spaces before splitting).
Having finished the last exercise of the worksheet, you can now wake from these troubled dreams to find yourself transformed into a more skilled programmer.
This is an elaboration of a comment from lecture. It doesn't ask you to do much except experiment with an interesting construction. I learned of this from Harold Cooper.
The following one-line function lets you create hierarchies of dictionaries in Python using defaultdict:
# Adapted from https://gist.github.com/hrldcpr/2012250 by Harold Cooper
from collections import defaultdict
def autohierarchy():
return defaultdict(autohierarchy)
Here's an example of how you can use it. Basically, you just attempt to access keys to create them! You can even access a key of a nested dictionary, and all the surrounding dictionaries will be created.
tasks = autohierarchy()
print("Adding stuff")
tasks["home"]["maintenance"]["demagnetize rain gutters"]
tasks["home"]["maintenance"]["replace missing front door"]
tasks["home"]["cleaning"]["load dishwasher"]
tasks["home"]["cleaning"]["run dishwasher"]
tasks["home"]["cleaning"]["empty dishwasher"]
tasks["school"]["mcs 275"]["project 3"]
print("Ok, added the stuff")
Now if you print one of these objects, it looks ugly. But if you convert it to JSON, it looks nice.
import json
print("Ugly, but has all the stuff we intended:")
print(tasks)
print()
print("Nicer:")
print(json.dumps(tasks, indent=4))
Can you figure out how and why this works?
How might you write a function to convert one of these autohierarchy
objects to a collection of nested dict
s so that they print nicely without using the json
module?