The main topics of this worksheet are:
The main references for these topics are:
The most useful files from the sample code repository are:
Write a function bst_copy(T)
that takes a BST
object T
(with integer keys) and returns a copy of it, created from scratch by instantiating a new BST
object and inserting the keys from T
.
# MCS 275 Worksheet 9 Problem 1
# Jennifer Vaccaro
# I wrote this solution in collaboration with the Tuesday discussion section (thanks!)
import bst
import binary
def copy_subtree(T, x):
"""If x != None, copies the key from x into a new node and adds it to a tree T.
Then recurse by copying the subtree descending from x's children."""
if x == None:
return
# Create a new node with the same key as x.
new_x = binary.Node(x.key)
# Inserting the new node into T assigns the parent/children
T.insert(new_x)
# Then copy the subtree from x.left and x.right
copy_subtree(T,x.left)
copy_subtree(T,x.right)
def bst_copy(T):
"""Returns a new tree with the same attributes/nodes as T"""
# Create a new empty tree
new_T = bst.BST()
# Call the recursive helper function to copy each of the nodes
copy_subtree(new_T, T.root)
return new_T
import treeutil
import treevis
T = treeutil.random_bst(6)
T2 = bst_copy(T)
print("The original:")
treevis.treeprint(T)
print("The copy:")
treevis.treeprint(T2)
print("Are they the same object? (Expect False.)")
print(T==T2) # Since we didn't define an __eq__ method, this will return False
# unless T and T2 are different names for a single object
Here is a function that takes two strings and returns the set of characters that appear in both strings.
def common_chars(s1,s2):
"""Return a set of all characters that are present in both
strings `s1` and `s2`."""
common = set()
for c1 in s1:
if c1 in s2:
common.add(c1)
return common
The problem with this function is that it has an implicit nested for loop that performs len(s1)*len(s2)
equality checks. The expression
c1 in s2
is equivalent to the return value of this function:
def c1_in_s2():
for c2 in s2:
if c1==c2:
return True
return False
In the worst case, this function performs len(s2)
checks, and it runs for each character of s1
. If s1
and s2
are each of length 50,000, then this becomes 2,500,000,000 equality checks.
However, most strings don't have that many distinct characters, so it would be faster to:
s1
(and make a set out of them)s2
(and make a set out of them)The time it takes to do this would be roughly proportional to len(s1) + len(s2) + n1*n2
where n1
is the number of distinct characters in s1
and n2
is the number of distinct characters in s2
. In most cases n1
and n2
will be bounded by a fixed constant (like 26
, if the strings only contain lower case letters of the alphabet), so the main contribution to the running time is proportional to the lengths of s1
and s2
individually, rather than their product.
Here is an alternative common_chars
function that uses this strategy:
# MCS 275 Worksheet 9 Problem 3
# Jennifer Vaccaro
# I wrote this solution myself, in accordance with the syllabus
def common_chars(s1,s2):
"""Return a set of all characters that are present in both
strings `s1` and `s2`."""
# By first turning s1 and s2 into sets, we have fewer characters to compare.
# Then we can return the intersection
return set(s1) & set(s2)
# Another syntax option would be 'return set(s1).intersection(s2)'
It works. Here's a simple example:
common_chars("mathematics","computer science")
Here is a timing study, showing it handles strings of length 50,000 with ease:
import random
import time
s1 = ''.join([ random.choice(["edfghijklmnopqrstuvwxyzzzzzzzzzzzzzzzz"]) for _ in range(50000) ])
s2 = ''.join([ random.choice(["abcedfghijklmnopqrstuvw"]) for _ in range(50000) ]) + 'z'
t_start = time.time()
both = common_chars(s1,s2)
t_end = time.time()
print("Common characters:")
print(both)
print("\nRunning time: {:.2f} seconds".format(t_end-t_start))
Download the JSON file of data about Nobel laureates from
http://api.nobelprize.org/v1/laureate.json
Write a Python program that reads this file and uses defaultdict
to make and print a histogram of the institutional affiliations that appear in each prize category (e.g. in chemistry, how many nobel laureates have their institutional affiliation as "University of California"? in peace, what institutional affiliation is listed for the largest number of laureates?)
Poking around in the JSON data to figure out where the relevant information is stored will be the first step. I suggest loading it into an object in the REPL and then exploring the list of keys, taking the value associated to one of those keys and listing its keys, etc..
# MCS 275 Worksheet 9 Problem 3
# Jennifer Vaccaro and Emily Dumas
"""Creates a histogram of the schools affiliated with the Nobel Prize in Chemistry"""
import json
from collections import defaultdict
# Create an empty defaultdict with default value zero
# (which is the value returned by function int)
hist = defaultdict(int)
# Open the file object. This syntax will automatically
# close the file for you once the indentation block ends
with open("laureate.json","r") as f:
data = json.load(f)
# Iterate through the laureates
for l in data["laureates"]:
# Iterate through the prizes
for p in l["prizes"]:
if p["category"] != "chemistry":
# Skip if the prize category is not chemistry
continue
# Iterate through the affiliations
for a in p["affiliations"]:
# Skip if the affiliation is not a dictionary
if not isinstance(a,dict):
continue
# Add the school/institution name to the histogram.
# Because we are using defaultdict, if we attempt to access
# a key that doesn't exist, it will be created and given the
# value zero.
hist[a["name"]] += 1
# Now, hist maps institution names to prize counts in chemistry
# We could just print the whole thing, but we can get a list in descending order
# using dict.items() to get key-value pairs and then sorting by value.
print("#PRIZES INSTITUTION")
for institution, count in sorted(hist.items(),key=lambda x:-x[1]):
print("{:>7d} {}".format(count,institution))
# the argument key=lambda x:-x[1] in the call to `sorted` means sort by
# second element of the tuple, in descending order. The tuples are
# (institution,prize_count) so this means the most prizes appear first.
Write a program to merge any number of CSV files, so that each row in any of the input files becomes a row in the output file. All of the input CSV files will have header rows. If the CSV files have the same columns, this is of course easy. But you should also handle the general case, where some columns may exist in multiple files, and others may be unique to a single file. The output file should contain one column for each distinct column name that appears in any of the input files.
Arrange it so your program csvmerge.py
accepts all the input filenames as command line arguments. The last command line argument is the name of the output file that should be created.
For example, you might use a command like
python3 csvmerge.py a.csv b.csv c.csv out.csv
with a.csv
containing:
name,age,favorite
Melissa,52,vanilla
Jonah,24,strawberry
and b.csv
containing:
name,major
Josefina,falconry
David,phrenology
and c.csv
containing:
age,major
5,bubbles
11,chess
In which case the program should create out.csv
containing:
name,age,favorite,major
Melissa,52,vanilla,
Jonah,24,strawberry,
Josefina,,,falconry
David,,,phrenology
,5,,bubbles
,11,,chess
# MCS 275 Worksheet 9 Problem 4
# Jennifer Vaccaro and Emily Dumas
"""csvmerge reads a set of input csv and writes their combined
data into a single output csv"""
import csv
import sys
# This code keeps me from accidentally emptying out this text file,
# needs at least 2 command line args
if len(sys.argv) < 3:
raise Exception("Usage: {} INPUTFILES OUTPUTFILE".format(sys.argv[0]))
# Iterate through the input files, adding the column names to a set
# At this point we don't look at any of the input data---only the
# header row.
columns = set()
for csv_fn in sys.argv[1:-1]:
with open(csv_fn,newline="") as csv_file:
reader = csv.DictReader(csv_file)
for col in reader.fieldnames:
columns.add(col)
# Note: because we use a set to store the column names, the columns
# may appear in a different order in the output file than they do in
# any input file. If order were important, we'd need to use a data
# structure other than a set to store the column names.
# Now, open the output file, and write the data from each input file into the output file.
# Use the combined input fieldnames. This is the second time each input file is opened.
out_fn = sys.argv[-1]
with open(out_fn,"w") as out_file:
writer = csv.DictWriter(out_file, fieldnames=list(columns)) # ieldnames must be a list
writer.writeheader()
# Iterate through our input files again, and write each row of them to the output file.
for csv_fn in sys.argv[1:-1]:
with open(csv_fn,newline="") as csv_file:
reader = csv.DictReader(csv_file)
for row in reader:
writer.writerow(row) # row may not have all the keys that `writer` expects
# but that's ok; DictWriter allows missing keys and
# fills in an empty output field for them.
Download and unzip this CSV file listing facilities (mostly post offices) owned by the USPS in the 50 states:
Take a look at the file to get a sense of how it is structured. (You may need to look back at it later to determine which column headings are relevant to the tasks below.)
Now, write a program that uses the csv
module to read this file and process it into a hierarchy of summary data that is written to a JSON file (using the json
module). The output should have the following hierarchy:
FDB Facility Type (All)
) equal to Post Office
.For example, the output file might begin
{ "IL": { "Central Illinois": { "total facilities": 165, "post offices": 144 }, ...
# MCS 275 Worksheet 9 Problem 5
# Emily Dumas
"""Read CSV of USPS facilities and compile summary statistics
by state and district in JSON output file."""
import json
import csv
import collections
infn = "usps_facilities.csv"
outfn = "usps_facility_summary.json"
def district_summary_factory():
"""Default value for a district's summary data"""
return { "total facilities":0, "post offices": 0 }
def state_summary_factory():
"""Default value for a state's summary data"""
return collections.defaultdict(district_summary_factory)
summary = collections.defaultdict(state_summary_factory)
# After all of this setup, we now have an object for storing
# the summary data where we never need to create keys. A line
# like
# summary["IL"]["Central Illinois"]["post offices"] += 1
# will create everything needed and increment the count.
with open(infn,newline="") as infile:
rdr=csv.DictReader(infile)
for facility in rdr:
# Put key elements of the facility data
# into conveniently named variables
state = facility["ST"]
district = facility["District"]
is_po = facility["FDB Facility Type (All)"]=="Post Office"
# Record what we want to know about this facility
summary[state][district]["total facilities"] += 1
if is_po:
summary[state][district]["post offices"] += 1
with open(outfn,"w") as outfile:
json.dump(summary,outfile)
The resulting output JSON file is very long; here's the part of it for Florida and Georgia (with extra whitespace and indenting added for readability):
...
"FL": {
"Suncoast": {
"total facilities": 187,
"post offices": 162
},
"Gulf Atlantic": {
"total facilities": 104,
"post offices": 80
},
"South Florida": {
"total facilities": 102,
"post offices": 90
}
},
"GA": {
"Atlanta": {
"total facilities": 143,
"post offices": 114
},
"Gulf Atlantic": {
"total facilities": 104,
"post offices": 86
},
"Tennessee": {
"total facilities": 9,
"post offices": 9
}
},
...
(There's no typo there; some of the post offices in Georgia are in a district that USPS calls "Tennessee". That district also contains all of the post offices in the state of Tennessee.)
What words appear in both War and Peace by Leo Tolstoy and The Metamorphosis by Franz Kafka?
Answer this for the english translations of these novels available as UTF-8 plain text from Project Gutenberg:
(You should look at the files in a browser to see the format; there is a header and a footer, with the actual text in between.)
Counting words is tricky, and there are a lot of edge cases, so let's impose these rules:
c
is considered alphabetic if c.isalpha()
returns True
. The method isalpha
of class str
is documented hereThese files contain accented characters and various exotic punctuation (e.g. proper start and end quotation marks, not just the symmetric ones on your keyboard). As a result, to find words and ignore punctuation it will be easiest to use a positive test (.isalpha
decides what is a letter) instead of a negative one (e.g. a hard-coded list of punctuation characters to convert to spaces before splitting).
Having finished the last exercise of the worksheet, you can now wake from these troubled dreams to find yourself transformed into a more skilled programmer.
# MCS 275 Worksheet 9 Problem 6
# Jennifer Vaccaro and Emily Dumas
"""Finds the shared words between 'War and Peace' by Leo Tolstoy
and 'Metamorphosis' by Franz Kafka"""
# Set the names of the text files
tolstoy_fn = "tolstoy.txt"
kafka_fn = "kafka.txt"
def lower_words(s):
"""Splits out a string s into lower case alphabet words by
converting all other characters to spaces then splitting"""
new_s = ""
for c in s:
if c.isalpha():
new_s += c.lower()
else:
new_s += " "
return new_s.split()
def ebook_word_set(fobj):
"""Return a set containing the distinct words in a project
Gutenberg ebook text file. Only counts words between the
start and end markers that these files include."""
words = set()
watching = False # Indicator whether we're in the main text yet.
for line in f:
if line.startswith("*** START OF THIS PROJECT GUTENBERG EBOOK"):
watching = True # We hit the start marker, so start counting words
continue # on the *next* line.
if not watching:
continue
if line.startswith("End of the Project Gutenberg EBook"):
break # we hit the end marker. stop.
# We are between the start and end marker.
# Add all words on this line to the set of words seen so far
words.update(lower_words(line))
return words
with open(tolstoy_fn,"r",encoding="utf8") as f:
tolstoy_words = ebook_word_set(f)
with open(kafka_fn,"r",encoding="utf8") as f:
kafka_words = ebook_word_set(f)
# The intersection of these sets is the set of shared words
shared_words = tolstoy_words & kafka_words
print("Total words in common:",len(shared_words))
lmax = max( [len(w) for w in shared_words] )
print("\nLongest shared word(s):")
print(",".join(sorted([w for w in shared_words if len(w)==lmax])))
print("\nAll shared words in alphabetical order:")
print(",".join(sorted(shared_words)))