This worksheet focuses on urllib and Beautiful Soup.
These things might be helpful while working on the problems. Remember that for worksheets, we don't strictly limit what resources you can consult, so these are only suggestions.
Use Beautiful Soup to write a script that takes an HTML file and writes equivalent HTML that is more nicely indented to an output file, using the title of the HTML document to generate the output filename (converting spaces to underscores). That is, if the document has title "Reasons not to taunt a polar bear", then the output filename would be
Reasons_not_to_taunt_a_polar_bear.html
(Recall that Beautiful Soup has a method to generate nicely indented HTML for any tag or BeautifulSoup object.)
Also, if there is no <title>
tag in the input HTML file, the script should print an error message and exit without writing any output file.
The input HTML filename should be expected as the first command line argument.
For example, if the in.html
contains
<!doctype html><html><head></head><body>
<h1>MCS 275 HTML file</h1></body></html>
Then running
python3 prettify.py in.html
should print a message
ERROR: This HTML file has no <title>.
and exit.
However, if in2.html
contains
<!doctype html><html><head>
<title>Abdominal surgery for beginners</title>
</head><body><h1>MCS 275 HTML file</h1></body>
</html>
Then running
python3 prettify.py in2.html
should not display anything in the terminal, but should create a new file with name
Abdominal_surgery_for_beginners.html
and content
<!DOCTYPE html>
<html>
<head>
<title>Abdominal surgery for beginners</title>
</head>
<body>
<h1>
MCS 275 HTML file
</h1>
</body>
</html>
#Based on the 2023 worksheet solutions
from bs4 import BeautifulSoup
import sys
# Create a beautiful soup from the filename in the first command line arg
with open(sys.argv[1],"r") as infile:
soup = BeautifulSoup(infile,"html.parser")
# Check whether the soup has a title in the head section
if soup.head == None or soup.head.title == None:
print("ERROR: HTML document has no title")
exit(1)
outfn = soup.head.title.string.replace(" ","_") + ".html"
# Write out the prettified soup to the filename in the 2nd command line arg
with open(outfn,"wt") as outfile:
outfile.write(soup.prettify())
Consider this web page for a graduate complex analysis class that was taught at UIC in 2016:
One section of the page lists weekly homework. Each homework assignment has a number, a title, and a list of problems from various sections of the textbook. Write a scraper that downloads this course web site's HTML, parses it with Beautiful Soup, and creates one dictionary for each homework assignment having the following format
{
"number": 10,
"title": "Harmonic functions",
"problems": "Sec 4.6.2(p166): 1,2\nSec 4.6.4(p171): 1,2,3,4"
}
It should then put these dictionaries into a list and save the list to a JSON file called math535spring2016homework.json
.
Note: If you finish this problem early, you might find it fun to watch this animation of the UIC logo distortion that appears on the Math 535 course web page, and see if you can figure out what's going on.
# It's good practice to save the html locally during development.
# Here's a short script that saves the html as 'math535.html'
from urllib.request import urlopen
from bs4 import BeautifulSoup
with urlopen("https://www.dumas.io/teaching/2016/spring/math535/") as response:
soup = BeautifulSoup(response,"html.parser")
with open("math535.html", "wt") as fout:
fout.write(soup.prettify())
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
# First, create a beautiful soup, either from the url or a local copy
## --- This secion assumes you have it saved locally ------------
with open("math535.html", "rt") as infile:
soup = BeautifulSoup(infile, "html.parser")
## -------------------------------------------------------
# --- Uncomment this section instead if you need to fetch it from the web ------
# with urlopen("https://www.dumas.io/teaching/2016/spring/math535/") as response:
# soup = BeautifulSoup(response,"html.parser")
# --------------------------------------------------------
# We want to make a list of dictionaries, so start with an empty list
hw_data = []
# The relevant section is in an unordered list inside the "homework" div.
hw_ul_tag = soup.find("div",id="homework").ul
# Iterate through each bullet item in the homeworks list
for hw in hw_ul_tag.find_all("li"):
# Not every 535 homework assignment fits the expected format.
# If there's an issue parsing, just skip that assignment and continue.
# A sweeping try/except is not always recommended, but neither
# is parsing html.
try:
# The problems are inside the contents, on lines without other tags.
problems = ""
for prob in hw.contents:
# Convert to string and strip out starting/ending white space
prob = str(prob).strip()
#If the content line has a tag or is whitespace, then skip
if "<" in prob or prob == "":
continue
#Otherwise, concatenate to problems
else:
problems += "\n" + prob
# The assignment number and title are all inside the "b" tag
heading = hw.b.string.strip()
words = heading.split()
number = int(words[1])
title = " ".join(words[7:])
# Create a dictionary with the fields we collected
d = {"number":number, "title":title, "problems":problems}
# Append the dictionary to the list of dictionaries
hw_data.append(d)
except Exception as e:
# Skip the homework assignments that don't have the expected format,
# but print the error message.
print("Skipping this row: ",heading)
print("Reason: ",e)
continue
# Dump out the list-of-dictionaries into a json file.
outfn = "math535spring2016homework.json"
with open(outfn, "wt") as outfp:
json.dump(hw_data, outfp)
print("\nWrote data on {} assignments to {}".format(len(hw_data),outfn))
Here is a link to an HTML file:
If you open it in a browser, you won't see anything. The document contains nothing but <span>
tags, and no text. Some of the <span>
tags are nested inside other <span>
tags. How deeply are they nested?
Every <span>
tag in this file has an id
attribute. There is exactly one <span>
that has greater depth in the the DOM tree than any other. What is its id
attribute?
Write a Python script to load the HTML file with Beautiful Soup and tranverse the DOM to answer these questions.
# It's good practice to save the html locally during development.
# Here's a short script that saves the html as 'capture.html'
from urllib.request import urlopen
from bs4 import BeautifulSoup
with urlopen("https://www.dumas.io/teaching/2024/spring/mcs275/data/capture.html") as response:
soup = BeautifulSoup(response,"html.parser")
with open("capture.html", "wt") as fout:
fout.write(soup.prettify())
from urllib.request import urlopen
from bs4 import BeautifulSoup
def span_tag_depth(tag):
"""Recursive function for recursing through the span tree and counting the maximum depth.
Returns the depth."""
# Maintain a list of the children's maximum depths
max_span_depth = 0
# Iterate through the child span tags WITHOUT RECURSING
# i.e. only immediate children, not ancestors
for t in tag.find_all("span", recursive=False):
depth = span_tag_depth(t)
# If the child's depth is the deepest so far, then replace.
if depth>max_span_depth:
max_span_depth = depth
# Pass up the maximum depth
return 1 + max_span_depth
def span_tag_depth_id(tag):
"""Recursive function for recursing through the span tree and counting the maximum depth.
Returns the depth and the leaf's span id."""
# Set the default depth
max_span_depth = 0
# If the current tag is <span> and has an id, set it as the default id
# Then, we will "pass up" the leaf id from the longest branch
if tag.name == "span" and tag.has_attr("id"):
max_span_id = tag["id"]
# Iterate through the child span tags WITHOUT RECURSING
# i.e. only immediate children, not ancestors
for t in tag.find_all("span", recursive=False):
# Recurse through t's children for the max branch length and leaf id
t_depth,t_id = span_tag_depth_id(t)
# If t has the deepest depth so far, replace the max depth/id.
if t_depth>max_span_depth:
max_span_depth = t_depth
max_span_id = t_id # leaf id
# Return the augmented max_depth and the id of the leaf.
return 1+max_span_depth,max_span_id
# Create a beautiful soup, either from the url or a local copy
# --- Comment this out for the final version ------------
with open("capture.html", "rt") as infile:
soup = BeautifulSoup(infile, "html.parser")
# -------------------------------------------------------
# # --- Comment this out during development ---------------
# with urlopen("https://www.dumas.io/teaching/2021/spring/mcs275/data/capture.html") as response:
# soup = BeautifulSoup(response,"html.parser")
# # -------------------------------------------------------
print("Maximum depth:",span_tag_depth(soup.span))
depth,span_id = span_tag_depth_id(soup.span)
print("Maximum depth:",depth,"Leaf id:",span_id)
In case you complete everything above with time to spare, I suggest using the remaining lab time to work on Project 4. Having your TA nearby to answer any questions that come up will probably be helpful!