MCS 275 Spring 2024 Worksheet 14¶

Course instructor: Emily Dumas

Topics¶

This worksheet focuses on urllib and Beautiful Soup.

Resources¶

These things might be helpful while working on the problems. Remember that for worksheets, we don't strictly limit what resources you can consult, so these are only suggestions.

Lecture 29 - HTML and CSS
Lecture 37 - Working with APIs and HTML
Lecture 38 - Using beautiful soup
Other online resources:
- w3schools HTML tutorial
- Beautiful Soup documentation
Course sample code:
- samplecode/http_requests_and_html_parsing
Downey's book, Think Python
MCS 260 course materials from Fall 2021:
- Slides, homework, worksheets, and projects
- MCS 260 Sample programs

1. HTML prettifier and renaming utility¶

Use Beautiful Soup to write a script that takes an HTML file and writes equivalent HTML that is more nicely indented to an output file, using the title of the HTML document to generate the output filename (converting spaces to underscores). That is, if the document has title "Reasons not to taunt a polar bear", then the output filename would be

Reasons_not_to_taunt_a_polar_bear.html

(Recall that Beautiful Soup has a method to generate nicely indented HTML for any tag or BeautifulSoup object.)

Also, if there is no <title> tag in the input HTML file, the script should print an error message and exit without writing any output file.

The input HTML filename should be expected as the first command line argument.

For example, if the in.html contains

<!doctype html><html><head></head><body>
<h1>MCS 275 HTML file</h1></body></html>

Then running

python3 prettify.py in.html

should print a message

ERROR: This HTML file has no <title>.

and exit.

However, if in2.html contains

<!doctype html><html><head>
<title>Abdominal surgery for beginners</title>
</head><body><h1>MCS 275 HTML file</h1></body>
</html>

Then running

python3 prettify.py in2.html

should not display anything in the terminal, but should create a new file with name

Abdominal_surgery_for_beginners.html

and content

<!DOCTYPE html>
<html>
 <head>
  <title>Abdominal surgery for beginners</title>
 </head>
 <body>
  <h1>
   MCS 275 HTML file
  </h1>
 </body>
</html>

2. Complex analysis homework scraper¶

Consider this web page for a graduate complex analysis class that was taught at UIC in 2016:

Math 535 Spring 2016

One section of the page lists weekly homework. Each homework assignment has a number, a title, and a list of problems from various sections of the textbook. Write a scraper that downloads this course web site's HTML, parses it with Beautiful Soup, and creates one dictionary for each homework assignment having the following format

{
  "number": 10,
  "title": "Harmonic functions",
  "problems": "Sec 4.6.2(p166): 1,2\nSec 4.6.4(p171): 1,2,3,4"
}

It should then put these dictionaries into a list and save the list to a JSON file called math535spring2016homework.json.

Note: If you finish this problem early, you might find it fun to watch this animation of the UIC logo distortion that appears on the Math 535 course web page, and see if you can figure out what's going on.

3. Capture the tag¶

Here is a link to an HTML file:

capture.html

If you open it in a browser, you won't see anything. The document contains nothing but  tags, and no text. Some of the  tags are nested inside other  tags. How deeply are they nested?

Every  tag in this file has an id attribute. There is exactly one  that has greater depth in the the DOM tree than any other. What is its id attribute?

Write a Python script to load the HTML file with Beautiful Soup and tranverse the DOM to answer these questions.

Use extra time to work on Project 4¶

In case you complete everything above with time to spare, I suggest using the remaining lab time to work on Project 4. Having your TA nearby to answer any questions that come up will probably be helpful!

Revision history¶

2024-04-14 Initial release