Beautiful soup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Import the module (only need to do this once)
from bs4 import BeautifulSoup
Parse a single HTML file into a DOM-like data structure in a variable soup
:
(This is one of the slide presentations from MCS 275.)
with open("html-for-scraping/lecture40.html") as fobj:
soup = BeautifulSoup(fobj,"html.parser")
Get the title of that lecture (the string that is the only text node under the title tag)
soup.head.title.string
How many slides were in that lecture?
# each slide is a <section> tag.
len(soup.find_all("section"))
(This count is only approximately right; in reveal.js, nested section tags are used to create slides that appear below others, and that feature is used here. The true slide count would be the number of section tags that don't contain other section tags. How would you find that?)
Let's do the same thing, but for every html file in the html-for-scraping
directory (several of the MCS 275 lectures).
import os
DATADIR="html-for-scraping"
for fn in os.listdir(DATADIR):
if not fn.endswith(".html"):
continue
with open(os.path.join(DATADIR,fn)) as fobj:
soup = BeautifulSoup(fobj,"html.parser")
print(fn,soup.head.title.string)
Remark: A cleaner way to get all files that end in .html would be to use glob.glob("html-for-scraping/*.html")
. But we didn't discuss the glob
module, so I used os.listdir
.
The next cell retrieves https://example.com/. Be careful to avoid making frequent automated requests to any web server, and to follow a site's terms of use and robots.txt rules. Here, I've added a 1-second delay to make sure this cell can never make more than 1 request per second.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import time
time.sleep(1)
with urlopen("https://example.com/") as response:
soup = BeautifulSoup(response,"html.parser")
Note: If we were going to work with the contents of this page many times, it would be better to download it to a file and then parse the file. That way, there would only be one network request, rather than a new request each time the program is run.
# printing a BeautifulSoup object shows the corresponding
# HTML
soup
# But it's actually a BeautifulSoup object, which has
# many methods and attributes.
type(soup)
# First p in the document
soup.p
soup.find_all("p")[1] # second p in the document
soup.div.p # first p tag inside the first div
# is the first p that appears in a div actually the first p
# in the whole document?
soup.div.p == soup.p
soup = BeautifulSoup("""
<html><head><title>Hello</title>
<body><h1>Hello</h1> This is my document.
<strong>Mine.</strong></body></html>""","html.parser")
soup
print(soup.prettify())
soup.title
type(soup.title)
soup.h1
soup.find_all("h1")
from urllib.request import urlopen
from bs4 import BeautifulSoup
import time
time.sleep(1)
with urlopen("https://dumas.io/") as response:
soup = BeautifulSoup(response,"html.parser")
# the div with id teaching, the first unordered list (UIC teaching)
# inside of the first unordered list (all teaching) in that div.
uic_teaching = soup.find("div",id="teaching").ul.ul
# text list of courses
for x in uic_teaching.find_all("li"):
print(x.text)
for x in uic_teaching.find_all("a"):
print("Link to {} with link text '{}'".format(
x["href"],
x.text
))
soup.find_all("div")[-1]
acktag = soup.find_all("div")[-1].h3
acktag.parent
acktag.parent.parent.name
acktag.parent.name
acktag.parent.attrs
acktag.parent.contents # will return a list
The part we did during Lecture 41
import datetime
from urllib.request import urlopen
from bs4 import BeautifulSoup
import time
time.sleep(1)
with urlopen("https://catalog.uic.edu/ucat/academic-calendar/") as response:
soup = BeautifulSoup(response,"html.parser")
# How many tables are in this document?
len(soup.find_all("table"))
Each table appears to correspond to one semester or summer.
Let's iterate over them and look at each table's rows to get key dates for the session. We'll need a function to parse dates in the string format used by the table, e.g.
September 2, M
def parse_datestr(year,datestr):
"""Take a year like "2020" and a date string like
"January 13, M" and convert it to a Python date object."""
# Discard the day of week after the ,
datestr = datestr.split(",")[0]
# this looks like January 13 2020
# which has format
return datetime.datetime.strptime(
datestr + " " + year,
"%B %d %Y"
).date()
for t in soup.find_all("table"):
# look for the preceding h2 to get which semester it is
table_heading = t.find_previous_sibling("h2")
if "summer" in table_heading.text.lower():
# TODO: Handle summer
continue
print("--------------------------------------")
print("SEMESTER:",table_heading.text)
# extract the year from the table heading
year = table_heading.text.split()[-1]
# Loop to examine rows of the semester table
for r in t.find_all("tr"):
if r.parent.name == "thead":
# skip header rows
continue
datestr, desc = [ x.text for x in r.find_all("td") ]
# TODO: Handle ranges of dates. For now, we just
# skip the row if parsing gives an exception due
# to the presence of a hyphen.
try:
date = parse_datestr(year,datestr)
print(date,desc)
except ValueError:
print("SKIPPING THIS ROW:",datestr,desc)
continue
# Goal: Write a CSV in the format:
# 2019,fall,2019-08-26,Instruction begins.
def parse_datestr(year,datestr):
"""Take a year like "2020" and a date string like
"January 13, M" and convert it to a Python date object."""
# Discard the day of week after the ,
datestr = datestr.split(",")[0]
# this looks like January 13 2020
# which has format
return datetime.datetime.strptime(
datestr + " " + year,
"%B %d %Y"
).date()