Beautiful Soup examples¶

MCS 275 Spring 2021 - Instructor Emily Dumas¶

Lecture 40¶

Beautiful soup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Import the module (only need to do this once)

from bs4 import BeautifulSoup

Parse a single HTML file into a DOM-like data structure in a variable soup:

(This is one of the slide presentations from MCS 275.)

with open("html-for-scraping/lecture40.html") as fobj:
    soup = BeautifulSoup(fobj,"html.parser")

Get the title of that lecture (the string that is the only text node under the title tag)

soup.head.title.string

'Lec 40: Parsing and scraping HTML'

How many slides were in that lecture?

# each slide is a <section> tag.
len(soup.find_all("section"))

21

(This count is only approximately right; in reveal.js, nested section tags are used to create slides that appear below others, and that feature is used here. The true slide count would be the number of section tags that don't contain other section tags. How would you find that?)

Let's do the same thing, but for every html file in the html-for-scraping directory (several of the MCS 275 lectures).

import os

DATADIR="html-for-scraping"

for fn in os.listdir(DATADIR):
    if not fn.endswith(".html"):
        continue
    with open(os.path.join(DATADIR,fn)) as fobj:
        soup = BeautifulSoup(fobj,"html.parser")
    print(fn,soup.head.title.string)

lecture17.html Lec 17: Quicksort
lecture40.html Lec 40: Parsing and scraping HTML
lecture23.html Lec 23: CSV and JSON
lecture22.html Lec 22: set and defaultdict

Remark: A cleaner way to get all files that end in .html would be to use glob.glob("html-for-scraping/*.html"). But we didn't discuss the glob module, so I used os.listdir.

Examples with example.com front page¶

The next cell retrieves https://example.com/. Be careful to avoid making frequent automated requests to any web server, and to follow a site's terms of use and robots.txt rules. Here, I've added a 1-second delay to make sure this cell can never make more than 1 request per second.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import time

time.sleep(1)
with urlopen("https://example.com/") as response:
    soup = BeautifulSoup(response,"html.parser")

Note: If we were going to work with the contents of this page many times, it would be better to download it to a file and then parse the file. That way, there would only be one network request, rather than a new request each time the program is run.

# printing a BeautifulSoup object shows the corresponding
# HTML
soup

<!DOCTYPE html>

<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

# But it's actually a BeautifulSoup object, which has
# many methods and attributes.
type(soup)

bs4.BeautifulSoup

# First p in the document
soup.p

<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>

soup.find_all("p")[1]  # second p in the document

<p><a href="https://www.iana.org/domains/example">More information...</a></p>

soup.div.p  # first p tag inside the first div

<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>

# is the first p that appears in a div actually the first p
# in the whole document?
soup.div.p == soup.p

True

Examples with an HTML string¶

soup = BeautifulSoup("""
<html><head><title>Hello</title>
<body><h1>Hello</h1> This is my document.
<strong>Mine.</strong></body></html>""","html.parser")

soup

<html><head><title>Hello</title>
<body><h1>Hello</h1> This is my document.
<strong>Mine.</strong></body></head></html>

print(soup.prettify())

<html>
 <head>
  <title>
   Hello
  </title>
  <body>
   <h1>
    Hello
   </h1>
   This is my document.
   <strong>
    Mine.
   </strong>
  </body>
 </head>
</html>

soup.title

<title>Hello</title>

type(soup.title)

bs4.element.Tag

soup.h1

<h1>Hello</h1>

soup.find_all("h1")

[<h1>Hello</h1>]

Examples with dumas.io¶

from urllib.request import urlopen
from bs4 import BeautifulSoup
import time

time.sleep(1)
with urlopen("https://dumas.io/") as response:
    soup = BeautifulSoup(response,"html.parser")

# the div with id teaching, the first unordered list (UIC teaching)
# inside of the first unordered list (all teaching) in that div.
uic_teaching = soup.find("div",id="teaching").ul.ul

# text list of courses
for x in uic_teaching.find_all("li"):
    print(x.text)

 Spring 2021 - MCS 275: Programming Tools and File Management (in Blackboard)
 Fall 2020 - MCS 260: Introduction to Computer Science
 Spring 2019 - Math 445: Introduction to Topology I
 Spring 2019 - Math 550: Differentiable Manifolds II
 Fall 2018 - Math 320: Linear Algebra I
 Spring 2018 - Math 445: Introduction to Topology I
 Fall 2017 - Math 549: Differentiable Manifolds I
 Fall 2017 - Math 210: Calculus III (primary course content is in Blackboard)
 Spring 2017 - Math 569: Representations of surface groups
 Fall 2016 - Math 180: Calculus I
 Fall 2016 - Math 320: Linear Algebra I
 Spring 2016 - Math 535: Complex Analysis I
 Fall 2015 - Math 445: Introduction to Topology I
 Fall 2015 - Math 210: Calculus III
 Fall 2014 - Math 550: Differentiable Manifolds II
 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Fall 2020 - MCS 260: Introduction to Computer Science
 Spring 2019 - Math 445: Introduction to Topology I
 Spring 2019 - Math 550: Differentiable Manifolds II
 Fall 2018 - Math 320: Linear Algebra I
 Spring 2018 - Math 445: Introduction to Topology I
 Fall 2017 - Math 549: Differentiable Manifolds I
 Fall 2017 - Math 210: Calculus III (primary course content is in Blackboard)
 Spring 2017 - Math 569: Representations of surface groups
 Fall 2016 - Math 180: Calculus I
 Fall 2016 - Math 320: Linear Algebra I
 Spring 2016 - Math 535: Complex Analysis I
 Fall 2015 - Math 445: Introduction to Topology I
 Fall 2015 - Math 210: Calculus III
 Fall 2014 - Math 550: Differentiable Manifolds II
 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Spring 2019 - Math 445: Introduction to Topology I
 Spring 2019 - Math 550: Differentiable Manifolds II
 Fall 2018 - Math 320: Linear Algebra I
 Spring 2018 - Math 445: Introduction to Topology I
 Fall 2017 - Math 549: Differentiable Manifolds I
 Fall 2017 - Math 210: Calculus III (primary course content is in Blackboard)
 Spring 2017 - Math 569: Representations of surface groups
 Fall 2016 - Math 180: Calculus I
 Fall 2016 - Math 320: Linear Algebra I
 Spring 2016 - Math 535: Complex Analysis I
 Fall 2015 - Math 445: Introduction to Topology I
 Fall 2015 - Math 210: Calculus III
 Fall 2014 - Math 550: Differentiable Manifolds II
 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Spring 2019 - Math 550: Differentiable Manifolds II
 Fall 2018 - Math 320: Linear Algebra I
 Spring 2018 - Math 445: Introduction to Topology I
 Fall 2017 - Math 549: Differentiable Manifolds I
 Fall 2017 - Math 210: Calculus III (primary course content is in Blackboard)
 Spring 2017 - Math 569: Representations of surface groups
 Fall 2016 - Math 180: Calculus I
 Fall 2016 - Math 320: Linear Algebra I
 Spring 2016 - Math 535: Complex Analysis I
 Fall 2015 - Math 445: Introduction to Topology I
 Fall 2015 - Math 210: Calculus III
 Fall 2014 - Math 550: Differentiable Manifolds II
 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Fall 2018 - Math 320: Linear Algebra I
 Spring 2018 - Math 445: Introduction to Topology I
 Fall 2017 - Math 549: Differentiable Manifolds I
 Fall 2017 - Math 210: Calculus III (primary course content is in Blackboard)
 Spring 2017 - Math 569: Representations of surface groups
 Fall 2016 - Math 180: Calculus I
 Fall 2016 - Math 320: Linear Algebra I
 Spring 2016 - Math 535: Complex Analysis I
 Fall 2015 - Math 445: Introduction to Topology I
 Fall 2015 - Math 210: Calculus III
 Fall 2014 - Math 550: Differentiable Manifolds II
 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Spring 2018 - Math 445: Introduction to Topology I
 Fall 2017 - Math 549: Differentiable Manifolds I
 Fall 2017 - Math 210: Calculus III (primary course content is in Blackboard)
 Spring 2017 - Math 569: Representations of surface groups
 Fall 2016 - Math 180: Calculus I
 Fall 2016 - Math 320: Linear Algebra I
 Spring 2016 - Math 535: Complex Analysis I
 Fall 2015 - Math 445: Introduction to Topology I
 Fall 2015 - Math 210: Calculus III
 Fall 2014 - Math 550: Differentiable Manifolds II
 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Fall 2017 - Math 549: Differentiable Manifolds I
 Fall 2017 - Math 210: Calculus III (primary course content is in Blackboard)
 Spring 2017 - Math 569: Representations of surface groups
 Fall 2016 - Math 180: Calculus I
 Fall 2016 - Math 320: Linear Algebra I
 Spring 2016 - Math 535: Complex Analysis I
 Fall 2015 - Math 445: Introduction to Topology I
 Fall 2015 - Math 210: Calculus III
 Fall 2014 - Math 550: Differentiable Manifolds II
 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Fall 2017 - Math 210: Calculus III (primary course content is in Blackboard)
 Spring 2017 - Math 569: Representations of surface groups
 Fall 2016 - Math 180: Calculus I
 Fall 2016 - Math 320: Linear Algebra I
 Spring 2016 - Math 535: Complex Analysis I
 Fall 2015 - Math 445: Introduction to Topology I
 Fall 2015 - Math 210: Calculus III
 Fall 2014 - Math 550: Differentiable Manifolds II
 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Spring 2017 - Math 569: Representations of surface groups
 Fall 2016 - Math 180: Calculus I
 Fall 2016 - Math 320: Linear Algebra I
 Spring 2016 - Math 535: Complex Analysis I
 Fall 2015 - Math 445: Introduction to Topology I
 Fall 2015 - Math 210: Calculus III
 Fall 2014 - Math 550: Differentiable Manifolds II
 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Fall 2016 - Math 180: Calculus I
 Fall 2016 - Math 320: Linear Algebra I
 Spring 2016 - Math 535: Complex Analysis I
 Fall 2015 - Math 445: Introduction to Topology I
 Fall 2015 - Math 210: Calculus III
 Fall 2014 - Math 550: Differentiable Manifolds II
 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Fall 2016 - Math 320: Linear Algebra I
 Spring 2016 - Math 535: Complex Analysis I
 Fall 2015 - Math 445: Introduction to Topology I
 Fall 2015 - Math 210: Calculus III
 Fall 2014 - Math 550: Differentiable Manifolds II
 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Spring 2016 - Math 535: Complex Analysis I
 Fall 2015 - Math 445: Introduction to Topology I
 Fall 2015 - Math 210: Calculus III
 Fall 2014 - Math 550: Differentiable Manifolds II
 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Fall 2015 - Math 445: Introduction to Topology I
 Fall 2015 - Math 210: Calculus III
 Fall 2014 - Math 550: Differentiable Manifolds II
 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Fall 2015 - Math 210: Calculus III
 Fall 2014 - Math 550: Differentiable Manifolds II
 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Fall 2014 - Math 550: Differentiable Manifolds II
 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Fall 2014 - Math 180: Calculus I
 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Spring 2014 - MCS 481: Computational Geometry
 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Spring 2014 - Math 180: Calculus I
 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Spring 2013 - Math 570: Topics in Teichmüller Theory and Geometric Structures
 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Spring 2013 - Math 215: Introduction to Advanced Mathematics
 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Spring 2012 - MCS 481: Computational Geometry
 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Fall 2011 - Math 180: Calculus I
 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Spring 2011 - MCS 481: Computational Geometry
 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Fall 2010 - Math 442: Differential Geometry of Curves and Surfaces
 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

 Spring 2010 - Math 535: Complex Analysis I
Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

Fall 2009 - Math 180: Calculus I
Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

Spring 2009 - Math 442: Differential Geometry of Curves and Surfaces
Fall 2008 - Math 210: Calculus III

Fall 2008 - Math 210: Calculus III

for x in uic_teaching.find_all("a"):
    print("Link to {} with link text '{}'".format(
        x["href"],
        x.text
    ))

Link to https://uic.blackboard.com/ultra/courses/_190696_1/cl/outline with link text 'MCS 275: Programming Tools and File Management'
Link to /teaching/2020/fall/mcs260/ with link text 'MCS 260: Introduction to Computer Science'
Link to /teaching/2019/spring/math445/ with link text 'Math 445: Introduction to Topology I'
Link to /teaching/2019/spring/math550/ with link text 'Math 550: Differentiable Manifolds II'
Link to /teaching/2018/fall/math320/ with link text 'Math 320: Linear Algebra I'
Link to /teaching/2018/spring/math445/ with link text 'Math 445: Introduction to Topology I'
Link to /teaching/2017/fall/math549/ with link text 'Math 549: Differentiable Manifolds I'
Link to http://www.math.uic.edu/math210/ with link text 'Math 210: Calculus III'
Link to /teaching/2017/spring/math569/ with link text 'Math 569: Representations of surface groups'
Link to http://www.math.uic.edu/math180/ with link text 'Math 180: Calculus I'
Link to /teaching/2016/fall/math320/ with link text 'Math 320: Linear Algebra I'
Link to /teaching/2016/spring/math535/ with link text 'Math 535: Complex Analysis I'
Link to /teaching/2015/fall/math445/ with link text 'Math 445: Introduction to Topology I'
Link to http://www.math.uic.edu/math210/ with link text 'Math 210: Calculus III'
Link to /teaching/2014/fall/math550/ with link text 'Math 550: Differentiable Manifolds II'
Link to /teaching/2014/fall/math180/ with link text 'Math 180: Calculus I'
Link to /teaching/2014/spring/mcs481/ with link text 'MCS 481: Computational Geometry'
Link to /teaching/2014/spring/math180/ with link text 'Math 180: Calculus I'
Link to /teaching/2013/spring/math570/ with link text 'Math 570: Topics in Teichmüller Theory and Geometric Structures'
Link to /teaching/2013/spring/math215/ with link text 'Math 215: Introduction to Advanced Mathematics'
Link to /teaching/2012/spring/mcs481/ with link text 'MCS 481: Computational Geometry'
Link to /teaching/2011/spring/mcs481/ with link text 'MCS 481: Computational Geometry'
Link to /teaching/2010/fall/math442/ with link text 'Math 442: Differential Geometry of Curves and Surfaces'
Link to /teaching/2010/spring/math535/ with link text 'Math 535: Complex Analysis I'
Link to /teaching/2009/spring/math442/ with link text 'Math 442: Differential Geometry of Curves and Surfaces'
Link to /teaching/2008/fall/math210/ with link text 'Math 210: Calculus III'

soup.find_all("div")[-1]

<div id="acknowledgement">
<h3>Acknowledgement</h3>
This material is based upon work supported by the National Science
Foundation. Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the author and do not
necessarily reflect the views of the National Science Foundation. 
</div>

acktag = soup.find_all("div")[-1].h3

acktag.parent

<div id="acknowledgement">
<h3>Acknowledgement</h3>
This material is based upon work supported by the National Science
Foundation. Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the author and do not
necessarily reflect the views of the National Science Foundation. 
</div>

acktag.parent.parent.name

'body'

acktag.parent.name

'div'

acktag.parent.attrs

{'id': 'acknowledgement'}

acktag.parent.contents # will return a list

['\n',
 <h3>Acknowledgement</h3>,
 '\nThis material is based upon work supported by the National Science\nFoundation. Any opinions, findings, and conclusions or recommendations\nexpressed in this material are those of the author and do not\nnecessarily reflect the views of the National Science Foundation. \n']

Academic calendar scraper development¶

The part we did during Lecture 41

import datetime

from urllib.request import urlopen
from bs4 import BeautifulSoup
import time

time.sleep(1)
with urlopen("https://catalog.uic.edu/ucat/academic-calendar/") as response:
    soup = BeautifulSoup(response,"html.parser")

# How many tables are in this document?
len(soup.find_all("table"))

13

Each table appears to correspond to one semester or summer.

Let's iterate over them and look at each table's rows to get key dates for the session. We'll need a function to parse dates in the string format used by the table, e.g.

September 2, M

def parse_datestr(year,datestr):
    """Take a year like "2020" and a date string like
    "January 13, M" and convert it to a Python date object."""
    # Discard the day of week after the ,
    datestr = datestr.split(",")[0]
    # this looks like January 13 2020
    # which has format 
    return datetime.datetime.strptime(
        datestr + " " + year,
        "%B %d %Y"
    ).date()

for t in soup.find_all("table"):
    # look for the preceding h2 to get which semester it is
    table_heading = t.find_previous_sibling("h2")
    if "summer" in table_heading.text.lower():
        # TODO: Handle summer
        continue
    print("--------------------------------------")
    print("SEMESTER:",table_heading.text)
    # extract the year from the table heading
    year = table_heading.text.split()[-1]
    # Loop to examine rows of the semester table
    for r in t.find_all("tr"):
        if r.parent.name == "thead":
            # skip header rows
            continue
        datestr, desc = [ x.text for x in r.find_all("td") ]
        # TODO: Handle ranges of dates.  For now, we just
        # skip the row if parsing gives an exception due
        # to the presence of a hyphen.
        try: 
            date = parse_datestr(year,datestr)
            print(date,desc)
        except ValueError:
            print("SKIPPING THIS ROW:",datestr,desc)
            continue

# Goal: Write a CSV in the format:
# 2019,fall,2019-08-26,Instruction begins.

--------------------------------------
SEMESTER: Fall Semester 2019
2019-08-26 Instruction begins.
2019-09-02 Labor Day holiday. No classes.
2019-09-06 Last day to complete late registration; last day to add a course(s) or make section changes; last day to drop individual courses via XE Registration without receiving W (Withdrawn) grade on academic record. Last day to Web Drop courses via XE Registration and receive 100% cancellation of tuition and fees.
2019-09-12 CampusCare Waiver deadline.
2019-11-01 Last day for undergraduate students to use optional late drop in college office and receive grade of W on academic record.
SKIPPING THIS ROW: November 28–29, Th–F Thanksgiving holiday. No classes.
2019-12-06 Instruction ends.
SKIPPING THIS ROW: December 9–13, M–F Final examinations.
2019-12-18 Instructor grading deadline for 16-week courses (5 p.m.)
2019-12-23 Grades available via my.UIC.edu
--------------------------------------
SEMESTER: Spring Semester 2020
2020-01-13 Instruction begins.
2020-01-20 Martin Luther King, Jr., Day. No classes.
2020-01-24 Last day to complete late registration; last day to add a course(s) or make section changes; last day to drop individual courses via XE Registration without receiving W (Withdrawn) grade on academic record. Last day to Web Drop courses via XE Registration and receive 100% cancellation of tuition and fees.
2020-02-09 CampusCare Waiver deadline.
SKIPPING THIS ROW: March 16–27, M–F (revised) Spring vacation. No classes.
2020-04-03 Last day for undergraduate students to use optional late drop in college office and receive grade of W on academic record.
2020-05-01 Instruction ends.
SKIPPING THIS ROW: May 4–8, M–F Final examinations.
2020-05-13 Instructor grading deadline for 16-week courses (5 p.m.)
2020-05-18 Grades available via my.UIC.edu
--------------------------------------
SEMESTER: Fall Semester 2020
2020-08-24 Instruction begins.
2020-09-04 Last day to complete late registration; last day to add a course(s) or make section changes; last day to drop individual courses via XE Registration without receiving W (Withdrawn) grade on academic record. Last day to Web Drop courses via XE Registration and receive 100% cancellation of tuition and fees.
2020-09-07 Labor Day holiday. No classes.
2020-09-12 CampusCare Waiver deadline.
2020-10-30 Last day for undergraduate students to use optional late drop in college office and receive grade of W on academic record.
2020-11-03 Election Day holiday. No classes.
SKIPPING THIS ROW: November 26–27, Th–F Thanksgiving holiday. No classes.
2020-12-04 Instruction ends.
SKIPPING THIS ROW: December 7–11, M–F Final examinations.
2020-12-16 Instructor grading deadline for 16-week courses (5 p.m.)
2020-12-21 Grades available via my.UIC.edu
--------------------------------------
SEMESTER: Spring Semester 2021
2021-01-11 Instruction begins.
2021-01-18 Martin Luther King, Jr., Day. No classes.
2021-01-22 Last day to complete late registration; last day to add a course(s) or make section changes; last day to drop individual courses via XE Registration without receiving W (Withdrawn) grade on academic record. Last day to Web Drop courses via XE Registration and receive 100% cancellation of tuition and fees.
2021-02-07 CampusCare Waiver deadline.
2021-03-19 Last day for undergraduate students to use optional late drop in college office and receive grade of W on academic record.
SKIPPING THIS ROW: March 22–26, M–F Spring vacation. No classes.
2021-04-30 Instruction ends.
SKIPPING THIS ROW: May 3–7, M–F Final examinations.
2021-05-12 Instructor grading deadline for 16-week courses (5 p.m.)
2021-05-17 Grades available via my.UIC.edu
--------------------------------------
SEMESTER: Fall Semester 2021
2021-08-23 Instruction begins.
2021-09-03 Last day to complete late registration; last day to add a course(s) or make section changes; last day to drop individual courses via XE Registration without receiving W (Withdrawn) grade on academic record. Last day to Web Drop courses via XE Registration and receive 100% cancellation of tuition and fees.
2021-09-06 Labor Day holiday. No classes.
SKIPPING THIS ROW:  CampusCare Waiver deadline.
2021-10-29 Last day for undergraduate students to use optional late drop in college office and receive grade of W on academic record.
SKIPPING THIS ROW: November 25–26, Th–F Thanksgiving holiday. No classes.
2021-12-03 Instruction ends.
SKIPPING THIS ROW: December 6–10, M–F Final examinations.
2021-12-15 Instructor grading deadline for 16-week courses (5 p.m.)
2021-12-20 Grades available via my.UIC.edu
--------------------------------------
SEMESTER: Spring Semester 2022
2022-01-10 Instruction begins.
2022-01-17 Martin Luther King, Jr., Day. No classes.
2022-01-21 Last day to complete late registration; last day to add a course(s) or make section changes; last day to drop individual courses via XE Registration without receiving W (Withdrawn) grade on academic record. Last day to Web Drop courses via XE Registration and receive 100% cancellation of tuition and fees.
SKIPPING THIS ROW:  CampusCare Waiver deadline.
2022-03-18 Last day for undergraduate students to use optional late drop in college office and receive grade of W on academic record.
SKIPPING THIS ROW: March 21–25, M–F Spring vacation. No classes.
2022-04-29 Instruction ends.
SKIPPING THIS ROW: May 2–6, M–F Final examinations.
2022-05-11 Instructor grading deadline for 16-week courses (5 p.m.)
2022-05-16 Grades available via my.UIC.edu
--------------------------------------
SEMESTER: Fall Semester 2022
2022-08-22 Instruction begins.
2022-09-02 Last day to complete late registration; last day to add a course(s) or make section changes; last day to drop individual courses via XE Registration without receiving W (Withdrawn) grade on academic record. Last day to Web Drop courses via XE Registration and receive 100% cancellation of tuition and fees.
2022-09-05 Labor Day holiday. No classes.
SKIPPING THIS ROW:  CampusCare Waiver deadline.
2022-10-28 Last day for undergraduate students to use optional late drop in college office and receive grade of W on academic record.
SKIPPING THIS ROW: November 24–25, Th–F Thanksgiving holiday. No classes.
2022-12-02 Instruction ends.
SKIPPING THIS ROW: December 5–9, M–F Final examinations.
2022-12-14 Instructor grading deadline for 16-week courses (5 p.m.)
2022-12-19 Grades available via my.UIC.edu
--------------------------------------
SEMESTER: Spring Semester 2023
2023-01-09 Instruction begins.
2023-01-16 Martin Luther King, Jr., Day. No classes.
2023-01-20 Last day to complete late registration; last day to add a course(s) or make section changes; last day to drop individual courses via XE Registration without receiving W (Withdrawn) grade on academic record. Last day to Web Drop courses via XE Registration and receive 100% cancellation of tuition and fees.
SKIPPING THIS ROW:  CampusCare Waiver deadline.
2023-03-17 Last day for undergraduate students to use optional late drop in college office and receive grade of W on academic record.
SKIPPING THIS ROW: March 20–24, M–F Spring vacation. No classes.
2023-04-28 Instruction ends.
SKIPPING THIS ROW: May 1–5, M–F Final examinations.
2023-05-10 Instructor grading deadline for 16-week courses (5 p.m.)
2023-05-15 Grades available via my.UIC.edu

def parse_datestr(year,datestr):
    """Take a year like "2020" and a date string like
    "January 13, M" and convert it to a Python date object."""
    # Discard the day of week after the ,
    datestr = datestr.split(",")[0]
    # this looks like January 13 2020
    # which has format 
    return datetime.datetime.strptime(
        datestr + " " + year,
        "%B %d %Y"
    ).date()