MCS 275 Spring 2024
Emily Dumas
Reminders and announcements:
beautifulsoup4
with
python3 -m pip install beautifulsoup4
APIs that directly serve machine-readable, typed data are the best way to bring data from an external service into your programs.
See e.g. apislist.com
Let's write a program that displays the Chicago weather forecast for the current day.
We'll use
to get Chicago forecast data as JSON.
1 A nice feature of this API is you don't need to do any authentication. Many public APIs require you to sign up for an API key and rate-limit each such key.
What if the information you need can be found in a web page (HTML document) but there is no API?
Extracting data from HTML — a language for making human-readable documents — should be considered a last resort.
But of course it might be that the HTML document's structure is the data.
Level 0: Treat HTML as a string. Do string things.
Level 1: Treat HTML as a stream of tags, attributes, and text. Have a HTML parser recognize them and tell you what it finds. html.parser
is good for this.
These approaches handle huge documents efficiently, but make nontrivial data extraction quite complex.
Level 2: Use a higher-level HTML data extraction framework like Beautiful Soup, Scrapy, or Selenium.
These frameworks create a data structure that represents the entire document, supporting various kinds of searching, traversal, and extraction.
Note that the whole document needs to fit in memory.
The Document Object Model or DOM is a language-independent model for representing a HTML document as a tree of nodes.
Each node represents part of the document, such as a tag, an attribute, or text appearing inside a tag.
The formal specification has rules for for naming, accessing, and modifying parts of a document. JavaScript fully implements this specification.
<html><head><title>My title</title></head><body><h1>A heading</h1>
<a href="https://example.com">Link text</a></body></html>
<p>I <strong>really</strong>like Python.</p>
This package provides a module called bs4
for turning HTML into a DOM-like data structure.
Widely used, e.g. at one point Reddit's backend software used it to select a representative image from a web page when a URL appeared in a post*.
Requires an HTML parser. We'll use html.parser
from the standard library, but beautiful soup supports others.
* As of 2014. Perhaps they still use it?
Parse HTML file into DOM:
from bs4 import BeautifulSoup
with open("lecture37.html") as fobj:
soup = BeautifulSoup(fobj,"html.parser")
Parse web page into DOM:
from urllib.request import urlopen
from bs4 import BeautifulSoup
with urlopen("https://example.com/") as response:
soup = BeautifulSoup(response,"html.parser")
Be careful about the ethics of connecting to web servers from programs.
A program that extracts data from HTML is a scraper
A program that visits all pages on a site is a spider.
All forms of automated access should:
Parse string into DOM:
from bs4 import BeautifulSoup
soup = BeautifulSoup(
"The coffee was strong.
",
"html.parser"
)
str(soup) # show as HTML
soup.prettify() # prettier HTML
soup.title # first (and only) title tag
soup.p # first p tag
soup.find("p") # first p tag (alternative)
soup.p.strong # first strong tag within the first p tag
soup.find_all("a") # list of all a tags
str(tag) # HTML for this tag and everything inside it
tag.name # name of the tag, e.g. "a" or "ul"
tag.attrs # dict of tag's attributes
tag["href"] # get a single attribute
tag.text # All the text nodes inside tag, concatenated
tag.string # If tag has only text inside it, returns that text
# But if it has other tags as well, returns None
tag.parent # enclosing tag
tag.contents # list of the children of this tag
tag.children # iterable of children of this tag
tag.banana # first descendant banana tag (sub actual tag name!)
tag.find(...) # first descendant meeting criteria
tag.find_all(...) # descendants meeting criteria
tag.find_next_sibling(...) # next sibling tag meeting criteria
Arguments supported by all the find*
methods:
tag.find_all(True) # all descendants
tag.find_all("tagname") # descendants by tag name
tag.find_all(href="https://example.com/") # by attribute
tag.find_all(class_="post") # by class
tag.find_all(re.compile("^fig")) # tag name regex match
tag.find_all("a",limit=15) # first 15 a tags
tag.find_all("a",recursive=False) # all a *children*
Also work with find()
, find_next_sibling()
, ...
soup.select(SELECTOR)
returns a list of tags that match a CSS selector, e.g.
soup.select(".wide") # all tags of class "wide"
# ul tags within divs of class messagebox
soup.select("div.messagebox ul")
There are many CSS selectors and functions we haven't discussed, so this gives a powerful alternative search syntax.
# all third elements of unordered lists
soup.select("ul > li:nth-of-type(3)")