MCS 275 Spring 2023
Emily Dumas
Reminders and announcements:
beautifulsoup4
with
python3 -m pip install beautifulsoup4
Recently, we've talked a lot about making HTTP servers in Python (e.g. web applications).
This week we'll switch to talking about Python as an HTTP client, parsing HTML, and extracting data (scraping).
A Uniform Resource Locator or URL specifies the location of a "resource", such as a document, a data file, or a coffee machine.
Basic structure is
protocol://hostname[:port]/path/filename?nam=val&nam2=val2
Everything after hostname is optional.
Sample URL:
https://www.dumas.io/teaching/2023/spring/mcs275/slides/lecture36.html
https://www.dumas.io/teaching/2023/spring/mcs275/slides/lecture36.html
www.dumas.io
/teaching/2023/spring/mcs275/slides/
lecture36.html
Module urllib
can retrieve resources from URLs.
E.g., it can open a file if you give it a file://
URL.
Most often it is used to make HTTP and HTTPS GET requests, to retrieve web pages from web servers and data from HTTP APIs.
urllib.request.urlopen(url)
retrieves the resource and returns a file-like object
Response consists of a numeric status code, some headers (an associative array), then a body or payload.
E.g. GET a web page, the HTML will be in the body.
There are lots of codes; first digit gives category:
Formal definition of the response structure is in RFC 2616.
Response to GET http://example.com/
x = urllib.request.urlopen(URL)
returns an object that makes available:
x.status
x.headers
x.read()
(or use x
where a file object is expected)An HTTP request has several parts, the last of which is the body/payload (an array of bytes).
Often, the body is an HTML document.
An HTML document has several parts, one of which is the body (contained in the tag <body>
).
Use the Bored JSON API to get a suggestion of an activity.
import json
from urllib.request import urlopen
with urlopen("https://www.boredapi.com/api/activity") as r:
# treat payload as file, process as JSON
data = json.load(r)
print("Maybe you could... ",data["activity"])
from urllib.request import urlopen
with urlopen("https://example.com/") as r:
html_bytes = r.read()
This gives the body as a bytes
object (an array of integers in the range 0...255).
If you want a string, you need to know the encoding.
And it might not be HTML! Can check r.headers.get_content_type()
or r.headers["content-type"]
.
from urllib.request import urlopen
with urlopen("https://example.com/") as r:
html_bytes = r.read()
# Determine encoding from Content-Type header
# (recommended)
charset = r.headers.get_content_charset()
html = html_bytes.decode(charset)
The encoding is usually specified in the Content-Type header, but this is not actually required.
from urllib.request import urlopen
with urlopen("https://example.com/") as r:
html_bytes = r.read()
# Determine encoding, using utf-8 if the
# server didn't give a Content-Type header
charset = r.headers.get_content_charset(failobj="utf-8")
html = html_bytes.decode(charset)
HTML is a language for making documents, meant to be displayed to humans. Avoid having programs read HTML if at all possible.
Web pages often contain data that might be useful to a computer program.
The same data is often available in a structured format meant for consumption by programs, e.g. through an API that returns a JSON object.
What do you do if there is no API, and you need to extract information from an HTML document?
Sigh with exasperation, then...
Level 0: Treat the HTML document as a string and use search operations (str.find
or regexes) to locate something you care about, like <title>
.
HTML is complicated, and this approach is very error-prone.
Level 1: Use a parser that knows how to recognize start/end tags, attributes, etc., and tell it what to do when it finds them (e.g. call this function...)
html.parser
is in the standard library.
This approach is event-based. You specify functions to handle things when they are found, but you don't get an overall picture of the entire document.
Level 2: Use a higher-level HTML data extraction framework like Beautiful Soup, Scrapy, or Selenium.
These frameworks create a data structure that represents the entire document, supporting various kinds of searching, traversal, and extraction.