MCS 275 Spring 2024
Emily Dumas
Reminders and announcements:
beautifulsoup4
with
python3 -m pip install beautifulsoup4
We've talked a lot about making HTTP servers in Python (e.g. with Flask).
Next we'll be using Python as an HTTP client.
Why? To exchange data directly (e.g. JSON APIs) or to fetch HTML for parsing and extraction (scraping).
A Uniform Resource Locator or URL specifies the location of a "resource", such as a document, a data file, or a coffee machine.
Basic structure is
protocol://hostname[:port]/path/filename?nam=val&nam2=val2
Everything after hostname is optional.
Sample URL:
https://www.dumas.io/teaching/2024/spring/mcs275/slides/lecture36.html
https://www.dumas.io/teaching/2024/spring/mcs275/slides/lecture36.html
www.dumas.io
/teaching/2024/spring/mcs275/slides/
lecture36.html
In a URL like https://example.com/store/shirts.html
the URL path and filename are /store/shirts.html
.
That doesn't mean there is a file with that name on the web server.
Things that might happen include:
/home/webserver/static-pages/prodpage-shirts-v8-spring24.html
/home/ddumas/public-html
and uses the path+filename in the URL to decide where to go therein, in this case sending
/home/ddumas/public-html/store/shirts.html
Module urllib
can retrieve resources from URLs.
E.g., it can open a file if you give it a file://
URL.
Most often it is used to make HTTP and HTTPS GET requests.
urllib.request.urlopen(url)
submits the request and returns a file-like object allowing us to retrieve the response.
Response consists of a numeric status code, some headers (an associative array), then a body or payload.
E.g. GET a web page, the HTML will be in the body.
There are lots of codes1; first digit gives category:
Formal definition of the response structure is in RFC 2616.
1 See also http.cat
Response to GET http://example.com/
x = urllib.request.urlopen(URL)
returns an object that makes available:
x.status
x.headers
x.read()
(or use x
where a file object is expected)An HTTP request has several parts, the last of which is the body/payload (an array of bytes).
Often, the body is an HTML document.
An HTML document has several parts, one of which is the body (contained in the tag <body>
).
Use the Bored JSON API to get a suggestion of an activity.
import json
from urllib.request import urlopen
with urlopen("https://www.boredapi.com/api/activity") as r:
# treat payload as file, process as JSON
data = json.load(r)
print("Maybe you could... ",data["activity"])
from urllib.request import urlopen
with urlopen("https://example.com/") as r:
html_bytes = r.read()
This gives the body as a bytes
object (an array of integers in the range 0...255).
If you want a string, you need to know the encoding.
And it might not be HTML! Can check r.headers.get_content_type()
or r.headers["content-type"]
.
from urllib.request import urlopen
with urlopen("https://example.com/") as r:
html_bytes = r.read()
# Determine encoding from Content-Type header
# (recommended)
charset = r.headers.get_content_charset()
html = html_bytes.decode(charset)
The encoding is usually specified in the Content-Type header, but this is not actually required.
from urllib.request import urlopen
with urlopen("https://example.com/") as r:
html_bytes = r.read()
# Determine encoding, using utf-8 if the
# server didn't give a Content-Type header
charset = r.headers.get_content_charset(failobj="utf-8")
html = html_bytes.decode(charset)
HTML is a language for making documents, meant to be displayed to humans. Avoid having programs read HTML if at all possible.
Web pages often contain data that might be useful to a computer program.
The same data is often available in a structured format meant for consumption by programs, e.g. through an API that returns a JSON object.
What do you do if there is no API, and you need to extract information from an HTML document?
Sigh with exasperation, then...
Level 0: Treat the HTML document as a string and use search operations (str.find
or regexes) to locate something you care about, like <title>
.
HTML is complicated, and this approach is very error-prone.
Level 1: Use a parser that knows how to recognize start/end tags, attributes, etc., and tell it what to do when it finds them (e.g. call this function...)
html.parser
is in the standard library.
This approach is event-based. You specify functions to handle things when they are found, but you don't get an overall picture of the entire document.
Level 2: Use a higher-level HTML data extraction framework like Beautiful Soup, Scrapy, or Selenium.
These frameworks create a data structure that represents the entire document, supporting various kinds of searching, traversal, and extraction.