MCS 275 Spring 2023
        Emily Dumas
    
Reminders and announcements:
beautifulsoup4 with
        
            python3 -m pip install beautifulsoup4
        APIs that directly serve machine-readable, typed data are the best way to bring data from an external service into your programs.
Extracting data from HTML — a language for making human-readable documents — should be considered a last resort.
We discuss what you can do if:
Level 0: Treat HTML as a string. Do string things.
Level 1: Treat HTML as a stream of tags, attributes, and text.  Have a HTML parser recognize them and tell you what it finds.  html.parser is good for this.
These approaches handle huge documents efficiently, but make nontrivial data extraction quite complex.
Level 2: Use a higher-level HTML data extraction framework like Beautiful Soup, Scrapy, or Selenium.
These frameworks create a data structure that represents the entire document, supporting various kinds of searching, traversal, and extraction.
Note that the whole document needs to fit in memory.
The Document Object Model or DOM is a language-independent model for representing a HTML document as a tree of nodes.
Each node represents part of the document, such as a tag, an attribute, or text appearing inside a tag.
The formal specification has rules for for naming, accessing, and modifying parts of a document. JavaScript fully implements this specification.
<html><head><title>My title</title></head><body><h1>A heading</h1>
<a href="https://example.com">Link text</a></body></html> 
            <p>I <strong>really</strong>like Python.</p> 
            This package provides a module called bs4 for turning HTML into a DOM-like data structure.
Widely used, e.g. at one point Reddit's backend software used it to select a representative image from a web page when a URL appeared in a post*.
Requires an HTML parser.  We'll use html.parser from the standard library, but beautiful soup supports others.
* As of 2014. Perhaps they still use it?
Parse HTML file into DOM:
        from bs4 import BeautifulSoup
        with open("lecture37.html") as fobj:
            soup = BeautifulSoup(fobj,"html.parser")
    Parse web page into DOM:
        from urllib.request import urlopen
        from bs4 import BeautifulSoup
        with urlopen("https://example.com/") as response:
            soup = BeautifulSoup(response,"html.parser")
    Be careful about the ethics of connecting to web servers from programs.
A program that extracts data from HTML is a scraper
A program that visits all pages on a site is a spider.
All forms of automated access should:
Parse string into DOM:
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(
            "The coffee was strong.
",
            "html.parser"
        )
    
        str(soup) # show as HTML
        soup.prettify() # prettier HTML
        soup.title # first (and only) title tag
        soup.p  # first p tag
        soup.find("p") # first p tag (alternative)
        soup.p.strong  # first strong tag within the first p tag
        soup.find_all("a") # list of all a tags
    
        str(tag) # HTML for this tag and everything inside it
        tag.name # name of the tag, e.g. "a" or "ul"
        tag.attrs # dict of tag's attributes
        tag["href"] # get a single attribute
        tag.text # All the text nodes inside tag, concatenated
        tag.string # If tag has only text inside it, returns that text
                   # But if it has other tags as well, returns None
        tag.parent # enclosing tag
        tag.contents # list of the children of this tag
        tag.children # iterable of children of this tag
        tag.banana # first descendant banana tag (sub actual tag name!)
        tag.find(...) # first descendant meeting criteria
        tag.find_all(...) # descendants meeting criteria
        tag.find_next_sibling(...) # next sibling tag meeting criteria
        Arguments supported by all the find* methods:
        tag.find_all(True) # all descendants
        tag.find_all("tagname") # descendants by tag name
        tag.find_all(href="https://example.com/") # by attribute
        tag.find_all(class_="post") # by class
        tag.find_all(re.compile("^fig")) # tag name regex match
        tag.find_all("a",limit=15) # first 15 a tags
        tag.find_all("a",recursive=False) # all a *children*
    Also work with find(), find_next_sibling(), ...
soup.select(SELECTOR) returns a list of tags that match a CSS selector, e.g.
        soup.select(".wide") # all tags of class "wide"
        
        # ul tags within divs of class messagebox
        soup.select("div.messagebox ul")
        There are many CSS selectors and functions we haven't discussed, so this gives a powerful alternative search syntax.
        # all third elements of unordered lists
        soup.select("ul > li:nth-of-type(3)")