A document from MCS 275 Spring 2021, instructor Emily Dumas. You can also get the notebook file.

MCS 275 Spring 2021 Lecture 44

Live coding generator examples

This version computes everything before returning a list.

In [5]:
def cubes(n):
    """Return the first n cubes of natural numbers"""
    L = []
    for x in range(1,n+1):
        L.append(x**3)
    return L

This version returns immediately and gives a generator, which produces the items as needed for iteration.

In [6]:
def cubesgen(n):
    """Return the first n cubes of natural numbers, lazily"""
    for x in range(1,n+1):
        yield x**3
In [7]:
cubes(5)
Out[7]:
[1, 8, 27, 64, 125]
In [8]:
cubesgen(5)
Out[8]:
<generator object cubesgen at 0x7f79c82a5190>
In [9]:
for y in cubes(5):
    print("Here is a cube:",y)
Here is a cube: 1
Here is a cube: 8
Here is a cube: 27
Here is a cube: 64
Here is a cube: 125

Generators are most often used as iterables directly in for loops:

In [10]:
for y in cubesgen(5):  # MOST COMMON
    print("Here is a cube:",y)
Here is a cube: 1
Here is a cube: 8
Here is a cube: 27
Here is a cube: 64
Here is a cube: 125

But you can also save the return value and request items manually:

In [22]:
g = cubesgen(5)   # MUCH LESS COMMON
In [23]:
next(g)  # now we can single-step the generator object until exhausted
Out[23]:
1
In [24]:
next(g)
Out[24]:
8
In [25]:
next(g)
Out[25]:
27
In [26]:
next(g)
Out[26]:
64
In [27]:
next(g)
Out[27]:
125
In [28]:
next(g)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-28-e734f8aca5ac> in <module>
----> 1 next(g)

StopIteration: 
In [52]:
def allcubes():
    """Return the cubes of the natural numbers, lazily"""
    x = 1
    while True:
        yield x**3
        x += 1
In [54]:
# Print all the cubes with at most 4 digits
for y in allcubes():
    if len(str(y))>4:
        break # Essential to have a break, as this iterable is infinite!
    print(y)
1
8
27
64
125
216
343
512
729
1000
1331
1728
2197
2744
3375
4096
4913
5832
6859
8000
9261
In [58]:
from bs4 import BeautifulSoup
import os



def links_in_html_doc(fn):
    """Return the destinations of http(s) links in an HTML file"""
    with open(fn) as infile:
        soup = BeautifulSoup(infile)
        for atag in soup.find_all("a"):
            url = atag["href"]
            if str(url).startswith("http"):
                yield url

def links_in_html_dir(dirname):
    """Return the destinations of http(s) links in all HTML files
    in the directory specified by `dirname`"""
    for fn in os.listdir(dirname):
        if not fn.endswith(".html"):
            continue
        yield from links_in_html_doc(os.path.join(dirname,fn))
In [ ]:
# Just print all the links in all the HTML files in one directory

# You'll need to set this to a directory containing some HTML files
# In the course sample code repo, this relative path contains the
# HTML slide presentations for several of our lectures.
HTMLDIR = "web/html-for-scraping"

for link in links_in_html_dir(HTMLDIR):
    print(link)
In [ ]:
# Make a histogram showing most common link destinations
# (assume HTMLDIR is set, as in previous cell)

from collections import defaultdict

hist = defaultdict(int)

for link in links_in_html_dir(HTMLDIR):
    hist[link] += 1

for link,count in sorted(hist.items(),key=lambda pair:-pair[1]):
    print("Appears",count,"times:",link)

List and generator comprehensions

In [46]:
[ x**3 for x in range(5) ]
Out[46]:
[0, 1, 8, 27, 64]
In [48]:
( x**3 for x in range(5) )
Out[48]:
<generator object <genexpr> at 0x7f79b8b3ed60>
In [50]:
sum( [ x**3 for x in range(10_000_000) ] ) # uses lots of memory
Out[50]:
2499999500000025000000000000
In [51]:
sum( x**3 for x in range(10_000_000) ) # uses very little of memory
Out[51]:
2499999500000025000000000000

Generator comprehensions are especially nice when combined with aggregating functions like any or all which may terminate early. For example, any( GENERATOR_COMPREHENSION ) will evaluate to True as soon as the generator yields its first truthy value; subsequent values are not computed. In contrast, any( [ LIST_COMPREHENSION ]) will always generate the entire list before searching for the first truthy value.