Excuse requests must be sent to TA before deadline
Python 3 and editor working?
If not, tell me immediately
Worksheet 2 available, Quiz 2 will be posted soon
Storage units
We've discussed the bit (b), a binary digit (0 or 1).
A byte (B) is a sequence of 8 bits, equivalently, an 8-digit binary number or a 2-digit hex number. It can represent an integer between 0=$\texttt{0x00}$ and 255=$\texttt{0xff}$.
A word is a longer sequence of bits of a length fixed by the hardware or operating system. Today, a word usually means 16 bits = 2 bytes.
Computers store information as sequences of bytes.
Counting bytes to measure the size of data often leads to large numbers.
Coarser units based on SI prefixes:
kilobyte (KB) = 1,000 bytes
megabyte (MB) = 1,000,000 bytes
gigabyte (GB) = 1,000,000,000 bytes
Based on powers of 2 (IEC system), useful in CS:
kibibyte (KiB) = $2^{10}$ bytes = 1024 bytes
mebibyte (MiB) = 1024 KiB = 1,048,576 bytes
gibibyte (GiB) = 1024 MiB = 1,073,741,824 bytes
Unfortunate current reality:
Occasionally, SI abbreviations are used for IEC units; in Windows, "GB" means GiB.
Very often, IEC units are read aloud using SI names; e.g. write 16GiB and read aloud as "16 gigabytes"
Unicode
Basic problem: How to turn written language into a sequence of bytes?
Unicode (1991) splits this into two steps:
Enumerate characters1 of most2 written languages; these are code points
Specify a way of encoding each code point as a sequence of bytes (not discussed today)
[1] There are also code points for many non-character entities, such as an indicator of whether the language is read left-to-right or right-to-left.
[2] Coverage is not perfect and the standard is regularly revised, adding new code points. Unicode 13.0 was released in March 2020.
Every code point has a number (a positive integer between 0 and 0x10ffff=1,114,111).
Code point numbers are always written $\texttt{U+}$ followed by hexadecimal digits.
$\texttt{U+41}$
A
$\texttt{U+109}$
ĉ
$\texttt{U+1f612}$
😒
The first 127 code points, U+0 to U+7F, include all the printable characters on an "en-us" keyboard, numbered exactly as in the older ASCII code (1969).
strings
In Python 3, a str is a sequence of code points.
A string literal is a way of writing a str in code.
Several syntaxes are supported:
'Hello world' # single quotes
"Hello world" # double quotes
# multi-line string with triple single quote
'''This is a string
that contains line breaks'''
# multi-line string with triple double quote
"""François: How is MCS 260?
Binali: It's going ok, I guess.
François: [shrugs]"""
Escape sequences
The $\texttt{\\}$ character has special meaning; it begins an escape sequence, such as:
$\texttt{\\n}$ - the newline character
$\texttt{\\'}$ - a single quote
$\texttt{\\"}$ - a double quote
$\texttt{\\\\}$ - a backslash
$\texttt{\\u0107}$ - Code point $\texttt{U+107}$
$\texttt{\\U0001f612}$ - Code point $\texttt{U+1f612}$
>>> print("I \"like\":\n\u0050\u0079\u0074\u0068\u006f\u006e")
I "like":
Python
>>>
Operations on strings
Most arithmetic operations forbid str operands.
$\texttt{+}$ is allowed between two strings. It concatenates the strings (meaning joins them).
$\texttt{*}$ is allowed with a string and an int. It concatenates $n$ copies of the string, where $n$ is the int argument.
The built-in $\texttt{len()}$ can be applied to a string to find the length of the string (a nonnegative int):
>>> len("MCS 260")
7
A single character from a string $\texttt{s}$ can be extracted using $\texttt{s[i]}$ where $\texttt{i}$ is the $0$-based index. So $0$=first character, $1$=second, etc..
>>> s = "lorem ipsum"
>>> s[2]
'r'
We'll say much more about indexing next time.
int
When converting from a string, $\texttt{int()}$ defaults to base $10$. But it supports other bases as well. The base is given as the second argument of the function.
>>> int("1001",2)
9
>>> int("3e",16)
62
Notice that integer literal prefixes like $\texttt{0b}$, $\texttt{0x}$, etc. must not be present here. The $\texttt{int()}$ function works with just digits.
However, if a base of $0$ is specified, then this signals that the string should be read as a Python literal, i.e. the base is determined by its prefix.
Notice $\texttt{a << b}$ is equivalent to $\texttt{a * 2**b}$.
Bitwise AND compares corresponding bits, and the output bit is $1$ if both input bits are $1$:
>>> 9 & 5 # 9 = 0b1001, 5 = 0b0101
1
1
0
0
1
0
1
0
1
AND:
0
0
0
1
Bitwise OR is similar, but the output bit is $1$ if at least one of the input bits is $1$.
>>> 9 | 5 # 9 = 0b1001, 5 = 0b0101
13
1
0
0
1
0
1
0
1
OR:
1
1
0
1
Bitwise XOR makes the output bit $1$ if exactly one of the input bits is $1$.
>>> 9 ^ 5 # 9 = 0b1001, 5 = 0b0101
12
1
0
0
1
0
1
0
1
XOR:
1
1
0
0
Logic gates
Circuits that perform logic operations on bits, logic gates, are fundamental building blocks of computers.
Thus the Python operators $\texttt{<<}$,$\texttt{>>}$,$\texttt{&}$,$\texttt{|}$,$\texttt{^}$ are especially low-level operations.
This chip (or integrated circuit / IC) contains four AND gates built from about $50$ transistors. The processor in an iPhone 11 has about $8,\!500,\!000,\!000$ transistors.