Lecture 43=n-1

pandas

MCS 275 Spring 2021
Emily Dumas

Lecture 43: pandas

Course bulletins:

  • Complete your course evaluations
  • Project 4 is due 6pm CDT Friday April 30.
  • The project 4 autograder is now open.
  • Install pandas with
    
                python3 -m pip install pandas
            
  • Today's pandas intro notebook

pandas

Pandas is a module for working with tabular data, i.e. data in a 2D array where each column has a name and fixed data type (e.g. name "year", type integer).

idnameatomic mass
14Silicon28.086
78Platinum195.084

In pandas, every row must have a unique identifier called its index. It can be just a 0-based index (the default) but other types are allowed (e.g. date/time).

TLDR

Pandas provides a data structure to properly represent the contents of a CSV or spreadsheet in Python.

or

pandas : csv :: bs4 : html.parser

pandas features

Excellent file format support: CSV, TSV, JSON, XLS, XLSX, SQL, SAS, HTML tables, ...

Searching, filtering, and transformation, with interface similar to numpy.

Excellent interoperability with numpy and matplotlib.

Wide user base, active development. (4 releases since MCS 275 started!)

Do I need this?

There are lots of storage formats for tabular data you might consider. Most have Python I/O modules.

Of these, only SQL provides the kind of advanced searching, filtering, etc., that pandas offers. However, SQL syntax is comparatively heavy and not tightly integrated with Python language. (But SQL is great!)

Another option is to just use a spreadsheet program. Scripts/notebooks offer better formalization, documentation, and reproducibility of analysis, though.

In defense of SQL

Pandas is for data analysis, usually by one person, working with data that can fit in memory of one machine. It does not specify a storage format or provide for concurrent access.

For persistent data that an application program will access in a predictable way, you should probably use SQL.

For exploration, visualization, cleaning, and transformation of a small dataset (a few GB max), pandas is an excellent choice.

Template


            import numpy as np
            import pandas as pd
            
            # and optionally
            import matplotlib.pyplot as plt
        

Core pandas concepts

  • index - The unique identifier of a row.
  • pd.Series - A single column of tabular data; behaves like a blend of numpy array (typed, fast) and a dictionary (index of arbitrary type).
  • pd.DataFrame - A table with named, typed columns and ordered, indexed rows. Equivalent to a collection of series that all share the same index.

Special data types


        pd.Timestamp   # like datetime.datetime but can autoparse
        pd.Timedelta   # like datetime.timedelta
        pd.Categorical # enumerated type (fixed set of possible values)
    

        import numpy as np
        import pandas as pd
        
        # Read entire CSV file into a dataframe
        df = pd.read_csv("mcs275gradebook.csv")

        # Access elements
        df["Quiz 11"] # one column
        df["Quiz 11"]["Emily Dumas"] # entry in that row
        df.loc["Emily Dumas"] # one row
        df.iloc[2] # third row        
    

References

Revision history

  • 2021-04-28 Notebook link
  • 2021-04-28 Initial publication