Pandas Practice Lab¶

The excitement begins!¶

As a reminder:

pd == pandas and df == dataframe

pandas functions¶

The five key pandas functions that allow you to solve “the vast majority of your data-manipulation challenges” are:

df.query() -- select observations by their values
df.sort_values() -- Reorder the rows
df.loc[] -- Pick variables by their column names
df.assign() -- Create new variables with functions of existing variables
df.describe() -- Collapse many values down to a single summary

Bonus: df.groupby() -- we can use this in combination with any of the five to change the scope of each function from operating on the entire dataset to operating on it group by group

A note on df.groupby(): if you want to get multiple results for various groups (e.g. income categories), you could use df.groupby(by='column'). This all happens in the background and results in a grouped object; sometimes, this means you need to reorganize your functions.

For more information, check out the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

# import libraries
import pandas as pd
import os
os.getcwd()

'/workspaces/MLI/Part_II'

Instructions¶

For this assignment you can use any or all of the datasets I provided (described below) or choose two of your own!

For each of the two datasets you select, produce four parts:

Question
Psuedocode that answers the question
Single line transactions using pandas functions
Pipe the individual lines together and create a function

You can work with your groups for coding tips/advise or work through similar programming issues, but everyone must have their own set of questions and results.

Data¶

Any dataset in the class repo works or:

nf2008_fg: http://users.stat.ufl.edu/~winner/data/nfl2008_fga.csv

red_wine_quality: https://data.world/uci/wine-quality

Example¶

Lets walk through a example. Below is a dataset on students admitted to a graduate program in Data Science (I actually don’t remember where this data came from).

Question: Are there differences in the admitted number of students on average by GPA and school rank?¶

An easy way to handle this is to identify function calls starting at the end of the question and working backwards.

Pseudocode:¶

Group by school rank
Calculate GPA Average
Count total admitted
Calculate % Admitted
Arrange in descending order by Mean GPA

# Loading and preparing the data
admit_2 = pd.read_csv("/workspaces/MLI/data/LogReg.csv")

print(admit_2.info())   # .info() outputs the structure of the dataframe object. It includes the columns, the data types in each, etc.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   admit   400 non-null    int64  
 1   gre     400 non-null    int64  
 2   gpa     400 non-null    float64
 3   rank    400 non-null    int64  
dtypes: float64(1), int64(3)
memory usage: 12.6 KB
None

Next, we want to factorize rank. This will allow better data analysis because there are only a few options for it. In R, these are called “levels”. In Python, these are “categorical” variables. Below, we’ll run through how to convert specific columns to categoricals.

 #change rank to a categorical variable
admit_2['rank'] = admit_2['rank'].astype("category")   # here, we assign the newly-assigned vars

print(admit_2.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   admit   400 non-null    int64   
 1   gre     400 non-null    int64   
 2   gpa     400 non-null    float64 
 3   rank    400 non-null    category
dtypes: category(1), float64(1), int64(2)
memory usage: 10.1 KB
None

Now, for the next portion: single-line calculations using the pseudocode as a plan¶

# look at the data:
admit_2.head()    # this shows a small subset of the data; .head(x) shows x number of entries

Group by rank and assign the grouped values to a new dataframe

x1 = admit_2.groupby(['rank'])   # this is done in the background

Next, calculate our additional variables (sum of admitted students, sum of students, and the mean GPA of the rankings in each category of rank. For this, we will be constructing a new dataframe.

There is one issue: you can’t use .assign() on a grouped dataframe. There are two possible solutions for this:

Make a call to an earlier iteration of the dataframe
Initialize an empty dataframe and assign the new variables there.

First, I’ll show you what happens in the first instance. You’ll want to use this IF you want to keep all the previous columns.

x2 = admit_2.assign(admitted= x1.admit.sum(), 
                    tot_students=x1.admit.count(),
                    meanGPA=x1.gpa.mean())

print(x2)

     admit  gre   gpa  rank  admitted  tot_students   meanGPA
0        0  380  3.61     3       NaN           NaN       NaN
1        1  660  3.67     3      33.0          61.0  3.453115
2        1  800  4.00     1      54.0         151.0  3.361656
3        1  640  3.19     4      28.0         121.0  3.432893
4        0  520  2.93     4      12.0          67.0  3.318358
..     ...  ...   ...   ...       ...           ...       ...
395      0  620  4.00     2       NaN           NaN       NaN
396      0  560  3.04     3       NaN           NaN       NaN
397      0  460  2.63     2       NaN           NaN       NaN
398      0  700  3.65     2       NaN           NaN       NaN
399      0  600  3.89     3       NaN           NaN       NaN

[400 rows x 7 columns]

Let’s try the other way.

temp_df = pd.DataFrame()
temp_df = temp_df.assign(admitted= x1.admit.sum(), 
                    tot_students=x1['rank'].count(), 
                    meanGPA=x1.gpa.mean())
temp_df.head()

The second method worked much better. But let’s say we used the first one -- how could we fix the issue?

This was messier than we would’ve preferred. As we can see, we’ve assigned new variables to x2, but it also uses all the other variables from x1. Now, we’ll want to select only the values we want.

There are a few ways we could do this:

Use slicing to select your columns and rows
Use slicing to select your columns and then remove all the NaNs

x2 = x2[['admitted', 'tot_students', 'meanGPA']]   # selecting columns
x2.head(10)

Now, we see the values, but we only want rows 1-4, which correspond to the rank factors we found earlier. To do this, we can either filter out NaNs or select those specific rows. First, let’s try slicing.

x3 = x2[1:5]
x3.head(10)

Now, let’s try removing the NaNs instead:

x4 = x2.dropna()   # dropna() is a simplified function of .drop(df.isna())
x4.head()

Both work!!

Now, we want to calculate the % of students from various ranks that were admitted. We’ll create a new variable to store the information.

x3 = x3.assign(ave_admit = lambda x: (x['admitted'] / x['tot_students']))
x3.head()

Now, let’s arrange the rows by descending GPA.

x3 = x3.sort_values(by='meanGPA', ascending=False)
x3.head()

Now, we pipe everything together and make a function

Function construction¶

def avg_admission(x):
    x1 = x.groupby(['rank'])
    avg = pd.DataFrame()
    avg = avg.assign(admitted= x1['admit'].sum(), 
                    tot_students=x1['rank'].count(),
                    meanGPA=x1['gpa'].mean()).assign(ave_admit = lambda temp: (temp['admitted'] / temp['tot_students'])).sort_values(by='meanGPA', ascending=False)
    return avg
# move right to see the .sort_values()

As you saw above, some functions were added to each other (specifically, the .assign().assign().sort_values) all happened sequentially. This is called chaining. You could separate for clarity, but this is also a pretty efficient use of space in a program.

print(avg_admission(admit_2))

      admitted  tot_students   meanGPA  ave_admit
rank                                             
1           33            61  3.453115   0.540984
3           28           121  3.432893   0.231405
2           54           151  3.361656   0.357616
4           12            67  3.318358   0.179104

Note: above, I used .dropna() as a simplified version of the work I did earlier. Since all other values are NaNs, using .dropna() removes all other values. Likewise, .loc[:, [columns]] selected only the columns that I was working with and all rows that those columns cover.

For further clarification, don’t forget to leave comments on your code, but you also don’t need to go in as much depth exploring different methods like I did! This was just to give you further examples for your personal reference.

Getting Back Up to Coding Speed

Fun with functions and pandas!

Data Preparation and ML Pipeline

Data Preparation for a Machine Learning