Answer all four sets of questions¶
Q0. Please answer the following questions in your own words.
- What makes a model “linear”? “Linear” in what?
- How do you interpret the coefficient for a dummy/one-hot-encoded variable? (This is a trick question, and the trick involves how you handle the intercept of the model.) There’s further explanation at the end of this document, if needed.
- Can linear regression be used for classification? Explain why, or why not.
- If you have a high accuracy on training but low on testing, what might be the problem? It’s also possible you’ll see a pattern in the residuals.
- Review this page: Non-Linear with Linear What are two ways to incorporate nonlinear relationships between your target/response/dependent/outcome variable and your features/control/response/independent variables ?
- What is the interpretation of the intercept? A slope coefficient for a variable? The coefficient for a dummy/one-hot-encoded variable?
Q1. Load clean q1_clean: https://
The data include
Price
per nightReview Scores Rating
: The average rating for the propertyNeighborhood
: The bourough of NYC. Note the space, or rename the variable.Property Type
: The kind of dwellingRoom Type
: The kind of space being rented
- Compute the average prices and scores by
Neighborhood
; which borough is the most expensive on average? Create a kernel density plot of price and log price, grouping byNeighborhood
. - Regress price on
Neighborhood
by creating the appropriate dummy/one-hot-encoded variables, without an intercept in the linear model. Compare the coefficients in the regression to the table from part 1. What pattern do you see? What are the coefficients in a regression of a continuous variable on one categorical variable? - Repeat part 2, but leave an intercept in the linear model. How do you have to handle the creation of the dummies differently? What is the intercept? Interpret the coefficients. How can I get the coefficients in part 2 from these new coefficients?
- Split the sample 80/20 into a training and a test set. Run a regression of
Price
onReview Scores Rating
andNeighborhood
. What is the and RMSE on the test set? What is the coefficient onReview Scores Rating
? What is the most expensive kind of property you can rent? - Run a regression of
Price
onReview Scores Rating
andNeighborhood
andProperty Type
. What is the and RMSE on the test set? What is the coefficient onReview Scores Rating
? What is the most expensive kind of property you can rent? - What does the coefficient on
Review Scores Rating
mean if it changes from part 4 to 5? Hint: Think about how multiple linear regression works. - (Optional) We’ve included
Neighborhood
andProperty Type
separately in the model. How do you interact them, so you can have “A bedroom in Queens” or “A townhouse in Manhattan”. Split the sample 80/20 into a training and a test set and run a regression including that kind of “property type X neighborhood” dummy, plusReview Scores Rating
. How does the slope coefficient forReview Scores Rating
, the , and the RMSE change? Do they increase significantly compares to part 5? Are the coefficients in this regression just the sum of the coefficients forNeighbourhood
andProperty Type
from 5? What is the most expensive kind of property you can rent?
Q2. This question is a case study for linear models. The data are about car prices. In particular, they include:
Price
,Color
,Seating_Capacity
Body_Type
: crossover, hatchback, muv, sedan, suvMake
,Make_Year
: The brand of car and year producedMileage_Run
: The number of miles on the odometerFuel_Type
: Diesel or gasoline/petrolTransmission
,Transmission_Type
: speeds and automatic/manual
- Load
cars_hw.csv
. These data were really dirty, and I’ve already cleaned them a significant amount in terms of missing values and other issues, but some issues remain (e.g. outliers, badly skewed variables that require a log or arcsinh transformation) Note this is different than normalizing: there is a text below that explains further. Clean the data however you think is most appropriate. - Summarize the
Price
variable and create a kernel density plot. Use.groupby()
and.describe()
to summarize prices by brand (Make
). Make a grouped kernel density plot byMake
. Which car brands are the most expensive? What do prices look like in general? - Split the data into an 80% training set and a 20% testing set.
- Make a model where you regress price on the numeric variables alone; what is the and
RMSE
on the training set and test set? Make a second model where, for the categorical variables, you regress price on a model comprised of one-hot encoded regressors/features alone (you can usepd.get_dummies()
; be careful of the dummy variable trap); what is the andRMSE
on the test set? Which model performs better on the test set? Make a third model that combines all the regressors from the previous two; what is the andRMSE
on the test set? Does the joint model perform better or worse, and by home much? - Use the
PolynomialFeatures
function fromsklearn
to expand the set of numerical variables you’re using in the regression. As you increase the degree of the expansion, how do the andRMSE
change? At what point does go negative on the test set? For your best model with expanded features, what is the andRMSE
? How does it compare to your best model from part 4? - For your best model so far, determine the predicted values for the test data and plot them against the true values. Do the predicted values and true values roughly line up along the diagonal, or not? Compute the residuals/errors for the test data and create a kernel density plot. Do the residuals look roughly bell-shaped around zero? Evaluate the strengths and weaknesses of your model.
Q3.
- Find a dataset on a topic you’re interested in. Some easy options are data.gov, kaggle.com, and data.world.
- Clean the data and do some exploratory data analysis on key variables that interest you. Pick a particular target/outcome variable and features/predictors.
- Split the sample into an ~80% training set and a ~20% test set.
- Run a few regressions of your target/outcome variable on a variety of features/predictors. Compute the RMSE on the test set.
- Which model performed the best, and why?
- What did you learn?
from sklearn import linear_model
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 40)
regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed
1 column after dummy encoding
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)