{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Advanced Regression Kaggle competion\n",
"aka Alex recommendations and best practices for model tuning. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I made this notebook as a warmup for Kaggle-style data science. \n",
"Before you get into Kaggle-style model tuning there are more important questions that a data scientist should ask:\n",
"1. How did I get this data? Is there some bias or systematic error that may mislead me in this dataset? Can I get more data or better data? For example by finding or scraping another related data source. The quality of the data is the most important thing. Your insights can be only as good as your data, so this usually the most important step in real life. \n",
"2. What am I trying to achieve? What is the business-related metric I should be using? \n",
"3. After you have decided on the dataset and the metric, we get into Kaggle-style model tuning. This is what this notebook is about. \n",
"Note also that a few competitions allow you to combine external datasets. We will study this topic later. \n",
"\n",
"This notebook requires the data files from the Advanced regression kaggle competion\n",
"https://www.kaggle.com/c/house-prices-advanced-regression-techniques\n",
"\n",
"\n",
"For Kaggle-style model tuning, here are some recommended readings:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://www.kaggle.com/apapiu/regularized-linear-models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"http://blog.kaggle.com/2017/06/15/stacking-made-easy-an-introduction-to-stacknet-by-competitions-grandmaster-marios-michailidis-kazanova/"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#Lets start by import some of our standard tools\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"from scipy.stats import skew\n",
"from scipy.stats.stats import pearsonr\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"#Lets read the data\n",
"train = pd.read_csv(\"train.csv\")\n",
"test = pd.read_csv(\"test.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1460, 81)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#The first thing you always do is LOOK at the data. Lets start with shapes\n",
"train.shape"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"(1459, 80)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, there are 1460 labeled examples in the training \n",
"set and 1459 in the test set. Lets look more carefully."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Id
\n",
"
MSSubClass
\n",
"
MSZoning
\n",
"
LotFrontage
\n",
"
LotArea
\n",
"
Street
\n",
"
Alley
\n",
"
LotShape
\n",
"
LandContour
\n",
"
Utilities
\n",
"
...
\n",
"
ScreenPorch
\n",
"
PoolArea
\n",
"
PoolQC
\n",
"
Fence
\n",
"
MiscFeature
\n",
"
MiscVal
\n",
"
MoSold
\n",
"
YrSold
\n",
"
SaleType
\n",
"
SaleCondition
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1461
\n",
"
20
\n",
"
RH
\n",
"
80.0
\n",
"
11622
\n",
"
Pave
\n",
"
NaN
\n",
"
Reg
\n",
"
Lvl
\n",
"
AllPub
\n",
"
...
\n",
"
120
\n",
"
0
\n",
"
NaN
\n",
"
MnPrv
\n",
"
NaN
\n",
"
0
\n",
"
6
\n",
"
2010
\n",
"
WD
\n",
"
Normal
\n",
"
\n",
"
\n",
"
1
\n",
"
1462
\n",
"
20
\n",
"
RL
\n",
"
81.0
\n",
"
14267
\n",
"
Pave
\n",
"
NaN
\n",
"
IR1
\n",
"
Lvl
\n",
"
AllPub
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
NaN
\n",
"
NaN
\n",
"
Gar2
\n",
"
12500
\n",
"
6
\n",
"
2010
\n",
"
WD
\n",
"
Normal
\n",
"
\n",
"
\n",
"
2
\n",
"
1463
\n",
"
60
\n",
"
RL
\n",
"
74.0
\n",
"
13830
\n",
"
Pave
\n",
"
NaN
\n",
"
IR1
\n",
"
Lvl
\n",
"
AllPub
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
NaN
\n",
"
MnPrv
\n",
"
NaN
\n",
"
0
\n",
"
3
\n",
"
2010
\n",
"
WD
\n",
"
Normal
\n",
"
\n",
"
\n",
"
3
\n",
"
1464
\n",
"
60
\n",
"
RL
\n",
"
78.0
\n",
"
9978
\n",
"
Pave
\n",
"
NaN
\n",
"
IR1
\n",
"
Lvl
\n",
"
AllPub
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
0
\n",
"
6
\n",
"
2010
\n",
"
WD
\n",
"
Normal
\n",
"
\n",
"
\n",
"
4
\n",
"
1465
\n",
"
120
\n",
"
RL
\n",
"
43.0
\n",
"
5005
\n",
"
Pave
\n",
"
NaN
\n",
"
IR1
\n",
"
HLS
\n",
"
AllPub
\n",
"
...
\n",
"
144
\n",
"
0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
0
\n",
"
1
\n",
"
2010
\n",
"
WD
\n",
"
Normal
\n",
"
\n",
" \n",
"
\n",
"
5 rows × 80 columns
\n",
"
"
],
"text/plain": [
" Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n",
"0 1461 20 RH 80.0 11622 Pave NaN Reg \n",
"1 1462 20 RL 81.0 14267 Pave NaN IR1 \n",
"2 1463 60 RL 74.0 13830 Pave NaN IR1 \n",
"3 1464 60 RL 78.0 9978 Pave NaN IR1 \n",
"4 1465 120 RL 43.0 5005 Pave NaN IR1 \n",
"\n",
" LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence \\\n",
"0 Lvl AllPub ... 120 0 NaN MnPrv \n",
"1 Lvl AllPub ... 0 0 NaN NaN \n",
"2 Lvl AllPub ... 0 0 NaN MnPrv \n",
"3 Lvl AllPub ... 0 0 NaN NaN \n",
"4 HLS AllPub ... 144 0 NaN NaN \n",
"\n",
" MiscFeature MiscVal MoSold YrSold SaleType SaleCondition \n",
"0 NaN 0 6 2010 WD Normal \n",
"1 Gar2 12500 6 2010 WD Normal \n",
"2 NaN 0 3 2010 WD Normal \n",
"3 NaN 0 6 2010 WD Normal \n",
"4 NaN 0 1 2010 WD Normal \n",
"\n",
"[5 rows x 80 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test.head() #You should always do that."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each row is a house, and we have features like MSSubClass, MSZoning, LotFrontage (that one is easy to understand) and many other features. It is important to understand what your features are. For this competition the file data_description.txt explains, for example, that MSSubClass is the type of house in that sale, e.g. 20 = 1-story house built in 1946 or newer, while code 30 is 1-story house build before 1945, etc. Similarly, MSZoning being RL means 'Residential low density' while 'RH' means 'Residential High-density', etc. \n",
"\n",
"We need to feed these features into models so we have to go from categorical features to vectors of numbers. There are many ways to encode categorical features and we will talk about this in class. The standard easy first method is called one-hot encoding (aka dummy variables). You replace one categorical feature with multiple columns that are 0 or 1. For MSZoning we will be adding a columns meaning MSZoning_Is_RH, MSZoning_Is_RL, etc. \n",
"\n",
"You can write manual code to do this, but pandas gives you the method get_dummies which is convenient. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"testdum= pd.get_dummies(test) #Ok lets try to run it, to see what it does. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1459, 271)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"testdum.shape #And then we look"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So the test dataframe had 1459 examples and we still have 1459 rows so that is good news. \n",
"The 80 features have been replaced by 271 features. \n",
"This should should include the number of possible values that the categorical features take, \n",
"while leaving the numerical features alone. Lets look more carefully.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Id
\n",
"
MSSubClass
\n",
"
LotFrontage
\n",
"
LotArea
\n",
"
OverallQual
\n",
"
OverallCond
\n",
"
YearBuilt
\n",
"
YearRemodAdd
\n",
"
MasVnrArea
\n",
"
BsmtFinSF1
\n",
"
...
\n",
"
SaleType_ConLw
\n",
"
SaleType_New
\n",
"
SaleType_Oth
\n",
"
SaleType_WD
\n",
"
SaleCondition_Abnorml
\n",
"
SaleCondition_AdjLand
\n",
"
SaleCondition_Alloca
\n",
"
SaleCondition_Family
\n",
"
SaleCondition_Normal
\n",
"
SaleCondition_Partial
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1461
\n",
"
20
\n",
"
80.0
\n",
"
11622
\n",
"
5
\n",
"
6
\n",
"
1961
\n",
"
1961
\n",
"
0.0
\n",
"
468.0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
1
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
1
\n",
"
0
\n",
"
\n",
"
\n",
"
1
\n",
"
1462
\n",
"
20
\n",
"
81.0
\n",
"
14267
\n",
"
6
\n",
"
6
\n",
"
1958
\n",
"
1958
\n",
"
108.0
\n",
"
923.0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
1
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
1
\n",
"
0
\n",
"
\n",
"
\n",
"
2
\n",
"
1463
\n",
"
60
\n",
"
74.0
\n",
"
13830
\n",
"
5
\n",
"
5
\n",
"
1997
\n",
"
1998
\n",
"
0.0
\n",
"
791.0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
1
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
1
\n",
"
0
\n",
"
\n",
"
\n",
"
3
\n",
"
1464
\n",
"
60
\n",
"
78.0
\n",
"
9978
\n",
"
6
\n",
"
6
\n",
"
1998
\n",
"
1998
\n",
"
20.0
\n",
"
602.0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
1
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
1
\n",
"
0
\n",
"
\n",
"
\n",
"
4
\n",
"
1465
\n",
"
120
\n",
"
43.0
\n",
"
5005
\n",
"
8
\n",
"
5
\n",
"
1992
\n",
"
1992
\n",
"
0.0
\n",
"
263.0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
1
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
1
\n",
"
0
\n",
"
\n",
" \n",
"
\n",
"
5 rows × 271 columns
\n",
"
"
],
"text/plain": [
" Id MSSubClass LotFrontage LotArea OverallQual OverallCond \\\n",
"0 1461 20 80.0 11622 5 6 \n",
"1 1462 20 81.0 14267 6 6 \n",
"2 1463 60 74.0 13830 5 5 \n",
"3 1464 60 78.0 9978 6 6 \n",
"4 1465 120 43.0 5005 8 5 \n",
"\n",
" YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... \\\n",
"0 1961 1961 0.0 468.0 ... \n",
"1 1958 1958 108.0 923.0 ... \n",
"2 1997 1998 0.0 791.0 ... \n",
"3 1998 1998 20.0 602.0 ... \n",
"4 1992 1992 0.0 263.0 ... \n",
"\n",
" SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD \\\n",
"0 0 0 0 1 \n",
"1 0 0 0 1 \n",
"2 0 0 0 1 \n",
"3 0 0 0 1 \n",
"4 0 0 0 1 \n",
"\n",
" SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 0 \n",
"3 0 0 0 \n",
"4 0 0 0 \n",
"\n",
" SaleCondition_Family SaleCondition_Normal SaleCondition_Partial \n",
"0 0 1 0 \n",
"1 0 1 0 \n",
"2 0 1 0 \n",
"3 0 1 0 \n",
"4 0 1 0 \n",
"\n",
"[5 rows x 271 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"testdum.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You see the column MSZoning is gone but the columns with numbers remain. You need to go find where they are if they have the names you expect"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Id',\n",
" 'MSSubClass',\n",
" 'LotFrontage',\n",
" 'LotArea',\n",
" 'OverallQual',\n",
" 'OverallCond',\n",
" 'YearBuilt',\n",
" 'YearRemodAdd',\n",
" 'MasVnrArea',\n",
" 'BsmtFinSF1',\n",
" 'BsmtFinSF2',\n",
" 'BsmtUnfSF',\n",
" 'TotalBsmtSF',\n",
" '1stFlrSF',\n",
" '2ndFlrSF',\n",
" 'LowQualFinSF',\n",
" 'GrLivArea',\n",
" 'BsmtFullBath',\n",
" 'BsmtHalfBath',\n",
" 'FullBath',\n",
" 'HalfBath',\n",
" 'BedroomAbvGr',\n",
" 'KitchenAbvGr',\n",
" 'TotRmsAbvGrd',\n",
" 'Fireplaces',\n",
" 'GarageYrBlt',\n",
" 'GarageCars',\n",
" 'GarageArea',\n",
" 'WoodDeckSF',\n",
" 'OpenPorchSF',\n",
" 'EnclosedPorch',\n",
" '3SsnPorch',\n",
" 'ScreenPorch',\n",
" 'PoolArea',\n",
" 'MiscVal',\n",
" 'MoSold',\n",
" 'YrSold',\n",
" 'MSZoning_C (all)',\n",
" 'MSZoning_FV',\n",
" 'MSZoning_RH',\n",
" 'MSZoning_RL',\n",
" 'MSZoning_RM',\n",
" 'Street_Grvl',\n",
" 'Street_Pave',\n",
" 'Alley_Grvl',\n",
" 'Alley_Pave',\n",
" 'LotShape_IR1',\n",
" 'LotShape_IR2',\n",
" 'LotShape_IR3',\n",
" 'LotShape_Reg',\n",
" 'LandContour_Bnk',\n",
" 'LandContour_HLS',\n",
" 'LandContour_Low',\n",
" 'LandContour_Lvl',\n",
" 'Utilities_AllPub',\n",
" 'LotConfig_Corner',\n",
" 'LotConfig_CulDSac',\n",
" 'LotConfig_FR2',\n",
" 'LotConfig_FR3',\n",
" 'LotConfig_Inside',\n",
" 'LandSlope_Gtl',\n",
" 'LandSlope_Mod',\n",
" 'LandSlope_Sev',\n",
" 'Neighborhood_Blmngtn',\n",
" 'Neighborhood_Blueste',\n",
" 'Neighborhood_BrDale',\n",
" 'Neighborhood_BrkSide',\n",
" 'Neighborhood_ClearCr',\n",
" 'Neighborhood_CollgCr',\n",
" 'Neighborhood_Crawfor',\n",
" 'Neighborhood_Edwards',\n",
" 'Neighborhood_Gilbert',\n",
" 'Neighborhood_IDOTRR',\n",
" 'Neighborhood_MeadowV',\n",
" 'Neighborhood_Mitchel',\n",
" 'Neighborhood_NAmes',\n",
" 'Neighborhood_NPkVill',\n",
" 'Neighborhood_NWAmes',\n",
" 'Neighborhood_NoRidge',\n",
" 'Neighborhood_NridgHt',\n",
" 'Neighborhood_OldTown',\n",
" 'Neighborhood_SWISU',\n",
" 'Neighborhood_Sawyer',\n",
" 'Neighborhood_SawyerW',\n",
" 'Neighborhood_Somerst',\n",
" 'Neighborhood_StoneBr',\n",
" 'Neighborhood_Timber',\n",
" 'Neighborhood_Veenker',\n",
" 'Condition1_Artery',\n",
" 'Condition1_Feedr',\n",
" 'Condition1_Norm',\n",
" 'Condition1_PosA',\n",
" 'Condition1_PosN',\n",
" 'Condition1_RRAe',\n",
" 'Condition1_RRAn',\n",
" 'Condition1_RRNe',\n",
" 'Condition1_RRNn',\n",
" 'Condition2_Artery',\n",
" 'Condition2_Feedr',\n",
" 'Condition2_Norm',\n",
" 'Condition2_PosA',\n",
" 'Condition2_PosN',\n",
" 'BldgType_1Fam',\n",
" 'BldgType_2fmCon',\n",
" 'BldgType_Duplex',\n",
" 'BldgType_Twnhs',\n",
" 'BldgType_TwnhsE',\n",
" 'HouseStyle_1.5Fin',\n",
" 'HouseStyle_1.5Unf',\n",
" 'HouseStyle_1Story',\n",
" 'HouseStyle_2.5Unf',\n",
" 'HouseStyle_2Story',\n",
" 'HouseStyle_SFoyer',\n",
" 'HouseStyle_SLvl',\n",
" 'RoofStyle_Flat',\n",
" 'RoofStyle_Gable',\n",
" 'RoofStyle_Gambrel',\n",
" 'RoofStyle_Hip',\n",
" 'RoofStyle_Mansard',\n",
" 'RoofStyle_Shed',\n",
" 'RoofMatl_CompShg',\n",
" 'RoofMatl_Tar&Grv',\n",
" 'RoofMatl_WdShake',\n",
" 'RoofMatl_WdShngl',\n",
" 'Exterior1st_AsbShng',\n",
" 'Exterior1st_AsphShn',\n",
" 'Exterior1st_BrkComm',\n",
" 'Exterior1st_BrkFace',\n",
" 'Exterior1st_CBlock',\n",
" 'Exterior1st_CemntBd',\n",
" 'Exterior1st_HdBoard',\n",
" 'Exterior1st_MetalSd',\n",
" 'Exterior1st_Plywood',\n",
" 'Exterior1st_Stucco',\n",
" 'Exterior1st_VinylSd',\n",
" 'Exterior1st_Wd Sdng',\n",
" 'Exterior1st_WdShing',\n",
" 'Exterior2nd_AsbShng',\n",
" 'Exterior2nd_AsphShn',\n",
" 'Exterior2nd_Brk Cmn',\n",
" 'Exterior2nd_BrkFace',\n",
" 'Exterior2nd_CBlock',\n",
" 'Exterior2nd_CmentBd',\n",
" 'Exterior2nd_HdBoard',\n",
" 'Exterior2nd_ImStucc',\n",
" 'Exterior2nd_MetalSd',\n",
" 'Exterior2nd_Plywood',\n",
" 'Exterior2nd_Stone',\n",
" 'Exterior2nd_Stucco',\n",
" 'Exterior2nd_VinylSd',\n",
" 'Exterior2nd_Wd Sdng',\n",
" 'Exterior2nd_Wd Shng',\n",
" 'MasVnrType_BrkCmn',\n",
" 'MasVnrType_BrkFace',\n",
" 'MasVnrType_None',\n",
" 'MasVnrType_Stone',\n",
" 'ExterQual_Ex',\n",
" 'ExterQual_Fa',\n",
" 'ExterQual_Gd',\n",
" 'ExterQual_TA',\n",
" 'ExterCond_Ex',\n",
" 'ExterCond_Fa',\n",
" 'ExterCond_Gd',\n",
" 'ExterCond_Po',\n",
" 'ExterCond_TA',\n",
" 'Foundation_BrkTil',\n",
" 'Foundation_CBlock',\n",
" 'Foundation_PConc',\n",
" 'Foundation_Slab',\n",
" 'Foundation_Stone',\n",
" 'Foundation_Wood',\n",
" 'BsmtQual_Ex',\n",
" 'BsmtQual_Fa',\n",
" 'BsmtQual_Gd',\n",
" 'BsmtQual_TA',\n",
" 'BsmtCond_Fa',\n",
" 'BsmtCond_Gd',\n",
" 'BsmtCond_Po',\n",
" 'BsmtCond_TA',\n",
" 'BsmtExposure_Av',\n",
" 'BsmtExposure_Gd',\n",
" 'BsmtExposure_Mn',\n",
" 'BsmtExposure_No',\n",
" 'BsmtFinType1_ALQ',\n",
" 'BsmtFinType1_BLQ',\n",
" 'BsmtFinType1_GLQ',\n",
" 'BsmtFinType1_LwQ',\n",
" 'BsmtFinType1_Rec',\n",
" 'BsmtFinType1_Unf',\n",
" 'BsmtFinType2_ALQ',\n",
" 'BsmtFinType2_BLQ',\n",
" 'BsmtFinType2_GLQ',\n",
" 'BsmtFinType2_LwQ',\n",
" 'BsmtFinType2_Rec',\n",
" 'BsmtFinType2_Unf',\n",
" 'Heating_GasA',\n",
" 'Heating_GasW',\n",
" 'Heating_Grav',\n",
" 'Heating_Wall',\n",
" 'HeatingQC_Ex',\n",
" 'HeatingQC_Fa',\n",
" 'HeatingQC_Gd',\n",
" 'HeatingQC_Po',\n",
" 'HeatingQC_TA',\n",
" 'CentralAir_N',\n",
" 'CentralAir_Y',\n",
" 'Electrical_FuseA',\n",
" 'Electrical_FuseF',\n",
" 'Electrical_FuseP',\n",
" 'Electrical_SBrkr',\n",
" 'KitchenQual_Ex',\n",
" 'KitchenQual_Fa',\n",
" 'KitchenQual_Gd',\n",
" 'KitchenQual_TA',\n",
" 'Functional_Maj1',\n",
" 'Functional_Maj2',\n",
" 'Functional_Min1',\n",
" 'Functional_Min2',\n",
" 'Functional_Mod',\n",
" 'Functional_Sev',\n",
" 'Functional_Typ',\n",
" 'FireplaceQu_Ex',\n",
" 'FireplaceQu_Fa',\n",
" 'FireplaceQu_Gd',\n",
" 'FireplaceQu_Po',\n",
" 'FireplaceQu_TA',\n",
" 'GarageType_2Types',\n",
" 'GarageType_Attchd',\n",
" 'GarageType_Basment',\n",
" 'GarageType_BuiltIn',\n",
" 'GarageType_CarPort',\n",
" 'GarageType_Detchd',\n",
" 'GarageFinish_Fin',\n",
" 'GarageFinish_RFn',\n",
" 'GarageFinish_Unf',\n",
" 'GarageQual_Fa',\n",
" 'GarageQual_Gd',\n",
" 'GarageQual_Po',\n",
" 'GarageQual_TA',\n",
" 'GarageCond_Ex',\n",
" 'GarageCond_Fa',\n",
" 'GarageCond_Gd',\n",
" 'GarageCond_Po',\n",
" 'GarageCond_TA',\n",
" 'PavedDrive_N',\n",
" 'PavedDrive_P',\n",
" 'PavedDrive_Y',\n",
" 'PoolQC_Ex',\n",
" 'PoolQC_Gd',\n",
" 'Fence_GdPrv',\n",
" 'Fence_GdWo',\n",
" 'Fence_MnPrv',\n",
" 'Fence_MnWw',\n",
" 'MiscFeature_Gar2',\n",
" 'MiscFeature_Othr',\n",
" 'MiscFeature_Shed',\n",
" 'SaleType_COD',\n",
" 'SaleType_CWD',\n",
" 'SaleType_Con',\n",
" 'SaleType_ConLD',\n",
" 'SaleType_ConLI',\n",
" 'SaleType_ConLw',\n",
" 'SaleType_New',\n",
" 'SaleType_Oth',\n",
" 'SaleType_WD',\n",
" 'SaleCondition_Abnorml',\n",
" 'SaleCondition_AdjLand',\n",
" 'SaleCondition_Alloca',\n",
" 'SaleCondition_Family',\n",
" 'SaleCondition_Normal',\n",
" 'SaleCondition_Partial']"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(testdum.columns.values) #This actually lists all the columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you scroll down in the feature names you will find 'MSZoning_RH' and 'MSZoning_RL' \n",
"as we were expecting. This is good news. You should check that indeed the row which had MSZoning being RH has indeed MSZoning_RH=1 and the other options =0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There one important mistake we have done here. Do you see it? \n",
"\n",
"The feature MSSubClass has remained numerical in testdum. \n",
"This is because the get_dummies method did not understand that the numbers 20,30 etc do not represent codes of a categorical variable. You should be one-hot encoding that column too. Find how to do that. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next step is to look at how the statistics of the features look like."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAEK1JREFUeJzt3V+MXGd9xvHvUycNUQgiacLIta06ldxKDi6BrkIkULUl\nghiCcC6qyChQR0rlmyCBaonaRWrFhaW0UhCqaC4sQBjxJ7IEUayEtnJMRqhSwSSQ4DiJG0McxZYT\nCxCF5SLtpr9e7AmdmPXurHfHu/vO9yON5j3vec+c9+ezevb4zJnZVBWSpHb9znJPQJI0Wga9JDXO\noJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXGXLPcEAK655prauHHjck9jyf3617/miiuu\nWO5pjJQ1tmMc6mytxscff/ynVXXtfONWRNBv3LiRxx57bLmnseT6/T6Tk5PLPY2RssZ2jEOdrdWY\n5IVhxnnpRpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGrciPhm7Wm3c/fCc\n63dtmebOecZciJP33LrkrympXZ7RS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMZ5e+UqNN9tnaPk\nrZ3S6uMZvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxQwV9kpNJjiZ5IsljXd/V\nSQ4lea57vmpg/J4kJ5IcT3LLqCYvSZrfQs7o/7yqbqiqiW55N3C4qjYBh7tlkmwGtgPXA1uB+5Ks\nWcI5S5IWYDGXbrYB+7v2fuC2gf77q+qVqnoeOAHcuIj9SJIWYdigL+CRJI8n2dn19arqTNd+Ceh1\n7XXAiwPbnur6JEnLYNgvNXt3VZ1O8hbgUJJnB1dWVSWphey4+4WxE6DX69Hv9xey+Yqwa8v0nOt7\nl88/ZrU59zhNTU2tymO3EONQI4xHneNQ42yGCvqqOt09n03yADOXYl5OsraqziRZC5zthp8GNgxs\nvr7rO/c19wH7ACYmJmpycvKCi1gu8/3h711bprn3aFtfEHryjsnXLff7fVbjsVuIcagRxqPOcahx\nNvNeuklyRZIrX2sD7wOeAg4CO7phO4AHu/ZBYHuSy5JcB2wCjiz1xCVJwxnmdLMHPJDktfFfq6p/\nTfJ94ECSu4AXgNsBqupYkgPA08A0cHdVvTqS2UuS5jVv0FfVT4C3zdL/M+Dm82yzF9i76NlJkhbN\nT8ZKUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEG\nvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BL\nUuMMeklqnEEvSY0z6CWpcQa9JDVu6KBPsibJD5M81C1fneRQkue656sGxu5JciLJ8SS3jGLikqTh\nLOSM/uPAMwPLu4HDVbUJONwtk2QzsB24HtgK3JdkzdJMV5K0UEMFfZL1wK3A5we6twH7u/Z+4LaB\n/vur6pWqeh44Ady4NNOVJC3UJUOO+yzwSeDKgb5eVZ3p2i8Bva69DvjuwLhTXd/rJNkJ7ATo9Xr0\n+/3hZ71C7NoyPef63uXzj1ltzj1OU1NTq/LYLcQ41AjjUec41DibeYM+yQeBs1X1eJLJ2cZUVSWp\nhey4qvYB+wAmJiZqcnLWl17R7tz98Jzrd22Z5t6jw/4uXR1O3jH5uuV+v89qPHYLMQ41wnjUOQ41\nzmaYFHoX8KEkHwDeALwpyVeAl5OsraozSdYCZ7vxp4ENA9uv7/okSctg3mv0VbWnqtZX1UZm3mT9\ndlV9BDgI7OiG7QAe7NoHge1JLktyHbAJOLLkM5ckDWUx1xXuAQ4kuQt4AbgdoKqOJTkAPA1MA3dX\n1auLnqkk6YIsKOirqg/0u/bPgJvPM24vsHeRc5MkLQE/GStJjTPoJalxBr0kNc6gl6TGGfSS1DiD\nXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+gl\nqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGzRv0Sd6Q5EiS\nJ5McS/Lprv/qJIeSPNc9XzWwzZ4kJ5IcT3LLKAuQJM1tmDP6V4D3VNXbgBuArUluAnYDh6tqE3C4\nWybJZmA7cD2wFbgvyZpRTF6SNL95g75mTHWLl3aPArYB+7v+/cBtXXsbcH9VvVJVzwMngBuXdNaS\npKENdY0+yZokTwBngUNV9T2gV1VnuiEvAb2uvQ54cWDzU12fJGkZXDLMoKp6FbghyZuBB5K89Zz1\nlaQWsuMkO4GdAL1ej36/v5DNV4RdW6bnXN+7fP4xq825x2lqampVHruFGIcaYTzqHIcaZzNU0L+m\nqn6R5FFmrr2/nGRtVZ1JspaZs32A08CGgc3Wd33nvtY+YB/AxMRETU5OXsD0l9edux+ec/2uLdPc\ne3RB/8Qr3sk7Jl+33O/3WY3HbiHGoUYYjzrHocbZDHPXzbXdmTxJLgfeCzwLHAR2dMN2AA927YPA\n9iSXJbkO2AQcWeqJS5KGM8zp5lpgf3fnzO8AB6rqoST/ARxIchfwAnA7QFUdS3IAeBqYBu7uLv1I\nkpbBvEFfVT8C3j5L/8+Am8+zzV5g76JnJ0laND8ZK0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn\n0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9\nJDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMbNG/RJNiR5NMnTSY4l\n+XjXf3WSQ0me656vGthmT5ITSY4nuWWUBUiS5jbMGf00sKuqNgM3AXcn2QzsBg5X1SbgcLdMt247\ncD2wFbgvyZpRTF6SNL95g76qzlTVD7r2r4BngHXANmB/N2w/cFvX3gbcX1WvVNXzwAngxqWeuCRp\nOAu6Rp9kI/B24HtAr6rOdKteAnpdex3w4sBmp7o+SdIyuGTYgUneCHwD+ERV/TLJb9ZVVSWphew4\nyU5gJ0Cv16Pf7y9k8xVh15bpOdf3Lp9/zGpz7nGamppalcduIcahRhiPOsehxtkMFfRJLmUm5L9a\nVd/sul9OsraqziRZC5zt+k8DGwY2X9/1vU5V7QP2AUxMTNTk5OSFVbCM7tz98Jzrd22Z5t6jQ/8u\nXRVO3jH5uuV+v89qPHYLMQ41wnjUOQ41zmaYu24CfAF4pqo+M7DqILCja+8AHhzo357ksiTXAZuA\nI0s3ZUnSQgxzuvku4KPA0SRPdH1/C9wDHEhyF/ACcDtAVR1LcgB4mpk7du6uqleXfOaSpKHMG/RV\n9e9AzrP65vNssxfYu4h5aYXaeM7lql1bpue9hLUUTt5z68j3IbXKT8ZKUuMMeklqnEEvSY0z6CWp\ncQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn\n0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9\nJDVu3qBP8sUkZ5M8NdB3dZJDSZ7rnq8aWLcnyYkkx5PcMqqJS5KGc8kQY74EfA748kDfbuBwVd2T\nZHe3/DdJNgPbgeuB3wceSfJHVfXq0k779TbufniULy9Jq9q8Z/RV9R3g5+d0bwP2d+39wG0D/fdX\n1StV9TxwArhxieYqSboAw5zRz6ZXVWe69ktAr2uvA747MO5U1/dbkuwEdgL0ej36/f4FTgV2bZm+\n4G1HqXf5yp3bUrlYNS7m52OxpqamlnX/F8s41DkONc7mQoP+N6qqktQFbLcP2AcwMTFRk5OTFzyH\nO1fopZtdW6a59+ii/4lXtItV48k7Jke+j/Pp9/ss5udztRiHOsehxtlc6F03LydZC9A9n+36TwMb\nBsat7/okScvkQoP+ILCja+8AHhzo357ksiTXAZuAI4uboiRpMeb9P3eSrwOTwDVJTgF/D9wDHEhy\nF/ACcDtAVR1LcgB4GpgG7h71HTeSpLnNG/RV9eHzrLr5POP3AnsXMylJ0tLxk7GS1DiDXpIaZ9BL\nUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGtf2d+iqGcv5V8S+tPWKZdu3tBQ8o5ek\nxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqc\nQS9JjTPoJalxBr0kNc4/PCLN4+jp/+LOZfjDJyfvufWi71NtGtkZfZKtSY4nOZFk96j2I0ma20iC\nPska4J+B9wObgQ8n2TyKfUmS5jaqM/obgRNV9ZOq+m/gfmDbiPYlSZrDqK7RrwNeHFg+BbxzRPuS\nmnSx/yD6ri3Ty/JexMW0Emu8GO/FpKqW/kWTvwC2VtVfdcsfBd5ZVR8bGLMT2Nkt/jFwfMknsvyu\nAX663JMYMWtsxzjU2VqNf1BV1843aFRn9KeBDQPL67u+36iqfcC+Ee1/RUjyWFVNLPc8Rska2zEO\ndY5DjbMZ1TX67wObklyX5HeB7cDBEe1LkjSHkZzRV9V0ko8B/wasAb5YVcdGsS9J0txG9oGpqvoW\n8K1Rvf4q0fSlqY41tmMc6hyHGn/LSN6MlSStHH7XjSQ1zqBfgCRfTHI2yVMDfVcnOZTkue75qoF1\ne7qvgDie5JaB/j9NcrRb909JcrFrOZ8kG5I8muTpJMeSfLzrb6bOJG9IciTJk12Nn+76m6nxNUnW\nJPlhkoe65RZrPNnN74kkj3V9zdW5KFXlY8gH8GfAO4CnBvr+EdjdtXcD/9C1NwNPApcB1wE/BtZ0\n644ANwEB/gV4/3LXNlDPWuAdXftK4D+7Wpqps5vPG7v2pcD3unk2U+NArX8NfA14qMWf125+J4Fr\nzulrrs7FPDyjX4Cq+g7w83O6twH7u/Z+4LaB/vur6pWqeh44AdyYZC3wpqr6bs38dH15YJtlV1Vn\nquoHXftXwDPMfNK5mTprxlS3eGn3KBqqESDJeuBW4PMD3U3VOIdxqXMoBv3i9arqTNd+Ceh17dm+\nBmJd9zg1S/+Kk2Qj8HZmznibqrO7pPEEcBY4VFXN1Qh8Fvgk8L8Dfa3VCDO/pB9J8nj3iXtos84L\n5vfRL6GqqiRN3MaU5I3AN4BPVNUvBy9XtlBnVb0K3JDkzcADSd56zvpVXWOSDwJnq+rxJJOzjVnt\nNQ54d1WdTvIW4FCSZwdXNlTnBfOMfvFe7v7bR/d8tus/39dAnO7a5/avGEkuZSbkv1pV3+y6m6sT\noKp+ATwKbKWtGt8FfCjJSWa+PfY9Sb5CWzUCUFWnu+ezwAPMfHtuc3UuhkG/eAeBHV17B/DgQP/2\nJJcluQ7YBBzp/jv5yyQ3de/q/+XANsuum9MXgGeq6jMDq5qpM8m13Zk8SS4H3gs8S0M1VtWeqlpf\nVRuZ+QqSb1fVR2ioRoAkVyS58rU28D7gKRqrc9GW+93g1fQAvg6cAf6HmWt4dwG/BxwGngMeAa4e\nGP8pZt7VP87AO/jABDM/jD8GPkf3wbWV8ADezcw1zx8BT3SPD7RUJ/AnwA+7Gp8C/q7rb6bGc+qd\n5P/vummqRuAPmbmL5kngGPCpFutc7MNPxkpS47x0I0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn\n0EtS4wx6SWrc/wHWDU9mYLroOAAAAABJRU5ErkJggg==\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"train.GrLivArea.hist() #this plots the histogram of the feature LotFrontage."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most homes are between 1000 to 3000 square feet. This is reasonable. "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAF0NJREFUeJzt3WGMlPdh5/HvL5DY2NsCPudGHKCDF8gVDopTVjRtTtFu\naWpSR8EvKhfLjXDlE5WOukmPqoK+ifoCyS/qqlF9PgmFXKlwvUdJLFAcp6HUqzRSMQmOe2vscKYB\nbDYY0sQmXR8it/R3L/YhHnMLM7MzD7P8+X0ktM/85/+f5zc7498+fnZmVraJiIhyva/fASIiol4p\n+oiIwqXoIyIKl6KPiChcij4ionAp+oiIwqXoIyIKl6KPiChcij4ionBz+x0A4M477/SyZctmvP6d\nd97h9ttv712gHkmuziRXZ5KrMyXmOnLkyL/Y/mDLibb7/m/16tXuxvPPP9/V+rokV2eSqzPJ1ZkS\ncwHfcRsdm1M3ERGFS9FHRBQuRR8RUbgUfURE4VL0ERGFS9FHRBSuraKX9AeSjkp6WdLTkm6VdIek\nA5Jeq74ubJq/TdJxScck3Vtf/IiIaKVl0UtaDPw+MGj7Q8AcYAOwFThoewVwsLqMpJXV9XcD64An\nJc2pJ35ERLTS7qmbucA8SXOB24AfAOuBXdX1u4D7q+31wIjti7ZPAMeBNb2LHBERnWj5EQi2xyX9\nKfA6cAH4hu1vSGrYPlNNexNoVNuLgUNNN3G6GosCLNv6bG23vWXVJA9f5fZPPnZfbfuNKJ2m3kV7\njQlT596/DPwW8DbwN8Be4AnbC5rmvWV7oaQngEO2d1fjO4HnbO+94nY3AZsAGo3G6pGRkRnfiYmJ\nCQYGBma8vi4l5hobP9/jNO9qzIOzF6a/btXi+bXtt5USH8c6JVdnusk1PDx8xPZgq3ntfKjZrwEn\nbP8QQNJXgF8BzkpaZPuMpEXAuWr+OLC0af2Sauw9bO8AdgAMDg56aGiojSjTGx0dpZv1dSkx19WO\nuHthy6pJHh+b/il58qGh2vbbSomPY52SqzPXI1c75+hfBz4q6TZJAtYCrwL7gY3VnI3Avmp7P7BB\n0i2SlgMrgMO9jR0REe1q5xz9C5L2Ai8Ck8B3mToSHwD2SHoEOAU8UM0/KmkP8Eo1f7PtSzXlj4iI\nFtr6PHrbnwc+f8XwRaaO7qebvx3Y3l20iIjohbwzNiKicCn6iIjCpegjIgqXoo+IKFyKPiKicCn6\niIjCpegjIgqXoo+IKFyKPiKicCn6iIjCpegjIgqXoo+IKFyKPiKicCn6iIjCpegjIgqXoo+IKFyK\nPiKicC2LXtJdkl5q+vcTSZ+TdIekA5Jeq74ubFqzTdJxScck3VvvXYiIiGtpWfS2j9m+x/Y9wGrg\n/wDPAFuBg7ZXAAery0haCWwA7gbWAU9KmlNT/oiIaKHTUzdrgX+2fQpYD+yqxncB91fb64ER2xdt\nnwCOA2t6ETYiIjrXadFvAJ6uthu2z1TbbwKNansx8EbTmtPVWERE9IFstzdR+gDwA+Bu22clvW17\nQdP1b9leKOkJ4JDt3dX4TuA523uvuL1NwCaARqOxemRkZMZ3YmJigoGBgRmvr0uJucbGz/c4zbsa\n8+DshemvW7V4fm37baXEx7FOydWZbnINDw8fsT3Yat7cDm7zk8CLts9Wl89KWmT7jKRFwLlqfBxY\n2rRuSTX2HrZ3ADsABgcHPTQ01EGU9xodHaWb9XUpMdfDW5/tbZgmW1ZN8vjY9E/Jkw8N1bbfVkp8\nHOuUXJ25Hrk6OXXzIO+etgHYD2ystjcC+5rGN0i6RdJyYAVwuNugERExM20d0Uu6HfgE8LtNw48B\neyQ9ApwCHgCwfVTSHuAVYBLYbPtST1NHRETb2ip62+8A/+6KsR8x9Sqc6eZvB7Z3nS4iIrqWd8ZG\nRBQuRR8RUbgUfURE4VL0ERGFS9FHRBQuRR8RUbgUfURE4VL0ERGFS9FHRBQuRR8RUbgUfURE4VL0\nERGFS9FHRBQuRR8RUbgUfURE4VL0ERGFS9FHRBSuraKXtEDSXknfk/SqpF+WdIekA5Jeq74ubJq/\nTdJxScck3Vtf/IiIaKXdI/ovAF+3/QvAh4FXga3AQdsrgIPVZSStBDYAdwPrgCclzel18IiIaE/L\nopc0H/g4sBPA9k9tvw2sB3ZV03YB91fb64ER2xdtnwCOA2t6HTwiItoj29eeIN0D7ABeYepo/gjw\nWWDc9oJqjoC3bC+Q9ARwyPbu6rqdwHO2915xu5uATQCNRmP1yMjIjO/ExMQEAwMDM15flxJzjY2f\n73GadzXmwdkL01+3avH82vbbSomPY52SqzPd5BoeHj5ie7DVvLlt3NZc4BeBR22/IOkLVKdpLrNt\nSdf+iXEF2zuY+gHC4OCgh4aGOln+HqOjo3Szvi4l5np467O9DdNky6pJHh+b/il58qGh2vbbSomP\nY52SqzPXI1c75+hPA6dtv1Bd3stU8Z+VtAig+nquun4cWNq0fkk1FhERfdCy6G2/Cbwh6a5qaC1T\np3H2AxursY3Avmp7P7BB0i2SlgMrgMM9TR0REW1r59QNwKPAU5I+AHwf+B2mfkjskfQIcAp4AMD2\nUUl7mPphMAlstn2p58kjIqItbRW97ZeA6U74r73K/O3A9i5yRUREj+SdsRERhUvRR0QULkUfEVG4\nFH1EROFS9BERhUvRR0QULkUfEVG4FH1EROFS9BERhUvRR0QULkUfEVG4FH1EROFS9BERhUvRR0QU\nLkUfEVG4FH1EROHaKnpJJyWNSXpJ0neqsTskHZD0WvV1YdP8bZKOSzom6d66wkdERGudHNEP277H\n9uW/NLUVOGh7BXCwuoyklcAG4G5gHfCkpDk9zBwRER3o5tTNemBXtb0LuL9pfMT2RdsngOPAmi72\nExERXWi36A38naQjkjZVYw3bZ6rtN4FGtb0YeKNp7elqLCIi+kC2W0+SFtsel/TvgQPAo8B+2wua\n5rxle6GkJ4BDtndX4zuB52zvveI2NwGbABqNxuqRkZEZ34mJiQkGBgZmvL4uJeYaGz/f4zTvasyD\nsxemv27V4vm17beVEh/HOiVXZ7rJNTw8fKTpdPpVzW3nxmyPV1/PSXqGqVMxZyUtsn1G0iLgXDV9\nHFjatHxJNXblbe4AdgAMDg56aGionSjTGh0dpZv1dSkx18Nbn+1tmCZbVk3y+Nj0T8mTDw3Vtt9W\nSnwc65RcnbkeuVqeupF0u6Sfu7wN/DrwMrAf2FhN2wjsq7b3Axsk3SJpObACONzr4BER0Z52jugb\nwDOSLs//a9tfl/RtYI+kR4BTwAMAto9K2gO8AkwCm21fqiV9RES01LLobX8f+PA04z8C1l5lzXZg\ne9fpIiKia3lnbERE4VL0ERGFS9FHRBQuRR8RUbgUfURE4VL0ERGFS9FHRBQuRR8RUbgUfURE4VL0\nERGFS9FHRBQuRR8RUbgUfURE4VL0ERGFS9FHRBQuRR8RUbgUfURE4doueklzJH1X0lery3dIOiDp\nterrwqa52yQdl3RM0r11BI+IiPZ0ckT/WeDVpstbgYO2VwAHq8tIWglsAO4G1gFPSprTm7gREdGp\ntope0hLgPuCLTcPrgV3V9i7g/qbxEdsXbZ8AjgNrehM3IiI61e4R/Z8DfwT8W9NYw/aZavtNoFFt\nLwbeaJp3uhqLiIg+kO1rT5A+BfyG7f8iaQj4Q9ufkvS27QVN896yvVDSE8Ah27ur8Z3Ac7b3XnG7\nm4BNAI1GY/XIyMiM78TExAQDAwMzXl+XEnONjZ/vcZp3NebB2QvTX7dq8fza9ttKiY9jnZKrM93k\nGh4ePmJ7sNW8uW3c1seAT0v6DeBW4Ocl7QbOSlpk+4ykRcC5av44sLRp/ZJq7D1s7wB2AAwODnpo\naKiNKNMbHR2lm/V1KTHXw1uf7W2YJltWTfL42PRPyZMPDdW231ZKfBzrlFyduR65Wp66sb3N9hLb\ny5j6Jevf2/5tYD+wsZq2EdhXbe8HNki6RdJyYAVwuOfJIyKiLe0c0V/NY8AeSY8Ap4AHAGwflbQH\neAWYBDbbvtR10oiImJGOit72KDBabf8IWHuVeduB7V1mi4iIHsg7YyMiCpeij4goXIo+IqJwKfqI\niMKl6CMiCpeij4goXDevo48+Wdblu1O3rJqs9R2uETG75Ig+IqJwKfqIiMKl6CMiCpeij4goXIo+\nIqJwKfqIiMKl6CMiCpeij4goXIo+IqJwKfqIiMK1LHpJt0o6LOmfJB2V9CfV+B2SDkh6rfq6sGnN\nNknHJR2TdG+ddyAiIq6tnSP6i8Cv2v4wcA+wTtJHga3AQdsrgIPVZSStZOqPiN8NrAOelDSnjvAR\nEdFay6L3lInq4vurfwbWA7uq8V3A/dX2emDE9kXbJ4DjwJqepo6IiLa1dY5e0hxJLwHngAO2XwAa\nts9UU94EGtX2YuCNpuWnq7GIiOgD2W5/srQAeAZ4FPiW7QVN171le6GkJ4BDtndX4zuB52zvveK2\nNgGbABqNxuqRkZEZ34mJiQkGBgZmvL4udeUaGz/f1frGPDh7oUdheuhauVYtnn99wzS52Z5f3Uqu\nznSTa3h4+IjtwVbzOvo8ettvS3qeqXPvZyUtsn1G0iKmjvYBxoGlTcuWVGNX3tYOYAfA4OCgh4aG\nOonyHqOjo3Szvi515er2s+S3rJrk8bHZ96cIrpXr5END1zdMk5vt+dWt5OrM9cjVzqtuPlgdySNp\nHvAJ4HvAfmBjNW0jsK/a3g9skHSLpOXACuBwr4NHRER72jmsWwTsql458z5gj+2vSvpHYI+kR4BT\nwAMAto9K2gO8AkwCm21fqid+RES00rLobf8v4CPTjP8IWHuVNduB7V2ni4iIruWdsRERhUvRR0QU\nLkUfEVG4FH1EROFS9BERhUvRR0QULkUfEVG4FH1EROFS9BERhUvRR0QULkUfEVG4FH1EROFS9BER\nhUvRR0QULkUfEVG4FH1EROFS9BERhWv5F6YkLQX+CmgABnbY/oKkO4D/CSwDTgIP2H6rWrMNeAS4\nBPy+7b+tJX3EdTA2fr7rP8g+Eycfu++67zPK1M4R/SSwxfZK4KPAZkkrga3AQdsrgIPVZarrNgB3\nA+uAJ6u/NxsREX3Qsuhtn7H9YrX9r8CrwGJgPbCrmrYLuL/aXg+M2L5o+wRwHFjT6+AREdEe2W5/\nsrQM+CbwIeB12wuqcQFv2V4g6QngkO3d1XU7geds773itjYBmwAajcbqkZGRGd+JiYkJBgYGZry+\nLnXlGhs/39X6xjw4e6FHYXroWrlWLZ5/fcM0Offj8335frW6zzfb875bJeYaHh4+Ynuw1byW5+gv\nkzQAfBn4nO2fTHX7FNuW1P5PjKk1O4AdAIODgx4aGupk+XuMjo7Szfq61JWr2/PFW1ZN8vhY2w/9\ndXOtXCcfGrq+YZr8xVP7+vL9anWfb7bnfbdu5lxtvepG0vuZKvmnbH+lGj4raVF1/SLgXDU+Dixt\nWr6kGouIiD5oWfTVaZmdwKu2/6zpqv3Axmp7I7CvaXyDpFskLQdWAId7FzkiIjrRzv+Pfgz4DDAm\n6aVq7I+Bx4A9kh4BTgEPANg+KmkP8ApTr9jZbPtSz5NHRERbWha97W8BusrVa6+yZjuwvYtcERHR\nI3lnbERE4VL0ERGFS9FHRBQuRR8RUbgUfURE4VL0ERGFS9FHRBQuRR8RUbgUfURE4VL0ERGFS9FH\nRBQuRR8RUbgUfURE4VL0ERGFS9FHRBQuRR8RUbh2/pTglySdk/Ry09gdkg5Ieq36urDpum2Sjks6\nJuneuoJHRER72jmi/0tg3RVjW4GDtlcAB6vLSFoJbADurtY8KWlOz9JGRETHWha97W8CP75ieD2w\nq9reBdzfND5i+6LtE8BxYE2PskZExAzM9Bx9w/aZavtNoFFtLwbeaJp3uhqLiIg+ke3Wk6RlwFdt\nf6i6/LbtBU3Xv2V7oaQngEO2d1fjO4HnbO+d5jY3AZsAGo3G6pGRkRnfiYmJCQYGBma8vi515Rob\nP9/V+sY8OHuhR2F66Fq5Vi2ef33DNDn34/N9+X61us832/O+WyXmGh4ePmJ7sNW8uTO6dTgraZHt\nM5IWAeeq8XFgadO8JdXY/8f2DmAHwODgoIeGhmYYBUZHR+lmfV3qyvXw1me7Wr9l1SSPj830oa/P\ntXKdfGjo+oZp8hdP7evL96vVfb7ZnvfduplzzfTUzX5gY7W9EdjXNL5B0i2SlgMrgMPdRYyIiG60\nPEyR9DQwBNwp6TTweeAxYI+kR4BTwAMAto9K2gO8AkwCm21fqil7RES0oWXR237wKletvcr87cD2\nbkJFRETv5J2xERGFS9FHRBQuRR8RUbgUfURE4VL0ERGFS9FHRBQuRR8RUbgUfURE4WbfB57cQJa1\n+MyZLasmu/5cmoiIbuWIPiKicCn6iIjCpegjIgqXoo+IKFyKPiKicCn6iIjC5eWVEbNUP1++e/Kx\n+2q53eiPHNFHRBSutqKXtE7SMUnHJW2taz8REXFttRS9pDnAfwM+CawEHpS0so59RUTEtdV1jn4N\ncNz29wEkjQDrmfqj4T03Nn4+HzUQ0UOtfj9wLd387iC/G6hHXaduFgNvNF0+XY1FRMR1Jtu9v1Hp\nN4F1tv9zdfkzwC/Z/r2mOZuATdXFu4BjXezyTuBfulhfl+TqTHJ1Jrk6U2Ku/2j7g60m1XXqZhxY\n2nR5STX2M7Z3ADt6sTNJ37E92Ivb6qXk6kxydSa5OnMz56rr1M23gRWSlkv6ALAB2F/TviIi4hpq\nOaK3PSnp94C/BeYAX7J9tI59RUTEtdX2zljbXwO+VtftX6Enp4BqkFydSa7OJFdnbtpctfwyNiIi\nZo98BEJEROFu6KKfrR+zIOlLks5JernfWS6TtFTS85JekXRU0mf7nQlA0q2SDkv6pyrXn/Q7UzNJ\ncyR9V9JX+53lMkknJY1JeknSd/qd5zJJCyTtlfQ9Sa9K+uVZkOmu6vt0+d9PJH2u37kAJP1B9Zx/\nWdLTkm6tbV836qmb6mMW/jfwCabekPVt4EHbtbz7thOSPg5MAH9l+0P9zgMgaRGwyPaLkn4OOALc\n3+/vlyQBt9uekPR+4FvAZ20f6meuyyT9V2AQ+Hnbn+p3HpgqemDQ9qx6TbikXcA/2P5i9Wq722y/\n3e9cl1WdMc7Ue3pO9TnLYqae6yttX5C0B/ia7b+sY3838hH9zz5mwfZPgcsfs9B3tr8J/LjfOZrZ\nPmP7xWr7X4FXmQXvVvaUieri+6t/s+LoQ9IS4D7gi/3OMttJmg98HNgJYPuns6nkK2uBf+53yTeZ\nC8yTNBe4DfhBXTu6kYs+H7MwQ5KWAR8BXuhvkinV6ZGXgHPAAduzIhfw58AfAf/W7yBXMPB3ko5U\n7zCfDZYDPwT+R3Wq64uSbu93qCtsAJ7udwgA2+PAnwKvA2eA87a/Udf+buSijxmQNAB8Gfic7Z/0\nOw+A7Uu272HqHdRrJPX9dJekTwHnbB/pd5Zp/Kfq+/VJYHN1qrDf5gK/CPx32x8B3gFm0+/NPgB8\nGvibfmcBkLSQqTMQy4H/ANwu6bfr2t+NXPQtP2Yh3qs6B/5l4CnbX+l3nitV/6v/PLCu31mAjwGf\nrs6HjwC/Kml3fyNNqY4GsX0OeIap05j9dho43fR/Y3uZKv7Z4pPAi7bP9jtI5deAE7Z/aPv/Al8B\nfqWund3IRZ+PWehA9UvPncCrtv+s33kuk/RBSQuq7XlM/XL9e/1NBba32V5iexlTz62/t13bEVe7\nJN1e/TKd6tTIrwN9f3WX7TeBNyTdVQ2tpaaPJZ+hB5klp20qrwMflXRb9d/mWqZ+b1aLG/Zvxs7m\nj1mQ9DQwBNwp6TTweds7+5uKjwGfAcaq8+EAf1y9g7mfFgG7qldEvA/YY3vWvJRxFmoAz0x1A3OB\nv7b99f5G+plHgaeqA6/vA7/T5zzAz34gfgL43X5nucz2C5L2Ai8Ck8B3qfEdsjfsyysjIqI9N/Kp\nm4iIaEOKPiKicCn6iIjCpegjIgqXoo+IKFyKPiKicCn6iIjCpegjIgr3/wBc5L+9cYDtswAAAABJ\nRU5ErkJggg==\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#Lets investigate how many bedrooms there are.\n",
"train.BedroomAbvGr.hist() #This is the number of Bedrooms above grade (i.e. not basement bedrooms)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hmm, this is odd. Why is this gap there. Lets look at the actual values of the feature."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3 804\n",
"2 358\n",
"4 213\n",
"1 50\n",
"5 21\n",
"6 7\n",
"0 6\n",
"8 1\n",
"Name: BedroomAbvGr, dtype: int64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train.BedroomAbvGr.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, it was the way the histogram bins integers then, lets not worry about it.\n",
"There is an 8 bedroom house, that seems odd. Lets find it to look if this is an outlier or error. \n",
"\n",
"I want to select the row of the train dataframe that has BedroomAbvGr=8. I can do it with one line of pandas slicing"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n",
"635 636 190 RH 60.0 10896 Pave Pave Reg \n",
"\n",
" LandContour Utilities LotConfig LandSlope Neighborhood Condition1 \\\n",
"635 Bnk AllPub Inside Gtl SWISU Feedr \n",
"\n",
" Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt \\\n",
"635 Norm 2fmCon 2.5Fin 6 7 1914 \n",
"\n",
" YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType \\\n",
"635 1995 Hip CompShg VinylSd VinylSd None \n",
"\n",
" MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure \\\n",
"635 0.0 Fa TA CBlock TA Fa No \n",
"\n",
" BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF \\\n",
"635 LwQ 256 Unf 0 1184 1440 \n",
"\n",
" Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF \\\n",
"635 GasA Ex Y FuseA 1440 1440 515 \n",
"\n",
" GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr \\\n",
"635 3395 0 0 2 0 8 \n",
"\n",
" KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces \\\n",
"635 2 Fa 14 Typ 0 \n",
"\n",
" FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea \\\n",
"635 NaN NaN NaN NaN 0 0 \n",
"\n",
" GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch \\\n",
"635 NaN NaN N 0 110 0 \n",
"\n",
" 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal \\\n",
"635 0 0 0 NaN NaN NaN 0 \n",
"\n",
" MoSold YrSold SaleType SaleCondition SalePrice \n",
"635 3 2007 WD Abnorml 200000 "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Uh, it does not show me all the columns. I can fix that in many ways.\n",
"#We'll change the number of max_columns:\n",
"pd.set_option('display.max_columns',300) \n",
"train[ train.BedroomAbvGr==8] #ok lets see now."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the data description we read that MSsubClass=190 means this is a '2 family conversion, all styles' home.\n",
"The Neighborhood is SWISU (South and West of Iowa State University).\n",
"House style is 2.5Fin (Two and one-half story house) built in 1914.\n",
"Has TotRmsAbvGrd = 14 (thats a lot of rooms), only 2 full bathrooms and 2 Kitchens which are unfortunately of quality 'Fa' (Fair). \n",
"\n",
"Lets check how big it is overall."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"635 3395\n",
"Name: GrLivArea, dtype: int64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train[ train.BedroomAbvGr==8].GrLivArea #total square feet except basement"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So my conclusion here is that this is a pretty large old house. \n",
"Probably 8 bedrooms is not an error, so lets leave it alone. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data cleaning and feature normalization. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are trying to predict the sales price. Lets look at its statistics. "
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAFdpJREFUeJzt3H2MVOd9xfHvKdjYZVxeYne0AhSIhFxBaeywIrYcRbux\nEuPYMv6jstZyI5w62kolVqJSVdBIbfMHqpsqVVJRt1mtnVI5MaEkrlEcJyLU2yZtbWJsEgw2ZWOw\nzApD4vql60akuL/+MY/hslmYWfYOc+fJ+UijufPc5957dlnO3r3zoojAzMzy9SudDmBmZu3lojcz\ny5yL3swscy56M7PMuejNzDLnojczy5yL3swscy56M7PMuejNzDI3s9MBAK688sq46qqrmD17dqej\nNPXWW285Z8m6JatzlqtbckJ1s+7Zs+enEXFV04kR0fHbypUr44knnohu4Jzl65aszlmubskZUd2s\nwNPRQsf60o2ZWeZc9GZmmXPRm5llzkVvZpY5F72ZWeZc9GZmmXPRm5llzkVvZpY5F72ZWeYq8REI\n3Wrxhsc6ctwj993SkeOaWXfyGb2ZWeZc9GZmmWta9JKulrS3cHtT0qclzZe0U9KhdD+vsM1GSaOS\nDkq6qb1fgpmZnU/Too+IgxFxTURcA6wE/gd4BNgA7IqIpcCu9BhJy4ABYDmwGrhf0ow25Tczsyam\neunmRuDHEfESsAbYksa3ALen5TXA1og4GRGHgVFgVRlhzcxs6tT4SOMWJ0sPAs9ExGZJr0fE3DQu\n4LWImCtpM/BkRDyU1j0APB4R2yfsaxAYBKjX6yuHh4ep1WrlfFVtND4+fjrnvrE3OpJhxYI5TecU\nc1Zdt2R1znJ1S06obtb+/v49EdHbbF7LL6+UdClwG7Bx4rqICEmt/8ZobDMEDAH09vZGrVajr69v\nKrvoiJGRkdM57+7Uyyvv6ms6p5iz6rolq3OWq1tyQndlncxULt3cTONs/nh6fFxSD0C6P5HGx4BF\nhe0WpjEzM+uAqRT9ncDDhcc7gLVpeS3waGF8QNIsSUuApcDu6QY1M7ML09KlG0mzgQ8Dv1cYvg/Y\nJuke4CXgDoCI2C9pG3AAOAWsi4i3S01tZmYta6noI+It4F0Txl6l8SqcyeZvAjZNO52ZmU2b3xlr\nZpY5F72ZWeZc9GZmmXPRm5llzkVvZpY5F72ZWeZc9GZmmXPRm5llzkVvZpY5F72ZWeZc9GZmmXPR\nm5llzkVvZpY5F72ZWeZc9GZmmXPRm5llzkVvZpY5F72ZWeZc9GZmmWup6CXNlbRd0guSnpd0vaT5\nknZKOpTu5xXmb5Q0KumgpJvaF9/MzJpp9Yz+i8C3I+I3gPcCzwMbgF0RsRTYlR4jaRkwACwHVgP3\nS5pRdnAzM2tN06KXNAf4IPAAQET8PCJeB9YAW9K0LcDtaXkNsDUiTkbEYWAUWFV2cDMza00rZ/RL\ngJ8AX5b0rKRhSbOBekQcS3NeAeppeQHwcmH7o2nMzMw6QBFx/glSL/AkcENEPCXpi8CbwL0RMbcw\n77WImCdpM/BkRDyUxh8AHo+I7RP2OwgMAtTr9ZXDw8PUarUyv7a2GB8fP51z39gbHcmwYsGcpnOK\nOauuW7I6Z7m6JSdUN2t/f/+eiOhtNm9mC/s6ChyNiKfS4+00rscfl9QTEcck9QAn0voxYFFh+4Vp\n7CwRMQQMAfT29katVqOvr6+FOJ01MjJyOufdGx7rSIYjd/U1nVPMWXXdktU5y9UtOaG7sk6m6aWb\niHgFeFnS1WnoRuAAsANYm8bWAo+m5R3AgKRZkpYAS4HdpaY2M7OWtXJGD3Av8BVJlwIvAh+n8Uti\nm6R7gJeAOwAiYr+kbTR+GZwC1kXE26UnNzOzlrRU9BGxF5jsOtCN55i/Cdg0jVxmZlYSvzPWzCxz\nLnozs8y56M3MMueiNzPLnIvezCxzLnozs8y56M3MMueiNzPLnIvezCxzLnozs8y56M3MMueiNzPL\nnIvezCxzLnozs8y56M3MMueiNzPLnIvezCxzLnozs8y56M3MMueiNzPLXEtFL+mIpH2S9kp6Oo3N\nl7RT0qF0P68wf6OkUUkHJd3UrvBmZtbcVM7o+yPimojoTY83ALsiYimwKz1G0jJgAFgOrAbulzSj\nxMxmZjYF07l0swbYkpa3ALcXxrdGxMmIOAyMAqumcRwzM5uGVos+gO9K2iNpMI3VI+JYWn4FqKfl\nBcDLhW2PpjEzM+sARUTzSdKCiBiT9OvATuBeYEdEzC3MeS0i5knaDDwZEQ+l8QeAxyNi+4R9DgKD\nAPV6feXw8DC1Wq20L6xdxsfHT+fcN/ZGRzKsWDCn6ZxizqrrlqzOWa5uyQnVzdrf37+ncDn9nGa2\nsrOIGEv3JyQ9QuNSzHFJPRFxTFIPcCJNHwMWFTZfmMYm7nMIGALo7e2NWq1GX19fK3E6amRk5HTO\nuzc81pEMR+7qazqnmLPquiWrc5arW3JCd2WdTNNLN5JmS7rinWXgI8BzwA5gbZq2Fng0Le8ABiTN\nkrQEWArsLju4mZm1ppUz+jrwiKR35n81Ir4t6QfANkn3AC8BdwBExH5J24ADwClgXUS83Zb0ZmbW\nVNOij4gXgfdOMv4qcOM5ttkEbJp2OjMzmza/M9bMLHMuejOzzLnozcwy56I3M8uci97MLHMuejOz\nzLnozcwy56I3M8uci97MLHMuejOzzLnozcwy56I3M8uci97MLHMuejOzzLnozcwy56I3M8uci97M\nLHMuejOzzLnozcwy56I3M8tcy0UvaYakZyV9Mz2eL2mnpEPpfl5h7kZJo5IOSrqpHcHNzKw1Uzmj\n/xTwfOHxBmBXRCwFdqXHSFoGDADLgdXA/ZJmlBPXzMymqqWil7QQuAUYLgyvAbak5S3A7YXxrRFx\nMiIOA6PAqnLimpnZVCkimk+StgN/DlwB/GFE3Crp9YiYm9YLeC0i5kraDDwZEQ+ldQ8Aj0fE9gn7\nHAQGAer1+srh4WFqtVqZX1tbjI+Pn865b+yNjmRYsWBO0znFnFXXLVmds1zdkhOqm7W/v39PRPQ2\nmzez2QRJtwInImKPpL7J5kRESGr+G+PsbYaAIYDe3t6o1Wr09U26+0oZGRk5nfPuDY91JMORu/qa\nzinmrLpuyeqc5eqWnNBdWSfTtOiBG4DbJH0UuAz4NUkPAccl9UTEMUk9wIk0fwxYVNh+YRozM7MO\naHqNPiI2RsTCiFhM40nWf46I3wF2AGvTtLXAo2l5BzAgaZakJcBSYHfpyc3MrCWtnNGfy33ANkn3\nAC8BdwBExH5J24ADwClgXUS8Pe2kZmZ2QaZU9BExAoyk5VeBG88xbxOwaZrZzMysBH5nrJlZ5lz0\nZmaZm841euuQxS28rHP9ilNtefnnkftuKX2fZtZePqM3M8uci97MLHMuejOzzLnozcwy56I3M8uc\ni97MLHMuejOzzLnozcwy56I3M8uci97MLHMuejOzzLnozcwy56I3M8uci97MLHMuejOzzLnozcwy\n17ToJV0mabekH0raL+mzaXy+pJ2SDqX7eYVtNkoalXRQ0k3t/ALMzOz8WjmjPwl8KCLeC1wDrJZ0\nHbAB2BURS4Fd6TGSlgEDwHJgNXC/pBntCG9mZs01LfpoGE8PL0m3ANYAW9L4FuD2tLwG2BoRJyPi\nMDAKrCo1tZmZtayla/SSZkjaC5wAdkbEU0A9Io6lKa8A9bS8AHi5sPnRNGZmZh2giGh9sjQXeAS4\nF/h+RMwtrHstIuZJ2gw8GREPpfEHgMcjYvuEfQ0CgwD1en3l8PAwtVpt2l9Qu42Pj5/OuW/sjQ6n\nObf65XD8Z+Xvd8WCOaXvs/g9rTLnLFe35ITqZu3v798TEb3N5s2cyk4j4nVJT9C49n5cUk9EHJPU\nQ+NsH2AMWFTYbGEam7ivIWAIoLe3N2q1Gn19fVOJ0xEjIyOnc9694bHOhjmP9StO8fl9U/rnbcmR\nu/pK32fxe1plzlmubskJ3ZV1Mq286uaqdCaPpMuBDwMvADuAtWnaWuDRtLwDGJA0S9ISYCmwu+zg\nZmbWmlZO+XqALemVM78CbIuIb0r6D2CbpHuAl4A7ACJiv6RtwAHgFLAuIt5uT3wzM2umadFHxI+A\naycZfxW48RzbbAI2TTudmZlNm98Za2aWORe9mVnmXPRmZplz0ZuZZc5Fb2aWORe9mVnmXPRmZplz\n0ZuZZc5Fb2aWORe9mVnmXPRmZplz0ZuZZc5Fb2aWORe9mVnmXPRmZplz0ZuZZc5Fb2aWORe9mVnm\nXPRmZplz0ZuZZa5p0UtaJOkJSQck7Zf0qTQ+X9JOSYfS/bzCNhsljUo6KOmmdn4BZmZ2fq2c0Z8C\n1kfEMuA6YJ2kZcAGYFdELAV2pcekdQPAcmA1cL+kGe0Ib2ZmzTUt+og4FhHPpOX/Bp4HFgBrgC1p\n2hbg9rS8BtgaEScj4jAwCqwqO7iZmbVmStfoJS0GrgWeAuoRcSytegWop+UFwMuFzY6mMTMz6wBF\nRGsTpRrwL8CmiPiGpNcjYm5h/WsRMU/SZuDJiHgojT8APB4R2yfsbxAYBKjX6yuHh4ep1WrlfFVt\nND4+fjrnvrE3Opzm3OqXw/Gflb/fFQvmlL7P4ve0ypyzXN2SE6qbtb+/f09E9DabN7OVnUm6BPg6\n8JWI+EYaPi6pJyKOSeoBTqTxMWBRYfOFaewsETEEDAH09vZGrVajr6+vlTgdNTIycjrn3Rse62yY\n81i/4hSf39fSP++UHLmrr/R9Fr+nVeac5eqWnNBdWSfTyqtuBDwAPB8Rf1VYtQNYm5bXAo8Wxgck\nzZK0BFgK7C4vspmZTUUrp3w3AB8D9knam8b+GLgP2CbpHuAl4A6AiNgvaRtwgMYrdtZFxNulJzcz\ns5Y0LfqI+D6gc6y+8RzbbAI2TSOXmZmVxO+MNTPLnIvezCxzLnozs8y56M3MMueiNzPLnIvezCxz\nLnozs8y56M3MMueiNzPLXPmfetUBiy/ih4utX3Gq0h9mZmY2kc/ozcwy56I3M8uci97MLHNZXKO3\ni6cdz4e08rzHkftuKf24Zr8sfEZvZpY5F72ZWeZc9GZmmXPRm5llzkVvZpa5pkUv6UFJJyQ9Vxib\nL2mnpEPpfl5h3UZJo5IOSrqpXcHNzKw1rZzR/z2wesLYBmBXRCwFdqXHSFoGDADL0zb3S5pRWloz\nM5uypkUfEf8K/NeE4TXAlrS8Bbi9ML41Ik5GxGFgFFhVUlYzM7sAF3qNvh4Rx9LyK0A9LS8AXi7M\nO5rGzMysQ6b9ztiICEkx1e0kDQKDAPV6nfHxcUZGRi4ow/oVpy5ouwtRv/ziHu9CdUtOaC3rhf5s\nlGk6P6MXk3OWr5uyTuZCi/64pJ6IOCapBziRxseARYV5C9PYL4iIIWAIoLe3N2q1Gn19fRcU5mJ+\nbPD6Faf4/L7qf3JEt+SE1rIeuavv4oQ5j5GRkQv+Gb2YnLN83ZR1Mhd66WYHsDYtrwUeLYwPSJol\naQmwFNg9vYhmZjYdTU/5JD0M9AFXSjoK/ClwH7BN0j3AS8AdABGxX9I24ABwClgXEW+3KbuZmbWg\nadFHxJ3nWHXjOeZvAjZNJ5SZmZXH74w1M8uci97MLHMuejOzzLnozcwy56I3M8uci97MLHMuejOz\nzLnozcwy56I3M8tcd3zqlf3SW3wRP7huoiP33dKxY5uVwWf0ZmaZc9GbmWXORW9mljkXvZlZ5lz0\nZmaZc9GbmWXORW9mljkXvZlZ5vyGKbMm3nmz1voVp7j7Ir5xy2/UsrL4jN7MLHNtK3pJqyUdlDQq\naUO7jmNmZufXlqKXNAP4G+BmYBlwp6Rl7TiWmZmdX7vO6FcBoxHxYkT8HNgKrGnTsczM7Dza9WTs\nAuDlwuOjwPvbdCyzLF3oJ3Ze7CeNL1S35IT2Zr0YT7orIsrfqfTbwOqI+ER6/DHg/RHxycKcQWAw\nPbwaeBX4aelhynclzlm2bsnqnOXqlpxQ3azvjoirmk1q1xn9GLCo8HhhGjstIoaAoXceS3o6Inrb\nlKc0zlm+bsnqnOXqlpzQXVkn065r9D8AlkpaIulSYADY0aZjmZnZebTljD4iTkn6JPAdYAbwYETs\nb8exzMzs/Nr2ztiI+BbwrSlsMtR8SiU4Z/m6JatzlqtbckJ3Zf0FbXky1szMqsMfgWBmlruI6OgN\nWA0cBEaBDW08zoPACeC5wth8YCdwKN3PK6zbmDIdBG4qjK8E9qV1f82Zv4pmAV9L408BiwvbrE3H\nOASsbZJzEfAEcADYD3yqilmBy4DdwA9Tzs9WMWdh/gzgWeCbFc95JB1jL/B0VbMCc4HtwAvA88D1\nVctJ42Xbewu3N4FPVy3nxbh17MCF/3w/Bt4DXEqjNJa16VgfBN7H2UX/OdIvF2AD8BdpeVnKMgtY\nkjLOSOt2A9cBAh4Hbk7jvw/8XVoeAL5W+E/6Yrqfl5bnnSdnD/C+tHwF8J8pT6Wypn3W0vIl6Yf8\nuqrlLOT9A+CrnCn6quY8Alw5YaxyWYEtwCfS8qU0ir9yOSd0zSvAu6ucs123Thf99cB3Co83Ahvb\neLzFnF30B4GetNwDHJwsB41XD12f5rxQGL8T+FJxTlqeSePNFSrOSeu+BNw5hcyPAh+uclbgV4Fn\naLz7uXI5abyPYxfwIc4UfeVypjlH+MWir1RWYA5wmHRWW9WcE7J9BPi3quds163T1+gn+6iEBRfx\n+PWIOJaWXwHqTXItSMsTx8/aJiJOAW8A7zrPvpqStBi4lsbZcuWySpohaS+NS2I7I6KSOYEvAH8E\n/F9hrIo5AQL4rqQ96d3jVcy6BPgJ8GVJz0oaljS7gjmLBoCH03KVc7ZFp4u+MqLxazc6neMdkmrA\n14FPR8SbxXVVyRoRb0fENTTOmFdJ+s0J6zueU9KtwImI2HOuOVXIWfCB9D29GVgn6YPFlRXJOpPG\nZdC/jYhrgbdoXAI5rSI5AUhv2rwN+MeJ66qUs506XfRNPyqhzY5L6gFI9yea5BpLyxPHz9pG0kwa\nf96+ep59nZOkS2iU/Fci4htVzgoQEa/TeAJ5dQVz3gDcJukIjU9R/ZCkhyqYE4CIGEv3J4BHaHwS\nbNWyHgWOpr/goPGk7PsqmPMdNwPPRMTx9LiqOdunU9eMCte0XqTxp+A7T8Yub+PxFnP2Nfq/5Own\nZT6Xlpdz9pMyL3LuJ2U+msbXcfaTMtvS8nwa1zPnpdthYP55Mgr4B+ALE8YrlRW4Cpibli8Hvgfc\nWrWcEzL3ceYafeVyArOBKwrL/07jl2cVs34PuDot/1nKWLmcaZutwMer+n/pYtw6WvTpG/JRGq8s\n+THwmTYe52HgGPC/NM5I7qFxLW0XjZc/fbf4DwF8JmU6SHqGPY33As+ldZs58zKry2j8aTiafije\nU9jmd9P4aPEH7hw5P0DjT8kfceZlYR+tWlbgt2i8XPFH6Rh/ksYrlXNC5j7OFH3lctJ49dkPOfOS\n1c9UOOs1wNPp3/+faJRZFXPOpnGGPacwVrmc7b75nbFmZpnr9DV6MzNrMxe9mVnmXPRmZplz0ZuZ\nZc5Fb2aWORe9mVnmXPRmZplz0ZuZZe7/AV2W/cdEyj5hAAAAAElFTkSuQmCC\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"train.SalePrice.hist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a very skewed distibution. It has a high dynamic range. \n",
"Alex recommends that when you are predicting \n",
"a variable with high dynamic range, \n",
"it is better to predict its logarithm instead. \n",
"Engineers discovered that and introduced the Decibel (dB) scale.\n",
"See also this https://www.youtube.com/watch?v=_p-WyPg1sbU for fun. \n",
"\n",
"Its best to use log1p i.e. logarithm of 1 plus the quantity (for non-negative features). the 1+ ensures that features that are 0 do not make the logarithm explode. \n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Some data cleaning \n",
"\n",
"all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],\n",
" test.loc[:,'MSSubClass':'SaleCondition']))\n",
"\n",
"#log transform the target:\n",
"train[\"SalePrice\"] = np.log1p(train[\"SalePrice\"])\n",
"\n",
"#log transform skewed numeric features:\n",
"numeric_feats = all_data.dtypes[all_data.dtypes != \"object\"].index\n",
"\n",
"skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness\n",
"skewed_feats = skewed_feats[skewed_feats > 0.75]\n",
"skewed_feats = skewed_feats.index #Lets mark which features are very skewed.\n",
"\n",
"all_data[skewed_feats] = np.log1p(all_data[skewed_feats])\n",
"\n",
"\n",
"all_data = pd.get_dummies(all_data)\n",
"all_data = all_data.fillna(all_data.mean())\n",
"X_train = all_data[:train.shape[0]]\n",
"X_test = all_data[train.shape[0]:]\n",
"y = train.SalePrice"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"