Basic Feature Engineering Techniques for ML
In this topic I want to describe some basic data transformation and analysing techniques to prepare data for modelling. For demonstration I took Medical Cost Personal Datasets from Kaggle.
Medical Cost Personal Datasets Insurance Forecast by using Linear Regression
import pandas as pd
from matplotlib import pyplot as plt
import os
import numpy as np
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from scipy.stats import ttest_ind
import scipy.stats as stats
import warnings
warnings.filterwarnings('ignore')
os.chdir('/python/insurence/')
df = pd.read_csv("insurance.csv")
df.head(5)
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
Numerical
-
age: age of primary beneficiary
-
bmi: body mass index, providing an understanding of a body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
-
charges: individual medical costs billed by health insurance
-
children: number of children covered by health insurance / number of dependents
Categorical
-
sex: insurance contractor gender, female, male
-
smoker: smoking or not
-
region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
df.describe()
age | bmi | children | charges | |
---|---|---|---|---|
count | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 |
mean | 39.207025 | 30.663397 | 1.094918 | 13270.422265 |
std | 14.049960 | 6.098187 | 1.205493 | 12110.011237 |
min | 18.000000 | 15.960000 | 0.000000 | 1121.873900 |
25% | 27.000000 | 26.296250 | 0.000000 | 4740.287150 |
50% | 39.000000 | 30.400000 | 1.000000 | 9382.033000 |
75% | 51.000000 | 34.693750 | 2.000000 | 16639.912515 |
max | 64.000000 | 53.130000 | 5.000000 | 63770.428010 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
Charges distribution
Let's research the distributions for numerical features of the set. Start with charges and analyse its distribution. Our goal is to make normal distribution for each feature before starting using this data in modelling.
#Charges distribution
sns.distplot(df['charges'])
We can see that charges
normal distribution is asymmetrical.
Task of the data scientist during the feature preparation is to achieve data uniformity.
There are some techniques for that to apply see more here or here.
- Imputation
- Handling Outliers
- Binning
- Log Transform
- Feature Encoding
- Grouping Operations
- Scaling
We will apply some of them depends on what behaviour we will have in feature samples.
Outliers in charges
Let's check how much outliers in charges
distribution.
sns.boxplot(x=df['charges'])
There are a lot of outliers from the right side. Tere are some imputation approaches for that. I'll try to delete outliers or replace outliers by median value and compare results.
#Set feature
feature = 'charges'
Imputation and Handling outliers
Below I am calculating limit the lower and upper values for a sample. Values outside of this range (lover limit, upper limit) are outliers.
#calculate lower and upper limit values for a sample
def boundary_values (feature):
feature_q25,feature_q75 = np.percentile(df[feature], 25), np.percentile(df[feature], 75)
feature_IQR = feature_q75 - feature_q25
Threshold = feature_IQR * 1.5 #interquartile range (IQR)
feature_lower, feature_upper = feature_q25-Threshold, feature_q75 + Threshold
print("Lower limit of " + feature + " distribution: " + str(feature_lower))
print("Upper limit of " + feature + " distribution: " + str(feature_upper))
return feature_lower,feature_upper;
#create two new DF with transformed/deleted outliers
#1st - outliers changed on sample median
#2nd - deleting outliers
def manage_outliers(df,feature_lower,feature_upper):
df_del = df.copy()
df_median = df.copy()
median = df_del.loc[(df_del[feature] < feature_upper) & \
(df_del[feature] > feature_lower), feature].median()
df_del.loc[(df_del[feature] > feature_upper)] = np.nan
df_del.loc[(df_del[feature] < feature_lower)] = np.nan
df_del.fillna(median, inplace=True)
df_median.loc[(df_median[feature] > charges_upper)] = np.nan
df_median.loc[(df_median[feature] < charges_lower)] = np.nan
df_median.dropna(subset = [feature], inplace=True)
return df_del, df_median;
#calculate limits
x,y = boundary_values(feature)
#samples with modified outliers
df_median, df_del = manage_outliers(df,x,y)
Lower limit of charges distribution: -13109.1508975
Upper limit of charges distribution: 34489.350562499996
df_median['charges'].mean()
9770.084335798923
df_agg = pd.DataFrame(
{'df': [
df['charges'].mean(),
df['charges'].max(),
df['charges'].min(),
df['charges'].std(),
df['charges'].count()]
}, columns=['df'], index=['mean', 'max', 'min', 'std', 'count'])
df_agg['df_mean'] = pd.DataFrame(
{'df_median': [
df_median['charges'].mean(),
df_median['charges'].max(),
df_median['charges'].min(),
df_median['charges'].std(),
df_median['charges'].count()]
}, columns=['df_median'], index=['mean', 'max', 'min', 'std', 'count'])
df_agg['df_del'] = pd.DataFrame(
{'df_del': [
df_del['charges'].mean(),
df_del['charges'].max(),
df_del['charges'].min(),
df_del['charges'].std(),
df_del['charges'].count()]
}, columns=['df_del'], index=['mean', 'max', 'min', 'std', 'count'])
df_agg
df | df_mean | df_del | |
---|---|---|---|
mean | 13270.422265 | 9770.084336 | 9927.753402 |
max | 63770.428010 | 34472.841000 | 34472.841000 |
min | 1121.873900 | 1121.873900 | 1121.873900 |
std | 12110.011237 | 6870.056585 | 7241.158309 |
count | 1338.000000 | 1338.000000 | 1199.000000 |
#the sample after sample modification
fig, ax = plt.subplots(2,2, figsize=(15, 5))
sns.distplot(ax = ax[0,0], x = df['charges'])
sns.distplot(ax = ax[1,0], x = df_median['charges'])
sns.distplot(ax = ax[1,1], x = df_del['charges'])
plt.show()
df_shape = df.agg(['skew', 'kurtosis']).transpose()
df_shape.rename(columns = {'skew':'skew_df','kurtosis':'kurtosis_df'}, inplace = True)
df_shape['skew_median'] = df_median.agg(['skew', 'kurtosis']).transpose()['skew']
df_shape['kurtosis_median'] = df_median.agg(['skew', 'kurtosis']).transpose()['kurtosis']
df_shape['skew_del'] = df_del.agg(['skew', 'kurtosis']).transpose()['skew']
df_shape['kurtosis_del'] = df_del.agg(['skew', 'kurtosis']).transpose()['kurtosis']
df_shape
skew_df | kurtosis_df | skew_median | kurtosis_median | skew_del | kurtosis_del | |
---|---|---|---|---|---|---|
age | 0.055673 | -1.245088 | 2.599285 | 4.763691 | 0.067588 | -1.255101 |
bmi | 0.284047 | -0.050732 | 2.599394 | 4.764021 | 0.366750 | 0.011529 |
children | 0.938380 | 0.202454 | 2.599417 | 4.764092 | 0.987108 | 0.318218 |
charges | 1.515880 | 1.606299 | 1.304122 | 1.565420 | 1.178483 | 1.022970 |
Let's look at the charges row. Here is some positive changes: skew and kurtosis decreased, but still not significantly. So charges still need additional improvements.
There are a few data transformation methods solving abnormality. We will try two of them (Square Root and Log) and choose better for the dataset.
Square Root transformation
Square root method is typically used when your data is moderately skewed, see more here.
df.insert(len(df.columns), 'charges_Sqrt',np.sqrt(df.iloc[:,6]))
df_median.insert(len(df_median.columns), 'charges_Sqrt',np.sqrt(df_median.iloc[:,6]))
df_del.insert(len(df_del.columns), 'charges_Sqrt',np.sqrt(df_del.iloc[:,6]))
fig, ax = plt.subplots(2,2, figsize=(15, 5))
sns.distplot(ax = ax[0,0], x = df['charges_Sqrt'])
sns.distplot(ax = ax[1,0], x = df_median['charges_Sqrt'])
sns.distplot(ax = ax[1,1], x = df_del['charges_Sqrt'])
plt.show()
df_shape_sqrt = df.agg(['skew', 'kurtosis']).transpose()
df_shape_sqrt.rename(columns = {'skew':'skew_df','kurtosis':'kurtosis_df'}, inplace = True)
df_shape_sqrt['skew_median'] = df_median.agg(['skew', 'kurtosis']).transpose()['skew']
df_shape_sqrt['kurtosis_median'] = df_median.agg(['skew', 'kurtosis']).transpose()['kurtosis']
df_shape_sqrt['skew_del'] = df_del.agg(['skew', 'kurtosis']).transpose()['skew']
df_shape_sqrt['kurtosis_del'] = df_del.agg(['skew', 'kurtosis']).transpose()['kurtosis']
df_shape_sqrt
skew_df | kurtosis_df | skew_median | kurtosis_median | skew_del | kurtosis_del | |
---|---|---|---|---|---|---|
age | 0.055673 | -1.245088 | 2.599285 | 4.763691 | 0.067588 | -1.255101 |
bmi | 0.284047 | -0.050732 | 2.599394 | 4.764021 | 0.366750 | 0.011529 |
children | 0.938380 | 0.202454 | 2.599417 | 4.764092 | 0.987108 | 0.318218 |
charges | 1.515880 | 1.606299 | 1.304122 | 1.565420 | 1.178483 | 1.022970 |
charges_Sqrt | 0.795863 | -0.073061 | 0.479111 | -0.089718 | 0.440043 | -0.399537 |
Log transformation
# Python log transform
df.insert(len(df.columns), 'charges_log',np.log(df['charges']))
df_median.insert(len(df_median.columns), 'charges_log',np.log(df_median['charges']))
df_del.insert(len(df_del.columns), 'charges_log',np.log(df_del['charges']))
df_shape_log = df.agg(['skew', 'kurtosis']).transpose()
df_shape_log.rename(columns = {'skew':'skew_df','kurtosis':'kurtosis_df'}, inplace = True)
df_shape_log['skew_median'] = df_median.agg(['skew', 'kurtosis']).transpose()['skew']
df_shape_log['kurtosis_median'] = df_median.agg(['skew', 'kurtosis']).transpose()['kurtosis']
df_shape_log['skew_del'] = df_del.agg(['skew', 'kurtosis']).transpose()['skew']
df_shape_log['kurtosis_del'] = df_del.agg(['skew', 'kurtosis']).transpose()['kurtosis']
df_shape_log
skew_df | kurtosis_df | skew_median | kurtosis_median | skew_del | kurtosis_del | |
---|---|---|---|---|---|---|
age | 0.055673 | -1.245088 | 2.599285 | 4.763691 | 0.067588 | -1.255101 |
bmi | 0.284047 | -0.050732 | 2.599394 | 4.764021 | 0.366750 | 0.011529 |
children | 0.938380 | 0.202454 | 2.599417 | 4.764092 | 0.987108 | 0.318218 |
charges | 1.515880 | 1.606299 | 1.304122 | 1.565420 | 1.178483 | 1.022970 |
charges_Sqrt | 0.795863 | -0.073061 | 0.479111 | -0.089718 | 0.440043 | -0.399537 |
charges_log | -0.090098 | -0.636667 | -0.393856 | -0.319703 | -0.328473 | -0.609327 |
fig, ax = plt.subplots(2,2, figsize=(15, 5))
sns.distplot(ax = ax[0,0], x = df['charges_log'])
sns.distplot(ax = ax[1,0], x = df_median['charges_log'])
sns.distplot(ax = ax[1,1], x= df_del['charges_log'])
plt.show()
Here in table we can compare pairs skew-kurtosis for three DF: unmodified, with outliers changed on mean and with deleted outliers.
The first three pare for charges
DF looks non-normal, because both in pair are enough far from 0.
After Square-Root transformations the best pair is a mediand_df
pair with lower skew and kurtosis in the same time.
If compare with Log-transformation, the best values for normal distribution is in the initial DF.
For df_del
there are a good result in solving skewness issue.
Log-transformation works good with asymmetrical data.
If we compare shapes on the graphs, we see there that initial DF is more symmetrical.
Interim conclusions: Distribution is still non-normal. But anyway the previous transformations get us some enough good results and allow to work with data further. For a modelling it makes sense to use log-transformed charges or square-root-transformed charges and outliers replaced by medians. Deleting outliers helps partly only with kurtosis issue.
Addition
The Box-Cox transformation is a technique to transform non-normal data into normal shape. Box-cox transformation attempts to transform a set of data to a normal distribution by finding the value of λ that minimises the variation, see more here
skewed_box_cox, lmda = stats.boxcox(df['charges'])
sns.distplot(skewed_box_cox)
df['boxcox'].skew()
-0.008734097133920404
df['boxcox'].kurtosis()
-0.6502935539475279
Box-cox gives good results and can be used for 'charges' as Log-transformation
BMI Distribution
sns.distplot(df['bmi'])
At first glance, the distribution looks normal.
Shapiro Normality test
There is one more test allows to check normality of distribution. It is Shapiro test. For this spicy library can be use
from scipy import stats
p_value = stats.shapiro(df['bmi'])[0]
if p_value <=0.05:
print("Distribution is non-normal")
else:
print('Distribution is normal')
Distribution is normal
df.agg(['skew', 'kurtosis'])['bmi'].transpose()
skew 0.284047 kurtosis -0.050732 Name: bmi, dtype: float64
Interim conclusions: BMI index distributed normally
Age Distribution
As we see on the plot, ages density is quite equal, except age near 20. Let's take a look deeper
sns.distplot(df['age'])
Outliers for Age
sns.boxplot(x=df['age'])
df.agg(['skew', 'kurtosis'])['age'].transpose()
skew 0.055673 kurtosis -1.245088 Name: age, dtype: float64
df.describe()['age']
count 1338.000000
mean 39.207025
std 14.049960
min 18.000000
25% 27.000000
50% 39.000000
75% 51.000000
max 64.000000
Name: age, dtype: float64
df.describe()
age | bmi | children | charges | charges_Sqrt | charges_log | |
---|---|---|---|---|---|---|
count | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 |
mean | 39.207025 | 30.663397 | 1.094918 | 13270.422265 | 104.833605 | 9.098659 |
std | 14.049960 | 6.098187 | 1.205493 | 12110.011237 | 47.770734 | 0.919527 |
min | 18.000000 | 15.960000 | 0.000000 | 1121.873900 | 33.494386 | 7.022756 |
25% | 27.000000 | 26.296250 | 0.000000 | 4740.287150 | 68.849739 | 8.463853 |
50% | 39.000000 | 30.400000 | 1.000000 | 9382.033000 | 96.860893 | 9.146552 |
75% | 51.000000 | 34.693750 | 2.000000 | 16639.912515 | 128.995729 | 9.719558 |
max | 64.000000 | 53.130000 | 5.000000 | 63770.428010 | 252.528074 | 11.063045 |
#samples with modified outliers
#calculate limits
feature = 'age'
x,y = boundary_values(feature)
Lower limit of age distribution: -9.0
Upper limit of age distribution: 87.0
So we see that there are no outliers in this distribution. Let's look at the "left side" counts by ages:
df.groupby(['age'])['age'].count().head(10)
age
18 69
19 68
20 29
21 28
22 28
23 28
24 28
25 28
26 28
27 28
Name: age, dtype: int64
As we see from the histogram and last output that it is near 2 times more data near 20 years, and it should be corrected. I want to find median value count for age and decrease diapasons of 18-19 years old till this median.
n = int(df['age'].value_counts().median())
df1 = df.copy()
df_19 = df1[(df1['age']==19)]
df_18 = df1[(df1['age']==18)]
df_19.describe()
df_18.iloc[n:df_18.size,:].index
df_19.iloc[n:df_19.size,:].index
df = df.drop(df_18.iloc[n:df_18.size,:].index)
df = df.drop(df_19.iloc[n:df_19.size,:].index)
df.describe()['age']
count 1255.000000
mean 40.576892
std 13.422954
min 18.000000
25% 29.000000
50% 41.000000
75% 52.000000
max 64.000000
Name: age, dtype: float64
sns.distplot(df['age'])
df.agg(['skew', 'kurtosis'])['age'].transpose()
skew 0.004377
kurtosis -1.195052
Name: age, dtype: float64
We have reduced skewness and kurtosis a little bit.
Interim conclusions: It make sense here to leave this distribution as it is because it shows all ages more-less equally and doesn't need to be more normal distributed.
Conclusions
- In this article I described the most typical, often used and effective transformation approaches to get normal distribution. This transformations are important for the further modelling applying. Some model are sensitive for the data view and data scientist has to investigate more in a data preparation.
- As a result we can see, that Log-transformation is the most universal and effective technique. It solves most of the skewness and kurtosis problems. Box-Cox transformations are also effective and flexible.
- It can happen that data looks non-normal, but in the same time it doesn't have some outliers or very high kurtosis. In this situation it make sense to analyse such data locally and adjust it manually, for example deleting data or replacing it for a median/mean/max/min/random etc. values.