import pandas as pd
import numpy as np
import warnings
%matplotlib inline
warnings.filterwarnings("ignore")
data_file = "/Users/princendhlovu/Downloads/dataset-of-10s.csv"
RawData = pd.read_csv(data_file)
RawData.head(5)
# drop the track, artist and uri columns
myData = RawData.drop(columns=['track','artist','uri'])
myData.head(5)
myData.describe()
# create a data description table
data_des = pd.DataFrame()
data_des['Features'] = myData.columns
data_des['Descriptions']= ['How suitable a track is for dancing ',
'A perceptual measure of intensity and activity',
'The estimated overall key of the track',
'The overall loudness of a track in decibels',
'The modality (major or minor) of a track',
'The presence of spoken words in a track',
'Whether the track is acoustic',
'Predicts whether a track contains no vocals',
'The presence of an audience in the recording',
'Musical positiveness conveyed by a track',
'Beats per minute',
'The duration of the track in milliseconds',
'An estimated overall time signature of a track',
'Timestamp the third section of the track',
'The number of sections the particular track has',
'The target variable for the track']
data_des['Scales']= ['ratio','ratio','ordinal','ratio','nominal','ratio','ratio','ratio','ratio',
'ratio','ratio','ratio','ratio','ratio','ratio','nominal']
data_des['Discrete/Continuous'] = ['Continuous','Continuous','Discrete','Continuous','Discrete',
'Continuous','Continuous','Continuous','Continuous','Continuous',
'Continuous','Discrete','Discrete','Continuous','Discrete',
'Discrete']
data_des['Range'] = ['0.062200-0.981000','0.000251-0.999000','0:C, 1:C#, 2:D, 3:Eb, 4:E, 5:F etc','-46.655000--0.149000','0 (Minor) and 1 (Major)',
'0.022500-0.956000','0-0.996000','0-0.995000','0.016700-0.982000','0-0.976000',
'39.369000-210.977000','29853-1734201','0-5','0-213.154990','2-88',
'0:flop, 1:hit']
data_des
# find data type
print(myData.info())
There are no missing values, so we are going to check for duplicates.
#Find the duplicate instances
index = myData.duplicated()
# find the number of duplicates
len(myData[index])
Since there are 139 duplicates, we are going to drop them to improve our data quality since they could have been added due to human error
myData = myData.drop_duplicates()
idx = myData.duplicated()
len(myData[idx])
We want to make those columns in array to_bin categorical. We want to be able to group songs by the their values in these columns. So, we are going to make the columns have a value of 1-10, so that it is easier to cross product each of them
to_bin = ['danceability','energy','speechiness','acousticness','instrumentalness','liveness','valence']
for idx,col in enumerate(to_bin):
myData[col] = np.digitize(myData[col],bins=[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])
myData.describe()
Normalizing features in the array to_norm
from sklearn.preprocessing import StandardScaler
to_norm = ['loudness','tempo','duration_ms','chorus_hit','sections']
def normalize(df):
result = df.copy()
for feature in df.columns:
max_val = df[feature].max()
min_val = df[feature].min()
result[feature] = (df[feature] - min_val)/(max_val - min_val) - 0.5
return result
X = myData.copy()
X = X.drop(columns='target')
X[to_norm] = normalize(myData[to_norm]).astype(np.float32)
y = myData.target.astype(np.int)
X.head(10)
X['danceability'].unique()
X['energy'].unique()
X['acousticness'].unique()
X['instrumentalness'].unique()
X['valence'].unique()
X['speechiness'].unique()
X['liveness'].unique()
y.unique()
key, energy, valance: key measures the pitch of the track and often has to do with how upbeat, or how much energy, it has. Valence describes the musical positiveness and positive songs often correlates to key, as well.
danceability, liveness: Liveness detects the presence of an audience in a track. Having a crowd increases the chances of making someone want to dance
speechiness, acousticness, instrumentalness: Speechiness detects the presence of spoken words, acousticness determines if the track is acoustic or not, and instrumentalness determines if the track has any vocals. These three features all vocals and overall sound of the track, and thus should be crossed.
time signature, energy: time signature measures the beats per second of the track, and this heavily correlates to how much energy the track has.
For this data set, we are trying to predict if a song is going to be a hit or a flop. It is in the best interest of the artist to have this prior knowledge or prediction so that they know how to properly allocate resources for marketing their songs. If a song is going to flop they may discard it or spend less resources (money) marketing it whereas if its going to be a hit there has to be more financial resources in hand to be used for marketing the song so that it generates more in revenue sales. In our model we are trying to reduce and minimize the number of False Positives in which we predict a song to be a hit when it is going to flop causing the artist to lose a lot of money marketing a song which wont top the charts. We can afford to have False Negatives because the song or track can find its way to the top of the Bill Board charts and by then we would have noticed its potential and mobilised marketing resources to increase the reach. Our evaluation criteria would be precision since we cannot live with False Positives and it is given by: $ Precision(p) = \frac{True Positives}{True Positives + False Negatives} $
#count the frequencies of classes
y.value_counts()
From the target we note that we have almost an even balance of hit and non hit songs from the count of (1's and 0's). Therefor we are going use Stratified Split to split our data and scikit learn's train_test_split to divide the dataset into 80% training and 20% testing. Stratified Split would ensure that all classes are represented well during the training set and that no class is favoured over the other in our model. The train_test_split allows us to stratify the data by a column so that we split the data evenly.
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
X_train, X_test, y_train, y_test = train_test_split(X,y,stratify=y,test_size = 0.2)
X_train = pd.DataFrame(X_train)
X_train.columns = X.columns
X_test = pd.DataFrame(X_test)
X_test.columns = X.columns
import tensorflow as tf
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import concatenate
from tensorflow import keras
from tensorflow.keras.layers import Dense, Activation, Input
from tensorflow.keras.layers import Embedding, Flatten, Concatenate
from tensorflow.keras.models import Model
from sklearn.preprocessing import LabelEncoder
from functools import reduce
# possible crossing options:
# 'key','time_signature','danceability',
# 'energy','speechiness','acousticness',
# 'instrumentalness','liveness','valence'
cross_columns = [['key','time_signature','valence'],
['danceability', 'energy','instrumentalness'],
['speechiness','acousticness','liveness'],
# ['time_signature','energy']
]
# save categorical features
categorical_headers = ['key','time_signature']+to_bin
# cross each set of columns in the list above
cross_col_df_names = []
for cols_list in cross_columns:
# encode as ints for the embedding
enc = LabelEncoder()
X_crossed_train = []
X_crossed_test = []
for row in X_train[cols_list].values:
X_crossed_train.append(reduce((lambda x,y: x+y),row))
for row in X_test[cols_list].values:
X_crossed_test.append(reduce((lambda x,y: x+y),row))
# get a nice name for this new crossed column
cross_col_name = '_'.join(cols_list)
# 2. encode as integers
# enc.fit(np.hstack((X_crossed_train.to_numpy(), X_crossed_test.to_numpy())))
enc.fit(np.hstack((np.array(X_crossed_train),np.array(X_crossed_test))))
# 3. Save into dataframe with new name
X_train[cross_col_name] = enc.transform(X_crossed_train)
X_test[cross_col_name] = enc.transform(X_crossed_test)
# keep track of the new names of the crossed columns
cross_col_df_names.append(cross_col_name)
# get crossed columns
X_train_crossed = X_train[cross_col_df_names].to_numpy()
X_test_crossed = X_test[cross_col_df_names].to_numpy()
# save categorical features
X_train_cat = X_train[categorical_headers].to_numpy()
X_test_cat = X_test[categorical_headers].to_numpy()
# and save off the numeric features
X_train_num = X_train.drop(columns=categorical_headers).to_numpy()
X_test_num = X_test.drop(columns=categorical_headers).to_numpy()
# we need to create separate lists for each branch
crossed_outputs = []
# CROSSED DATA INPUT
input_crossed = Input(shape=(X_train_crossed.shape[1],), dtype='int64', name='wide_inputs')
for idx,col in enumerate(cross_col_df_names):
# track what the maximum integer value will be for this variable
# which is the same as the number of categories
N = max(X_train[col].max(),X_test[col].max())+1
# this line of code does this: input_branch[:,idx]
x = tf.gather(input_crossed, idx, axis=1)
# now use an embedding to deal with integers as if they were one hot encoded
x = Embedding(input_dim=N,
output_dim=int(np.sqrt(N)),
input_length=1, name=col+'_embed')(x)
# save these outputs to concatenate later
crossed_outputs.append(x)
# now concatenate the outputs and add a fully connected layer
wide_branch = concatenate(crossed_outputs, name='wide_concat')
# reset this input branch
all_deep_branch_outputs = []
# CATEGORICAL DATA INPUT
input_cat = Input(shape=(X_train_cat.shape[1],), dtype='int64', name='categorical_input')
for idx,col in enumerate(categorical_headers):
# track what the maximum integer value will be for this variable
# which is the same as the number of categories
N = max(X_train[col].max(),X_test[col].max())+1
# this line of code does this: input_branch[:,idx]
x = tf.gather(input_cat, idx, axis=1)
# now use an embedding to deal with integers as if they were one hot encoded
x = Embedding(input_dim=N,
output_dim=int(np.sqrt(N)),
input_length=1, name=col+'_embed')(x)
# save these outputs to concatenate later
all_deep_branch_outputs.append(x)
# NUMERIC DATA INPUT
# create dense input branch for numeric
input_num = Input(shape=(X_train_num.shape[1],), name='numeric')
x_dense = Dense(units=15, activation='relu',name='num_1')(input_num)
all_deep_branch_outputs.append(x_dense)
# merge the deep branches together
deep_branch = concatenate(all_deep_branch_outputs,name='concat_embeds')
deep_branch = Dense(units=50,activation='relu', name='deep1')(deep_branch)
deep_branch = Dense(units=25,activation='relu', name='deep2')(deep_branch)
deep_branch = Dense(units=10,activation='relu', name='deep3')(deep_branch)
# merge the deep and wide branch
final_branch = concatenate([wide_branch, deep_branch],
name='concat_deep_wide')
final_branch = Dense(units=1,activation='sigmoid',
name='combined')(final_branch)
model = Model(inputs=[input_crossed,input_cat,input_num],
outputs=final_branch)
# model.summary()
%%time
model.compile(optimizer='sgd',
loss='mean_squared_error',
metrics=['Precision'])
# lets also add the history variable to see how we are doing
# and lets add a validation set to keep track of our progress
history = model.fit([X_train_crossed,X_train_cat,X_train_num],
y_train,
epochs=15,
batch_size=32,
verbose=1,
validation_data = ([X_test_crossed,X_test_cat,X_test_num],y_test))
from sklearn import metrics as mt
yhat = np.round(model.predict([X_test_crossed,X_test_cat,X_test_num]))
print(mt.confusion_matrix(y_test,yhat))
print(mt.precision_score(y_test,yhat))
y_pred_0 = model.predict([X_test_crossed,X_test_cat,X_test_num]).ravel()
#false positve and true postive rates using roc
fpr_0, tpr_0, thresholds_0 = mt.roc_curve(y_test, y_pred_0)
#area under the curve
auc_0 = mt.auc(fpr_0, tpr_0)
from matplotlib import pyplot as plt
%matplotlib inline
plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
plt.plot(history.history['precision'])
plt.ylabel('Precision %')
plt.title('Training')
plt.subplot(2,2,2)
plt.plot(history.history['val_precision'])
plt.title('Validation')
plt.subplot(2,2,3)
plt.plot(history.history['loss'])
plt.ylabel('Training Loss')
plt.xlabel('epochs')
plt.subplot(2,2,4)
plt.plot(history.history['val_loss'])
plt.xlabel('epochs')
# possible crossing options:
# 'key','time_signature','danceability',
# 'energy','speechiness','acousticness',
# 'instrumentalness','liveness','valence'
cross_columns = [['danceability','energy','valence'],
['key', 'danceability','liveness'],
['speechiness','acousticness','instrumentalness'],
['time_signature','energy']
]
# cross each set of columns in the list above
cross_col_df_names = []
for cols_list in cross_columns:
# encode as ints for the embedding
enc = LabelEncoder()
X_crossed_train = []
X_crossed_test = []
for row in X_train[cols_list].values:
X_crossed_train.append(reduce((lambda x,y: x+y),row))
for row in X_test[cols_list].values:
X_crossed_test.append(reduce((lambda x,y: x+y),row))
# get a nice name for this new crossed column
cross_col_name = '_'.join(cols_list)
# 2. encode as integers
# enc.fit(np.hstack((X_crossed_train.to_numpy(), X_crossed_test.to_numpy())))
enc.fit(np.hstack((np.array(X_crossed_train),np.array(X_crossed_test))))
# 3. Save into dataframe with new name
X_train[cross_col_name] = enc.transform(X_crossed_train)
X_test[cross_col_name] = enc.transform(X_crossed_test)
# keep track of the new names of the crossed columns
cross_col_df_names.append(cross_col_name)
# get crossed columns
X_train_crossed = X_train[cross_col_df_names].to_numpy()
X_test_crossed = X_test[cross_col_df_names].to_numpy()
# save categorical features
X_train_cat = X_train[categorical_headers].to_numpy()
X_test_cat = X_test[categorical_headers].to_numpy()
# and save off the numeric features
X_train_num = X_train.drop(columns=categorical_headers).to_numpy()
X_test_num = X_test.drop(columns=categorical_headers).to_numpy()
# we need to create separate lists for each branch
crossed_outputs = []
# CROSSED DATA INPUT
input_crossed = Input(shape=(X_train_crossed.shape[1],), dtype='int64', name='wide_inputs')
for idx,col in enumerate(cross_col_df_names):
# track what the maximum integer value will be for this variable
# which is the same as the number of categories
N = max(X_train[col].max(),X_test[col].max())+1
# this line of code does this: input_branch[:,idx]
x = tf.gather(input_crossed, idx, axis=1)
# now use an embedding to deal with integers as if they were one hot encoded
x = Embedding(input_dim=N,
output_dim=int(np.sqrt(N)),
input_length=1, name=col+'_embed')(x)
# save these outputs to concatenate later
crossed_outputs.append(x)
# now concatenate the outputs and add a fully connected layer
wide_branch = concatenate(crossed_outputs, name='wide_concat')
# reset this input branch
all_deep_branch_outputs = []
# CATEGORICAL DATA INPUT
input_cat = Input(shape=(X_train_cat.shape[1],), dtype='int64', name='categorical_input')
for idx,col in enumerate(categorical_headers):
# track what the maximum integer value will be for this variable
# which is the same as the number of categories
N = max(X_train[col].max(),X_test[col].max())+1
# this line of code does this: input_branch[:,idx]
x = tf.gather(input_cat, idx, axis=1)
# now use an embedding to deal with integers as if they were one hot encoded
x = Embedding(input_dim=N,
output_dim=int(np.sqrt(N)),
input_length=1, name=col+'_embed')(x)
# save these outputs to concatenate later
all_deep_branch_outputs.append(x)
# NUMERIC DATA INPUT
# create dense input branch for numeric
input_num = Input(shape=(X_train_num.shape[1],), name='numeric')
x_dense = Dense(units=15, activation='relu',name='num_1')(input_num)
all_deep_branch_outputs.append(x_dense)
# merge the deep branches together
deep_branch = concatenate(all_deep_branch_outputs,name='concat_embeds')
deep_branch = Dense(units=50,activation='relu', name='deep1')(deep_branch)
deep_branch = Dense(units=25,activation='relu', name='deep2')(deep_branch)
deep_branch = Dense(units=10,activation='relu', name='deep3')(deep_branch)
# merge the deep and wide branch
final_branch = concatenate([wide_branch, deep_branch],
name='concat_deep_wide')
final_branch = Dense(units=1,activation='sigmoid',
name='combined')(final_branch)
model = Model(inputs=[input_crossed,input_cat,input_num],
outputs=final_branch)
# model.summary()
%%time
model.compile(optimizer='sgd',
loss='mean_squared_error',
metrics=['Precision'])
# lets also add the history variable to see how we are doing
# and lets add a validation set to keep track of our progress
history = model.fit([X_train_crossed,X_train_cat,X_train_num],
y_train,
epochs=15,
batch_size=32,
verbose=1,
validation_data = ([X_test_crossed,X_test_cat,X_test_num],y_test))
yhat = np.round(model.predict([X_test_crossed,X_test_cat,X_test_num]))
yhat_best = yhat
print(mt.confusion_matrix(y_test,yhat))
print(mt.precision_score(y_test,yhat))
y_pred_1 = model.predict([X_test_crossed,X_test_cat,X_test_num]).ravel()
#false positve and true postive rates using roc
fpr_1, tpr_1, thresholds_1 = mt.roc_curve(y_test, y_pred_1)
#area under the curve
auc_1 = mt.auc(fpr_1, tpr_1)
from matplotlib import pyplot as plt
%matplotlib inline
plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
plt.plot(history.history['precision'])
plt.ylabel('Precision %')
plt.title('Training')
plt.subplot(2,2,2)
plt.plot(history.history['val_precision'])
plt.title('Validation')
plt.subplot(2,2,3)
plt.plot(history.history['loss'])
plt.ylabel('Training Loss')
plt.xlabel('epochs')
plt.subplot(2,2,4)
plt.plot(history.history['val_loss'])
plt.xlabel('epochs')
model10_hist_accur = history.history['precision']
model10_val_accur = history.history['val_precision']
model10_hist_loss = history.history['loss']
model10_val_loss = history.history['val_loss']
# get crossed columns
X_train_crossed = X_train[cross_col_df_names].to_numpy()
X_test_crossed = X_test[cross_col_df_names].to_numpy()
# save categorical features
X_train_cat = X_train[categorical_headers].to_numpy()
X_test_cat = X_test[categorical_headers].to_numpy()
# and save off the numeric features
X_train_num = X_train.drop(columns=categorical_headers).to_numpy()
X_test_num = X_test.drop(columns=categorical_headers).to_numpy()
# we need to create separate lists for each branch
crossed_outputs = []
# CROSSED DATA INPUT
input_crossed = Input(shape=(X_train_crossed.shape[1],), dtype='int64', name='wide_inputs')
for idx,col in enumerate(cross_col_df_names):
# track what the maximum integer value will be for this variable
# which is the same as the number of categories
N = max(X_train[col].max(),X_test[col].max())+1
# this line of code does this: input_branch[:,idx]
x = tf.gather(input_crossed, idx, axis=1)
# now use an embedding to deal with integers as if they were one hot encoded
x = Embedding(input_dim=N,
output_dim=int(np.sqrt(N)),
input_length=1, name=col+'_embed')(x)
# save these outputs to concatenate later
crossed_outputs.append(x)
# now concatenate the outputs and add a fully connected layer
wide_branch = concatenate(crossed_outputs, name='wide_concat')
# reset this input branch
all_deep_branch_outputs = []
# CATEGORICAL DATA INPUT
input_cat = Input(shape=(X_train_cat.shape[1],), dtype='int64', name='categorical_input')
for idx,col in enumerate(categorical_headers):
# track what the maximum integer value will be for this variable
# which is the same as the number of categories
N = max(X_train[col].max(),X_test[col].max())+1
# this line of code does this: input_branch[:,idx]
x = tf.gather(input_cat, idx, axis=1)
# now use an embedding to deal with integers as if they were one hot encoded
x = Embedding(input_dim=N,
output_dim=int(np.sqrt(N)),
input_length=1, name=col+'_embed')(x)
# save these outputs to concatenate later
all_deep_branch_outputs.append(x)
# NUMERIC DATA INPUT
# create dense input branch for numeric
input_num = Input(shape=(X_train_num.shape[1],), name='numeric')
x_dense = Dense(units=15, activation='relu',name='num_1')(input_num)
all_deep_branch_outputs.append(x_dense)
# merge the deep branches together
deep_branch = concatenate(all_deep_branch_outputs,name='concat_embeds')
deep_branch = Dense(units=50,activation='relu', name='deep1')(deep_branch)
deep_branch = Dense(units=25,activation='relu', name='deep2')(deep_branch)
deep_branch = Dense(units=10,activation='relu', name='deep3')(deep_branch)
# merge the deep and wide branch
final_branch = concatenate([wide_branch, deep_branch],
name='concat_deep_wide')
final_branch = Dense(units=1,activation='sigmoid',
name='combined')(final_branch)
model = Model(inputs=[input_crossed,input_cat,input_num],
outputs=final_branch)
# model.summary()
%%time
model.compile(optimizer='adagrad',
loss='mean_squared_error',
metrics=['Precision'])
# lets also add the history variable to see how we are doing
# and lets add a validation set to keep track of our progress
history = model.fit([X_train_crossed,X_train_cat,X_train_num],
y_train,
epochs=15,
batch_size=32,
verbose=1,
validation_data = ([X_test_crossed,X_test_cat,X_test_num],y_test))
yhat = np.round(model.predict([X_test_crossed,X_test_cat,X_test_num]))
print(mt.confusion_matrix(y_test,yhat))
print(mt.precision_score(y_test,yhat))
y_pred_2 = model.predict([X_test_crossed,X_test_cat,X_test_num]).ravel()
#false positve and true postive rates using roc
fpr_2, tpr_2, thresholds_2 = mt.roc_curve(y_test, y_pred_2)
#area under the curve
auc_2 = mt.auc(fpr_2, tpr_2)
from matplotlib import pyplot as plt
%matplotlib inline
plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
plt.plot(history.history['precision'])
plt.ylabel('Precision %')
plt.title('Training')
plt.subplot(2,2,2)
plt.plot(history.history['val_precision'])
plt.title('Validation')
plt.subplot(2,2,3)
plt.plot(history.history['loss'])
plt.ylabel('Training Loss')
plt.xlabel('epochs')
plt.subplot(2,2,4)
plt.plot(history.history['val_loss'])
plt.xlabel('epochs')
# get crossed columns
X_train_crossed = X_train[cross_col_df_names].to_numpy()
X_test_crossed = X_test[cross_col_df_names].to_numpy()
# save categorical features
X_train_cat = X_train[categorical_headers].to_numpy()
X_test_cat = X_test[categorical_headers].to_numpy()
# and save off the numeric features
X_train_num = X_train.drop(columns=categorical_headers).to_numpy()
X_test_num = X_test.drop(columns=categorical_headers).to_numpy()
# we need to create separate lists for each branch
crossed_outputs = []
# CROSSED DATA INPUT
input_crossed = Input(shape=(X_train_crossed.shape[1],), dtype='int64', name='wide_inputs')
for idx,col in enumerate(cross_col_df_names):
# track what the maximum integer value will be for this variable
# which is the same as the number of categories
N = max(X_train[col].max(),X_test[col].max())+1
# this line of code does this: input_branch[:,idx]
x = tf.gather(input_crossed, idx, axis=1)
# now use an embedding to deal with integers as if they were one hot encoded
x = Embedding(input_dim=N,
output_dim=int(np.sqrt(N)),
input_length=1, name=col+'_embed')(x)
# save these outputs to concatenate later
crossed_outputs.append(x)
# now concatenate the outputs and add a fully connected layer
wide_branch = concatenate(crossed_outputs, name='wide_concat')
# reset this input branch
all_deep_branch_outputs = []
# CATEGORICAL DATA INPUT
input_cat = Input(shape=(X_train_cat.shape[1],), dtype='int64', name='categorical_input')
for idx,col in enumerate(categorical_headers):
# track what the maximum integer value will be for this variable
# which is the same as the number of categories
N = max(X_train[col].max(),X_test[col].max())+1
# this line of code does this: input_branch[:,idx]
x = tf.gather(input_cat, idx, axis=1)
# now use an embedding to deal with integers as if they were one hot encoded
x = Embedding(input_dim=N,
output_dim=int(np.sqrt(N)),
input_length=1, name=col+'_embed')(x)
# save these outputs to concatenate later
all_deep_branch_outputs.append(x)
# NUMERIC DATA INPUT
# create dense input branch for numeric
input_num = Input(shape=(X_train_num.shape[1],), name='numeric')
x_dense = Dense(units=15, activation='relu',name='num_1')(input_num)
all_deep_branch_outputs.append(x_dense)
# merge the deep branches together
deep_branch = concatenate(all_deep_branch_outputs,name='concat_embeds')
deep_branch = Dense(units=50,activation='relu', name='deep1')(deep_branch)
deep_branch = Dense(units=25,activation='relu', name='deep2')(deep_branch)
deep_branch = Dense(units=10,activation='relu', name='deep3')(deep_branch)
deep_branch = Dense(units=5,activation='relu', name='deep4')(deep_branch)
# merge the deep and wide branch
final_branch = concatenate([wide_branch, deep_branch],
name='concat_deep_wide')
final_branch = Dense(units=1,activation='sigmoid',
name='combined')(final_branch)
model = Model(inputs=[input_crossed,input_cat,input_num],
outputs=final_branch)
# model.summary()
%%time
model.compile(optimizer='sgd',
loss='mean_squared_error',
metrics=['Precision'])
# lets also add the history variable to see how we are doing
# and lets add a validation set to keep track of our progress
history = model.fit([X_train_crossed,X_train_cat,X_train_num],
y_train,
epochs=15,
batch_size=32,
verbose=1,
validation_data = ([X_test_crossed,X_test_cat,X_test_num],y_test))
yhat = np.round(model.predict([X_test_crossed,X_test_cat,X_test_num]))
print(mt.confusion_matrix(y_test,yhat))
print(mt.precision_score(y_test,yhat))
y_pred_3 = model.predict([X_test_crossed,X_test_cat,X_test_num]).ravel()
#false positve and true postive rates using roc
fpr_3, tpr_3, thresholds_3 = mt.roc_curve(y_test, y_pred_3)
#area under the curve
auc_3 = mt.auc(fpr_3, tpr_3)
from matplotlib import pyplot as plt
%matplotlib inline
plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
plt.plot(history.history['precision'])
plt.ylabel('Precision %')
plt.title('Training')
plt.subplot(2,2,2)
plt.plot(history.history['val_precision'])
plt.title('Validation')
plt.subplot(2,2,3)
plt.plot(history.history['loss'])
plt.ylabel('Training Loss')
plt.xlabel('epochs')
plt.subplot(2,2,4)
plt.plot(history.history['val_loss'])
plt.xlabel('epochs')
# get crossed columns
X_train_crossed = X_train[cross_col_df_names].to_numpy()
X_test_crossed = X_test[cross_col_df_names].to_numpy()
# save categorical features
X_train_cat = X_train[categorical_headers].to_numpy()
X_test_cat = X_test[categorical_headers].to_numpy()
# and save off the numeric features
X_train_num = X_train.drop(columns=categorical_headers).to_numpy()
X_test_num = X_test.drop(columns=categorical_headers).to_numpy()
# we need to create separate lists for each branch
crossed_outputs = []
# CROSSED DATA INPUT
input_crossed = Input(shape=(X_train_crossed.shape[1],), dtype='int64', name='wide_inputs')
for idx,col in enumerate(cross_col_df_names):
# track what the maximum integer value will be for this variable
# which is the same as the number of categories
N = max(X_train[col].max(),X_test[col].max())+1
# this line of code does this: input_branch[:,idx]
x = tf.gather(input_crossed, idx, axis=1)
# now use an embedding to deal with integers as if they were one hot encoded
x = Embedding(input_dim=N,
output_dim=int(np.sqrt(N)),
input_length=1, name=col+'_embed')(x)
# save these outputs to concatenate later
crossed_outputs.append(x)
# now concatenate the outputs and add a fully connected layer
wide_branch = concatenate(crossed_outputs, name='wide_concat')
# reset this input branch
all_deep_branch_outputs = []
# CATEGORICAL DATA INPUT
input_cat = Input(shape=(X_train_cat.shape[1],), dtype='int64', name='categorical_input')
for idx,col in enumerate(categorical_headers):
# track what the maximum integer value will be for this variable
# which is the same as the number of categories
N = max(X_train[col].max(),X_test[col].max())+1
# this line of code does this: input_branch[:,idx]
x = tf.gather(input_cat, idx, axis=1)
# now use an embedding to deal with integers as if they were one hot encoded
x = Embedding(input_dim=N,
output_dim=int(np.sqrt(N)),
input_length=1, name=col+'_embed')(x)
# save these outputs to concatenate later
all_deep_branch_outputs.append(x)
# NUMERIC DATA INPUT
# create dense input branch for numeric
input_num = Input(shape=(X_train_num.shape[1],), name='numeric')
x_dense = Dense(units=15, activation='relu',name='num_1')(input_num)
all_deep_branch_outputs.append(x_dense)
# merge the deep branches together
deep_branch = concatenate(all_deep_branch_outputs,name='concat_embeds')
deep_branch = Dense(units=75,activation='relu', name='deep0')(deep_branch)
deep_branch = Dense(units=50,activation='relu', name='deep1')(deep_branch)
deep_branch = Dense(units=25,activation='relu', name='deep2')(deep_branch)
deep_branch = Dense(units=10,activation='relu', name='deep3')(deep_branch)
deep_branch = Dense(units=5,activation='relu', name='deep4')(deep_branch)
# merge the deep and wide branch
final_branch = concatenate([wide_branch, deep_branch],
name='concat_deep_wide')
final_branch = Dense(units=1,activation='sigmoid',
name='combined')(final_branch)
model = Model(inputs=[input_crossed,input_cat,input_num],
outputs=final_branch)
# model.summary()
%%time
model.compile(optimizer='sgd',
loss='mean_squared_error',
metrics=['Precision'])
# lets also add the history variable to see how we are doing
# and lets add a validation set to keep track of our progress
history = model.fit([X_train_crossed,X_train_cat,X_train_num],
y_train,
epochs=15,
batch_size=32,
verbose=1,
validation_data = ([X_test_crossed,X_test_cat,X_test_num],y_test))
yhat = np.round(model.predict([X_test_crossed,X_test_cat,X_test_num]))
print(mt.confusion_matrix(y_test,yhat))
print(mt.precision_score(y_test,yhat))
from matplotlib import pyplot as plt
%matplotlib inline
plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
plt.plot(history.history['precision'])
plt.ylabel('Precision %')
plt.title('Training')
plt.subplot(2,2,2)
plt.plot(history.history['val_precision'])
plt.title('Validation')
plt.subplot(2,2,3)
plt.plot(history.history['loss'])
plt.ylabel('Training Loss')
plt.xlabel('epochs')
plt.subplot(2,2,4)
plt.plot(history.history['val_loss'])
plt.xlabel('epochs')
from sklearn import metrics
y_pred_4 = model.predict([X_test_crossed,X_test_cat,X_test_num]).ravel()
#false positve and true postive rates using roc
fpr_4, tpr_4, thresholds_4 = metrics.roc_curve(y_test, y_pred_4)
#area under the curve
auc_4 = metrics.auc(fpr_4, tpr_4)
plt.figure(figsize=(12,12))
#plot halfway line
plt.plot([0,1], [0,1], 'k--')
#plot for model 0 ROC
plt.plot(fpr_0, tpr_0, label='Model 0 (area = {:.3f})'.format(auc_0))
#plot for model 1 ROC
plt.plot(fpr_1, tpr_1, label='Model 1 (area = {:.3f})'.format(auc_1))
#plot for model 2 ROC
plt.plot(fpr_2, tpr_2, label='Model 2 (area = {:.3f})'.format(auc_2))
#plot for model 3 ROC
plt.plot(fpr_3, tpr_3, label='Model 3 (area = {:.3f})'.format(auc_3))
#plot for model 4 ROC
plt.plot(fpr_4, tpr_4, label='Model 4 (area = {:.3f})'.format(auc_4))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('All Wide and Deep ROC curves')
plt.legend(loc='best')
plt.show()
From the above ROC we note that model 1 perfomed better than the other models, so we are going to compare it with the standard MultiLayer Perceptron from scikit learn's library.
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score
from sklearn import metrics
data_features = ['key','time_signature','valence',
'danceability', 'energy','instrumentalness',
'speechiness','acousticness','liveness',
'time_signature','energy'
]
mlp = MLPClassifier(hidden_layer_sizes=(50,),
learning_rate_init=0.01,
random_state=1,
activation='relu')
mlp.fit(X_train[data_features], y_train)
yhat_mlp = mlp.predict(X_test[data_features])
print("MLP Accuracy Score: ", accuracy_score(y_test, yhat_mlp))
print("MLP Precision Score: ",precision_score(y_test,yhat_mlp))
#false positve and true postive rates using roc
fpr_sk, tpr_sk, thresholds_sk = metrics.roc_curve(y_test, yhat_mlp)
#area under the curve
auc_sk = metrics.auc(fpr_sk, tpr_sk)
We note that the MLP has a higher precision score of 0.7692307692307693 compared to our best performing Wide and Deep Network which has an precision score of 0.6823935558112774
plt.figure(figsize=(10,12))
#plot halfway line
plt.plot([0,1], [0,1], 'k--')
#plot for Wide and Deep ROC
plt.plot(fpr_4, tpr_4, label='Wide and Deep (area = {:.3f})'.format(auc_4))
#plot for MLP ROC
plt.plot(fpr_sk, tpr_sk, label='MLP (area = {:.3f})'.format(auc_sk))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Wide and Deep vs MLP ROC curve')
plt.legend(loc='best')
plt.show()
We can conclude that our Wide and Deep Neural Network performed slightly better than the Multi Layer Perceptron (mlp) from scikit's learn standard library. The ROC curve of the Wide and Deep Network is more close to the top left with an AUC of 0.815 compared to 0.808 of the standard mlp. Now we are going to carry out a Mcnemar test to compare the two models
from statsmodels.stats.contingency_tables import mcnemar
# define contingency table
# calculate mcnemar test
result = mcnemar(mt.confusion_matrix(y_test,yhat_mlp), exact=False, correction=True)
result2 = mcnemar(mt.confusion_matrix(y_test,yhat_best), exact=False, correction=True)
# summarize the finding
print('statistic=%.3f, p-value=%.25f' % (result.statistic, result.pvalue))
print('statistic=%.3f, p-value=%.25f' % (result2.statistic, result2.pvalue))
Since the p-value is less than 0.05, we accept the alternative hypothesis that there is no significant difference between these models. However since the wide and deep network has a significantly higher p-value, we can conclude that it perfomes better compared to the MLP.
Here we examine the effects of dropout on the ROC curve compared to our best perfoming wide and deep model. We also look to see if there are any differences between the training and validation loss and accuracy graphs
from keras.layers import Dropout
# get crossed columns
X_train_crossed = X_train[cross_col_df_names].to_numpy()
X_test_crossed = X_test[cross_col_df_names].to_numpy()
# save categorical features
X_train_cat = X_train[categorical_headers].to_numpy()
X_test_cat = X_test[categorical_headers].to_numpy()
# and save off the numeric features
X_train_num = X_train.drop(columns=categorical_headers).to_numpy()
X_test_num = X_test.drop(columns=categorical_headers).to_numpy()
# we need to create separate lists for each branch
crossed_outputs = []
# CROSSED DATA INPUT
input_crossed = Input(shape=(X_train_crossed.shape[1],), dtype='int64', name='wide_inputs')
for idx,col in enumerate(cross_col_df_names):
# track what the maximum integer value will be for this variable
# which is the same as the number of categories
N = max(X_train[col].max(),X_test[col].max())+1
# this line of code does this: input_branch[:,idx]
x = tf.gather(input_crossed, idx, axis=1)
# now use an embedding to deal with integers as if they were one hot encoded
x = Embedding(input_dim=N,
output_dim=int(np.sqrt(N)),
input_length=1, name=col+'_embed')(x)
# save these outputs to concatenate later
crossed_outputs.append(x)
# merging the branches together
wide_branch = concatenate(crossed_outputs, name='wide_concat')
wide_branch = Dense(units=1,activation='relu',name='num_0')(wide_branch)
wide_branch = Dropout(0.1)(wide_branch)
# reset this input branch
all_deep_branch_outputs = []
# CATEGORICAL DATA INPUT
input_cat = Input(shape=(X_train_cat.shape[1],), dtype='int64', name='categorical_input')
for idx,col in enumerate(categorical_headers):
# track what the maximum integer value will be for this variable
# which is the same as the number of categories
N = max(X_train[col].max(),X_test[col].max())+1
# this line of code does this: input_branch[:,idx]
x = tf.gather(input_cat, idx, axis=1)
# now use an embedding to deal with integers as if they were one hot encoded
x = Embedding(input_dim=N,
output_dim=int(np.sqrt(N)),
input_length=1, name=col+'_embed')(x)
# save these outputs to concatenate later
all_deep_branch_outputs.append(x)
# NUMERIC DATA INPUT
# create dense input branch for numeric
input_num = Input(shape=(X_train_num.shape[1],), name='numeric')
x_dense = Dense(units=15, activation='relu',name='num_1')(input_num)
x_dense = Dropout(0.1)(x_dense)
all_deep_branch_outputs.append(x_dense)
# merge the deep branches together
deep_branch = concatenate(all_deep_branch_outputs,name='concat_embeds')
deep_branch = Dense(units=75,activation='relu', name='deep0')(deep_branch)
deep_branch = Dropout(0.3)(deep_branch)
print('Deep 0 created')
deep_branch = Dense(units=50,activation='relu', name='deep1')(deep_branch)
deep_branch = Dropout(0.3)(deep_branch)
print('Deep 1 created')
deep_branch = Dense(units=25,activation='relu', name='deep2')(deep_branch)
deep_branch = Dropout(0.3)(deep_branch)
print('Deep 2 created')
deep_branch = Dense(units=10,activation='relu', name='deep3')(deep_branch)
deep_branch = Dropout(0.3)(deep_branch)
print('Deep 3 created')
deep_branch = Dense(units=5,activation='relu', name='deep4')(deep_branch)
deep_branch = Dropout(0.3)(deep_branch)
print('Deep 4 created')
# merge the deep and wide branch
final_branch = concatenate([wide_branch, deep_branch],
name='concat_deep_wide')
final_branch = Dense(units=1,activation='sigmoid',
name='combined')(final_branch)
deep_branch = Dropout(0.1)(deep_branch)
model = Model(inputs=[input_crossed,input_cat,input_num],
outputs=final_branch)
# model.summary()
%%time
model.compile(optimizer='sgd',
loss='mean_squared_error',
metrics=['Precision'])
# lets also add the history variable to see how we are doing
# and lets add a validation set to keep track of our progress
history = model.fit([X_train_crossed,X_train_cat,X_train_num],
y_train,
epochs=15,
batch_size=32,
verbose=1,
validation_data = ([X_test_crossed,X_test_cat,X_test_num],y_test))
yhat = np.round(model.predict([X_test_crossed,X_test_cat,X_test_num]))
yhat_drop = yhat
print(mt.confusion_matrix(y_test,yhat))
print(mt.precision_score(y_test,yhat))
y_pred_dropout = model.predict([X_test_crossed,X_test_cat,X_test_num]).ravel()
#false positve and true postive rates using roc
fpr_dropout, tpr_dropout, thresholds_2 = mt.roc_curve(y_test, y_pred_dropout)
#area under the curve
auc_dropout = mt.auc(fpr_dropout, tpr_dropout)
plt.figure(figsize=(10,12))
#plot halfway line
plt.plot([0,1], [0,1], 'k--')
#plot for Wide and Deep ROC
plt.plot(fpr_4, tpr_4, label='Wide and Deep (area = {:.3f})'.format(auc_4))
#plot for MLP ROC
plt.plot(fpr_dropout, tpr_dropout, label='Wide and Deep with Dropout (area = {:.3f})'.format(auc_dropout))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Wide and Deep vs Wide and Deep with Dropout ROC curve')
plt.legend(loc='best')
plt.show()
Here we note that our wide and deep model had a large AUC without the dropout, showing that our model was not overfitting.
from matplotlib import pyplot as plt
%matplotlib inline
plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
plt.plot(history.history['precision'])
plt.ylabel('Precision %')
plt.title('Training with DropOut')
plt.subplot(2,2,2)
plt.plot(history.history['val_precision'])
plt.title('Validation with DropOut')
plt.subplot(2,2,3)
plt.plot(history.history['loss'])
plt.ylabel('Training')
plt.xlabel('epochs')
plt.subplot(2,2,4)
plt.plot(history.history['val_loss'])
plt.xlabel('epochs')
plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
plt.plot(model10_hist_accur)
plt.ylabel('Precision %')
plt.title('Training without DropOut')
plt.subplot(2,2,2)
plt.plot(model10_val_accur)
plt.title('Validation without DropOut')
plt.subplot(2,2,3)
plt.plot(model10_hist_loss)
plt.ylabel('Training Loss')
plt.xlabel('epochs')
plt.subplot(2,2,4)
plt.plot(model10_val_loss)
plt.xlabel('epochs')
The validation accuracy without dropout is slightly higher compared to that with dropout but we also note that validation lines are pretty consistent for the accuracy and loss functions when have dropout compared to when we do not have it. This might be because our data set is small and if there are more data samples (bigger dataset) there is a possibility that using dropout might be beneficial in the overall generalization process thus reducing overfitting for the model. What we could do differently if there were no hardware constraints would be to increase the number of epochs to 30 and observe if there would be any changes.