Going on with Kaggle’s bike competition

Going on with the Kaggle competition about bike rentals, I tried out a decision tree classifier for finding out the importances of features. I gotta admit, I am not sure yet, what are the exact drawbacks of this method (i.e. what can be missed by this approach), but at least it shows you what you should also look at.

In scikit-learn this can be done by printing the feature_importances_ attribute of the trained decision tree classifier.

import pandas as pd
import numpy as np
from sklearn import tree
 
df = pd.read_csv('train_added.csv', index_col=0, parse_dates=[0])
y = df['count']
 
# delete the result columns
df = df.drop('count', 1)
df = df.drop('casual', 1)
df = df.drop('registered', 1)
 
clf = tree.DecisionTreeClassifier()
clf = clf.fit(df, y)
 
for col, imp in zip(df.columns, clf.feature_importances_):
    print('%s: %f' % (col, imp))

Which shows us for the Kaggle dataset:

season: 0.060665
holiday: 0.008030
workingday: 0.021309
weather: 0.061653
temp: 0.147404
atemp: 0.154334
humidity: 0.222730
windspeed: 0.211515
hour: 0.112359

We already found out, that the hour of the day has an impact on bike rentals, but it seems we should also have a look at windpseed, humidity and temperature (where we should probably especially check for correlation between temp and atemp, so they don’t break our predictor later, in case we use a model that assumes independance between variables).

Feature analysis

So let’s have a closer look at the most important variables from the decision tree. At least from humidity and temp we can deduce a clear tendency in bike rentals when values rise or fall.

Bike rentals according to humidity

Bike rentals according to temperature

Bike rentals according to windspeed

However, I am uncertain about windspeed. One could see a slight non-linear behaviour in the end, but the problem here is, that almost all values lie in the intervall [0,30]. Windspeed values higher than 40 are almost never observed.

Windspeed histogram

Thus, the whole righter part of the plot should be ignored. This lack of samples also explains the huge differences in bike rental count when windspeeds above 40 are detected: There are just not enough samples to give a reliable average value (a statistician would say the observation is not significant).

As said, one should also check the correlation between temp and atemp, which I already did. Thus I know that atemps behaviour will be about the same as temp, not giving us much information here.

One can use Numpy’s function corrcoff to calculate a correlation matrix for a given dataframe or also for a subset of columns. In this case the two desired columns are enough.

print(corrcoef(df['temp'], df['atemp']))

This little piece of code gives us the correlation matrix for temp and atemp (in this order, but since we only have two values it wouldn’t matter anyway):

[[ 1.          0.98494811]
 [ 0.98494811  1.        ]]

So the correlation of each column with itself is of course 1. However, temp and atemp also have a correlation of 0.985, which is quite high. Thus, when training a classifier we have to keep in mind that the classifier might only work with independant columns (or at least it might work worse if there are strong dependancies).

With the information just collected, we now have an even better understanding of the data. I’m still wondering if maybe I can get more information out the windspeed column. Maybe it is more useful if separated into seasons? Maybe in winter people dislike harsh wind, but in summer they like it, because it feels cooler? Just an idea I will check later.

Gaussian-Mixture-Model in R

In meinem letzten Eintrag zum Kaggle-Wettbewerb zur Fahrradnutzung habe ich zwei Verteilungen der Uhrzeiten geplottet, zu welchen Fahrräder gemietet wurden. In einer davon waren zwei Spitzen zu erkennen und grob sieht es auch nach zwei Gaussglocken aus.

Aggregation over all working and non-workingdays

In so einem Fall kann man versuchen, ob man mit einem Gaussian-Mixture-Model weiterkommt. Das ist ein Model, bei dem zwei Gaussverteilungen so kombiniert werden, dass man eine neue Wahrscheinlichkeitsverteilung erhält.

Hierzu müssen wir die Daten in R zunächst einmal so umwandeln, dass sie eindimensional sind und jeder Wert mit der korrekten Häufigkeit auftritt. Hatten wir bisher gespeichert, dass um 7 Uhr X Fahrräder und um 8 Uhr Y Fahrräder geliehen wurden, so muss nun die Uhrzeit 7 Uhr X-mal auftreten und die Uhrzeit 8 Uhr Y-mal. Mit so einer Verteilungsliste kann man dann Funktionen für Gaussverteilungen füllen.

Praktischerweise benötigt man in R mit rep hierfür nach dem Laden der Daten nur eine Zeile.

t = read.csv('train_altered.csv', header=TRUE)
hours = rep(t$hours, times=t$count)

Hiermit wird jede Uhrzeit entsprechend der Anzahl an gemieteten Fahrrädern mehrfach in einem Vektor gespeichert.

Mit dem Paket mixtools kann man dann ein Gaussian-Mixture-Model berechnen:

library(mixtools)
model = normalmixEM(hours)
plot(model, which=2)

Nicht-optimales GMM

Leider ist zu erkennen, dass die Gaussverteilungen nicht zu den Höhepunkten passen. Das liegt daran, dass nach rechts hin aus dem Bild heraus Messwerte fehlen, die die grüne Gausskurve “auffüllen” könnten. Deshalb ist die grüne Gausskurve viel zu klein und die rote muss nach rechts rutschen, um den Fehler in der Mitte des Plots zu kompensieren.

Allerdings kann man erkennen, dass man sehr viel flachere Randstücke erreichen kann, wenn man das Histogramm ein wenig rotiert. D.h. 4 Uhr soll ganz nach links an den Rand des Plots und niedrigere Uhrzeiten dafür ganz nach rechts. Dann wären beide Ränder flach und ein Gaussmodell würde besser passen.

Auch das lässt sich in R schnell umsetzen.

hours_moved = (hours-4) %% 24

Und hiermit kann erneut ein Modell trainiert werden, das wesentlich besser auf die tatsächliche Verteilung passt.

model = normalmixEM(hours_moved)
plot(model, which=2)

Optimiertes GMM nach Verschiebung

Anmerkung: Ich sehe im Moment nur noch nicht, ob mich das GMM beim Kaggle-Wettbewerb weiterbringen kann oder nicht. Ich kann jetzt ja erstmal nur sagen, mit welcher Wahrscheinlichkeit zu welcher Uhrzeit ein Fahrrad gemietet wird (an Werktagen wohlgemerkt). Bei Kaggle muss ich jedoch die Anzahl an Fahrradvermietungen vorhersagen.
Trotzdem ist es natürlich gut, GMMs mal ausprobiert zu haben und auch gesehen zu haben, welchen Effekt eine Achsenverschiebung hier haben kann.

Zur Referenz noch dieselbe Lösung in Python mit Numpy und Scikit-Learn:

import pandas as pd
import numpy as np
from sklearn import mixture
 
df = pd.read_csv('train_altered.csv', index_col=0, parse_dates=[0])
 
hours = np.repeat(df['hour'].values, df['count'].values)
hours_moved = (hours - 4) % 24
 
np.random.seed(1)
g = mixture.GMM(n_components=2)
g.fit(hours_moved)
 
print(g)
print(np.round(g.means_, 2))

My first-impressions approach to the Kaggle “Bike Sharing Demand” contest

In this article I will share my approach to the Kaggle contest named “Bike Sharing Demand”. It is in my opinion a quite easy dataset, so it’s easy for me to learn with. It’s also a very good dataset for visualisations.

The first obvious dependancy that came into my mind was the one between temperature and bike count, because of course many people would want to use a bike when it’s warm, but not when it’s cold. I plotted both curves (all values normalized, so they have about the same amplitude), and got this result (red line is always bike usage count and blue line is temperature):

Bike usage all days

On this plot I recognized the next interesting aspects. One of them should have been obvious: Bike demand is lower in winter months and higher in summer months. But the other one was not so trivial, yet important for prediction. The bike demand had a constant rise between January 2011 and December 2012, just like the business cycle when it is plotted. So from this distant view, we have a sinus curve overlapped with a rising slope (in blue).

Average slope over all months

However, I was confused by the plot, because the lines always went up and down causing the whole plot to be colored. It then came to my mind that the data is not given in terms of days, but in units of hours. This means, that probably at night the bike count will drop pretty low. So I plotted one individual day giving me this (here the blue line is temperature again):

Bike usage on a workday

Here, one can see very well that bike demand is high at morning and evening, which seems like working hours. Oh! Then the bike demand could be different on weekends? I went to see in a calendar of 2011 if the 1st of March was a working day. And yes, it was. So I plotted some weekend day.

Bike usage on a Saturday

We can see a totally different situation now.

If we aggregate over all workingdays and non-workingdays we can see that this clearly is a scheme and not by accident. On workingdays we have two peaks at 7/8 o’clock and at 17/18 o’clock (when people commute to and from work), whereas on non-workingdays there is only one broad peak between 12 and 17 o’clock (when people usually do a holiday trip).

Aggregation over all working and non-workingdays

So what did I learn from this? First of all, that it’s always a good idea, if you have some class columns (R calls them factors) to check if there is some significant difference in the data for each class. And secondly, you cannot always rely on the features you already have, because in this case the time of the day together with the factor “workday?” is some important information. However, the time of the day is not given in the dataset by kaggle on its own. Instead, you will have to extract it from the datetime column yourself. In the format it is given, you cannot aggregate over the time alone.