Paweł Jankiewicz

Pawel

How much time I spend depends on how much I'm sure that I can win
it. When you start doing good in a competition you cannot stop - it's
addictive and you must keep it going. You cannot win a competition
without hard work - so I expect that will be a few hours a day. It
really depends on what you want to achieve.

The most difficult was the Flight Quest
(http://www.gequest.com/c/flight) with 30+ messy tables and no
structure. They just gave the data and said "play with it". This was
about 300 hours over 2.5 months.

My team won more than 50 000$ (+ I cannot reveal the amount I won
during a private competition for Allstate - I signed NDA for that).

--

My opinion is that attitude is very important and ability to keep cool
till the end. It is one of the things that even if you complete a task
in 99% the last 1% tends to be 100 times more difficult than the rest.

--

My skills:

- SQL (~5 years) - In the competitions when I took 1,2 places I used
  SQL 90% of the time. I cannot underline more how important it was
  for me. I guess you must be able to transfer every idea into SQL. It
  also gives you power to quickly iterate simple solutions.

- Python (~6 years) - Very handy when you must process some
  unstructured text data. There is a scikit-learn library that
  kagglers are using with success as a machine learning library I
  haven't used it much but it is definitely worth learning.

- R (~2 years) - used mostly for modeling. I connected to SQL with
  Rodbc library which I found very pleasant mix.

To get started enter a competition that you find the most interesting
and don't stop trying. I guess the learning curve is steep so at the
begining you must put much time into it. I found it very helpful to
enter a team - you can discuss the problems with someone else which is
enlightening.

Apart from that every problem needs a framework/pipeline that lets you
control the abstraction. The most ideal situation is when you can
process the raw data -> clean it -> add features -> make models ->
blend models in a few lines of code.

--

GBM is definitely a killer algorithm (R gbm or python scikit-learn
implementation which is very good). In a private competition for
Allstate it turned out that almost all used it :). I personally like
random forests which can be easily ran in parallel.

What I also like to use is to combine models sequentially. For example
create a linear model and then as a second step use random forests
(residual errors as a response). This is called "stacking".

Apart from that I plan to learn more about deep learning it seems that
it can dominate kaggle competitions soon.

--

Good question :). This is the trickiest thing. If you're familiar with
cross validation concept the second model is fed with out-of-folds
vectors of residual errors it made on each fold.

If you're not familiar with cross validation (which you should) then a
simplified explanation could be:

1. Cut the data in half A and B
2. Training:
    - train on A and predict B, save predictions as B'
    - train on B and predict A, save predictions as A'
3. Construct a full vector of predictions from A' + B'

After this you have a full vector of predicted responses. What is
important the predictions are unbiased in any way. You can calculate
the residual errors and model these in the second stage. You can
repeat the procedure many times switching the algorithms. It sounds
strange but it works.

--

As for the second part of your question I use data visualization
constantly. Maybe "visualization" is too strong word for what I
do. Most of the times I used plot(x,y) type of charts. It really helps
me to catch some patterns.

--

I used a dedicated server for my purposes
http://www.hosteurope.de/Server/Root-Server/

--

There are 4 introduction competitions on Kaggle prepared for learning
purposes. They come with nice tutorials. It is really simple to get
started.

--

Maybe it will be disappointing or a cliche to say that 90% of machine
learning is preparation, processing and cleaning the data but it is
true. It could be less when a data set is already in good shape (on
Kaggle sometimes this is the case). Out of the 90% time 80% is
calculating "sums and averages" (I've heard this
somewhere). Statistical knowledge is not critical for that part.

For the latter 10% ML knowledge is a must. Statistics can be
handy. Then to achieve a good accuracy you need to know several
machine learning algorithms inside out - read through the forums of
closed competitions - there are threads in which competitors discuss
the strategies they used. Reading it was how I started.

Now some controversial tips. There were several phases I went through
in these competitions.  

1. At first I was crazy about overfitting and preventing it.  Don't
think about it - if it comes it will strike you in the
forehead. Believe me.

2. Then I was focused on cross validation. Every model I created was
cross validated. Now I rarely do it. If the data is big enough all you
need is a well selected holdout set.

3. Finally I created super complicated processes for variable
selection (like wrapper selection). It took too much time. The hard
truth is you don't need to select variables (at least the modern
algortihms don't need it). Also I prefer to think not about selecting
variables but eliminating unnecessary ones but only to save some CPU
time.

--

1. It depends on the problem. Last 2 competitions I took part in
provided tables with many relations. It was only logical to use SQL
for the task. Some other competitions have only 1 table with data - it
would make less sense to use SQL for that. It is good to mention sqldf
R package which lets you write SQL queries on R data frames.

2. I also used python scikit-learn. To be honest R is not a great
language to write in. It comes with great packages but what you get in
exchange is the worst type of scripting language I can think of. The
second thing that rescues R language is a Rstudio IDE.

3. ML course was great but as I have said I didn't implement any
algorithm from A to Z in production environment. Andres Ng's course is
very low level type of stuff - at least some parts are. My sweet spot
is on a little higher level of abstraction. Apart from implementation
M/L course teaches you about variance-bias trade off - for me that was
the most important thing.

--

Of course it depends on the problem. When the algo is not scalable
many times the only choice is to eliminate variables. Maybe that's why
I like decision trees which are very scalable so I can get away with
this. I tried PCA - SVD dimension reduction many times. I had no
success with it. I should make a stop here because I'm writing about
competitions and by "no success" I mean that it didn't improve
accuracy. If you can lose 1% of accuracy removing 50% of variables
this IS a way to go in real life.

Personally I like the concepts of random projections and feature
hashing for dimension reduction.

--

Apart from that I can tell you what gave me the most headaches

1. Testing - it is really painful not to be able to easily change one
part of the code without knowing that something else not breaks.

2. Control Version System - many times I've created something that I
couldn't recreate. This is the most frustrating thing ever. So from
the day 0 try to create automated processes that are easy to
streamline.

--

Obviously you don't start with the most difficult approach you can
think of. We used stacking because over the course of the competition
we experimented a lot with different settings and this proved to be
the best. For example I had an intuition about the flight delay
prediction that this problem consists of many independent
events. Let's say that there is a fog over the airport (which causes 5
minute delay vs "no fog") + there can be a delay that was provoked by
too much traffic (+10 minutes vs "no traffic"). I guessed right that
you could sum up these delays because fog and traffic are mostly
independent. Thus we used linear model which could catch the additive
structure of the problem. By using linear regression we lost the
interactions between variables so decision trees came to stage as a
second stage algorithm.

What I find important is the ability to quickly iterate and experiment
as much as you can.

--


It's hard to say what supplemental material I can recommend. Getting
hands dirty is by far the best way to augment any skill. I can
recommend this course which is not obsessed with buzz words. What our
professor said about SQL being frowned upon in the era of "big data"
and not agreeing with that sold me. Also you can google for solutions
to other data mining competitions. For example winning papers for
netflix competition, heritage prize milestones. These are all valuable
sources of ideas.

--

It took me 1 year. But in 2012 I spent almost all my free time on
these competitions. I got my hands very dirty with practical
problems. This also meant that I neglected coursera :(.

--

I agree with Johan - decision trees are very good with imbalanced
data. I would go with boosted decision trees (GBM). You can:

1) Change distribution of the observations to make it more balanced
50-50, 40-60, 30-70 etc. Validate the results to check what is the
best option

2) Give more weight to positive Then you can recalibrate the
predictions if you want to get the true probabilities because after
the preprocessing they will be skewed.

--

2. There are competitions in which errors are measured in a standard
way log error, rmse etc. The problem appears only if the competition
uses some non-standard metrics where out of shelf solutions won't do
you any good. I tend not to select features explicitly - only removing
the non-performing ones (too sparse, too little variance etc). Then I
use a scalable algorithm. I do feature engineering on my own (in an
old school fashion). I find it my strongest point. I wouldn't do so
good in a "core machine learning" task.

3. I had no success using unsupervised methods (like PCA) in the
competitions. However I didn't take part in any that would require it
(like the recent black box competition). At this point I'm looking for
a way to experiment with the deep learning methods - it seems the
"deep" part is all about unsupervised learning.

--

I would use SQL in every competition that involves 

a) more than 2 tables with data 

b) predicting future - for example whenever there is a date in the
competition and you are requested to predict non-seen future - it is
extremely important to prepare unbiased features in such a way that
you estimate the model performance without actually seeing subsequent
observations. SQL is very handy in such situations.

--

Q: Hi, I do the challenge
http://www.kaggle.com/c/amazon-employee-access-challenge for homework.

Q: I build a decision tree for train data. I didn't expand categorical
feature to binary feature, so code is like if feature in (a,b,c) go
left, if feature in (d,e,f) go right. But if the input is g, that is
neither in left (a,b,c) nor in right (d,e,f), how to deal with this
unseen data. Thanks.

If I understand you correctly you encode the features as binary
ones. Maybe you should think of encoding it as 3 values. -1 for left,
+1 for right and 0 if the category is unseen? This way it would be
unbiased. It is similar to a way when you encode NAs in numerical
variables. When you normalize a vector V: (V - mean(V))/var(V)
the NA should be equal = 0

--

You should not predict class but a probability that the observation is
of class 1. The accuracy is 94% because you predict all observations
as 1 which is bad. The competition metric is based on AUC which is
used in situations where the distribution is skewed and gives more
truthful representation of errors. Depending on the software you're
using you should be able to choose a probability output instead of raw
class (1 or 0).

--

Yes, visualization helps when I really think about it. Good example of
visualization is when I noticed in the FlightQuest competition that
the model makes bigger errors close to the airport. There was really a
pattern visible so I knew that I must work on this area.

If there is overfitting you'll see it :). Generally you will overfit
when using "intelligent" algorithms which are able to learn the data
"by heart". I believe that good data > good algorithm so the most
preferable solution is to work on good features and use a simple
algorithm (linear regression for example). If you're unable to create
meaningful features you must rely on those "intelligent" algorithms
which have a tendency to overfit and then you must decide what
regularization you want to use.

--

This bond competition was very interesting. I lost 3rd place on the
last day when two users merged into a team (which is now forbidden
later than 7 days before the end). I was not happy to say at
least. This is a regression task all the way.

You can get good results with 2 tricks:

1. Make the response stationery. You don't predict absolute values on
T0 but rather predict T0−T1. How you define the response is sometimes
the most important thing - not only in this competition.

2. You must first convert all the transaction types into the
transaction type of the current transaction. So when you have
transactions 2 (response), 3,4,2,3 they all must be 2, 2, 2 ... . It
was rather complicated. But you can simplify it by adding removing
some constants to buy/sell transactions. If you convert it well a
simple average of the prices for the last 10 periods is the best
predictor as far as I remember.

You must convert any given trade_price to the type that you have to
predict - not necessarily to type 2. The trade prices can be in the
correct type already so you don't have to make any conversions. If you
are comparing trade prices of different types you are comparing apples
to oranges in my opinion that's why you need to align them. This way
the trade_price_1 will be even more correlated. Logistic regression
returns a probability between 0 and 1 - I think you should not use it.

--

First of all: the guys in the top 10 are not geniuses. Don't be
intimidated. If I'm not doing great I always think - "what they are
doing which I'm missing". Sometimes all it comes down is to apply some
tricks: convert the response to something else that is easier to
model, transpose the data, find the right algorithm etc. And there is
one important thing. When you're improving by 0.0001 it means that it
is not the right way. Don't get stuck with bad approach. When you're
improving by 20 places or more then you know that you're on the right
track.

--

The feature engineering is for me a matter of experiments. I make a
lot of correlation vizualization. What is particulary hard is to use
response variable in the process of creation the features. I would
call it supervised feature engineering. For example in time based
problems you have something that I call "stickiness" of
correlation. It mostly concerns categorical variables. When you
compare the responses and how they change in time year over year -
many times the correlation is high. I very often use this fact to use
meaningful features. It is mostly trial and error. You create features
-> make the model -> when you see a large increase in performance that
means that features are working. I seldom try one feature. If I create
one I replicate the idea to create more and I add them in sets.

--

If you're serious about data science you should have more RAM. 32 GB
is a good number. The amazon data is very sparse most of those columns
are 0s. So scikit-learn is probably converting the data to a dense
matrix.