Last edit
Summary: ==Pawel== {{{ How much time I spend depends on how much I'm sure that I can win it. When you start doing good in a competition you cannot stop . . .
Added:
> --
> If you're serious about data science you should have more RAM. 32 GB
> is a good number. The amazon data is very sparse most of those columns
> are 0s. So scikit-learn is probably converting the data to a dense
> matrix.
How much time I spend depends on how much I'm sure that I can win it. When you start doing good in a competition you cannot stop - it's addictive and you must keep it going. You cannot win a competition without hard work - so I expect that will be a few hours a day. It really depends on what you want to achieve. The most difficult was the Flight Quest (http://www.gequest.com/c/flight) with 30+ messy tables and no structure. They just gave the data and said "play with it". This was about 300 hours over 2.5 months. My team won more than 50 000$ (+ I cannot reveal the amount I won during a private competition for Allstate - I signed NDA for that). -- My opinion is that attitude is very important and ability to keep cool till the end. It is one of the things that even if you complete a task in 99% the last 1% tends to be 100 times more difficult than the rest. -- My skills: - SQL (~5 years) - In the competitions when I took 1,2 places I used SQL 90% of the time. I cannot underline more how important it was for me. I guess you must be able to transfer every idea into SQL. It also gives you power to quickly iterate simple solutions. - Python (~6 years) - Very handy when you must process some unstructured text data. There is a scikit-learn library that kagglers are using with success as a machine learning library I haven't used it much but it is definitely worth learning. - R (~2 years) - used mostly for modeling. I connected to SQL with Rodbc library which I found very pleasant mix. To get started enter a competition that you find the most interesting and don't stop trying. I guess the learning curve is steep so at the begining you must put much time into it. I found it very helpful to enter a team - you can discuss the problems with someone else which is enlightening. Apart from that every problem needs a framework/pipeline that lets you control the abstraction. The most ideal situation is when you can process the raw data -> clean it -> add features -> make models -> blend models in a few lines of code. -- GBM is definitely a killer algorithm (R gbm or python scikit-learn implementation which is very good). In a private competition for Allstate it turned out that almost all used it :). I personally like random forests which can be easily ran in parallel. What I also like to use is to combine models sequentially. For example create a linear model and then as a second step use random forests (residual errors as a response). This is called "stacking". Apart from that I plan to learn more about deep learning it seems that it can dominate kaggle competitions soon. -- Good question :). This is the trickiest thing. If you're familiar with cross validation concept the second model is fed with out-of-folds vectors of residual errors it made on each fold. If you're not familiar with cross validation (which you should) then a simplified explanation could be: 1. Cut the data in half A and B 2. Training: - train on A and predict B, save predictions as B' - train on B and predict A, save predictions as A' 3. Construct a full vector of predictions from A' + B' After this you have a full vector of predicted responses. What is important the predictions are unbiased in any way. You can calculate the residual errors and model these in the second stage. You can repeat the procedure many times switching the algorithms. It sounds strange but it works. -- As for the second part of your question I use data visualization constantly. Maybe "visualization" is too strong word for what I do. Most of the times I used plot(x,y) type of charts. It really helps me to catch some patterns. -- I used a dedicated server for my purposes http://www.hosteurope.de/Server/Root-Server/ -- There are 4 introduction competitions on Kaggle prepared for learning purposes. They come with nice tutorials. It is really simple to get started. -- Maybe it will be disappointing or a cliche to say that 90% of machine learning is preparation, processing and cleaning the data but it is true. It could be less when a data set is already in good shape (on Kaggle sometimes this is the case). Out of the 90% time 80% is calculating "sums and averages" (I've heard this somewhere). Statistical knowledge is not critical for that part. For the latter 10% ML knowledge is a must. Statistics can be handy. Then to achieve a good accuracy you need to know several machine learning algorithms inside out - read through the forums of closed competitions - there are threads in which competitors discuss the strategies they used. Reading it was how I started. Now some controversial tips. There were several phases I went through in these competitions. 1. At first I was crazy about overfitting and preventing it. Don't think about it - if it comes it will strike you in the forehead. Believe me. 2. Then I was focused on cross validation. Every model I created was cross validated. Now I rarely do it. If the data is big enough all you need is a well selected holdout set. 3. Finally I created super complicated processes for variable selection (like wrapper selection). It took too much time. The hard truth is you don't need to select variables (at least the modern algortihms don't need it). Also I prefer to think not about selecting variables but eliminating unnecessary ones but only to save some CPU time. -- 1. It depends on the problem. Last 2 competitions I took part in provided tables with many relations. It was only logical to use SQL for the task. Some other competitions have only 1 table with data - it would make less sense to use SQL for that. It is good to mention sqldf R package which lets you write SQL queries on R data frames. 2. I also used python scikit-learn. To be honest R is not a great language to write in. It comes with great packages but what you get in exchange is the worst type of scripting language I can think of. The second thing that rescues R language is a Rstudio IDE. 3. ML course was great but as I have said I didn't implement any algorithm from A to Z in production environment. Andres Ng's course is very low level type of stuff - at least some parts are. My sweet spot is on a little higher level of abstraction. Apart from implementation M/L course teaches you about variance-bias trade off - for me that was the most important thing. -- Of course it depends on the problem. When the algo is not scalable many times the only choice is to eliminate variables. Maybe that's why I like decision trees which are very scalable so I can get away with this. I tried PCA - SVD dimension reduction many times. I had no success with it. I should make a stop here because I'm writing about competitions and by "no success" I mean that it didn't improve accuracy. If you can lose 1% of accuracy removing 50% of variables this IS a way to go in real life. Personally I like the concepts of random projections and feature hashing for dimension reduction. -- Apart from that I can tell you what gave me the most headaches 1. Testing - it is really painful not to be able to easily change one part of the code without knowing that something else not breaks. 2. Control Version System - many times I've created something that I couldn't recreate. This is the most frustrating thing ever. So from the day 0 try to create automated processes that are easy to streamline. -- Obviously you don't start with the most difficult approach you can think of. We used stacking because over the course of the competition we experimented a lot with different settings and this proved to be the best. For example I had an intuition about the flight delay prediction that this problem consists of many independent events. Let's say that there is a fog over the airport (which causes 5 minute delay vs "no fog") + there can be a delay that was provoked by too much traffic (+10 minutes vs "no traffic"). I guessed right that you could sum up these delays because fog and traffic are mostly independent. Thus we used linear model which could catch the additive structure of the problem. By using linear regression we lost the interactions between variables so decision trees came to stage as a second stage algorithm. What I find important is the ability to quickly iterate and experiment as much as you can. -- It's hard to say what supplemental material I can recommend. Getting hands dirty is by far the best way to augment any skill. I can recommend this course which is not obsessed with buzz words. What our professor said about SQL being frowned upon in the era of "big data" and not agreeing with that sold me. Also you can google for solutions to other data mining competitions. For example winning papers for netflix competition, heritage prize milestones. These are all valuable sources of ideas. -- It took me 1 year. But in 2012 I spent almost all my free time on these competitions. I got my hands very dirty with practical problems. This also meant that I neglected coursera :(. -- I agree with Johan - decision trees are very good with imbalanced data. I would go with boosted decision trees (GBM). You can: 1) Change distribution of the observations to make it more balanced 50-50, 40-60, 30-70 etc. Validate the results to check what is the best option 2) Give more weight to positive Then you can recalibrate the predictions if you want to get the true probabilities because after the preprocessing they will be skewed. -- 2. There are competitions in which errors are measured in a standard way log error, rmse etc. The problem appears only if the competition uses some non-standard metrics where out of shelf solutions won't do you any good. I tend not to select features explicitly - only removing the non-performing ones (too sparse, too little variance etc). Then I use a scalable algorithm. I do feature engineering on my own (in an old school fashion). I find it my strongest point. I wouldn't do so good in a "core machine learning" task. 3. I had no success using unsupervised methods (like PCA) in the competitions. However I didn't take part in any that would require it (like the recent black box competition). At this point I'm looking for a way to experiment with the deep learning methods - it seems the "deep" part is all about unsupervised learning. -- I would use SQL in every competition that involves a) more than 2 tables with data b) predicting future - for example whenever there is a date in the competition and you are requested to predict non-seen future - it is extremely important to prepare unbiased features in such a way that you estimate the model performance without actually seeing subsequent observations. SQL is very handy in such situations. -- Q: Hi, I do the challenge http://www.kaggle.com/c/amazon-employee-access-challenge for homework. Q: I build a decision tree for train data. I didn't expand categorical feature to binary feature, so code is like if feature in (a,b,c) go left, if feature in (d,e,f) go right. But if the input is g, that is neither in left (a,b,c) nor in right (d,e,f), how to deal with this unseen data. Thanks. If I understand you correctly you encode the features as binary ones. Maybe you should think of encoding it as 3 values. -1 for left, +1 for right and 0 if the category is unseen? This way it would be unbiased. It is similar to a way when you encode NAs in numerical variables. When you normalize a vector V: (V - mean(V))/var(V) the NA should be equal = 0 -- You should not predict class but a probability that the observation is of class 1. The accuracy is 94% because you predict all observations as 1 which is bad. The competition metric is based on AUC which is used in situations where the distribution is skewed and gives more truthful representation of errors. Depending on the software you're using you should be able to choose a probability output instead of raw class (1 or 0). -- Yes, visualization helps when I really think about it. Good example of visualization is when I noticed in the FlightQuest competition that the model makes bigger errors close to the airport. There was really a pattern visible so I knew that I must work on this area. If there is overfitting you'll see it :). Generally you will overfit when using "intelligent" algorithms which are able to learn the data "by heart". I believe that good data > good algorithm so the most preferable solution is to work on good features and use a simple algorithm (linear regression for example). If you're unable to create meaningful features you must rely on those "intelligent" algorithms which have a tendency to overfit and then you must decide what regularization you want to use. -- This bond competition was very interesting. I lost 3rd place on the last day when two users merged into a team (which is now forbidden later than 7 days before the end). I was not happy to say at least. This is a regression task all the way. You can get good results with 2 tricks: 1. Make the response stationery. You don't predict absolute values on T0 but rather predict T0−T1. How you define the response is sometimes the most important thing - not only in this competition. 2. You must first convert all the transaction types into the transaction type of the current transaction. So when you have transactions 2 (response), 3,4,2,3 they all must be 2, 2, 2 ... . It was rather complicated. But you can simplify it by adding removing some constants to buy/sell transactions. If you convert it well a simple average of the prices for the last 10 periods is the best predictor as far as I remember. You must convert any given trade_price to the type that you have to predict - not necessarily to type 2. The trade prices can be in the correct type already so you don't have to make any conversions. If you are comparing trade prices of different types you are comparing apples to oranges in my opinion that's why you need to align them. This way the trade_price_1 will be even more correlated. Logistic regression returns a probability between 0 and 1 - I think you should not use it. -- First of all: the guys in the top 10 are not geniuses. Don't be intimidated. If I'm not doing great I always think - "what they are doing which I'm missing". Sometimes all it comes down is to apply some tricks: convert the response to something else that is easier to model, transpose the data, find the right algorithm etc. And there is one important thing. When you're improving by 0.0001 it means that it is not the right way. Don't get stuck with bad approach. When you're improving by 20 places or more then you know that you're on the right track. -- The feature engineering is for me a matter of experiments. I make a lot of correlation vizualization. What is particulary hard is to use response variable in the process of creation the features. I would call it supervised feature engineering. For example in time based problems you have something that I call "stickiness" of correlation. It mostly concerns categorical variables. When you compare the responses and how they change in time year over year - many times the correlation is high. I very often use this fact to use meaningful features. It is mostly trial and error. You create features -> make the model -> when you see a large increase in performance that means that features are working. I seldom try one feature. If I create one I replicate the idea to create more and I add them in sets. -- If you're serious about data science you should have more RAM. 32 GB is a good number. The amazon data is very sparse most of those columns are 0s. So scikit-learn is probably converting the data to a dense matrix.