Oh that's a tough question. Technically, you need to be a strong coder and competent with modern machine learning algorithms, and of course you should know how to do effective visualizations. People generally nominate "Elements of Statistical Learning" as the top book, but personally I'd stay well away from it - I don't find it pragmatic or clear. I rather like http://www.cs.waikato.ac.nz/ml/weka/book.html as a pragmatic machine learning overview. Also Andrew Ng's Coursera course is wonderful. We've had many Kaggle winners come out of that course. And for a Python focused machine learning intro, http://www.amazon.com/dp/1420067184 is underrated. I strongly suggest you become familiar with Git, Bash, SSH, a good text editor like Emacs or Vim, and learn the basics of networking (Coursera has an excellent course on this too!) O'Reilly has good books on all of the above, although there's plenty of decent online tutorials too. Sign up for safari.oreilly.com and you'll have access to all the tech books you can ever want! My #1 advice to becoming a top data scientist is to implement the main algorithms yourself. I'd start with a gradient descent approach to matrix factorization (as used in the netflix prize), an ensemble of trees method (like random forests or GBMs), GLMNet, and Restricted Bolzmann Machines. Once you've built your own implementation, you'll be one of the tiny minority that understand how they work well enough to really use them effectively. Also, enter competitions. It's the only way to get fast honest feedback on how you're going. If you're shy, you can always enter old closed competitions - you'll see where you would have been on the leaderboard, but noone will know if you go badly! Python would be my suggestion as the #1 language to learn, although R is also very useful. R In A Nutshell is a terrific book for becoming an R guru, and the ggplot book is a great way to learn to create visualizations. -- No need to learn Octave/Matlab - it's a rather dated language - unless you do Andrew Ng's course (which uses Octave for programming assignments). -- I learnt more by competing in Kaggle competitions than my 8 years in management consulting and 10 years running data intensive startups. The best data scientists I know are all competitors. There's no other way to get true feedback on your effectiveness. It's humbling, but if you stick at it within 6 months of doing 30 minutes a day you'll be better than all your colleagues and friends. (Many data science bloggers claim to be against competitions. I'm not surprised - they have a lot to lose if they compete and turn out not to be so hot after all, and very little to gain.) -- I use random forests for pretty much everything. Along with GBMs and deep learning there are few predictive modelling problems they are not good for. I know a at least a couple of dozen languages, but I use 80% C#, 10% R, and 10% F# nowadays (along with JS for in browser stuff and SQL for DBs, of course). -- Q: What library do you use for deep learning? Theano (http://deeplearning.net/software/theano)? Or did you write your own? I'm only just getting to deep learning now myself. I'm trying to create what Hinton calls a DREDNET. -- Q: Have you used any good C#/.NET Random Forest libraries (open source or commercial)? I coded my own. I bet you can do it in a day or less. Try it! -- I don't think logistic regression is a good approach - in general it is easy to overfit, makes many unreasonable assumptions about data, and is more complex than more modern approaches. Try a random forest, GBM, or deep learning network. They will all handle 16000 features just fine. -- Can you sample the data? Try a 10% sample to allow you to do a variable importance. Once you've found the relevant vars, you can increase the sample size. Also, use a smaller % of data for each tree. -- I find ML algorithms are not closely linked to stats on the whole. They generally require convex optimization, trees, and simple linear algebra. These are things I would expect to see taught in a CS course. Traditional stats (t tests, distributions, etc) aren't very relevant to modern data science in my experience. -- I do not see any reason that a random forest should perform better after sub setting. Effectively, a random forest is already doing sub setting for you. Increasing the number of trees will only help if the dataset is too complex for the default number of trees. In general, the way to create an effective predictive model is through careful feature engineering. Look for opportunities to extract additional insight from individual variables, or combinations of variables. -- I don't know where you got that impression, but it couldn't be further from the truth. Random forests have the best variable importance algorithm I know. Look at any of Cutler and Brieman's writings for details on methodology, or just go ahead and use varImpPlot() in R to start playing with it. -- There is no one best contest. They're all great. Pick one that interests you. I think digit recognition is interesting, personally. -- I would recommend the original papers on GLMNet, Random Forest, and GBM. I would also suggest Chris Bishop's books - I found them slow going because I'm a poor mathematician, but they're really well done. Also see my other responses in this thread for more book suggestions. -- To get the most out of RFs, think carefully about what you put in to them. That is to say, feature engineering is the difference between a winning model and an average model. Tuning RF parameters, on the other hand, won't generally make much difference. -- Q: Does it help to combine regularization with ensemble methods like random forest? No it doesn't - at least not with RF. RF is already immune to overfitting because it uses sub-sampling (assuming you sample appropriate sizes) -- An example of feature engineering: given a date/time field, create additional columns: isWeekend; DayOfWeek; isHoliday; isPeakHour; ... Or given an image, create features for edges, complexity, etc... Random forests are not able to find features that are this complex. Any programming language can be used for this job.