Jeremy P Howard

Oh that's a tough question. Technically, you need to be a strong coder
and competent with modern machine learning algorithms, and of course
you should know how to do effective visualizations. People generally
nominate "Elements of Statistical Learning" as the top book, but
personally I'd stay well away from it - I don't find it pragmatic or
clear.

I rather like http://www.cs.waikato.ac.nz/ml/weka/book.html as a
pragmatic machine learning overview. Also Andrew Ng's Coursera course
is wonderful. We've had many Kaggle winners come out of that
course. And for a Python focused machine learning intro,
http://www.amazon.com/dp/1420067184 is underrated.

I strongly suggest you become familiar with Git, Bash, SSH, a good
text editor like Emacs or Vim, and learn the basics of networking
(Coursera has an excellent course on this too!) O'Reilly has good
books on all of the above, although there's plenty of decent online
tutorials too. Sign up for safari.oreilly.com and you'll have access
to all the tech books you can ever want!

My #1 advice to becoming a top data scientist is to implement the main
algorithms yourself. I'd start with a gradient descent approach to
matrix factorization (as used in the netflix prize), an ensemble of
trees method (like random forests or GBMs), GLMNet, and Restricted
Bolzmann Machines. Once you've built your own implementation, you'll
be one of the tiny minority that understand how they work well enough
to really use them effectively. Also, enter competitions. It's the
only way to get fast honest feedback on how you're going. If you're
shy, you can always enter old closed competitions - you'll see where
you would have been on the leaderboard, but noone will know if you go
badly!

Python would be my suggestion as the #1 language to learn, although R
is also very useful. R In A Nutshell is a terrific book for becoming
an R guru, and the ggplot book is a great way to learn to create
visualizations.

--

No need to learn Octave/Matlab - it's a rather dated language - unless
you do Andrew Ng's course (which uses Octave for programming
assignments).

--

I learnt more by competing in Kaggle competitions than my 8 years in
management consulting and 10 years running data intensive
startups. The best data scientists I know are all competitors. There's
no other way to get true feedback on your effectiveness. It's
humbling, but if you stick at it within 6 months of doing 30 minutes a
day you'll be better than all your colleagues and friends.

(Many data science bloggers claim to be against competitions. I'm not
surprised - they have a lot to lose if they compete and turn out not
to be so hot after all, and very little to gain.)

--

I use random forests for pretty much everything. Along with GBMs and
deep learning there are few predictive modelling problems they are not
good for. I know a at least a couple of dozen languages, but I use 80%
C#, 10% R, and 10% F# nowadays (along with JS for in browser stuff and
SQL for DBs, of course).

--

Q: What library do you use for deep learning? Theano
(http://deeplearning.net/software/theano)? Or did you write your own?

I'm only just getting to deep learning now myself. I'm trying to
create what Hinton calls a DREDNET.

--

Q: Have you used any good C#/.NET Random Forest libraries (open source
or commercial)?

I coded my own. I bet you can do it in a day or less. Try it!

--

I don't think logistic regression is a good approach - in general it
is easy to overfit, makes many unreasonable assumptions about data,
and is more complex than more modern approaches. Try a random forest,
GBM, or deep learning network. They will all handle 16000 features
just fine.

--

Can you sample the data? Try a 10% sample to allow you to do a
variable importance. Once you've found the relevant vars, you can
increase the sample size. Also, use a smaller % of data for each tree.

--

I find ML algorithms are not closely linked to stats on the
whole. They generally require convex optimization, trees, and simple
linear algebra. These are things I would expect to see taught in a CS
course. Traditional stats (t tests, distributions, etc) aren't very
relevant to modern data science in my experience.

--

I do not see any reason that a random forest should perform better
after sub setting. Effectively, a random forest is already doing sub
setting for you. Increasing the number of trees will only help if the
dataset is too complex for the default number of trees.

In general, the way to create an effective predictive model is through
careful feature engineering. Look for opportunities to extract
additional insight from individual variables, or combinations of
variables.

--

I don't know where you got that impression, but it couldn't be further
from the truth. Random forests have the best variable importance
algorithm I know. Look at any of Cutler and Brieman's writings for
details on methodology, or just go ahead and use varImpPlot() in R to
start playing with it.

--

There is no one best contest. They're all great. Pick one that
interests you. I think digit recognition is interesting, personally.

--

I would recommend the original papers on GLMNet, Random Forest, and
GBM. I would also suggest Chris Bishop's books - I found them slow
going because I'm a poor mathematician, but they're really well
done. Also see my other responses in this thread for more book
suggestions.

--

To get the most out of RFs, think carefully about what you put in to
them. That is to say, feature engineering is the difference between a
winning model and an average model. Tuning RF parameters, on the other
hand, won't generally make much difference.

--

Q: Does it help to combine regularization with ensemble methods like
random forest?

No it doesn't - at least not with RF. RF is already immune to
overfitting because it uses sub-sampling (assuming you sample
appropriate sizes)

--

An example of feature engineering: given a date/time field, create
additional columns: isWeekend; DayOfWeek; isHoliday; isPeakHour;
... Or given an image, create features for edges, complexity,
etc... Random forests are not able to find features that are this
complex. Any programming language can be used for this job.