Unfortunately, I haven't done any natural language processing before, so I'm a bit out of my element. However, there are good tutorials online as well as R packages that can guide you through the rough parts. I thought writing up my explorations might be useful to others that want to get started with this approach. A gist of the code I wrote is available here.
What I did:
1) I took 20K recent hourly oDesk jobs that where the freelancer worked at least 5 hours. I calculated the log wage over the course of the contract. Incidentally, oDesk wages---like real wages---are pretty well approximated by a normal distribution.
2) I used the RTextTools package to create a document term matrix from the job titles (this is just a matrix of 1 & 0 where the rows are jobs and the columns are relatively frequent words that are not common English words---if the job title contained that word, it gets a 1, otherwise a 0).
3) I fit a linear model using the lasso for regularization (using the glmnet package). I used cross validation to select the best lambda. A linear model probably isn't ideal for this, but at least it gives nicely interpretable coefficients.
So, how does it do? Here are a sample of the coefficients that didn't get set to zero by the lasso, ordered by magnitude (point sizes are scaled by the log number of times that word appears in the 10K training sample):
In terms of out of sample prediction, the R-squared was a little over 0.30. I'll have to see how much of an improvement can be obtained from using some of the structured data available, but explaining 30% of the variation just using the titles is a higher than I would have expected before fitting the model.