Logistic Regression – A true analytics workhorse!

When I retire one day and look back, I can safely predict that I would have spent a significant portion of my working life (for good or for bad) around the application of one algorithm – the logistic regression.  I hopefully don’t wake up one day and wish that I had spent less of my lifetime fitting a logistic regression model, and more time doing something some real : )  One can safely conjecture that the above two statements hold true for any person involved in hands-on analytics all their working life.  Do you agree?

Here is another observation – I think every single person is at the receiving end of this algorithm in one way or another.  When you go check your mailbox tomorrow, and you see that glorious piece of junk mail soliciting you for a free credit card offer – it is highly likely that was a result of an actual calibrated logistic regression model identifying you as a relatively more probable candidate for opening that junk mail, or maybe even accepting the solicitation.  When you once applied, and the bank approved your loan application – it is highly likely that a logistic regression told them that you are a relatively safe borrower when it comes to defaulting on the loan.  I would call logistic regression the absolute workhorse when it comes to real applied analytics.  Is the algorithm sexy enough?  Maybe not anymore – younger miners may want to be caught seen working with it only just about as much as a teenager wants to be driven by their parents to a school dance.  There are perhaps more sexier classifiers such as support vector machines, kernel based methods, random forests, etc these days, BUT, the reality is that logistic regression will continue to be the workhorse of predictive analytics for times to come.

One reason for its popularity is the fact that its produces a very interpretable model.  You can easily look at the coefficients of the model, and tell which way a certain factor influenced the prediction.  For example, if you have a model that scores individuals for loan default risk, and suppose age of the individual was a factor, you can look at the coefficients and tell that age is negatively correlated to default risk.  Another reason for its popularity is that most commercial data mining packages have robust implementation for this algorithm – and the different stepwise options offer an almost inbuilt feature selection capability.

Now onto the mathy stuff behind our favorite analytical workhorse algorithm.  It is very close to the regular multiple-regression, yet the main difference lies in the indirect way it handles the dependent variable.  Instead of predicting the dependent variable directly (which could be represented by values 0 or 1 in the binomial case), it predicts the probability for the occurrence of a particular class (for example clip_image002[6] or alternatively, clip_image004[6]).  Thus the dependent variable is a metric measure that lies between 0 and 1.  The restriction of the dependent variable to lie between the bounds of 0 and 1 is achieved by assuming a logistic function relationship between the independent and the dependent variable. 

clip_image006[6]

Figure: The logistic relationship between the dependent variable and independent variable

 

For the binomial case, the linear logistic regression model can be expressed as

clip_image008[6]

I love the simplicity of the math behind this algorithm – particularly the calibration process by which you arrive at the interceptclip_image010[12], and the coefficient vectorclip_image012[12].    The estimation procedure is based upon the maximum likelihood method.  Assume a data set for a binomial case clip_image014[6], where the independent variable is represented by the observation clip_image016[6], and the dependent variable takes the clip_image018[6]values 0 and 1 (representing the two classes clip_image020[6] and clip_image022[6]).  Assuming that each observation is drawn independent draw from a Bernoulli, the joint probability (or the likelihood function) can be expressed as

clip_image024[6]

Basically, the product of the probability of successes and failures, assuming that the observations are independent.  probability 101.  The optimal values for intercept clip_image010[13], and the coefficient vector clip_image012[13] are arrived at by solving the following equations (maximum likelihood estimators or MLE’s):

clip_image026[6],

where

clip_image028[6]

Top 10 signs you have done too much data mining!

1. When ordering fast food, you are less interested in the menu, more in the type of data that might be recorded in that transaction.

2. You notice that bear and diaper are not in the same aisle.

3. There are so many types of beer, that it needs its own aisle. No scope for association rule mining there.

4. You deliberately respond to a junk mail, just to mess up the performance of someone else’s classification model

5. You try to fit in, lest you become an outlier. Which means you will be treated, and might even be dropped from analysis.

6. In the fall, when leaves are missing, you mistake your backyard oak for a CART tree.

7. At airport, you notice that your flight is delayed by 4 hours, but you are more interested in wanting to plot the frequency distribution of flight status.

8. When co-passenger yells: “stay away from the median”, you first reaction is, now way, mean is to be avoided, not median, as mean can get skewed by outliers.

9. At the liquor aisle, Smirnoff always reminds you of Kolmogorov-Smirno(v)(ff) test.

10. You realize that top ten list is not really representative sample, as it is biased towards the top.

Birth of my blog!

This is my blog for all things decision sciences! A long time planned idea, but never got to exercise it, but here I am, attempting to kick it off. I started my career as an engineer – more so as that was my only choice at that point in time in my life, and then in grad school gradually wandered into more of what I selected out of my free will – into the world of math and statistics. Even though my original field of mechanical engineering had a mathematical component, it just wasn’t my cup of tea, as I just couldn’t appreciate the basics enough, and couldn’t imagine among other things analyzing force along a particular direction for a screw/nut/bolt/pulley/beam for the rest of my life. I wanted to get into something more basic, some field of study that is applicable to all disciplines – from social sciences to finance and economics, and that was probability and statistics for me. I also considered economics, but was too broke by the end of graduate school to pursue another degree in it.

That is what I do – I am a data-scientist. I love data, and I can spend hours and hours analyzing it, just like a programmer could spend hours and hours coding. I like all types of data – structured and unstructured. Over the years, I have played with large scale numerical datasets, large scale textual datasets, signal and imagery data, and all combinations of the above. I started analyzing textual data only over the last decade – and it has been absolutely fascinating. Text, or natural language has been a fascinating field for me, and if were to do my Ph.D. again, I would do it in statistical linguistics. Analytics in linguistics exploded so recently that I could see it happening in during my professional career – and I continue to be amazed at some of the fantastic statistical phenomena that is present in language. The field is also blessed with some of the greatest thinkers of modern times – ranging from Chomsky to Pinker.

I have been practicing data science for over 15 years now, and this blog is intended to capture my observations in all these years of practicing it. My postings will have no chronological order, or for that matter any order of significance, just what comes to my mind when I decided to make the post.

My humble hope is that you enjoy it, and as always, comments are welcome.