Building a Churn Model - an intern's perspective

With “Tech” and “Data” being the newest buzzwords, I was more excited about my summer internship as a Data Scientist at Viki. Coined as the “Sexiest Job of the 21st Century” by the Harvard Business Review, I was beyond eager to swim in data, and add as much value to the Viki Team.

I started of by getting introduced to the technologies- Hadoop, particularly AWS EMR, Hive, Tez, PostgreSQL, Redshift etc, to name a few. The amount of data collected is immense- every action, from a pause on a video to a click to another link, on the site is captured by an event being fired. As I started to look through the data and more specifically, user and subscription data, I realized that there’s an opportunity- to build a churn model and see the predictive powers of factors on a user’s inclination to churn or subscribe to VikiPass.

The Data

To build a churn model, I first had to look and collect at basic user data: how many minutes of video has a user watched, how many different seasons, number of times (sessions) they’ve been on Viki, the number of days a user has been on Viki, gender, country, the application a user uses Viki on etc. Essentially, both numeric and nominal (categorical) data was included.

Due to the nature of the variables, it made sense to give users that watched a greater number of sessions more weight- in order to this, a technique called term frequency- inverse document frequency was used- this was to reflect the relative importance of each feature. Though this technique is used to see how important a word is in an entire document, the methodology used can be extrapolated to numerical outcomes. For example, in this data, log normalization was used as follows:

(Video plays/Number of Sessions) * Log (1 + Number of Sessions)

An issue faced was the lack of data for many users. Due to the nature of Viki, often times, user data, particularly demographic data was not available. In order to combat this, I had to categorize data fields. For example, for country data- I bucketed the data fields into the top five countries and bucketed the rest into a category ‘other’.

My target variable in building this model was a classification, boolean value- whether or not someone has churned, denoted as ‘inactive’ or ‘active’. Another issue I faced (again, related to the lack of data) was that during any given time period, there was a significant less number of users unsubscribing than those who are active (a positive for any company; however, for data analysis, having sparse data always makes a data set hard to use!). There was not much I could do about this, but just work with the set I had.

After several data collection processes from the Redshift server, transformations of fields and cleaning up, my dataset was finally ready to use. I split the data into two- a training set (where I built the model on) and a testing set (where I could see the validity of the model.

Data Modeling

I approached two methodologies- Shrinkage Methods and Tree Based Methods. The aim of shrinkage methods is essentially to shrink the coefficient estimates towards zero, in order to reduce variance. All the predictors are included; however, because some estimates can be exactly zero, shrinkage methods can also perform variable selection (this depends on what shrinkage method is being performed). This methodology is based on a tuning parameter, lambda. The two post popular techniques for shrinking the predictors towards zero are the lasso and ridge, both of which I performed.

The Ridge Method is similar to least squares but has a Bias- Variance trade off. That is, as the lambda increases, the flexibility of the ridge regression fit decreases, which yields a decreased variance but an increased bias in our fit. However, as I realized through the data, flexibility may not always be a good thing as the model could be fooled by high noise (high variability in the data) even though the model fitted will have low bias. The ridge aims to find predictors that minimizes the following equation:

The Lasso Method, another popular shrinkage method, has an advantage over Ridge in that it performs variable selection- some of the variables can actually be zero. In this way, sparse models are created. Do note, that this does not mean that a small number of predictors are correlated with the response variable (churn or active), but rather, that we can predict the response well using only a small number of predictors. The lasso aims to find predictors that minimizes the following equation (notice the slight change from the ridge method)::

The third method, Random Forest, I tried was a Tree Based Method. This method stratifies the variables into regions and essentially uses decision trees at each node to find the optimal variables. They are known to be good for interpretability, but less accurate than shrinkage methods for predictability. Essentially, random forest de-correlates the trees in order to consider less heavy predictors and the resulting tree is less variable (more reliable).

Comparison Methodology

To compare the different methods, I looked at precision vs. recall curves, lift curves and the mean squared error.

Precision/Recall, the first method used, is the metric to see relevance in classification problems. High precision means that there are more relevant results than irrelevant and can be summarized as follows:

High recall means that the algorithm returned most of the relevant results, that is:

The second technique used is called a lift curve. A lift curve is a curve that allows us to identify those who will ‘respond’ to the marketing campaign (sending coupons, reminders etc). After building this churn model, our aim would be to launch a campaign to see whether or not the model is effective. More specifically, is, can we prevent churn if we can identify who will unsubscribe? The aim is to skim the cream- small number of cases and getting a relatively large portion of the responders. A lift curve is represented as follows:

On the y-axis, the cumulative number of true positives is represented and on the x-axis the cumulative number of total cases (in descending order of probability). The dark blue line would ideally represent the model constructed, and the purple dotted line is a comparison to how a campaign would be different if we purely, randomly guessed our respondents.

The final technique I used to compare the different models was looking at the Mean Squared Error (MSE). The MSE measures the difference between the predicted and the true responses and will be small if the two values are close and large if the values differ. Our aim is to minimize the MSE, which is equivalent to maximizing adjusted R-squared values. The Mean Squared Error is given by the following equation:

Where f(xi) is the prediction value and yi is the true value. The MSE that is used to pick and fit a model is usually calculated using the training dataset; however, because we are interested in how well the model fits unseen data (test data set and eventually its predictive power on a current, real time dataset that does not have true values), we evaluate the MSE on the test data set and proceed to select the learning method where the test Mean Squared error is the smallest.

Results and Future Studies

The model performed decently well on the testing data. Overall, the model predicted 98.9% correctly- that is, it predicted well whether or not someone would churn or still be an ‘active’ user. However, out of those who were supposed to churn, it only predicted that 47% would actually churn. Furthermore, on those who were predicted to churn, 92% were predicted correctly. Generally, this model had a higher recall than precision rate.

Generally, an A/B test will take place to see whether or not there is use in predicting who will churn or subscribe. For future work, more predictors could be included and different models such as Support Vector Machines and Cartesian products to look at interactions between predictors can be tried and tested.

The Experience

One of the best things about being an intern at Viki and what makes it so great is that everyone is interested in seeing you and your project succeed. Every intern is given a live project so they can contribute to Viki. I had never built a churn predictive model before or led a study from the start to end, but as I was given this responsibility, my learning was huge. I refined my skills in SQL and R, conversed with top management about the scope and future work and presented to a team of data scientists across the Rakuten company for feedback and input. The support and mentorship I was given was immense- I was encouraged to explore different technologies, methods and ideas that suit me.

Kriti Agarwal - Data Science Intern at Viki, Summer 2015

If you're interested in an internship with us, check out our available internship's positions.


comments powered by Disqus