Can you keep the user?

Predicting user churn using its traffic with SparkML

Source: Udacity


While all platforms offering a service are on a rush to gain more and more users, some of them are losing the existing ones.

Nowadays, as it is very simple to open an account anywhere, in the same way there are just a few steps to cancel it and to forget about the platform.

Within a nanodegree program, Udacity has proposed a challenge to make the first step in the process — to predict the users that are more likely to give up on a service.

They created Sparkify, a music streaming platform similar to Spotify or YouTube Music. Given the users traffic on the platform, the purpose of the challenge is to predict user churn. The motivation is that after prediction, the platform can take action to keep the users prone to cancellation and to give them, for instance, custom discounts or tailored support.

Defining churn

A churned user is a user who has stopped utilizing an application or more accurately in our case, canceled its account. Therefore, we are saying that we are dealing with churn when a user has arrived on a Cancellation Confirmation page, this one being, of course, its last action on the platform.

The approach

The goal is to take users behavior and try to predict which ones may have intentions to leave in the future. Is it the fact that they are listening to less music on the platform? Or that they are just not coming back often? Is the application boring them and they are spending just a little time on it?

The approach is to dive deeper into the given variables and decide which ones may give insights to detect the users that are not that into the platform. All metrics that are created for the scope of data analysis or to be added as features are computed at user level, since our goal is to put them into two categories: the user will probably continue with the music streaming service OR the user will cancel/already cancelled the account. For the prediction itself we will deal with a binary classification problem and as of the technologies, the main one used is Spark together with Python. For data analysis and feature engineering I used Spark SQL, for visualizations seaborn library and for classification Spark ML. You can see my notebook containing all the code here.

Analysis of the given variables

user traffic schema

User related categorical variables

The user gender variable is neither as an intuition, nor proven by the data that it determines user churn on a music streaming platform and therefore I chose not to use it as a feature for the prediction models.

It is a common behavior for a user to try both paid and free levels. Data analysis was made on the most recent level, since it better defines an user at the present moment or at the moment of churn. This can be an interesting feature to add since it looks like an user is more likely to stay on the platform if already purchased a paid subscription.

In the given dataset, there are users from many locations in the United States, so location in this form was thought not to be a good idea to use as a feature for the model. However, it could be interesting to find a way to aggregate the locations, in future improvements. For example, we could use location as a feature if states are extracted and then grouped, for example in northern/southern/western/eastern categories or the user logs are getting bigger.

For userAgent the issue was similar, too many categories, given the number of users. I extracted the operating system from it to see if this gives us more insight. It looked like we have just a few users that are using Linux and those tend to cancel the account more than users who prefer Mac or Windows. However, the number of Linux users all over the world is low compared to other operating systems. Therefore, it may be not a good idea to rely on this feature or on other features engineered from this variable, for the moment.

User actions

The columns status(307/404/200), method(PUT/GET) and auth(Cancelled/Logged In) are related to the authentication status, HTTP status code/HTTP methods and it should not be of interest for our current scope.

The page type describes the actions the user is making on the platform. To analyze the types of actions users are making may be interesting for a future in-depth research. At the moment, let us notice that the main activity the user has on the platform is to listen to music(good sign). As an intuition and also looking on the number of actions per type of page plot, the number of Thumbs Up/Down may be related to user involvement.

Listening to music related continuous metrics

Distribution of variables for active vs churned users

It is noticeable that the pattern is similar for actions, plays, songs, artists and length plots and this is confirmed by the high values of correlations below. Therefore, we can consider enough to add to the model features just one variable from this list.

pairwise Pearson correlation of variables

User session related analysis

Distribution of variables for active vs churned users
pairwise Pearson correlation of variables

As in the case of user totals metrics, the pattern is similar for the average counts per user session as well. Therefore, we can consider enough to add the number of sessions, average number of votes per user session and average songs played by a user in a session as features for the model.

Actions timestamp analysis

Regarding the action timestamp, I found interesting to analyze the session duration and the time between sessions. As a unit of measure, I used minutes.

Distribution of variables for active vs churned users
pairwise Pearson correlation of variables

One interesting finding here is that users that are active still seem to have more days since they opened the account on the music streaming platform compared to the ones that cancelled it. This will be added to the features list together with total time spent and the average time between sessions.

Feature selection

Following the data analysis and logic described in the section above, the selected features, computed at user level are the ones in the list below:

  • last_level — represents the most recent level(free/paid); more accurately, it represents the last level a churned user had or the current level an active user has
  • sessions the total number of sessions an user had on the application
  • total_time_spent_in_minutes — the total time(in minutes) spent on the music streaming platform
  • time_active_in_minutes — the total time(in minutes) that has passed between first action and last action of an user
  • avg_time_between_sessions_in_minutes — the average time(in minutes) between two sessions; in other words, it represents the average break a user takes from using our platform; if an user had only one session, then the value will be 0
  • avg_plays_per_session — the average number of songs a user plays in a session
  • avg_votes_per_session — the average number of votes(thumbs up/down) a user gives in a session (user involvement)

Encoding categorical data and feature scaling

For scaling, I created a CustomScaler, at the moment tailored for our data.

There is only one categorical variable to encode(last_level) and since it can have only two values, I encoded one value with 0 and the othter with 1.

For continuous features scaling, I used z-score normalization — for each value we subtract the feature mean and divide by the standard deviation. This makes the values of each feature fall into a normal distribution with the mean equal to 0 and the standard deviation equal to 1.

The scaler was fit only on the training data, and afterwards it was used to transform both training and test data. This is because the test set plays the role of fresh unseen data, so it’s not supposed to be accessible at the training stage[1].

Model Training

We can call the classes imbalanced, since we have three times more active users than users that have cancelled the account.

Keeping this in mind and considering we want to find as many users as possible that are likely to leave, the objective is to get a good precision score for the negative class (churn_flag=1). "Precision is the estimated probability that a document randomly selected from the pool of retrieved documents is relevant."[2]

We should check the recall score as well, but it is not the case to be as focused on it, since it shouldn't be much of a problem if we will give discounts and offers to active users that are actually not that likely to cancel. As for a metric to evaluate the Spark models, I used areaUnderPR (area under precision-recall curve), having in mind the same considerations. "According to Saito and Rehmsmeier, precision-recall plots are more informative than ROC plots when evaluating binary classifiers on imbalanced data."[2]

The approach here is a trial and error one, meaning that the plan is to train several types of models with different parameters (in the future, maybe with different features as well), until we find the best performing one.

For this first iteration, I tried Logistic Regression, Decision Tree Classifier and Linear Support Vector Machine with their default SparkML parameters and obtained the results below:

It looks like the logistic regression and the support vector machine models tended to predict that almost all users are active and are not going to leave, while the decision tree model being able to detect 5 out of 8 churned users.

Fine tuning

A good way to improve one model is by checking several values for key parameters with the help of the grid search method. Unfortunately, my current environment is very slow and until moving on a cluster in the cloud, I tried to improve only one model: the logistic regression one. The reason I chose this classifier would be that its evaluation score was better than the decision tree one and it trains significantly faster than the support vector machine.

Parameters chosen

Considering the evaluation score shows that the logistic regression model is performing well, the first idea that came into my mind was to try to lower the predicted probability threshold. What does this mean? The model actually predicts, for each user, a probability that it belongs to a class or another. The default value is 0.5, meaning that my model is predicting churn if it predicted a probability of churn of at least 50% for that specific user.

Next, I thought about the kind of logistic regression to use. The default is auto, but I assumed binomial will work better in our binary classification problem. Also, I tried a couple of values for the logistic regression regularization parameter, 0.01 and 0.1.

Grid Search parameters

After training with all combinations of the parameters discussed, the best model had the expected threshold parameter value(0.25), 0.01 regularization parameter and family auto. As you will see below, lowering the probability threshold was a game changer.


After fine tuning the logistic regression model, the evaluation score was improved and the model was able to predict churn for 6 out of 8 users that cancelled their account while it hasn’t predicted churn for any active user. The results are impressive, considering the size of the dataset.

Churned users in test dataset — predictions versus label
Active users in test dataset — predictions versus label


It is crucial for a user dedicated platform like a music streaming one to make sure the users are enjoying the service and will keep using it, the first step in the process being to detect the users that are more likely to leave.

To accomplish this goal using the users traffic, I started with an analysis of the given variables. For each one, I computed metrics at user level and did some exploratory data analysis to see the patterns with respect to the user churn flag. After going over the values and the plots, I came up with a list of features relevant for the scope while in the process, I decided to give up on some features that were too similar one to another (highly correlated) to reduce dimensionality. Before fitting those to a model, I assured each categorical variable is encoded and each continuous one is scaled. For the binary classification problem, I took a trial and error approach, meaning that I tried several types of models with different parameters to find the best performing one. The obstacle here was that my current working environment is very slow and many models with many parameters options could not be tried. However, I was able to get a logistic regression model able to predict the majority of users that churned from the testing dataset.

Next steps

To continue the research, there are some future steps I already planned:

  • first one on the list is to run my notebook using the full dataset Udacity is providing on AWS; the dataset I used has only 226 users and only 8 churned users in the test dataset, so we should not rely on this small dataset in the future
  • while running the code using the full dataset on a cluster in AWS, to fill in a parameters grid for each model, for fine tuning
  • to engineer other features like, for example, location and check the feature importance after training the models, to see which ones better define a user likely to cancel
  • to extend the custom defined scaler, so that it can be used in a pipeline
  • predict also user downgrade (from paid level to free level) and upgrade (viceversa)

I hope this article was insightful and if you want to see my detailed code, you can find it here .