Can you keep the user?

Predicting user churn using its traffic with SparkML

Source: Udacity


While all platforms offering a service are on a rush to gain more and more users, some of them are losing the existing ones.

Defining churn

A churned user is a user who has stopped utilizing an application or more accurately in our case, canceled its account. Therefore, we are saying that we are dealing with churn when a user has arrived on a Cancellation Confirmation page, this one being, of course, its last action on the platform.

The approach

The goal is to take users behavior and try to predict which ones may have intentions to leave in the future. Is it the fact that they are listening to less music on the platform? Or that they are just not coming back often? Is the application boring them and they are spending just a little time on it?

Analysis of the given variables

user traffic schema
Distribution of variables for active vs churned users
pairwise Pearson correlation of variables
Distribution of variables for active vs churned users
pairwise Pearson correlation of variables
Distribution of variables for active vs churned users
pairwise Pearson correlation of variables

Feature selection

Following the data analysis and logic described in the section above, the selected features, computed at user level are the ones in the list below:

  • sessions the total number of sessions an user had on the application
  • total_time_spent_in_minutes — the total time(in minutes) spent on the music streaming platform
  • time_active_in_minutes — the total time(in minutes) that has passed between first action and last action of an user
  • avg_time_between_sessions_in_minutes — the average time(in minutes) between two sessions; in other words, it represents the average break a user takes from using our platform; if an user had only one session, then the value will be 0
  • avg_plays_per_session — the average number of songs a user plays in a session
  • avg_votes_per_session — the average number of votes(thumbs up/down) a user gives in a session (user involvement)

Model Training

We can call the classes imbalanced, since we have three times more active users than users that have cancelled the account.

Grid Search parameters
Churned users in test dataset — predictions versus label
Active users in test dataset — predictions versus label


It is crucial for a user dedicated platform like a music streaming one to make sure the users are enjoying the service and will keep using it, the first step in the process being to detect the users that are more likely to leave.

Next steps

To continue the research, there are some future steps I already planned:

  • while running the code using the full dataset on a cluster in AWS, to fill in a parameters grid for each model, for fine tuning
  • to engineer other features like, for example, location and check the feature importance after training the models, to see which ones better define a user likely to cancel
  • to extend the custom defined scaler, so that it can be used in a pipeline
  • predict also user downgrade (from paid level to free level) and upgrade (viceversa)