At a recent Analytics Hackathon I was able to use R to double our prediction percentage for which customers converted to paid plans. An accurate prediction can generate lead scores so sales can focus their effort on promising opportunities or customers who are at risk. A good predictive model also hints at which features to improve and positive behaviors to encourage.
The problem with existing analytics tools like KissMetrics or Tableau is that you have to build your own models which is a guessing game. SAS and SPSS can do it but they cost well over $10,000. I was looking for a better way and found R, which will build optimal models (almost) automatically. While statistics is hard to learn, R makes it relatively easy because all the packages are basically plug-and-play. This turns it from a hard scientific problem in more of an engineering/hacker one. Once I learned R, I was able to do it in just a few hours. In this guide I will show you how to build a decision tree, and then use it to predict which customers will convert.
Step 1: Get Started With R
It might be a pain to learn a new language, but it’s worth it. The impressive package library turns a regular hacker into statistical superman. It also allows you to do set calculations so your code will be short and sweet. There are many guides to learn the basics of the language. I really enjoyed R for Everyone which I could quickly flip through on Kindle. I looked at R courses on Udemy. I thought the pace was too slow, but if you can play content at 2-3X speed it might be a good option.
The most popular IDE is RStudio and it’s free. Inside RStudio make a new project to save your work. Also, remember to save your environment regularly, because RStudio only does it when you exit. I like to try out new commands on the console. Once I get a bit of code working, then I copy it to an RScript so I can step through it and experiment with new algorithms.
The hardest part is learning about the useful packages, but below I’ll discuss ones to get started with. You can specify that RStudio loads your packages automatically when it loads your project. Just create .Rprofile file and paste this in for now, then run it. If any packages need to be installed you can do so through the Tools menu.
require(ggplot2) require(boot) require(ROCR) require(OptimalCutpoints) require(caret) require(plyr) require(rpart) require(rattle) require(reshape2)
Step 2: Load And Format Your Data
Next you’ll want to load data about what features or behaviors your customers are exhibiting, and whether they convert to a paid plan. Here are types of features that we found valuable to analyze:
- Size of their need at signup
- Fully activated by completing the on boarding flow
- Adding more team members
- Investments into the product by configuring it to save time
- Feature usage
You can load data from a database or a CSV file. If you load it from the database, you’ll need to set up an ODBC connection and have the right drivers. I found it easier to not load strings as factors both because it loads faster, and because factors are more difficult to work with. The data I’m using in this post has been modified in order to protect privacy.
require(odbcConnect) db <- odbcConnect("DatabaseName") f <- sqlQuery("select * from table", stringsAsFactors=FALSE) f <- read.csv("~/table.csv", stringsAsFactors=FALSE)
Here are some examples of ways to make your data easier to build models with. Working with NA data, or empty data, is really challenging in R and can throw your calculations off. I just strip out any rows with NA data using the complete.cases command. Hopefully you’re left with a decent representative sample to work with.
f <- f[complete.cases(f),]
Next you’ll want to select the data and features to use in your analysis. You can get an overall picture by looking at all the data, or zoom into a particular point in your funnel and look at conversion from one stage to the next. It’s better to select a small number of variables that account for a large portion of the variance in your data. This will help avoid building a model that overfits your data. If you have a limited supply of data, this will reduce the number of model parameters to learn. You may also want to combine dependent or closely related features into a single metric. If you later choose to do linear regression, it works best with linearly independent features.
f$additionalUsers <- with(f, count_admins + count_nonadmins)
I have anonymized the data set used in this example by modifying the data and replacing the variable names with numbers. I hope you’ll be able to imagine your own variables in each of these formulas.
Step 3: Visualize Your Data
When I get started with a new data set, I often like to visualize it first. If there is a significant difference between paid and free users, then our prediction is likely to be a good one. One way to do this is to see how the averages vary based on whether the account is paid or not. The aggregate function can calculate the means split by whether they are paid.
vars <- c("var1","var2","var3","var4","var5","var6","var7","var8") tierSeg <- aggregate(f[vars], by=list(f$pdStat), mean)
We can then plot each of these variables in a bar chart. A popular charting package is ggplot2. It lets you specify a data frame, and then you can layer visualizations on top of it. In this case, I’m including the geom_bar for the bar chart, as well as facet_wrap to create one chart for each variable.
colnames(tierSeg) <- "IsPaid" tiers.m <- melt(tierSeg, id.vars='IsPaid') ggplot(data=tiers.m, aes(IsPaid, value)) + geom_bar(aes(fill = IsPaid), stat="identity", position = "dodge") + facet_wrap(~ variable, nrow=2, scales="free_y")
Step 4: Build Your Models
Whether a customer is paid or not is a binary variable, so common model choices include logistic regression or a decision tree. It’s somewhat tricky to interpret the coefficients of logistic regression, especially with dependent input variables. I’m going to choose a decision tree, because it will give me a better clue regarding variable importance, and sets easy to interpret thresholds. This model will predict the paid column, using several input columns I created regarding features and behaviors usage.
m <- rpart(paid ~ var1 + var2 + var3 + var4 + var5 + var6 + var7 + var8, data=f, method="class")
You can visualize the tree using the fancyRpartPlot package. I prefer this one over the standard plot because it’s easier to read. The nodes and cutoff points allow me to understand how different segments perform.
It’s also useful to see the importance of various variables. Here’s how you can visualize it with a graph using the ggplot2 package. I’m sorting the rows in decreasing variable importance to make it easier to view.
a<-data.frame(m$variable.importance) a$variable <- factor(rownames(a), levels=rownames(a)) x <- transform(a, variable=reorder(variable, -variable.importance) ) ggplot(data=a, aes(x=variable, y=importance)) + geom_bar(stat="identity", fill="lightblue")
Step 5: Predict Your Conversions
Now you can predict the probability that someone will convert to a paid plan. This might be useful to your sales team directly, so they can focus on ones with a high probability of closing.
f$prob <- predict(m, f)[,"TRUE"]
This prediction can also be thought of as a lead score, and your sales team might set a threshold for which ones are worthwhile to reach out to. Different thresholds will have different rates of true positives and false positives. One way to visualize these tradeoffs is to plot an ROC curve. You can choose the performance measures that matter most to you. I care most about positive predictive value (ppv) which is the probability that if the model classifies the account as paid, that it will actually be paid. I also care about the sensitivity, which is the percentage of actually paid accounts correctly recognized as paid. Here you can see that this model has about a 65% sensitivity and a 70% positive predictive value at a cut point near the middle. It’s a fairly good model, but could probably be improved with more data.
pred <- prediction(f$prob, f$paid) perf <- performance(pred, measure="sens", x.measure="ppv") plot(perf, col=rainbow(10))
You can also pick a cut point that optimizes for a criteria. Here I’m choosing to maximize Kappa, which maximizes agreement between the predicted paid plans and actual paid plans. Once you have a threshold, use it to classify each point as paid or not.
optimal.cutpoints(X="prob", status="paid", tag.healthy=TRUE, methods=c("MaxKappa"), data=feat, direction=">", control=cont) f$predict <- f$prod> .2
You can calculate the accuracy of your classifier at this cut point using a confusion matrix. It will also show you the positive predictive value and sensitivity. While an accuracy of 92% is quite good, the positive predictive value is lower because there is a higher prevalence of accounts that don’t go paid.
Reference Prediction FALSE TRUE FALSE 1060 62 TRUE 73 202 Accuracy : 0.9207672 95% CI : (0.9478534, 0.971242) No Information Rate : 0.9442023 P-Value [Acc > NIR] : 0.006509636 Kappa : 0.7303789 Mcnemar's Test P-Value : 1.000000000 Sensitivity : 0.6512676056 Specificity : 0.95876270 Pos Pred Value : 0.6745454545 Neg Pred Value : 0.93966728
Step 6: Interpret Your Results
Does the model make sense given what you know about your customers and your product. Are there enough data points that you can be confident in the model? Does it seem like the model could be over-fitting the data? If not, you might want to experiment with different models or variable transformations.
I already knew that the more days that someone uses our product and the more data they send to it, the more likely they are to convert. However, I learned that adding more users is also correlated with conversion. I could conduct an experiment to determine if an increase in users will cause an increase in conversions. For example, we could A/B test offers to add more team members to the account. This is a new and potentially valuable key to increasing our conversion rate. Additionally, the higher accuracy lead scores will help our sales team be more efficient.
If you’ve worked in analytics before, what are your suggestions on how to make even better predictions? I’m always looking to learn from the best.
About The Author
Jason Skowronski is currently a Product Manager at Loggly. He studied machine learning in grad school, and enjoys attending hackathons in the SF bay area. If you want to learn more or would like help, please contact him directly.