Why are my customers churning? Your customer service team is racking up support tickets, the IT department is fixing bugs and shipping overnight, and the marketing department launches weekly promo deals hoping to get your customers to buy. Instead, you get some visits on your site, and yet, customers keep dropping off.


You’re not the only one. Research shows that over 95% of customers silently quit a product without leaving any clues or feedback as to why they have quit. 


CHURN RATE, SOMETIMES ALSO CALLED ATTRITION RATE, IS THE PERCENTAGE OF CUSTOMERS THAT STOP UTILIZING A SERVICE WITHIN A TIME GIVEN PERIOD. IT IS OFTEN USED TO MEASURE BUSINESSES THAT HAVE A CONTRACTUAL CUSTOMER BASE, ESPECIALLY SUBSCRIBER-BASED SERVICE MODELS. Statista


IN THIS POST-YOU’LL LEARN EVERYTHING YOU NEED TO KNOW TO PREDICT CUSTOMER CHURN, BASED ON A CHURN RULE BY BUILDING A MACHINE LEARNING MODEL WITH A SPARK CLASSIFICATION ALGORITHM.

How is customer churn measured?


Churn rate shows you the percentage of which your existing customers stop using your service or product in a specified time period.


To determine the percentage of revenue that has churned, take all your monthly recurring revenue (MRR) at the beginning of the month and divide it by the monthly recurring revenue you lost that month — minus any upgrades or additional revenue from existing customers.


For example, if a chain restaurant had 500 customers at the beginning of the month and only 450 customers at the end of the month, its customer churn rate would be 10%.


But that’s only the first level of measuring churn, one that relies on your mortal judgment to identify leaking buckets of customers.


What if you’d like to predict who will churn in the future based on an RFM model’s outcome? The traditional process is painful, and it goes like this:


You’ll need to choose a data source you wish to extract information from, such as a Microsoft SQL Server– you have to determine which tables you will extract from that source. Then, you must determine the churn rules. What will you consider as a churned customer? Now that you have a data source and customer data, you must merge the two so you can generate a churn column in a brand new table you will create out of this. Then, you have to train, test, and predict based on that data. Followed by adding a model which will run all of the previous elements. Lastly, you will need to send the output of your model into a CSV table, or a business application. 


That process would probably take months if it even gets to production. 


In a study conducted by Gartner, it was found that 47% of companies struggle to move projects into production due to difficulties during processes and/or applications.


And the workflow looks something like this.


An ML model example




One of the easiest ways to build this is with Datagran, by building a data workflow, or what we usually call a pipeline. With Datagran, you can centralize your entire company’s data in a single workspace, process the information inside of it so you can pick and choose what you want to work with, train, test and predict based on it, add Spark algorithms (to predict Churn, for example) and send the results to production via a business app like Slack, RestAPI, Workbook and more, so you can share with stakeholders. Additionally, pipelines run on frequency so you don’t have to keep building models every time you need to find churning customers. 


How to predict Churn with Datagran based on RFM analysis


Datagran connects your databases and data warehouses and processes all types of data better than doing this process manually. It can identify your most loyal, modest, and most critical customers, find their contact information, send the results to a Slack channel or any other available app like RestAPI, Salesforce, Webhook, and more, and turn them into your best customers. Before we begin, there are a few things that need to be highlighted:


  • You must have ample customer information.
  • Specify churn rules– you must ask what does your business count as churn? If this is not determined, then you can extract it from an RFM model by using its results which will deliver clusters of customers who are Hibernating, Hibernating + About to Sleep.
  • You will then join customer data and add a churn column to apply the churn rules.
  • Then you will separate train/test/prediction. To predict churn we rather predict on new customers or those in the last quartile.
  • Finally, you will build a Classification model.



  1. The first step is to connect your data sources to Datagran


If you don’t already have one, sign up for a Datagran account. Then select your workspace (or create a new one), add a project, head over to Integrations, and click on Add Integrations to connect any data source you need from the list available. 


New to Datagran? Watch this video to learn more about Integrations and how to set them up.


If you haven’t connected your data source yet, click on the plus sign to add an integration, and select the one you wish to integrate. Datagran will then ask you to choose the streams of information you want to pull from your data. 


Streams of information are the type of data collected from your store, so in this case, you want to make sure you select streams relevant to your customers. For example, you’ll want to pull information from customers, invoice tickets, frequency of purchase, etc. You will then be asked how often you want it to pull data from your data source—select DAILY as your frequency.


A data pipeline dashboard



  1. Create a Machine Learning data workflow (pipeline) and determine how to measure Churn from the results of an RFM model


Now that Datagran has your data, it’s time to turn it into an ML workflow where you will use a Spark Classification algorithm to measure churn. To do that, you’ll need to make a new Pipeline—or automated workflow—and add the data sources you want to extract information from. In this example, we are using a Microsoft SQL server.


IN THIS USE CASE, YOU WILL EXTRACT THE CHURN RULES FROM AN RFM MODEL. TO LEARN HOW TO BUILD AN RFM MODEL CLICK HERE


Drag in your Microsoft SQL server data from the right sidebar, then drag and drop a Custom SQL Operator to determine how you will measure churn by selecting specific variables you will extract from your dataset. Click the Edit button, select the source and copy and paste the sample query below. In this tutorial, you will identify the customers who:


  1. Have spent less than $320.00
  2. Have purchased less than 2 times.
  3. Have gone without making a purchase for more than 185 days.


A data pipeline dashboard


A data pipeline dashboard


A data pipeline dashboard


SELECT
  *,
IF
  (Amount_Spent <= 319000
    AND Days_Off >= 185
    AND Invoices <= 2,
    1,
    0) AS churn
FROM
  base




3. Run the query and save the table. 


Now go back to the pipeline, press the play button to run the SQL operator by hovering over it. Then, add a second SQL operator which will extract your customer’s information. Repeat the same process but this time copy and paste the query below. Finish it by running the query and pressing play to run the element as shown in step 2.


A data pipeline dashboard


SELECT
  DISTINCT Facturas.CheckID,
  Facturas.CheckNumb,
  CAST(Clientes.InvoiceNumber AS INT64) AS InvoiceNumber,
  Clientes.BusinessDate,
  EXTRACT(YEAR
  FROM
    Clientes.BusinessDate) AS YearInvoice,
  EXTRACT(MONTH
  FROM
    Clientes.BusinessDate) AS MonthInvoice,
  EXTRACT(DAY
  FROM
    Clientes.BusinessDate) AS DayInvoice,
  Clientes.PrimaryPhoneNumber,
  Clientes.EmailAddress,
  Clientes.AddressLine1,
  Clientes.MainPaymentDescription,
IF
  (Clientes.MainPaymentDescription = 'Postmates'
    OR Clientes.MainPaymentDescription = 'UbearEats',
    0,
  IF
    (Clientes.MainPaymentDescription = 'Cash'
      1,
    IF
      (Clientes.MainPaymentDescription = 'MasterCard'
        OR Clientes.MainPaymentDescription = 'Visa'
        OR Clientes.MainPaymentDescription = 'Amer. Exp.'
        OR Clientes.MainPaymentDescription = 'Diners'
        OR Clientes.MainPaymentDescription = 'Credit',
        2,
        3))) AS PaymentDescription_ID,
IF
  (Clientes.MainPaymentDescription = 'Postmates'
    OR Clientes.MainPaymentDescription = 'UbearEats'
    1) AS CompraMedioDigital,
  IFNULL(DATETIME_DIFF(DATETIME(Clients.ClosedTime),
      DATETIME(CLients.SentToKitchenTime),
      MINUTE),
    263) AS DeliveryTime,
  Clientes.City,
  IFNULL(CAST(Items.FKStoreId AS INT64),
    0) AS FKStoreId,
  Items.Item,
  IFNULL(CAST(Items.FKItemId AS INT64),
    0) AS FKItemId,
  Clientes.SubTotal,
  Facturas.Promo,
IF
  (CAST(Facturas.Promo AS FLOAT64) > 0,
    1,
    0) AS TieneDescuento,
  Facturas.TotalSale,
  Items.Price,
  Items.Store,
  CAST(Items.Value AS FLOAT64) AS Value
FROM
  `Adataunickpf__adataunick_dbo_view_pfc_clients` AS Clients
JOIN (
  SELECT
    DISTINCT DOB,
    CheckNumb,
    CheckID,
    IdStore,
  Promo,
    SaleTotal
  FROM
    `Adataunickpf__adataunick_dbo_view_pfc_invoices` ) AS Invoices
ON
  Facturas.IdStore = Clientes.IdStore
  AND Facturas.CheckID = Clientes.OrderNumber
  AND Facturas.DOB = Clientes.BusinessDate
JOIN (
  SELECT
    DISTINCT FKStoreId,
    DateOfBusiness,
    CheckNumber,
    Value,
    Store,
    Price,
    Item,
    FKItemId
  FROM
    `Adataunickpf__adataunick_dbo_view_fac_items` ) AS Items
ON
  Items.FKStoreId = Invoice.IdTienda
  AND Items.DateOfBusiness = Invoice.DOB
  AND Items.CheckNumber = Invoice.CheckNumb
WHERE  CAST(Valor AS FLOAT64) >= 2000



  1. Now you will merge step 2 and step 3 SQL operators so you can begin training your ML model. 


To do so, drag and drop a third SQL operator, connect it to the first operator containing sale information, and the second one containing customer information. To do so, copy and paste the query below in the SQL canvas. Run the query, save the results, and press play on the element to run the operator.


A data pipeline dashboard



WITH
base AS (
SELECT
  PrimaryPhoneNumber,
  YearInvoice,
  MonthInvoice,
  DayInvoice,
  FKStoreId,
  FKItemId,
  PaymentDescription_ID,
  DigitalChannel,
  DeliveryTime,
  WithPromo,
  Valor
FROM
  `operator_5f9c298e7d5325f3f733e320__sql_output`),
rules AS (
SELECT
  ID,
  churn
FROM
  `operator_5f9c143f9fed4dc29333e33f__sql_output` )
SELECT
PrimaryPhoneNumber,
YearInvoice,
MonthInvoice,
DayInvoice,
FKStoreId,
FKItemId,
PaymentDescription_ID,
DigitalChannel,
DeliveryTime,
WithPromo,
Value,
churn
FROM
base
LEFT JOIN
rules
ON
base.PrimaryPhoneNumber = rules.ID



  1. Train your ML model


This will let the machine know exactly the information it needs to work on. To learn more about operators, watch this video.


Tip: We suggest using 80% of the data set for training, and 20% of the data set for testing.


To train the model you need a set of data to train the algorithms (train dataset). If you want to evaluate the model, you need a set of test data (test dataset), although it is not mandatory. And finally, the objective is the output of the model (100% of the data set), the values you want Datagran to predict. For this, it is necessary to have a dataset for prediction (predict dataset). 


Hover over the first operator element which will be named Train and press the edit button. The SQL query editor page will pop-up where you can choose what columns to pull from your data. Copy and paste the query example below and replace each variable to apply to your table’s variables, or press the Show Editor button located on the top right-hand corner to choose the columns without having to add code. Then, run the operator.


A data pipeline dashboard


SELECT
YearInvoice,
MonthInvoice,
DayInvoice,
FKStoreId,
FKItemId,
PaymentDescription_ID,
DigitalChannel,
DeliveryTime,
WithPromo,
Value,
churn
FROM
`operator_5f9c2a097d5325f3f733e323__sql_output`
WHERE
RAND() <= 0.8

 


Second, test the dataset:


Drag and drop a second SQL operator into the canvas. Click the Edit button and name the second SQL element Test. Copy and paste the dummy query inside of the SQL canvas. Again, this is only an example, you must replace the query’s column variables with your own.


SELECT
YearInvoice,
MonthInvoice,
DayInvoice,
FKStoreId,
FKItemId,
PaymentDescription_ID,
DigitalChannel,
DeliveryTime,
WithPromo,
Value,
churn
FROM
`operator_5f9c2a097d5325f3f733e323__sql_output`
WHERE
RAND() <= 0.2




The third step is to run your predictions: 


In this case, you don’t want to predict from the information being pulled in the future, but what the data is telling you from the information you currently have on your customer records. This SQL will run with 100% of the data. What you will find out with this query is what will your customers most likely do in the future based on historical data. 


Drag and drop a third SQL operator into the canvas. Press the edit button and name it Predict. Copy and paste the dummy query below and replace the column variables with your own.


SELECT
2020 YearInvoice,
11 MonthInvoice,
01 DayInvoice,
1 FKStoreId,
200447 FKItemId,
0 PaymentDescription_ID,
0 DigitalChannel,
120 DeliveryTime,
0 WithPromo,
41900 Value
UNION ALL
SELECT
2020 YearInvoice,
11 MonthInvoice,
03 DayInvoice,
1 FKStoreId,
200073 FKItemId,
2 PaymentDescription_ID,
1 DigitalChannel,
300 DeliveryTime,
0 WithPromo,
40900 Value



Run the SQL operators:


For all three SQL elements to activate the play button must be pressed. Simply hover over each element to find it.


  1. Connect the Spark classification algorithm to find the churn rate


Drag and drop the Spark Classification operator into the canvas. Connect all the operators to it and open its settings by hovering over the Spark element and pressing the Edit button. Choose the options from each dropdown menu and save them when you’re done. Go back to the canvas a run the element by hovering over the Spark Classification element and pressing the play button.


A data pipeline dashboard



  1. Send the ML model results to production


Drag and drop and action such as Slack to send the churn results to a Slack channel. Connect the Spark classification element to the new Action and press the edit button. Log in to your Slack account, select the channel and the user you want to send the results to. Then, choose the table and columns you wish to send and a description of the message. You can save and test it or simply save it. Once you’re done, go back to the pipeline canvas and press the + sign, and select Play to run your pipeline. This will begin running your workflow daily, extracting information from your data source, training, testing, and predicting, as well as applying the Spark Classification algorithm, and determining churn daily. Your Slack channel will automatically receive the model’s results daily for further analysis.


An icon


A churn model dashboard



Datagran easily lets you measure churn with machine learning models thanks to its easy-to-use pipeline tool. Send the model’s outcomes instantly to any of our available business apps to act based on the results like never before. Building ML models to predict churn provides the right tools to retain customers even before they leave, putting your business ahead of the competition. Learn more about our pipelines, build more advanced models to solve business problems ranging from operational to consumer issues.