A couple of months ago, we shared a post to help you find your Buyer Persona with non-traditional data. If you read it, you know we pulled in customer service ticket data from an airline use case, with information such as ticket number, type of ticket, the reason for the call, and so on. And customer data like name, email address, the phone number to build a pipeline that would help us understand who their buyer persona was. The thing about finding a buyer persona traditionally, is that it has to be obtained from qualitatively and quantitatively data through market studies which are rather expensive. Companies are now able to gather other types of data that are as valuable as demographic information, such as the type of data mentioned earlier. So not using it would be like wearing blinders while walking amongst a field of gold.


What happens when these models are executed is sad. A Gartner study said that nearly 85% of data projects fail due to the lack of knowledge to operationalize them. Only 4% of companies have shown an ability to do so. 


Data scientists find it hard to explain machine learning results to non-technical stakeholders. To be understood and impactful, data often needs to be visually presented in graphs or charts. While these tools are incredibly useful, it’s difficult to build them manually. Taking the time to pull information from multiple areas and put it into a reporting tool is frustrating and time-consuming. Not only that, but if we’re talking about finding your buyer’s persona, we’re also up against graphs that would look like this in the perfect world:


Source: KDnuggets



But unfortunately, they look more like this:



Source: Google



Because clustering is unsupervised, no “truth” is available to verify results. The absence of truth complicates assessing quality. There is a method recommended by Google which helps you interpret your results and adjust clustering. Here is a brief breakdown of what should be done:


Step One: Quality of Clustering

Visual checking to determine if the clusters look as expected.


Step Two: Performance of the Similarity Measure

identify pairs of examples that are known to be more or less similar than other pairs. Then, calculate the similarity measure for each pair of examples. Ensure that the similarity measure for more similar examples is higher than the similarity measure for less similar examples.


Step Three: Optimum Number of Clusters

k-means requires you to decide the number of clusters beforehand. How do you determine the optimal value? Try running the algorithm for increasing and note the sum of cluster magnitudes. As increases, clusters become smaller, and the total distance decreases. Plot this distance against the number of clusters.


Thankfully, there’s an easier way, and in this post, we will explain how to analyze and navigate through the  outcome of a buyer’s persona model.


With the same use case data gathered from the buyer persona blog post, we know there are two datasets called:


  • Fresh Desk Data: this dataset contains information from customer service tickets. Most of its characteristics are categorical. the numerical characteristics were from a timestamp date, for which a transformation to units of Hours was carried out using the SQL operator in order to match the right format.


Tip: If using Datagran to build a pipeline, you don’t need to have your data organized and in a warehouse. The data can be in a CSV file or even raw. In other words, any company in any state of maturity can run a model. One of the operators available in a pipeline is SQL allowing you to process your data so it fits the format you need.

  • Passengers Manifesto: contains information associated with passengers such as their name, reference contacts, and flight number. This is the passenger information for the model’s categorical variables. of which the only ones are reward plan status, City Pair Connection (very little data), Fare Class, Pax Type, and Flight Number.


Because some of the data had NULL or empty fields, it was decided to train with at least 70% of complete data (not null).


After running the model, the result included three clusters.


Cluster 1


This cluster has the highest number of records compared to the other two, but it does not exceed cluster 2 by much (3820 → 44.9%). In this cluster we could find:


  1. The status of the requests is concentrated in open and pending
  2. In the top 5 of agents associated with requests, there are 4 agents belonging to the top 5 of the other clusters, in addition to this cluster, 6 agents exclusively belong to it.
  3. The priority is mostly low in all clusters, although most of the high priority ones are concentrated here.
  4. The source is the same for all clusters, feedback widgets, and emails.
  5. The type of request is the same for all clusters, reservation, and purchase management, there is an exclusive refund.
  6. Customer service ticket types are the same for all clusters.
  7. This cluster fully comprises the records that do not have information associated with the resolution status.
  8. The groups are the same in all clusters.
  9. It mostly comprises the Agent interaction rating equal to 0.
  10. Customer interaction is the same for all clusters.
  11. The flight numbers with the highest records are 5764, 5765, and 5587. In conclusion, for this cluster, there are 17 exclusive flights.
  12. In this cluster, the LiftStatus variable is mostly comprised of “No show” status followed by a large number of “Boarded” records.
  13. Expiration records with an average of 57 hours (2.4 days), changing only the Q3 between the clusters. 


Summary:

This cluster can be outlined by saying that these are the requests of passengers who mostly did not board flights (it could be because they were canceled), or their requests are open and pending, therefore they do not have a resolution time as well as a resolution status, additionally, they are the ones with the longest expiration dates, their type of request includes all reimbursements, the first two number of flights can be associated with roundtrip flights, so it could be said that there was some kind of problem with these flights and have had little or no agent interaction. This cluster has exclusively 17 flights.



Cluster 2

This cluster has the second-highest number of records, being surpassed only by cluster 1 (3155 → 43.8%). In this cluster we can find:


  1. The requests status is concentrated in closed.
  2. In the top 5 of agents associated with requests, there are 2 agents belonging to the top 5 of cluster 1, in addition to this cluster, 4 agents exclusively belong to it.
  3. The priority is mostly low in all clusters.
  4. The source is the same for all clusters, feedback widgets, and emails. Outbound Email source is also included in this cluster.
  5. The type of request is the same for all clusters, reservation, and purchase management, exclusively for itinerary information.
  6. Customer service ticket type, same for all clusters.
  7. The records of this cluster for resolution status comprise mostly "Service Level Agreement Violated" followed by "Within Service Level Agreement".
  8. The groups are the same in all clusters.
  9. It mostly comprises the Agent interaction rating equal to 1.
  10. Customer interaction is the same for all clusters.
  11. The flight numbers with the highest records are 5764, 5586, and 5587, for this cluster there are 7 exclusive flights.
  12. In this cluster, the LiftStatus variable is mostly comprised of “Boarded” status followed by a large number of “No show” records
  13. The records of resolution hours are concentrated with an average of 77 hours (<100) (3.2 days)

expiration records with an average of 56 hours (2.3 days), changing only the Q3 between the clusters.


Summary:

This cluster can be outlined by saying that they are passengers who mostly boarded their flights, their cases are closed and the resolution times are between the first 4 days, the customer service tickets of these passengers are divided between those who comply and not the Agreement of the level of service, being higher those who do not comply with it, the type of request includes those of Itinerary News, they have mostly had minimal interaction with an agent, their resolution time is around 3 days, in this cluster at its top 3 there is a round trip flight which could infer some kind of problem with this flight such as a delay or cancellation. It has exclusively 7 flights.



Cluster 3

This cluster has the least amount of records compared to the other two clusters (1031 → 11.3%). In this cluster we can find:


  1. The status of the requests is concentrated in closed
  2. In the top 5 of agents associated with requests, there are 5 agents belonging to the top 5 between cluster 1 and 2, in addition to this cluster 1 agent exclusively belongs to it.
  3. The priority is mostly low in all clusters.
  4. The source is the same for all clusters, feedback widgets and emails.
  5. The type of request is the same for all clusters, reservation and purchase management, exclusively there are complementary services.
  6. Type of customer service tickets, same for all clusters, except that there is no congratulation here.
  7. The records of this cluster for resolution status include only “Service level agreement violated”.
  8. The groups are the same in all clusters.
  9. It mostly comprises the Agent interaction rating equal to 1 and encompasses the highest interaction value per agent.
  10. Customer interaction is the same for all clusters and includes the one with the greatest interaction.
  11. The flight numbers with the highest records are 5793, 5716, and 5733, for this cluster there are 9 exclusive flights.
  12. In this cluster, the LiftStatus variable is mostly comprised of “Boarded” status followed by “No show” records, the only one that does not have a checked-in.
  13. The records of resolution hours are concentrated with an average of 1259 (> 1000) (52.5 days)
  14. Expiration records with an average of 51 hours (2.1 days), changing only the Q3 between the cluster.


This cluster can be outlined by saying that it is the smallest compared to the other two and comprises passengers who have generally boarded the flights, their respective PQRS are in a closed state and their entire resolution status does not comply with the Level Agreement of service, have had at least one interaction with an agent with a tendency to 2 interactions and the resolution times include the highest values, around 53 days, but a shorter expiration time compared to the rest of the clusters. It has exclusively 9 flights.



Finding Your Ideal Customer


When running a buyer persona model, clusters with characteristics like the ones listed above are key to drawing results and gain insights to expand on them. 


The first step is to break down each cluster to understand what characteristics are left, so then you can name each persona that represents their actions. You must formulate a few questions you wish to answer based on your findings:


  • Who your clients are?
  • How do they behave?
  • What are they interested in?
  • What kind of challenges do they face?


Based on the cluster results we can gather the following highlights and names for each persona:

Cluster 1: No-show customers

Passengers who mostly did not board flights

Requested refunds

Cluster 2: Buyer Persona

Passengers who mostly boarded their flights

Their customer service tickets were closed in the first 4 days

The type of request includes itinerary information

They had minimal interaction with an agent

Cluster 3: Unsatisfied customers

Passengers who boarded the flights

Their respective customer service tickets are in a closed state and their entire resolution status does not comply with the Level Agreement of service.

They had at least one interaction with an agent with a tendency to 2 interactions and the resolution times include the highest values with an average of 53 days.



Now we can dig even deeper. For example, each cluster can be analyzed to gather Lifetime Customer Value or Customer Acquisition Cost. 


LTV: The simplest formula for measuring customer lifetime value is the average order total multiplied by the average number of purchases in a year multiplied by average retention time in years.


Use this formula to understand how much income is each cluster generating in a specific timeframe. This will help you set accurate budgets across marketing, sales, and customer service efforts for the future.


CAC: CAC can be calculated by dividing the costs spent on acquiring more customers (marketing expenses) by the number of customers acquired in the period the money was spent. In this case, calculate the costs incurred to service each passenger and determine if your marketing efforts are paying off, or if they are falling short.


Taking it even further once you determine your LTV and CAC by running additional models like Recommended Product– a model that predicts the products your customers are most likely to buy, and RFM– a segmentation model to predict who your best and worst customers are. The possibilities are endless, but the premise lies on making more personalized connections with your customers, and in tandem increase revenue. With these actions, your teams can contact your best customers with even more personalized requests like promotional offers they are most likely to react to. Examples of this could be:


  • Discounted fares
  • Discounted destinations
  • Reward cash out for exclusive destinations


Additionally, using these results in finance can help your company define investments across multiple departments. For example, cluster results can be used to predict sales efforts in terms of manpower needed and the strategy used.


Marketing efforts can be revisited by launching personalized campaigns that ensure a deeper connection with each customer in order to resolve open disputes, complaints, and dissatisfactions.


In operations, for instance, clusterization can help you establish groups of customer service agents for dissatisfied or satisfied customers based on their performance. 


Ready to run a clustering algorithm? Try this method with your latest model results and share with us how you made your data actionable. 


In conclusion, traditional ways of finding buyers personas are time consuming, expensive and full of people biase. With ML you can accurately cluster your customers to then build your buyer personas in a way that is more scientific. Although ML can feel complex, interpreting and designing a buyer persona based on its output is not. Work with your BI or analytics team in tandem to build one in collaboration, the business and analytical perspectives will draw great results for the organization and you will start to pave your way to a more data centric professional.