Cohort analysis involves identifying a group of users and tracking them over time.
The two key steps are:
Identify the cohort - Usually the group of users are identified by a specific time period – for example, students who graduated in 2010. Or customers who made a purchase in on specific day or month.
Trending - Once you identified the cohort group, you want to a track a metric of the cohort over time. For example, what was the average salary of students graduating in 2010 over the next five years.
In Arcadia we can do single cohort analysis using Segment entities.
Navigate to dataset details page and within Segments sub-page, create a new segment.
There are a few things we need to fill out:
- Name for the cohort, which will be referenced when creating visual
- Filters that define the cohort - in the below example we are choosing customers who had orders within the month of Jan 2009 and at least one sales transaction greater than 100
- Lastly, we need a unique identifier for each entity in the cohort. This has to be a column in the dataset. In the below example, we have chosen “Customer Name”.
Create a new visual and configure the time period and metric. In the below example we are going to trend by month the distinct count of customers.
To enable cohort, we need to navigate to the “Segment” panel and choose the cohort we just created.
To get the retention metric we need to use a analytic function (window functions) as explain here:
In our example we used the following:
count(distinct([Customer Name])) / first_value ( count(distinct([Customer Name]))) over (order by from_timestamp([Order Date], 'yyyy-MM')) as '% Active'
To understand how this all works, you can reference the generated SQL. You can look at the Activity Log page
In our example the final SQL looks like below. Notice the join that’s generated back to the main table.
order_date), ‘yyyy-MM’) as
year and month of Order Date, count(distinct((TA_0.
customer_name))) / first_value ( count(distinct((TA_0.
customer_name)))) over (order by from_timestamp((TA_0.
order_date), ‘yyyy-MM’)) as
% Active, count(distinct ((TA_0.
Superstore_LVTA_0 WHERE (((TA_0.
order_date) between ‘2009-01-01’ and ‘2009-01-31’ AND (TA_0.
sales) > 100)) GROUP BY 1) arc_u10
order_datebetween ‘2009-01-01’ and ‘2009-12-31’) GROUP BY 1 LIMIT 5000