5
$\begingroup$

I’m working with ad server data where I can’t get user-level data — only aggregated reports. The data is aggregated on multiple categorical dimensions (e.g., day × product × medium × source × campaign × format), and metrics like conversion, impression, and cost are sums over those dimensions. My goal is to predict conversion.

I have two main questions:

a) Should I always pull the data at the most granular aggregation level possible? (including all categories like format, source, campaign)

b) If a categorical variable (e.g., format) is irrelevant for the model, is it better to remove that variable by re-aggregating the data at a higher level (dropping that dimension), or keep the original aggregation level but simply exclude the variable from the model?

Thanks for any insights!

$\endgroup$

2 Answers 2

3
$\begingroup$

a) Should I always pull the data at the most granular aggregation level possible? (including all categories like format, source, campaign)

Assuming that you are 1) allowed and 2) able to retrieve that data, I'd say that it can't hurt to have a look, or even train a model on the granular data to see if it performs better.

b) If a categorical variable (e.g., format) is irrelevant for the model, is it better to remove that variable by re-aggregating the data at a higher level (dropping that dimension), or keep the original aggregation level but simply exclude the variable from the model?

If you know that this variable isn't relevant at all, yes, you should remove it and re-aggregate to avoid your model using that variable to make decisions. If the variable is truly irrelevant, your model might ignore it on its own, but better safe than sorry.

Some additional food for thought

When using aggregated values to build a model that needs to make predictions on individual samples from that aggregation, you should look at it this way: you're facing the same problem, you just happen to work with synthetic data points emerging from your granular data.

In some use cases, discernible patterns might only arise at a certain level of aggregation. Let's say you want to predict the amount of sales for a particular product, in a particular colour, at a particular store, for a particular day. You might not sell that specific item often enough to have useful data to draw patterns on. But, if you decide to predict how much you'll see of this item, in all colours, at all stores, over the next week, you might be able to get somewhere.

$\endgroup$
1
$\begingroup$

https://annahava.medium.com/too-many-categories-how-to-deal-with-categorical-features-of-high-cardinality-d4563cfe62d6

You can refer to this

Another method which I do suggest is to use encoding and so check out feature agglomeration ->A clustering algo which combines overlapping variables

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.