How to split a numeric variable into a binary low-high variable

Question

I have measured frequency of a certain behavior on 15 individuals. I would like to create two groups based on the amount of this behaviour that was observed (i.e., a group exhibiting high levels of the behaviour and a group exhibiting low levels of the behaviour).

I want to see whether this new binary variable predicts a dependent variable that I have measured.

It seems like some clustering, but can you elaborate a bit more? What do you mean by "levels of that behavior"? — user88
– user88, Commented May 1, 2011 at 11:35
The levels of the behavior are the frequencies (occurrence of behavior/minute) that were exhibited by the individuals. — Orp
– Orp, Commented May 1, 2011 at 13:46
@Orp Is the behaviour measured on a single variable or multiple variables? — Jeromy Anglim
– Jeromy Anglim, Commented May 1, 2011 at 14:47
@Orp there seem to be a lot of different ways to approach this... some more information is needed to best develop an answer. How are the data distributed? Do you have an a priori reason to think there are 2 groups and only 2 groups? etc... — DQdlM
– DQdlM, Commented May 1, 2011 at 15:20
Thanks Jeromy, I am not sure I understand your question. The behavior scores range from 0 occurrences per minute to 8 per minute on a continuous scale. — Orp
– Orp, Commented May 1, 2011 at 20:08

Glen_b · Accepted Answer · 2023-03-30 06:59:33Z

Assuming you have a single predictor variable that represents frequency of behaviour, I would make the following points

Should you split a numeric variable into high-low groups

I quote the following from one of my blog posts on creating clusters, where I use the term "median split" as a prototypical example of converting a numeric variable into a binary high-low variable.

Many researchers have heard the advice to not form median splits (see, Howell for a discussion), or other kinds of binary splits for that matter. The same arguments also tend to apply with other forms of abrupt grouping into a small number of factors.

Some arguments FOR running median splits are: 1) it allows you to do an ANOVA or t-test and compare group means; 2) group differences are easier to communicate to a lay audience; 3) it reflects the important distinction in the underlying continuous variable.

Some arguments AGAINST running median splits are: 1) you can always find an equivalent analysis that respects the continuous nature of the variable (e.g., regression); 2) when creating median splits, you lose a lot of information; 3) the cut-off tends to be relatively arbitrary and it varies between samples; 4) the resulting model based on a median split does not reflect the underlying nature of the variable; 5) in most cases a binary split will have less statistical power; 6) if the purpose is to communicate to a scientific audience, respecting the continuous nature of the variable is a necessary complexity.

From the above you can see that there are generally more reasons in favour of maintaining the continuous version of the variable. The two occasions where splits are tolerable are where it makes it easy to communicate findings to a lay audience and where the underlying effect of interest occurs in a stepwise fashion. In the case of the latter, the presence of a stepwise effect can be tested empirically; a quick look at a scatter plot should give some sense if there is a point where the effect changes dramatically. Likewise decisions based on test scores are often based on pass-fail kinds of categories, and there is often a concrete desire to draw inferences about these specific groups.

Also, check out page 128 of Making Friends with Your Data for further discussion.

In summary, my advice would be to run a correlation or a regression predicting your outcome variable from the continuous version of your predictor. You may or may not want to perform an order preserving transformation of your predictor depending on its distribution.

Creating two groups based on numeric variable

Putting aside the issues raised above, if you decide that you still want to split your predictor variable into high-low groups, the following are some options

Use Statistical properties of your sample
- Median split
- Above or below the mean
- Take bottom 25% and top 25% and throw out the middle
- Take bottom third and top third and throw out the middle third
Use accepted or externally validated cut-offs
- e.g., medical diagnoses are often based on certain cut-offs on a continuous scale
- Use your own understanding of the phenomena to define a cut-off
Examine a histogram or density plot and look for a natural split in the data (as mentioned by @rolando2)

$\begingroup$ Jeromy, thanks for taking the time. It is very helpful. $\endgroup$

Orp
– Orp

2011-05-02 06:48:27 +00:00
Commented May 2, 2011 at 6:48 — Orp
– Orp, Commented May 2, 2011 at 6:48

rolando2 · Accepted Answer · 2011-05-01 23:17:09Z

Based on the post and the comments to date: If you want to create two groups based on a single variable, you are faced with an arbitrary choice. You can say that below x is "low" and at or above x is "high" but there is not going to be any statistical procedure (certainly not a significance test) that can make that determination for you. In this situation some people would draw a histogram and look for what seems like a "natural" dividing point, which might simply mean one that would be convincing or defensible to one's particular audience. Alternatively, one might divide so as to create two equal-sized groups. There is no right or wrong answer. But I question the need for dichotomization, for I suspect that whatever methods you plan to apply with two groups could be replaced by other methods at least as informative that preserve the original variable. For example, rather than dichotomizing and running a T-test using a dependent variable, why not correlate your independent and dependent variables, or create a scatterplot to show their relationship.

$\begingroup$ I agree with this answer : ) $\endgroup$

Behacad
– Behacad

2011-05-02 02:52:18 +00:00
Commented May 2, 2011 at 2:52 — Behacad
– Behacad, Commented May 2, 2011 at 2:52
$\begingroup$ That was very informative, and makes sense. Thanks so much! $\endgroup$

Orp
– Orp

2011-05-02 06:35:41 +00:00
Commented May 2, 2011 at 6:35 — Orp
– Orp, Commented May 2, 2011 at 6:35

Stack Exchange Network

How to split a numeric variable into a binary low-high variable

2 Answers 2

Should you split a numeric variable into high-low groups

Creating two groups based on numeric variable

Linked

Hot Network Questions

How to split a numeric variable into a binary low-high variable

2 Answers 2

Should you split a numeric variable into high-low groups

Creating two groups based on numeric variable

Linked

Related

Hot Network Questions