## Cut function in R

Sometimes it is useful to **categorize the values of a continuous variable** in different levels of a factor. For that purpose, you can use the R `cut`

function. In the following block of code we show the **syntax of the function** and the simplified description of the arguments.

```
cut(num_vector, # Numeric input vector
breaks, # Number or vector of breaks
labels = NULL, # Labels for each group
include.lowest = FALSE, # Whether to include the lowest 'break' or not
right = TRUE, # Whether the right interval is closed (and the left open) or vice versa
dig.lab = 3, # Number of digits of the groups if labels = NULL
ordered_result = FALSE, # Whether to order the factor result or not
…) # Additional arguments
```

### Cut in R: the breaks argument

The `breaks`

argument allows you to cut the data in bins and hence to categorize it. Consider the following vector:

`x <- -5:5`

On the one hand, you can set the `breaks`

argument to any integer number, creating as many intervals (levels) as the specified number. These intervals will be all of the same length.

`cut(x, breaks = 2)`

```
(-5.01,0] (-5.01,0] (-5.01,0] (-5.01,0] (-5.01,0]
(-5.01,0] (0,5.01] (0,5.01] (0,5.01] (0,5.01] (0,5.01]
Levels: (-5.01,0] (0,5.01]
```

On the other hand, you can specify the intervals you prefer.

`cut(x, breaks = c(-6, 2, 5))`

```
(-6,2] (-6,2] (-6,2] (-6,2] (-6,2] (-6,2] (-6,2] (-6,2] (2,5] (2,5]
(2,5]
Levels: (-6,2] (2,5]
```

It is worth to mention that if the intervals have decimals you can modify the number of decimals with the `dig.lab`

argument and decide whether to order the results or not with the `ordered_result`

argument.

### Cut in R: the labels argument

You can also change the levels of the output factor with the `labels`

argument.

```
x <- c(12, 1, 25, 12, 65, 2, 6, 17)
cut(x, breaks = c(0, 3, 12, 15, 20, 80),
labels = c("First", "Second", "Third", "Fourth", "Fifth"))
# Equivalent to
c <- cut(x, breaks = c(0, 3, 12, 15, 20, 80))
levels(c) <- c("First", "Second", "Third", "Fourth", "Fifth")
```

```
Second First Fifth Second Fifth First Second Fourth
Levels: First Second Third Fourth Fifth
```

### Include lowest value

The `include.lowest`

argument specify whether to include the lowest break or not. By default, it is set to `FALSE`

.

```
x <- 15:25
cut(x, breaks = c(15, 20, 25), include.lowest = FALSE)
```

```
<NA> (15,20] (15,20] (15,20] (15,20]
(15,20] (20,25] (20,25] (20,25] (20,25] (20,25]
Levels: (15,20] (20,25]
```

In this case, the lowest value (15), specified as a break, it is not included in the interval (the left interval is open), so the value is categorized as `NA`

, because the number 15 doesn’t belong to any of the intervals. However, if you set `include.lowest`

to `TRUE`

, the value will be included, as the left interval of the lowest break will be closed.

`cut(x, breaks = c(15, 20, 25), include.lowest = TRUE)`

```
[15,20] [15,20] [15,20] [15,20] [15,20]
[15,20] (20,25] (20,25] (20,25] (20,25] (20,25]
Levels: [15,20] (20,25]
```

### The argument ‘right’

Consider, for instance, you want to categorize some data ( x ) in the following categories:

- Low, if x \in [0, 150).
- Medium, if x \in [150, 200).
- High, if x \in [200, \infty ).

By default, the argument `right`

is set to `TRUE`

, so the intervals are opened on the left and closed on the right (x, y].

```
x <- c(75, 150, 160, 151, 216, 149)
categories <- cut(x, breaks = c(0, 150, 200, Inf),
labels = c("low", "medium", "high"))
data.frame(x, categories)
```

In this scenario, not all the values are categorized well.

```
x categories
75 low
150 low # <-- Categorized as low
160 medium
151 medium
216 high
149 low
```

However, if you set `right = FALSE`

, the intervals will be closed on the left and open on the right.

```
categories <- cut(x, breaks = c(0, 150, 200, Inf),
labels = c("low", "medium", "high"),
right = FALSE)
data.frame(x, categories)
```

Now the data is categorized correctly:

```
x categories
75 low
150 medium # <-- Categorized as medium
160 medium
151 medium
216 high
149 low
```

`right`

and `include.lowest`

can lead to mistakes, so we recommend changing the values of the `breaks`

argument instead of the others.
## Example: How to categorize age groups in R?

Consider, for instance, that you want to categorize a numeric vector of ages in the following categories:

- 0-14: Children.
- 15-24: Youth.
- 25-64: Adult.
- 65 and over: Senior.

`age <- c(0, 12, 89, 14, 25, 2, 65, 1, 16, 24, 67, 61, 64)`

At first glance, you could think in set the following, but an error will arise.

```
cut(age, breaks = c(14, 24, 64, Inf),
labels = c("Children", "Youth", "Adult", "Senior"))
```

Nonetheless, if you have specified 4 break values and 4 labels, as the breaks are intervals, you are generating three intervals instead of four (14-24, 24-64 and 64-Inf) . Consequently, you will need to add in this case the lowest value to have four intervals:

```
cut(age, breaks = c(0, 14, 24, 64, Inf),
labels = c("Children", "Youth", "Adult", "Senior"))
```

```
<NA> Children Senior Children Adult Children Senior Children
Youth Youth Senior Adult Adult
Levels: Children Youth Adult Senior
```

But now the lowest age (0), will be categorized as `NA`

, as the lowest value of the breaks is not included by default. You could solve this changing the 0 of the breaks (for example setting -0.01 instead of 0) or setting the `include.lowest`

argument to `TRUE`

.

```
cut(age, breaks = c(-0.01, 14, 24, 64, Inf),
labels = c("Children", "Youth", "Adult", "Senior"))
# Equivalent to:
cut(age, breaks = c(0, 14, 24, 64, Inf),
labels = c("Children", "Youth", "Adult", "Senior"),
include.lowest = TRUE)
```

```
Children Children Senior Children Adult Children Senior Children
Youth Youth Senior Adult Adult
Levels: Children Youth Adult Senior
```

## Example: How to categorize exam notes?

As another example, exam notes can be categorized as fail, if the note is lower than 5 points out of 10, or pass in the other case. We will generate a simple data set to categorize exam qualifications.

```
numeric <- c(6.1, 5.3, 8.9, 5.0, 8.8, 1.9, 6.6, 7.2, 9.4, 4.9,
7.1, 3.9, 1.0, 9.3, 9.9, 5.9, 5.1, 8.4, 3.2, 10.0)
```

In this example you could implement the function as follows:

```
categorized_note <- cut(numeric, breaks = c(0, 4.9, 10),
labels = c("fail", "pass"))
# Equivalent to:
# categorized_note <- cut(numeric, breaks = c(0, 5, 10.1),
# labels = c("fail", "pass"), right = FALSE)
# You could specify factor levels with levels function
# levels(categorized_note) <- c("fail", "pass")
# Generating the dataframe
final_notes <- data.frame(numeric, categorized_note)
head(final_notes)
```

Note that in the equivalent alternative we set `right = FALSE`

, because if `TRUE`

, a 5 would be fail instead of pass. However, when setting this argument to `FALSE`

, the right interval is open, so a 10 won’t enter the interval and that is the reason because we set the third break as 10.1 instead of 10. The final result is as follows:

```
numeric categorized_note
1 6.1 pass
2 5.3 pass
3 8.9 pass
4 5.0 pass
5 8.8 pass
6 1.9 fail
```