Tell R in Which Order to Read Factors
Factors
Introduction
In R, factors are used to work with categorical variables, variables that take a fixed and known set of possible values. They are also useful when you desire to display graphic symbol vectors in a non-alphabetical gild.
Historically, factors were much easier to work with than characters. As a result, many of the functions in base of operations R automatically convert characters to factors. This means that factors often ingather up in places where they're non actually helpful. Fortunately, you don't demand to worry almost that in the tidyverse, and can focus on situations where factors are genuinely useful.
Prerequisites
To work with factors, we'll use the forcats package, which is part of the core tidyverse. It provides tools for dealing with categorical variables (and it'southward an anagram of factors!) using a wide range of helpers for working with factors.
Creating factors
Imagine that you have a variable that records month:
x1 <- c ( "December", "Apr", "Jan", "Mar" )
Using a string to record this variable has two problems:
-
There are only twelve possible months, and there's nothing saving you lot from typos:
x2 <- c ( "Dec", "April", "Jam", "Mar" )
-
It doesn't sort in a useful mode:
sort ( x1 ) #> [1] "April" "Dec" "January" "Mar"
Y'all can fix both of these issues with a gene. To create a cistron you must start by creating a list of the valid levels:
month_levels <- c ( "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec" )
Now you tin create a factor:
y1 <- factor ( x1, levels = month_levels ) y1 #> [1] Dec Apr Jan Mar #> Levels: Jan Feb Mar April May Jun Jul Aug Sep Oct Nov Dec sort ( y1 ) #> [i] January Mar Apr December #> Levels: Jan Feb Mar April May Jun Jul Aug Sep Oct November Dec
And any values non in the gear up will be silently converted to NA:
y2 <- factor ( x2, levels = month_levels ) y2 #> [ane] Dec Apr <NA> Mar #> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep October Nov Dec
If you want a warning, you can use readr::parse_factor()
:
y2 <- parse_factor ( x2, levels = month_levels ) #> Alarm: 1 parsing failure. #> row col expected actual #> 3 -- value in level gear up Jam
If you omit the levels, they'll be taken from the information in alphabetical society:
factor ( x1 ) #> [1] Dec April Jan Mar #> Levels: Apr Dec Jan Mar
Sometimes you'd adopt that the guild of the levels friction match the order of the first appearance in the data. You can do that when creating the factor past setting levels to unique(ten)
, or after the fact, with fct_inorder()
:
f1 <- factor ( x1, levels = unique ( x1 ) ) f1 #> [one] December April Jan Mar #> Levels: Dec Apr Jan Mar f2 <- x1 %>% factor ( ) %>% fct_inorder ( ) f2 #> [one] Dec Apr Jan Mar #> Levels: Dec Apr January Mar
If you lot ever need to access the gear up of valid levels straight, yous tin can do so with levels()
:
levels ( f2 ) #> [1] "Dec" "Apr" "Jan" "Mar"
Modifying factor order
It's often useful to change the society of the gene levels in a visualisation. For example, imagine you want to explore the boilerplate number of hours spent watching TV per day across religions:
relig_summary <- gss_cat %>% group_by ( relig ) %>% summarise ( age = mean ( age, na.rm = True ), tvhours = mean ( tvhours, na.rm = TRUE ), n = due north ( ) ) #> `summarise()` ungrouping output (override with `.groups` argument) ggplot ( relig_summary, aes ( tvhours, relig ) ) + geom_point ( )
It is hard to interpret this plot because there's no overall blueprint. We tin can improve it past reordering the levels of relig
using fct_reorder()
. fct_reorder()
takes three arguments:
-
f
, the factor whose levels you want to change. -
x
, a numeric vector that you lot want to utilize to reorder the levels. - Optionally,
fun
, a function that's used if there are multiple values ofx
for each value off
. The default value ismedian
.
Reordering religion makes it much easier to see that people in the "Don't know" category watch much more TV, and Hinduism & Other Eastern religions lookout much less.
As yous kickoff making more complicated transformations, I'd recommend moving them out of aes()
and into a separate mutate()
step. For example, you could rewrite the plot in a higher place as:
relig_summary %>% mutate (relig = fct_reorder ( relig, tvhours ) ) %>% ggplot ( aes ( tvhours, relig ) ) + geom_point ( )
What if we create a similar plot looking at how average age varies beyond reported income level?
rincome_summary <- gss_cat %>% group_by ( rincome ) %>% summarise ( age = hateful ( age, na.rm = TRUE ), tvhours = mean ( tvhours, na.rm = Truthful ), n = n ( ) ) #> `summarise()` ungrouping output (override with `.groups` statement) ggplot ( rincome_summary, aes ( age, fct_reorder ( rincome, age ) ) ) + geom_point ( )
Here, arbitrarily reordering the levels isn't a good idea! That's because rincome
already has a principled lodge that we shouldn't mess with. Reserve fct_reorder()
for factors whose levels are arbitrarily ordered.
Yet, information technology does make sense to pull "Not applicable" to the front with the other special levels. You lot can use fct_relevel()
. It takes a cistron, f
, and and then any number of levels that you lot want to move to the front of the line.
Why do y'all recollect the average age for "Not applicable" is then high?
Another type of reordering is useful when you are colouring the lines on a plot. fct_reorder2()
reorders the cistron by the y
values associated with the largest x
values. This makes the plot easier to read because the line colours line up with the legend.
by_age <- gss_cat %>% filter ( ! is.na ( historic period ) ) %>% count ( age, marital ) %>% group_by ( age ) %>% mutate (prop = n / sum ( north ) ) ggplot ( by_age, aes ( age, prop, color = marital ) ) + geom_line (na.rm = TRUE ) ggplot ( by_age, aes ( historic period, prop, colour = fct_reorder2 ( marital, historic period, prop ) ) ) + geom_line ( ) + labs (color = "marital" )
Finally, for bar plots, you tin employ fct_infreq()
to order levels in increasing frequency: this is the simplest type of reordering because it doesn't need any extra variables. You may want to combine with fct_rev()
.
gss_cat %>% mutate (marital = marital %>% fct_infreq ( ) %>% fct_rev ( ) ) %>% ggplot ( aes ( marital ) ) + geom_bar ( )
Exercises
-
There are some suspiciously loftier numbers in
tvhours
. Is the mean a adept summary? -
For each factor in
gss_cat
place whether the lodge of the levels is arbitrary or principled. -
Why did moving "Not applicable" to the forepart of the levels move information technology to the bottom of the plot?
Modifying cistron levels
More powerful than changing the orders of the levels is changing their values. This allows you lot to clarify labels for publication, and collapse levels for high-level displays. The well-nigh full general and powerful tool is fct_recode()
. It allows you to recode, or change, the value of each level. For example, have the gss_cat$partyid
:
gss_cat %>% count ( partyid ) #> # A tibble: 10 10 2 #> partyid n #> <fct> <int> #> 1 No answer 154 #> ii Don't know 1 #> 3 Other political party 393 #> 4 Strong republican 2314 #> v Not str republican 3032 #> six Ind,near rep 1791 #> # … with 4 more than rows
The levels are terse and inconsistent. Let'south tweak them to exist longer and use a parallel construction.
gss_cat %>% mutate (partyid = fct_recode ( partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, stiff" = "Strong democrat" ) ) %>% count ( partyid ) #> # A tibble: 10 x 2 #> partyid due north #> <fct> <int> #> ane No reply 154 #> ii Don't know 1 #> 3 Other party 393 #> 4 Republican, strong 2314 #> 5 Republican, weak 3032 #> half dozen Contained, near rep 1791 #> # … with 4 more than rows
fct_recode()
volition leave levels that aren't explicitly mentioned as is, and volition warn you if you accidentally refer to a level that doesn't exist.
To combine groups, you tin can assign multiple old levels to the same new level:
gss_cat %>% mutate (partyid = fct_recode ( partyid, "Republican, stiff" = "Strong republican", "Republican, weak" = "Non str republican", "Independent, about rep" = "Ind,near rep", "Contained, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat", "Other" = "No answer", "Other" = "Don't know", "Other" = "Other party" ) ) %>% count ( partyid ) #> # A tibble: 8 ten 2 #> partyid n #> <fct> <int> #> 1 Other 548 #> 2 Republican, stiff 2314 #> iii Republican, weak 3032 #> four Independent, nearly rep 1791 #> 5 Independent 4119 #> 6 Independent, near dem 2499 #> # … with 2 more rows
You must use this technique with care: if you group together categories that are truly different you will finish upwards with misleading results.
If you want to plummet a lot of levels, fct_collapse()
is a useful variant of fct_recode()
. For each new variable, you lot can provide a vector of old levels:
gss_cat %>% mutate (partyid = fct_collapse ( partyid, other = c ( "No answer", "Don't know", "Other political party" ), rep = c ( "Strong republican", "Not str republican" ), ind = c ( "Ind,near rep", "Contained", "Ind,near dem" ), dem = c ( "Not str democrat", "Strong democrat" ) ) ) %>% count ( partyid ) #> # A tibble: 4 x 2 #> partyid northward #> <fct> <int> #> 1 other 548 #> 2 rep 5346 #> 3 ind 8409 #> 4 dem 7180
Sometimes you lot just desire to lump together all the pocket-sized groups to make a plot or tabular array simpler. That'due south the chore of fct_lump()
:
gss_cat %>% mutate (relig = fct_lump ( relig ) ) %>% count ( relig ) #> # A tibble: 2 x 2 #> relig n #> <fct> <int> #> 1 Protestant 10846 #> 2 Other 10637
The default behaviour is to progressively lump together the smallest groups, ensuring that the amass is still the smallest grouping. In this instance it's not very helpful: it is truthful that the bulk of Americans in this survey are Protestant, just we've probably over complanate.
Instead, we can employ the n
parameter to specify how many groups (excluding other) we want to proceed:
gss_cat %>% mutate (relig = fct_lump ( relig, n = 10 ) ) %>% count ( relig, sort = True ) %>% impress (n = Inf ) #> # A tibble: 10 x 2 #> relig n #> <fct> <int> #> 1 Protestant 10846 #> 2 Catholic 5124 #> 3 None 3523 #> 4 Christian 689 #> v Other 458 #> 6 Jewish 388 #> 7 Buddhism 147 #> eight Inter-nondenominational 109 #> nine Moslem/islam 104 #> 10 Orthodox-christian 95
Exercises
-
How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
-
How could you lot collapse
rincome
into a small prepare of categories?
Source: https://r4ds.had.co.nz/factors.html
0 Response to "Tell R in Which Order to Read Factors"
Post a Comment