Tell R in Which Order to Read Factors

Factors

Introduction

In R, factors are used to work with categorical variables, variables that take a fixed and known set of possible values. They are also useful when you desire to display graphic symbol vectors in a non-alphabetical gild.

Historically, factors were much easier to work with than characters. As a result, many of the functions in base of operations R automatically convert characters to factors. This means that factors often ingather up in places where they're non actually helpful. Fortunately, you don't demand to worry almost that in the tidyverse, and can focus on situations where factors are genuinely useful.

Prerequisites

To work with factors, we'll use the forcats package, which is part of the core tidyverse. It provides tools for dealing with categorical variables (and it'southward an anagram of factors!) using a wide range of helpers for working with factors.

Creating factors

Imagine that you have a variable that records month:

                              x1                <-                c                (                "December",                "Apr",                "Jan",                "Mar"                )                          

Using a string to record this variable has two problems:

  1. There are only twelve possible months, and there's nothing saving you lot from typos:

                                          x2                    <-                    c                    (                    "Dec",                    "April",                    "Jam",                    "Mar"                    )                                  
  2. It doesn't sort in a useful mode:

                                          sort                    (                    x1                    )                    #> [1] "April" "Dec" "January" "Mar"                                  

Y'all can fix both of these issues with a gene. To create a cistron you must start by creating a list of the valid levels:

                              month_levels                <-                c                (                "Jan",                "Feb",                "Mar",                "Apr",                "May",                "Jun",                "Jul",                "Aug",                "Sep",                "Oct",                "Nov",                "Dec"                )                          

Now you tin create a factor:

                              y1                <-                factor                (                x1, levels                =                month_levels                )                y1                #> [1] Dec Apr Jan Mar                #> Levels: Jan Feb Mar April May Jun Jul Aug Sep Oct Nov Dec                sort                (                y1                )                #> [i] January Mar Apr December                #> Levels: Jan Feb Mar April May Jun Jul Aug Sep Oct November Dec                          

And any values non in the gear up will be silently converted to NA:

                              y2                <-                factor                (                x2, levels                =                month_levels                )                y2                #> [ane] Dec  Apr  <NA> Mar                                #> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep October Nov Dec                          

If you want a warning, you can use readr::parse_factor():

                              y2                <-                parse_factor                (                x2, levels                =                month_levels                )                #> Alarm: 1 parsing failure.                #> row col           expected actual                #>   3  -- value in level gear up    Jam                          

If you omit the levels, they'll be taken from the information in alphabetical society:

                              factor                (                x1                )                #> [1] Dec April Jan Mar                #> Levels: Apr Dec Jan Mar                          

Sometimes you'd adopt that the guild of the levels friction match the order of the first appearance in the data. You can do that when creating the factor past setting levels to unique(ten), or after the fact, with fct_inorder():

                              f1                <-                factor                (                x1, levels                =                unique                (                x1                )                )                f1                #> [one] December April Jan Mar                #> Levels: Dec Apr Jan Mar                f2                <-                x1                %>%                factor                (                )                %>%                fct_inorder                (                )                f2                #> [one] Dec Apr Jan Mar                #> Levels: Dec Apr January Mar                          

If you lot ever need to access the gear up of valid levels straight, yous tin can do so with levels():

                              levels                (                f2                )                #> [1] "Dec" "Apr" "Jan" "Mar"                          

Modifying factor order

It's often useful to change the society of the gene levels in a visualisation. For example, imagine you want to explore the boilerplate number of hours spent watching TV per day across religions:

                              relig_summary                <-                gss_cat                %>%                group_by                (                relig                )                %>%                summarise                (                age                =                mean                (                age, na.rm                =                True                ),     tvhours                =                mean                (                tvhours, na.rm                =                TRUE                ),     n                =                due north                (                )                )                #> `summarise()` ungrouping output (override with `.groups` argument)                ggplot                (                relig_summary,                aes                (                tvhours,                relig                )                )                +                geom_point                (                )                          

It is hard to interpret this plot because there's no overall blueprint. We tin can improve it past reordering the levels of relig using fct_reorder(). fct_reorder() takes three arguments:

  • f, the factor whose levels you want to change.
  • x, a numeric vector that you lot want to utilize to reorder the levels.
  • Optionally, fun, a function that's used if there are multiple values of x for each value of f. The default value is median.

Reordering religion makes it much easier to see that people in the "Don't know" category watch much more TV, and Hinduism & Other Eastern religions lookout much less.

As yous kickoff making more complicated transformations, I'd recommend moving them out of aes() and into a separate mutate() step. For example, you could rewrite the plot in a higher place as:

                              relig_summary                %>%                mutate                (relig                =                fct_reorder                (                relig,                tvhours                )                )                %>%                ggplot                (                aes                (                tvhours,                relig                )                )                +                geom_point                (                )                          

What if we create a similar plot looking at how average age varies beyond reported income level?

                              rincome_summary                <-                gss_cat                %>%                group_by                (                rincome                )                %>%                summarise                (                age                =                hateful                (                age, na.rm                =                TRUE                ),     tvhours                =                mean                (                tvhours, na.rm                =                Truthful                ),     n                =                n                (                )                )                #> `summarise()` ungrouping output (override with `.groups` statement)                ggplot                (                rincome_summary,                aes                (                age,                fct_reorder                (                rincome,                age                )                )                )                +                geom_point                (                )                          

Here, arbitrarily reordering the levels isn't a good idea! That's because rincome already has a principled lodge that we shouldn't mess with. Reserve fct_reorder() for factors whose levels are arbitrarily ordered.

Yet, information technology does make sense to pull "Not applicable" to the front with the other special levels. You lot can use fct_relevel(). It takes a cistron, f, and and then any number of levels that you lot want to move to the front of the line.

Why do y'all recollect the average age for "Not applicable" is then high?

Another type of reordering is useful when you are colouring the lines on a plot. fct_reorder2() reorders the cistron by the y values associated with the largest x values. This makes the plot easier to read because the line colours line up with the legend.

                              by_age                <-                gss_cat                %>%                filter                (                !                is.na                (                historic period                )                )                %>%                count                (                age,                marital                )                %>%                group_by                (                age                )                %>%                mutate                (prop                =                n                /                sum                (                north                )                )                ggplot                (                by_age,                aes                (                age,                prop, color                =                marital                )                )                +                geom_line                (na.rm                =                TRUE                )                ggplot                (                by_age,                aes                (                historic period,                prop, colour                =                fct_reorder2                (                marital,                historic period,                prop                )                )                )                +                geom_line                (                )                +                labs                (color                =                "marital"                )                          

Finally, for bar plots, you tin employ fct_infreq() to order levels in increasing frequency: this is the simplest type of reordering because it doesn't need any extra variables. You may want to combine with fct_rev().

                              gss_cat                %>%                mutate                (marital                =                marital                %>%                fct_infreq                (                )                %>%                fct_rev                (                )                )                %>%                ggplot                (                aes                (                marital                )                )                +                geom_bar                (                )                          

Exercises

  1. There are some suspiciously loftier numbers in tvhours. Is the mean a adept summary?

  2. For each factor in gss_cat place whether the lodge of the levels is arbitrary or principled.

  3. Why did moving "Not applicable" to the forepart of the levels move information technology to the bottom of the plot?

Modifying cistron levels

More powerful than changing the orders of the levels is changing their values. This allows you lot to clarify labels for publication, and collapse levels for high-level displays. The well-nigh full general and powerful tool is fct_recode(). It allows you to recode, or change, the value of each level. For example, have the gss_cat$partyid:

                              gss_cat                %>%                count                (                partyid                )                #> # A tibble: 10 10 2                #>   partyid                n                #>   <fct>              <int>                #> 1 No answer            154                #> ii Don't know             1                #> 3 Other political party          393                #> 4 Strong republican   2314                #> v Not str republican  3032                #> six Ind,near rep        1791                #> # … with 4 more than rows                          

The levels are terse and inconsistent. Let'south tweak them to exist longer and use a parallel construction.

                              gss_cat                %>%                mutate                (partyid                =                fct_recode                (                partyid,                "Republican, strong"                =                "Strong republican",                "Republican, weak"                =                "Not str republican",                "Independent, near rep"                =                "Ind,near rep",                "Independent, near dem"                =                "Ind,near dem",                "Democrat, weak"                =                "Not str democrat",                "Democrat, stiff"                =                "Strong democrat"                )                )                %>%                count                (                partyid                )                #> # A tibble: 10 x 2                #>   partyid                   due north                #>   <fct>                 <int>                #> ane No reply               154                #> ii Don't know                1                #> 3 Other party             393                #> 4 Republican, strong     2314                #> 5 Republican, weak       3032                #> half dozen Contained, near rep  1791                #> # … with 4 more than rows                          

fct_recode() volition leave levels that aren't explicitly mentioned as is, and volition warn you if you accidentally refer to a level that doesn't exist.

To combine groups, you tin can assign multiple old levels to the same new level:

                              gss_cat                %>%                mutate                (partyid                =                fct_recode                (                partyid,                "Republican, stiff"                =                "Strong republican",                "Republican, weak"                =                "Non str republican",                "Independent, about rep"                =                "Ind,near rep",                "Contained, near dem"                =                "Ind,near dem",                "Democrat, weak"                =                "Not str democrat",                "Democrat, strong"                =                "Strong democrat",                "Other"                =                "No answer",                "Other"                =                "Don't know",                "Other"                =                "Other party"                )                )                %>%                count                (                partyid                )                #> # A tibble: 8 ten 2                #>   partyid                   n                #>   <fct>                 <int>                #> 1 Other                   548                #> 2 Republican, stiff     2314                #> iii Republican, weak       3032                #> four Independent, nearly rep  1791                #> 5 Independent            4119                #> 6 Independent, near dem  2499                #> # … with 2 more rows                          

You must use this technique with care: if you group together categories that are truly different you will finish upwards with misleading results.

If you want to plummet a lot of levels, fct_collapse() is a useful variant of fct_recode(). For each new variable, you lot can provide a vector of old levels:

                              gss_cat                %>%                mutate                (partyid                =                fct_collapse                (                partyid,     other                =                c                (                "No answer",                "Don't know",                "Other political party"                ),     rep                =                c                (                "Strong republican",                "Not str republican"                ),     ind                =                c                (                "Ind,near rep",                "Contained",                "Ind,near dem"                ),     dem                =                c                (                "Not str democrat",                "Strong democrat"                )                )                )                %>%                count                (                partyid                )                #> # A tibble: 4 x 2                #>   partyid     northward                #>   <fct>   <int>                #> 1 other     548                #> 2 rep      5346                #> 3 ind      8409                #> 4 dem      7180                          

Sometimes you lot just desire to lump together all the pocket-sized groups to make a plot or tabular array simpler. That'due south the chore of fct_lump():

                              gss_cat                %>%                mutate                (relig                =                fct_lump                (                relig                )                )                %>%                count                (                relig                )                #> # A tibble: 2 x 2                #>   relig          n                #>   <fct>      <int>                #> 1 Protestant 10846                #> 2 Other      10637                          

The default behaviour is to progressively lump together the smallest groups, ensuring that the amass is still the smallest grouping. In this instance it's not very helpful: it is truthful that the bulk of Americans in this survey are Protestant, just we've probably over complanate.

Instead, we can employ the n parameter to specify how many groups (excluding other) we want to proceed:

                              gss_cat                %>%                mutate                (relig                =                fct_lump                (                relig, n                =                10                )                )                %>%                count                (                relig, sort                =                True                )                %>%                impress                (n                =                Inf                )                #> # A tibble: 10 x 2                #>    relig                       n                #>    <fct>                   <int>                #>  1 Protestant              10846                #>  2 Catholic                 5124                #>  3 None                     3523                #>  4 Christian                 689                #>  v Other                     458                #>  6 Jewish                    388                #>  7 Buddhism                  147                #>  eight Inter-nondenominational   109                #>  nine Moslem/islam              104                #> 10 Orthodox-christian         95                          

Exercises

  1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?

  2. How could you lot collapse rincome into a small prepare of categories?

allenwrign1948.blogspot.com

Source: https://r4ds.had.co.nz/factors.html

0 Response to "Tell R in Which Order to Read Factors"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel