Sample selection function — select_sample • SampleSelectR

Selects a random sample using a specified method and sample size. Selection can also optionally be stratified and/or include a measure of size (mos) if a PPS method is used.

Usage

select_sample(
  frame,
  method,
  n,
  outall = FALSE,
  strata = NULL,
  mos = NULL,
  sort_vars = NULL,
  sort_method = NULL
)

Arguments

frame: A data frame, data.table, or tibble from which to draw the sample. No default.
method: The desired sampling method. Valid options are "srs", "sys_eq", "sys_pps", and "chromy_pps". No default.
n: The sample size to draw from the frame. If strata is NULL, must be a positive integer. If strata is not NULL, must be a data.frame, tibble, or data.table with columns for each stratification variable as the same type and variable names as the frame plus a column with the sample size (sample_size) which is a positive integer. No default.
outall: A logical value indicating whether to return the entire frame with a selection indicator or just the sample. Default is FALSE.
strata: A vector of characters with variable names of strata. Default is NULL.
mos: A character string defining the variable name on the frame for the measure of size. If not NULL, must have method = c("sys_pps", "chromy_pps"). If NULL, must have method=c("srs", "sys"). Default is NULL.
sort_vars: A vector of characters indicating the variables that should be used to sort the frame. If not NULL, cannot have method = "srs". Default is NULL.
sort_method: A character string defining the method to implicitly sort the frame. Valid options are "serpentine" and "nest". Must coincide with sort_var; i.e., both must be NULL or both must be not NULL. Default is NULL.

Value

A tidytable object containing the entire frame with a selection indicator or just the sample, dependent on the value of outall. Selection probability and sampling weight are also included. May include various summary messages to the console when applicable for certain sampling methods.

Examples


# SRS of 100 US counties, using geographic region as strata
# n is a data frame containing the strata values and corresponding desired sample size
# Sample size column must be titled 'sample_size'

n_df_srs <- data.frame(
  Region = as.factor(c("Northeast", "Midwest", "South", "West")),
  sample_size = c(25, 25, 25, 25)
)

county_2023 |>
  select_sample(method = "srs", n = n_df_srs, strata = "Region")
#> Stratum: Region = South 
#> --Frame size: 1422
#> --Sample size: 25
#> Stratum: Region = West 
#> --Frame size: 449
#> --Sample size: 25
#> Stratum: Region = Northeast 
#> --Frame size: 218
#> --Sample size: 25
#> Stratum: Region = Midwest 
#> --Frame size: 1055
#> --Sample size: 25
#> # A tidytable: 100 × 27
#>    Region GEOID Name    State Division Pop_Tot Pop_Pct_White_NH Pop_Pct_Black_NH
#>    <fct>  <chr> <chr>   <chr> <fct>      <dbl>            <dbl>            <dbl>
#>  1 South  05043 Drew C… AR    West So…   17143             64.9           26.7  
#>  2 South  05097 Montgo… AR    West So…    8571             89.7            0.840
#>  3 South  12011 Browar… FL    South A… 1946127             31.9           27.6  
#>  4 South  12109 St. Jo… FL    South A…  292243             78.6            4.93 
#>  5 South  13071 Colqui… GA    South A…   45907             54.3           22.8  
#>  6 South  13093 Dooly … GA    South A…   11026             40.9           47.7  
#>  7 South  13127 Glynn … GA    South A…   84987             62.1           24.5  
#>  8 South  13259 Stewar… GA    South A…    4978             24.6           59.7  
#>  9 South  13273 Terrel… GA    South A…    8941             36.4           59.6  
#> 10 South  21131 Leslie… KY    East So…   10261             96.8            0.526
#> # ℹ 90 more rows
#> # ℹ 19 more variables: Pop_Pct_AIAN_NH <dbl>, Pop_Pct_Asian_NH <dbl>,
#> #   Pop_Pct_NHPI_NH <dbl>, Pop_Pct_Other_NH <dbl>, Pop_Pct_Hispanic <dbl>,
#> #   HU_Tot <dbl>, HU_Pct_Occupied <dbl>, HU_Pct_Vacant <dbl>,
#> #   Pop_Pct_0004 <dbl>, Pop_Pct_0509 <dbl>, Pop_Pct_1014 <dbl>,
#> #   Pop_Pct_2544 <dbl>, Pop_Pct_4564 <dbl>, Pop_Pct_6574 <dbl>,
#> #   Pop_Pct_75plus <dbl>, Pop_Pct_1517 <dbl>, Pop_Pct_1824 <dbl>, …


# Systematic sample of 250 US universities. Each unit has an equal probability of being selected
# Includes a nested sort of enrollment total within sector
# Returns all obs from original data frame with a selection indicator column

sample_sys_eq <- ipeds |>
  select_sample(
    method = "sys_eq", n = 250, outall = TRUE,
    sort_vars = c("SECTOR", "ENRTOT"), sort_method = "nest"
  )
#> Frame size: 5914
#> Sample size: 250
#> Sampling interval (k): 23.656
#> Random start (r): 10.85918

# For samples taken with outall = TRUE, the sample size can be verified by summing
# the SelectionIndicator column.

sample_sys_eq
#> # A tidytable: 5,914 × 19
#>    UNITID INSTNM      STABBR  FIPS OBEREG ICLEVEL SECTOR LOCALE DEGGRANT HLOFFER
#>     <dbl> <chr>       <chr>  <dbl> <fct>  <fct>   <fct>  <fct>  <fct>    <fct>  
#>  1 180203 Aaniiih Na… MT        30 Rocky… Four o… Publi… Rural… Degree-… Bachel…
#>  2 491297 University… WI        55 Great… Four o… Publi… Rural… Degree-… Bachel…
#>  3 200086 Nueta Hida… ND        38 Plain… Four o… Publi… Rural… Degree-… Bachel…
#>  4 219408 Sisseton W… SD        46 Plain… Four o… Publi… Rural… Degree-… Bachel…
#>  5 260372 Lac Courte… WI        55 Great… Four o… Publi… Rural… Degree-… Master…
#>  6 214607 Pennsylvan… PA        42 Mid E… Four o… Publi… Subur… Degree-… Master…
#>  7 366340 Stone Chil… MT        30 Rocky… Four o… Publi… Rural… Degree-… Bachel…
#>  8 200466 Sitting Bu… ND        38 Plain… Four o… Publi… Rural… Degree-… Master…
#>  9 434584 Ilisagvik … AK         2 Far W… Four o… Publi… Rural… Degree-… Bachel…
#> 10 243188 University… PR        72 Other… Four o… Publi… Rural… Degree-… Bachel…
#> # ℹ 5,904 more rows
#> # ℹ 9 more variables: ENRTOT <dbl>, EFUG <dbl>, EFUG1ST <dbl>, EFUGFT <dbl>,
#> #   EFGRAD <dbl>, EFGRADFT <dbl>, SelectionIndicator <lgl>,
#> #   SelectionProbability <dbl>, SamplingWeight <dbl>
sum(sample_sys_eq$SelectionIndicator)
#> [1] 250


# Systematic PPS sample of 250 US universities. Each unit's probability of selection
# is proportional to its size measure.
# Using enrollment total as MOS
# Includes a nested sort of enrollment total within sector

sample_sys_pps <- ipeds |>
  select_sample(
    method = "sys_pps", n = 250, mos = "ENRTOT",
    sort_vars = c("SECTOR", "ENRTOT"), sort_method = "nest"
  )
#> Frame size: 5914
#> Sample size: 250
#> Sampling interval (k): 78812
#> Random start (r): 10272.68

# For pps samples, it is possible for a single sampling unit to be selected multiple times
# due to a large mos value. This is especially true as desired sample size increases. The
# result is the final sample may not meet the desired sample size. To verify the pps sample,
# the NumberHits column can be summed and should total to the desired sample size.

sample_sys_pps
#> # A tidytable: 247 × 19
#>    UNITID INSTNM      STABBR  FIPS OBEREG ICLEVEL SECTOR LOCALE DEGGRANT HLOFFER
#>     <dbl> <chr>       <chr>  <dbl> <fct>  <fct>   <fct>  <fct>  <fct>    <fct>  
#>  1 214795 Pennsylvan… PA        42 Mid E… Four o… Publi… Rural… Degree-… Bachel…
#>  2 187596 Navajo Tec… NM        35 South… Four o… Publi… Rural… Degree-… Doctor…
#>  3 228501 Sul Ross S… TX        48 South… Four o… Publi… Town:… Degree-… Master…
#>  4 204705 Ohio State… OH        39 Great… Four o… Publi… Subur… Degree-… Bachel…
#>  5 163338 University… MD        24 Mid E… Four o… Publi… Town:… Degree-… Doctor…
#>  6 487010 The Univer… TN        47 South… Four o… Publi… City:… Degree-… Doctor…
#>  7 219259 Northern S… SD        46 Plain… Four o… Publi… Town:… Degree-… Master…
#>  8 218645 University… SC        45 South… Four o… Publi… Subur… Degree-… Master…
#>  9 207397 Oklahoma S… OK        40 South… Four o… Publi… City:… Degree-… Bachel…
#> 10 155025 Emporia St… KS        20 Plain… Four o… Publi… Town:… Degree-… Doctor…
#> # ℹ 237 more rows
#> # ℹ 9 more variables: ENRTOT <dbl>, EFUG <dbl>, EFUG1ST <dbl>, EFUGFT <dbl>,
#> #   EFGRAD <dbl>, EFGRADFT <dbl>, SamplingWeight <dbl>, NumberHits <int>,
#> #   ExpectedHits <dbl>
sum(sample_sys_pps$NumberHits)
#> [1] 250


# Sequential aka Chromy's method PPS sample of 500 PUMAs, using geographic region as strata
# Includes a serpentine sort of geographic division then state
# Using population total as MOS, each unit's probability of selection is proportional to its
# size measure.
# Note that there may be a discrepancy between the desired and final sample sizes. The final
# sample size can be verified by totaling NumberHits.

n_df_chr <- data.frame(
  Region = as.factor(c("Northeast", "Midwest", "South", "West")),
  sample_size = c(125, 125, 125, 125)
)

puma_2023 |>
  select_sample(
    method = "chromy_pps", n = n_df_chr, strata = "Region", mos = "Pop_Tot",
    sort_vars = c("Division", "State"), sort_method = "serpentine"
  )
#> Stratum: Region = South 
#> --Frame size: 952
#> --Sample size: 125
#> Stratum: Region = West 
#> --Frame size: 581
#> --Sample size: 125
#> Stratum: Region = Northeast 
#> --Frame size: 423
#> --Sample size: 125
#> Stratum: Region = Midwest 
#> --Frame size: 506
#> --Sample size: 125
#> # A tidytable: 500 × 28
#>    Region GEOID   Name  State Division Pop_Tot Pop_Pct_White_NH Pop_Pct_Black_NH
#>    <fct>  <chr>   <chr> <chr> <fct>      <dbl>            <dbl>            <dbl>
#>  1 South  1000105 Sout… DE    South A…  107205             59.0            23.3 
#>  2 South  1200902 Brev… FL    South A…  124268             82.9             2.04
#>  3 South  1201106 Brow… FL    South A…  186422             29.4            28.1 
#>  4 South  1201113 Brow… FL    South A…  125752             26.9            22.6 
#>  5 South  1201500 Char… FL    South A…  195083             82.4             5.10
#>  6 South  1203104 Duva… FL    South A…  120322             45.6            28.3 
#>  7 South  1203107 Duva… FL    South A…  131852             64.7             9.97
#>  8 South  1205704 Hill… FL    South A…  136079             54.3             6.31
#>  9 South  1205707 Hill… FL    South A…  152043             38.3            18.3 
#> 10 South  1206902 Lake… FL    South A…  136281             73.1             9.06
#> # ℹ 490 more rows
#> # ℹ 20 more variables: Pop_Pct_AIAN_NH <dbl>, Pop_Pct_Asian_NH <dbl>,
#> #   Pop_Pct_NHPI_NH <dbl>, Pop_Pct_Other_NH <dbl>, Pop_Pct_Hispanic <dbl>,
#> #   HU_Tot <dbl>, HU_Pct_Occupied <dbl>, HU_Pct_Vacant <dbl>,
#> #   Pop_Pct_0004 <dbl>, Pop_Pct_0509 <dbl>, Pop_Pct_1014 <dbl>,
#> #   Pop_Pct_2544 <dbl>, Pop_Pct_4564 <dbl>, Pop_Pct_6574 <dbl>,
#> #   Pop_Pct_75plus <dbl>, Pop_Pct_1517 <dbl>, Pop_Pct_1824 <dbl>, …