An overview on dformula

Introduction

dformula allows to easily modify, transform, add and extrapolate using the basic R formula. The operations on the data are the following:

Operation Function
Add new variables add()
Transform existing variables transform()
Rename existing variables rename()
Selection rows and columns select()
Removing row and column remove

The formula is composed of two part:

column_names ∼ new_variables

the right-hand side shows the names of the columns of the data and the left-hand side the transformation or the new variables to insert in the data.

The I() is used in the right-hand side to indicate the type of transformation of the existing variable. In this function, we can insert logical statement, function implemented in R or user build function.

For example:

$$ var\_name_1 + var\_name_2 \sim I(log(var\_name_1)) + I(var\_name_2 == "something") $$

the two variable varname1 and varname2 are transformed in log(varname1) or selected to be equal to $"something"$.

In the same fashion of SQL, we have the from argument, the input data, and the as argument, the new name of the variables, after transformation, selection or addition.



The CRAN version can be loaded

library('dformula')

or the development version from GitHub:

remotes::install_github('serafinialessio/dformula')

The data are available in the package will be used in this overview

data("population_data")
pop_data <- population_data

which describes the Population and Area of world countries

str(pop_data)
## 'data.frame':    159 obs. of  3 variables:
##  $ Country   : chr  "China" "India" "United States" "Indonesia" ...
##  $ Population: num  1.44e+09 1.38e+09 3.31e+08 2.74e+08 2.21e+08 ...
##  $ Area      : num  9388211 2973190 9147420 1811570 771 ...

Adding variables

The add() function inserts new variables starting from the existing columns in the data.

Suppose we want to calculate population density and attach this to the original dataset

new_pop <- add(from = pop_data, formula = ~ I(Population / Area))
head(new_pop)
##         Country Population       Area        Var.1
## 1         China 1439323776 9388211.00    153.31183
## 2         India 1380004385 2973190.00    464.14941
## 3 United States  331002651 9147420.00     36.18536
## 4     Indonesia  273523615 1811570.00    150.98705
## 5      Pakistan  220892340     770.88 286545.68804
## 6        Brazil  212559417 8358140.00     25.43143

and give a name to this new variable

new_pop <- add(from = pop_data, formula = ~ I(Population / Area), as = "pop_density")
head(new_pop)
##         Country Population       Area  pop_density
## 1         China 1439323776 9388211.00    153.31183
## 2         India 1380004385 2973190.00    464.14941
## 3 United States  331002651 9147420.00     36.18536
## 4     Indonesia  273523615 1811570.00    150.98705
## 5      Pakistan  220892340     770.88 286545.68804
## 6        Brazil  212559417 8358140.00     25.43143

Multiple variable can be added with a single formula

new_pop <- add(from = pop_data, formula = ~ I(Population / Area) + I(log(Area)))
head(new_pop)
##         Country Population       Area        Var.1     Var.2
## 1         China 1439323776 9388211.00    153.31183 16.054965
## 2         India 1380004385 2973190.00    464.14941 14.905146
## 3 United States  331002651 9147420.00     36.18536 16.028982
## 4     Indonesia  273523615 1811570.00    150.98705 14.409704
## 5      Pakistan  220892340     770.88 286545.68804  6.647533
## 6        Brazil  212559417 8358140.00     25.43143 15.938746

and with new names

new_pop <- add(from = pop_data, formula = ~ I(Population / Area) + I(log(Area)), 
               as = c("pop_density", "log_area"))
head(new_pop)
##         Country Population       Area  pop_density  log_area
## 1         China 1439323776 9388211.00    153.31183 16.054965
## 2         India 1380004385 2973190.00    464.14941 14.905146
## 3 United States  331002651 9147420.00     36.18536 16.028982
## 4     Indonesia  273523615 1811570.00    150.98705 14.409704
## 5      Pakistan  220892340     770.88 286545.68804  6.647533
## 6        Brazil  212559417 8358140.00     25.43143 15.938746

If we have one transformation applied to a group of variables, we do not specify the function multiple times

new_pop <- add(from = pop_data, formula = Population + Area ~ log())
head(new_pop)
##         Country Population       Area    Var.1     Var.2
## 1         China 1439323776 9388211.00 21.08744 16.054965
## 2         India 1380004385 2973190.00 21.04535 14.905146
## 3 United States  331002651 9147420.00 19.61764 16.028982
## 4     Indonesia  273523615 1811570.00 19.42690 14.409704
## 5      Pakistan  220892340     770.88 19.21319  6.647533
## 6        Brazil  212559417 8358140.00 19.17473 15.938746

and with new column names

new_pop <- add(from = pop_data, formula = Population + Area ~ log(),
               as = c("log_pop", "log_area"))
head(new_pop)
##         Country Population       Area  log_pop  log_area
## 1         China 1439323776 9388211.00 21.08744 16.054965
## 2         India 1380004385 2973190.00 21.04535 14.905146
## 3 United States  331002651 9147420.00 19.61764 16.028982
## 4     Indonesia  273523615 1811570.00 19.42690 14.409704
## 5      Pakistan  220892340     770.88 19.21319  6.647533
## 6        Brazil  212559417 8358140.00 19.17473 15.938746

Suppose we want to add a numerical id for the countries at the beginning of the dataset, using the position argument

new_pop <- add(from = pop_data, 
               formula = ~ I(1:nrow(new_pop)), 
               position = "left", as = "id")
head(new_pop)
##   id       Country Population       Area
## 1  1         China 1439323776 9388211.00
## 2  2         India 1380004385 2973190.00
## 3  3 United States  331002651 9147420.00
## 4  4     Indonesia  273523615 1811570.00
## 5  5      Pakistan  220892340     770.88
## 6  6        Brazil  212559417 8358140.00

We can also add a constant variable. For example the year of the observation

new_pop <- add(from = pop_data, formula = ~ C("2020"), position = "left")
head(new_pop)
##   Var.1 Var.2 Var.3       Country Population       Area
## 1  2020  2020  2020         China 1439323776 9388211.00
## 2  2020  2020  2020         India 1380004385 2973190.00
## 3  2020  2020  2020 United States  331002651 9147420.00
## 4  2020  2020  2020     Indonesia  273523615 1811570.00
## 5  2020  2020  2020      Pakistan  220892340     770.88
## 6  2020  2020  2020        Brazil  212559417 8358140.00

or both

new_pop <- add(from = pop_data, 
               formula = ~ I(1:nrow(new_pop)) + C("2020"), 
               position = "left", as = c("ids", "year"))
head(new_pop)
##   ids year       Country Population       Area
## 1   1 2020         China 1439323776 9388211.00
## 2   2 2020         India 1380004385 2973190.00
## 3   3 2020 United States  331002651 9147420.00
## 4   4 2020     Indonesia  273523615 1811570.00
## 5   5 2020      Pakistan  220892340     770.88
## 6   6 2020        Brazil  212559417 8358140.00

The C() construct add a constant for all the rows

We can be interested in having a dummy variable, i.e. a variable equal to 1 if some event happen or 0 otherwise. For example, we suppose to build a dummy variables with the most populated countries. In this we suppose countries with more than 100 million of people.

new_pop <- add(from = pop_data, formula =  ~ I(Population > 100000000))
head(new_pop)
##         Country Population       Area Var.1
## 1         China 1439323776 9388211.00     1
## 2         India 1380004385 2973190.00     1
## 3 United States  331002651 9147420.00     1
## 4     Indonesia  273523615 1811570.00     1
## 5      Pakistan  220892340     770.88     1
## 6        Brazil  212559417 8358140.00     1

or two variables one with the most populated countries and the other with the biggest extended countries

new_pop <- add(from = pop_data, 
               formula =  ~ I(Population > 100000000) + I(Area > 8000000))
head(new_pop)
##         Country Population       Area Var.1 Var.2
## 1         China 1439323776 9388211.00     1     1
## 2         India 1380004385 2973190.00     1     0
## 3 United States  331002651 9147420.00     1     1
## 4     Indonesia  273523615 1811570.00     1     0
## 5      Pakistan  220892340     770.88     1     0
## 6        Brazil  212559417 8358140.00     1     1

or a variable indicating the most populated and the biggest countries togheter

new_pop <- add(from = pop_data, 
               formula =  ~ I(Population > 100000000 & Area > 8000000))
head(new_pop)
##         Country Population       Area Var.1
## 1         China 1439323776 9388211.00     1
## 2         India 1380004385 2973190.00     0
## 3 United States  331002651 9147420.00     1
## 4     Indonesia  273523615 1811570.00     0
## 5      Pakistan  220892340     770.88     0
## 6        Brazil  212559417 8358140.00     1

If we want obtain a boolean vector, as an interrogation, setting to TRUE the argument logic_convert the function will return a boolean vector

new_pop <- add(from = pop_data, 
               formula =  ~ I(Population > 100000000), 
               logic_convert = FALSE, as = "most_populated")
head(new_pop)
##         Country Population       Area most_populated
## 1         China 1439323776 9388211.00           TRUE
## 2         India 1380004385 2973190.00           TRUE
## 3 United States  331002651 9147420.00           TRUE
## 4     Indonesia  273523615 1811570.00           TRUE
## 5      Pakistan  220892340     770.88           TRUE
## 6        Brazil  212559417 8358140.00           TRUE

Transform variables

The transform() function modifies existing variables in the dataset.

Suppose we want to change the scale on the Population

new_pop <- transform(from = pop_data, 
                     formula =  Population ~ I(Population/10000))
head(new_pop)
##         Country Population       Area
## 1         China  143932.38 9388211.00
## 2         India  138000.44 2973190.00
## 3 United States   33100.27 9147420.00
## 4     Indonesia   27352.36 1811570.00
## 5      Pakistan   22089.23     770.88
## 6        Brazil   21255.94 8358140.00

or we want a logarithmic transformation, renaming the variable

new_pop <- transform(from = pop_data, 
                     formula =  Population ~ I(log(Population)), 
                     as = "log_pop")
head(new_pop)
##                                    
## 1         China 21.08744 9388211.00
## 2         India 21.04535 2973190.00
## 3 United States 19.61764 9147420.00
## 4     Indonesia 19.42690 1811570.00
## 5      Pakistan 19.21319     770.88
## 6        Brazil 19.17473 8358140.00

With a single formula multiple variables can be transformed, as showed before.

new_pop <- transform(from = pop_data, 
                     formula =  Population  + Area~ I(log()))
head(new_pop)
##         Country Population      Area
## 1         China   21.08744 16.054965
## 2         India   21.04535 14.905146
## 3 United States   19.61764 16.028982
## 4     Indonesia   19.42690 14.409704
## 5      Pakistan   19.21319  6.647533
## 6        Brazil   19.17473 15.938746

We can also transformed multiple variables with multiple transformations

new_pop <- transform(from = pop_data, 
                     formula =  Population + Area ~ I(Population > 100000000) + I(log(Area)))
head(new_pop)
##         Country Population      Area
## 1         China          1 16.054965
## 2         India          1 14.905146
## 3 United States          1 16.028982
## 4     Indonesia          1 14.409704
## 5      Pakistan          1  6.647533
## 6        Brazil          1 15.938746

Rename variables

The rename() function may be used to change names of existing variables, for example

new_pop <- rename(from = pop_data, formula =  Population  ~ pop )
head(new_pop)
##         Country        pop       Area
## 1         China 1439323776 9388211.00
## 2         India 1380004385 2973190.00
## 3 United States  331002651 9147420.00
## 4     Indonesia  273523615 1811570.00
## 5      Pakistan  220892340     770.88
## 6        Brazil  212559417 8358140.00

or multiple variables

new_pop <- rename(from = pop_data, formula =  Population  + Area ~ pop + area)
head(new_pop)
##         Country        pop       area
## 1         China 1439323776 9388211.00
## 2         India 1380004385 2973190.00
## 3 United States  331002651 9147420.00
## 4     Indonesia  273523615 1811570.00
## 5      Pakistan  220892340     770.88
## 6        Brazil  212559417 8358140.00

Select variables and rows

In the same fashion of SQL, the select() function first select the rows, given a statement, and then shows the select variables.

The first part of the formula are the columns to select, as the previous functions, and the right-hand side of the formula, the condition part, will select the rows.

Suppose to want to select only the most populated countries

new_pop <- select(from = pop_data, 
                  formula =  . ~ I(Population > 100000000))
head(new_pop)
##         Country Population       Area
## 1         China 1439323776 9388211.00
## 2         India 1380004385 2973190.00
## 3 United States  331002651 9147420.00
## 4     Indonesia  273523615 1811570.00
## 5      Pakistan  220892340     770.88
## 6        Brazil  212559417 8358140.00

you can also add . to returns all variables instead of nothing.

We want only the name of the most populated countries

new_pop <- select(from = pop_data, 
                  formula =  Country ~ I(Population > 100000000))
head(new_pop)
##         Country
## 1         China
## 2         India
## 3 United States
## 4     Indonesia
## 5      Pakistan
## 6        Brazil

We might be interest in only the most populated and biggest countries

new_pop <- select(from = pop_data, 
               formula = . ~ I(Population > 100000000 & Area > 8000000)) 
head(new_pop)
##         Country Population     Area
## 1         China 1439323776  9388211
## 3 United States  331002651  9147420
## 6        Brazil  212559417  8358140
## 9        Russia  145934462 16376870

or both

new_pop <- select(from = pop_data, 
               formula = ~ I(Population > 100000000 | Area > 8000000)) 
head(new_pop)
##         Country Population       Area
## 1         China 1439323776 9388211.00
## 2         India 1380004385 2973190.00
## 3 United States  331002651 9147420.00
## 4     Indonesia  273523615 1811570.00
## 5      Pakistan  220892340     770.88
## 6        Brazil  212559417 8358140.00

by selecting only the names

new_pop <- select(from = pop_data, 
               formula = Country ~ I(Population > 100000000 | Area > 8000000)) 
head(new_pop)
##         Country
## 1         China
## 2         India
## 3 United States
## 4     Indonesia
## 5      Pakistan
## 6        Brazil

Remove variables

The remove() function has the same syntax of select() function, but now the rows and columns will be removed.

new_pop <- remove(from = pop_data, 
                  formula =  Area ~ I(Population > 100000000))
head(new_pop)
##     Country Population
## 15  Vietnam   97338579
## 16    Congo   89561403
## 17   Turkey   84339067
## 18     Iran   83992949
## 19  Germany   83783942
## 20 Thailand   69799978

Handling Missing Values

In all the functions, except for rename, the argument na.remove will remove all the rows with missing values, after adding, transforming or selecting the rows.

The remove function, can be employed to remove all the rows with at least a missing observation,

data("airquality")
dt <- airquality

dt_new <- remove(from = dt,formula = .~., na.remove = TRUE)
head(dt_new)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 7    23     299  8.6   65     5   7
## 8    19      99 13.8   59     5   8

If we are interested to focus on the observation with missing values, the na.return = TRUE arguments of select function will return only the incomplete rows after the selection

dt_new <- select(from = dt,formula = ~ I(Temp > 50), na.return = TRUE)
head(dt_new)
##    Ozone Solar.R Wind Temp Month Day
## 5     NA      NA 14.3   56     5   5
## 6     28      NA 14.9   66     5   6
## 10    NA     194  8.6   69     5  10
## 11     7      NA  6.9   74     5  11
## 25    NA      66 16.6   57     5  25
## 26    NA     266 14.9   58     5  26