--- title: "An overview on dformula" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{An overview on dformula} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, fig.width = 6, fig.height = 5, message = FALSE, warning = FALSE) ``` ## Introduction **dformula** allows to easily modify, transform, add and extrapolate using the basic R formula. The operations on the data are the following: | Operation | Function| ------------|------------| | Add new variables | `add()` | | Transform existing variables| `transform()`| | Rename existing variables | `rename()` | | Selection rows and columns | `select()`| | Removing row and column | `remove`| ------------------------------------------ The formula is composed of two part: $$ column\_names \sim new\_variables $$ the right-hand side shows the names of the columns of the data and the left-hand side the transformation or the new variables to insert in the data. The `I()` is used in the right-hand side to indicate the type of transformation of the existing variable. In this function, we can insert logical statement, function implemented in R or user build function. For example: $$ var\_name_1 + var\_name_2 \sim I(log(var\_name_1)) + I(var\_name_2 == "something") $$ the two variable $var_name_1$ and $var_name_2$ are transformed in $log(var_name_1)$ or selected to be equal to $"something"$. In the same fashion of SQL, we have the `from` argument, the input data, and the `as` argument, the new name of the variables, after transformation, selection or addition.

The CRAN version can be loaded ```{r, message = FALSE} library('dformula') ``` or the development version from GitHub: ```{r, message = FALSE, eval=FALSE} remotes::install_github('serafinialessio/dformula') ``` The data are available in the package will be used in this overview ```{r} data("population_data") pop_data <- population_data ``` which describes the **Population** and **Area** of world countries ```{r} str(pop_data) ``` ## `Adding` variables The `add()` function inserts new variables starting from the existing columns in the data. Suppose we want to calculate population density and attach this to the original dataset ```{r} new_pop <- add(from = pop_data, formula = ~ I(Population / Area)) head(new_pop) ``` and give a name to this new variable ```{r} new_pop <- add(from = pop_data, formula = ~ I(Population / Area), as = "pop_density") head(new_pop) ``` Multiple variable can be added with a single formula ```{r} new_pop <- add(from = pop_data, formula = ~ I(Population / Area) + I(log(Area))) head(new_pop) ``` and with new names ```{r} new_pop <- add(from = pop_data, formula = ~ I(Population / Area) + I(log(Area)), as = c("pop_density", "log_area")) head(new_pop) ``` If we have one transformation applied to a group of variables, we do not specify the function multiple times ```{r} new_pop <- add(from = pop_data, formula = Population + Area ~ log()) head(new_pop) ``` and with new column names ```{r} new_pop <- add(from = pop_data, formula = Population + Area ~ log(), as = c("log_pop", "log_area")) head(new_pop) ``` Suppose we want to add a numerical **id** for the countries at the beginning of the dataset, using the `position` argument ```{r} new_pop <- add(from = pop_data, formula = ~ I(1:nrow(new_pop)), position = "left", as = "id") head(new_pop) ``` We can also add a constant variable. For example the year of the observation ```{r} new_pop <- add(from = pop_data, formula = ~ C("2020"), position = "left") head(new_pop) ``` or both ```{r} new_pop <- add(from = pop_data, formula = ~ I(1:nrow(new_pop)) + C("2020"), position = "left", as = c("ids", "year")) head(new_pop) ``` The `C()` construct add a constant for all the rows We can be interested in having a dummy variable, i.e. a variable equal to $1$ if some event happen or $0$ otherwise. For example, we suppose to build a dummy variables with the most populated countries. In this we suppose countries with more than $100$ million of people. ```{r} new_pop <- add(from = pop_data, formula = ~ I(Population > 100000000)) head(new_pop) ``` or two variables one with the most populated countries and the other with the biggest extended countries ```{r} new_pop <- add(from = pop_data, formula = ~ I(Population > 100000000) + I(Area > 8000000)) head(new_pop) ``` or a variable indicating the most populated and the biggest countries togheter ```{r} new_pop <- add(from = pop_data, formula = ~ I(Population > 100000000 & Area > 8000000)) head(new_pop) ``` If we want obtain a boolean vector, as an interrogation, setting to `TRUE` the argument `logic_convert` the function will return a boolean vector ```{r} new_pop <- add(from = pop_data, formula = ~ I(Population > 100000000), logic_convert = FALSE, as = "most_populated") head(new_pop) ``` ## `Transform` variables The `transform()` function modifies existing variables in the dataset. Suppose we want to change the scale on the **Population** ```{r} new_pop <- transform(from = pop_data, formula = Population ~ I(Population/10000)) head(new_pop) ``` or we want a logarithmic transformation, renaming the variable ```{r} new_pop <- transform(from = pop_data, formula = Population ~ I(log(Population)), as = "log_pop") head(new_pop) ``` With a single formula multiple variables can be transformed, as showed before. ```{r} new_pop <- transform(from = pop_data, formula = Population + Area~ I(log())) head(new_pop) ``` We can also transformed multiple variables with multiple transformations ```{r} new_pop <- transform(from = pop_data, formula = Population + Area ~ I(Population > 100000000) + I(log(Area))) head(new_pop) ``` ## `Rename` variables The `rename()` function may be used to change names of existing variables, for example ```{r} new_pop <- rename(from = pop_data, formula = Population ~ pop ) head(new_pop) ``` or multiple variables ```{r} new_pop <- rename(from = pop_data, formula = Population + Area ~ pop + area) head(new_pop) ``` ## `Select` variables and rows In the same fashion of SQL, the `select()` function first select the rows, given a statement, and then shows the select variables. The first part of the formula are the columns to select, as the previous functions, and the right-hand side of the formula, the condition part, will select the rows. Suppose to want to select only the most populated countries ```{r} new_pop <- select(from = pop_data, formula = . ~ I(Population > 100000000)) head(new_pop) ``` you can also add `.` to returns all variables instead of nothing. We want only the name of the most populated countries ```{r} new_pop <- select(from = pop_data, formula = Country ~ I(Population > 100000000)) head(new_pop) ``` We might be interest in only the most populated and biggest countries ```{r} new_pop <- select(from = pop_data, formula = . ~ I(Population > 100000000 & Area > 8000000)) head(new_pop) ``` or both ```{r} new_pop <- select(from = pop_data, formula = ~ I(Population > 100000000 | Area > 8000000)) head(new_pop) ``` by selecting only the names ```{r} new_pop <- select(from = pop_data, formula = Country ~ I(Population > 100000000 | Area > 8000000)) head(new_pop) ``` ## `Remove` variables The `remove()` function has the same syntax of `select()` function, but now the rows and columns will be removed. ```{r} new_pop <- remove(from = pop_data, formula = Area ~ I(Population > 100000000)) head(new_pop) ``` ## Handling Missing Values In all the functions, except for `rename`, the argument `na.remove` will remove all the rows with missing values, after adding, transforming or selecting the rows. The `remove` function, can be employed to remove all the rows with at least a missing observation, ```{r} data("airquality") dt <- airquality dt_new <- remove(from = dt,formula = .~., na.remove = TRUE) head(dt_new) ``` If we are interested to focus on the observation with missing values, the `na.return = TRUE` arguments of `select` function will return only the incomplete rows after the selection ```{r} dt_new <- select(from = dt,formula = ~ I(Temp > 50), na.return = TRUE) head(dt_new) ```