---
title: "An overview on dformula"
output:
rmarkdown::html_vignette:
toc: true
vignette: >
%\VignetteIndexEntry{An overview on dformula}
%\VignetteEngine{knitr::rmarkdown}
%\usepackage[utf8]{inputenc}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE,
fig.width = 6, fig.height = 5,
message = FALSE, warning = FALSE)
```
## Introduction
**dformula** allows to easily modify, transform, add and extrapolate using the basic R formula. The operations on the data are the following:
| Operation | Function|
------------|------------|
| Add new variables | `add()` |
| Transform existing variables| `transform()`|
| Rename existing variables | `rename()` |
| Selection rows and columns | `select()`|
| Removing row and column | `remove`|
------------------------------------------
The formula is composed of two part:
$$
column\_names \sim new\_variables
$$
the right-hand side shows the names of the columns of the data and the left-hand side the transformation or the new variables to insert in the data.
The `I()` is used in the right-hand side to indicate the type of transformation of the existing variable. In this function, we can insert logical statement, function implemented in R or user build function.
For example:
$$
var\_name_1 + var\_name_2 \sim I(log(var\_name_1)) + I(var\_name_2 == "something")
$$
the two variable $var_name_1$ and $var_name_2$ are transformed in $log(var_name_1)$ or selected to be equal to $"something"$.
In the same fashion of SQL, we have the `from` argument, the input data, and the `as` argument, the new name of the variables, after transformation, selection or addition.
The CRAN version can be loaded
```{r, message = FALSE}
library('dformula')
```
or the development version from GitHub:
```{r, message = FALSE, eval=FALSE}
remotes::install_github('serafinialessio/dformula')
```
The data are available in the package will be used in this overview
```{r}
data("population_data")
pop_data <- population_data
```
which describes the **Population** and **Area** of world countries
```{r}
str(pop_data)
```
## `Adding` variables
The `add()` function inserts new variables starting from the existing columns in the data.
Suppose we want to calculate population density and attach this to the original dataset
```{r}
new_pop <- add(from = pop_data, formula = ~ I(Population / Area))
head(new_pop)
```
and give a name to this new variable
```{r}
new_pop <- add(from = pop_data, formula = ~ I(Population / Area), as = "pop_density")
head(new_pop)
```
Multiple variable can be added with a single formula
```{r}
new_pop <- add(from = pop_data, formula = ~ I(Population / Area) + I(log(Area)))
head(new_pop)
```
and with new names
```{r}
new_pop <- add(from = pop_data, formula = ~ I(Population / Area) + I(log(Area)),
as = c("pop_density", "log_area"))
head(new_pop)
```
If we have one transformation applied to a group of variables, we do not specify the function multiple times
```{r}
new_pop <- add(from = pop_data, formula = Population + Area ~ log())
head(new_pop)
```
and with new column names
```{r}
new_pop <- add(from = pop_data, formula = Population + Area ~ log(),
as = c("log_pop", "log_area"))
head(new_pop)
```
Suppose we want to add a numerical **id** for the countries at the beginning of the dataset, using the `position` argument
```{r}
new_pop <- add(from = pop_data,
formula = ~ I(1:nrow(new_pop)),
position = "left", as = "id")
head(new_pop)
```
We can also add a constant variable. For example the year of the observation
```{r}
new_pop <- add(from = pop_data, formula = ~ C("2020"), position = "left")
head(new_pop)
```
or both
```{r}
new_pop <- add(from = pop_data,
formula = ~ I(1:nrow(new_pop)) + C("2020"),
position = "left", as = c("ids", "year"))
head(new_pop)
```
The `C()` construct add a constant for all the rows
We can be interested in having a dummy variable, i.e. a variable equal to $1$ if some event happen or $0$ otherwise.
For example, we suppose to build a dummy variables with the most populated countries. In this we suppose countries with more than $100$ million of people.
```{r}
new_pop <- add(from = pop_data, formula = ~ I(Population > 100000000))
head(new_pop)
```
or two variables one with the most populated countries and the other with the biggest extended countries
```{r}
new_pop <- add(from = pop_data,
formula = ~ I(Population > 100000000) + I(Area > 8000000))
head(new_pop)
```
or a variable indicating the most populated and the biggest countries togheter
```{r}
new_pop <- add(from = pop_data,
formula = ~ I(Population > 100000000 & Area > 8000000))
head(new_pop)
```
If we want obtain a boolean vector, as an interrogation, setting to `TRUE` the argument `logic_convert` the function will return a boolean vector
```{r}
new_pop <- add(from = pop_data,
formula = ~ I(Population > 100000000),
logic_convert = FALSE, as = "most_populated")
head(new_pop)
```
## `Transform` variables
The `transform()` function modifies existing variables in the dataset.
Suppose we want to change the scale on the **Population**
```{r}
new_pop <- transform(from = pop_data,
formula = Population ~ I(Population/10000))
head(new_pop)
```
or we want a logarithmic transformation, renaming the variable
```{r}
new_pop <- transform(from = pop_data,
formula = Population ~ I(log(Population)),
as = "log_pop")
head(new_pop)
```
With a single formula multiple variables can be transformed, as showed before.
```{r}
new_pop <- transform(from = pop_data,
formula = Population + Area~ I(log()))
head(new_pop)
```
We can also transformed multiple variables with multiple transformations
```{r}
new_pop <- transform(from = pop_data,
formula = Population + Area ~ I(Population > 100000000) + I(log(Area)))
head(new_pop)
```
## `Rename` variables
The `rename()` function may be used to change names of existing variables, for example
```{r}
new_pop <- rename(from = pop_data, formula = Population ~ pop )
head(new_pop)
```
or multiple variables
```{r}
new_pop <- rename(from = pop_data, formula = Population + Area ~ pop + area)
head(new_pop)
```
## `Select` variables and rows
In the same fashion of SQL, the `select()` function first select the rows, given a statement, and then shows the select variables.
The first part of the formula are the columns to select, as the previous functions, and the right-hand side of the formula, the condition part, will select the rows.
Suppose to want to select only the most populated countries
```{r}
new_pop <- select(from = pop_data,
formula = . ~ I(Population > 100000000))
head(new_pop)
```
you can also add `.` to returns all variables instead of nothing.
We want only the name of the most populated countries
```{r}
new_pop <- select(from = pop_data,
formula = Country ~ I(Population > 100000000))
head(new_pop)
```
We might be interest in only the most populated and biggest countries
```{r}
new_pop <- select(from = pop_data,
formula = . ~ I(Population > 100000000 & Area > 8000000))
head(new_pop)
```
or both
```{r}
new_pop <- select(from = pop_data,
formula = ~ I(Population > 100000000 | Area > 8000000))
head(new_pop)
```
by selecting only the names
```{r}
new_pop <- select(from = pop_data,
formula = Country ~ I(Population > 100000000 | Area > 8000000))
head(new_pop)
```
## `Remove` variables
The `remove()` function has the same syntax of `select()` function, but now the rows and columns will be removed.
```{r}
new_pop <- remove(from = pop_data,
formula = Area ~ I(Population > 100000000))
head(new_pop)
```
## Handling Missing Values
In all the functions, except for `rename`, the argument `na.remove` will remove all the rows with missing values, after adding, transforming or selecting the rows.
The `remove` function, can be employed to remove all the rows with at least a missing observation,
```{r}
data("airquality")
dt <- airquality
dt_new <- remove(from = dt,formula = .~., na.remove = TRUE)
head(dt_new)
```
If we are interested to focus on the observation with missing values, the `na.return = TRUE` arguments of `select` function will return only the incomplete rows after the selection
```{r}
dt_new <- select(from = dt,formula = ~ I(Temp > 50), na.return = TRUE)
head(dt_new)
```