The basic R language is classified as a general purpose language instead of a DSL. It has more functionality of a common-known computer language including mathematical operations and variables assignments.
Here are a few typical examples from R.
word <- "Hello, R"
number <- 28
multiply <- number * 2
quotient <- number / 2
logic <- multiply < quotient
for (i in 1:quotient){
if (i %% 2 == 0){
cat("Even number", i, "\n")
}
}
## Even number 2
## Even number 4
## Even number 6
## Even number 8
## Even number 10
## Even number 12
## Even number 14
is_even <- function(x){
if (is.numeric(x)){
return (x %% 2 == 0)
}
else{
cat("Object is not numeric.")
}
}
is_even(24)
## [1] TRUE
is_even(c(0, 21, 38, 29, 2.5, 4.1))
## [1] TRUE FALSE TRUE FALSE FALSE FALSE
is_even("Hello")
## Object is not numeric.
Compared to the basic R language, the DSLs have limited functionality. Most of the DSLs do not have abilities of mathematical computing and rely on the base structures provided by their parent languages, such as the loops.
A Domain-Specific Language (DSL) is a programming language developed for a specific type of problem with a highly concentrated functionality. As a DSL does not have comprehensive functionality as its parent GPL, it can be designed as less complex and more accurate than the General-Purpose Languages (GPL), such as C, Python and Rust.
Professional people who are not programmers benefit from the straightforwardness and efficient automated functions of a DSL. The readability of DSLs greatly reduce the learning process for non-programmers. DSLs can provide a more literary expression such as the following example using the R Project for Statistical Computing https://www.r-project.org/
# Base-R-style
# df[which(x), ... , drop = FALSE]
# Dplyr-style
# filter(df, x)
Such readability eases the process of finding mistakes in projects compared to debug through the command lines. In 2018, Johann Thor Mogensen Ingibergsso and his team conducted an empirical study of readability of a DSL called ViSaL. The study contains experiments aiming to answer three research questions, including checking the connection between textual descriptions to program fragments, examining the debugging performance by two-way bug-finding test and figuring out the design advantages of ViSaL over C++ in bug-finding. All of these research questions receive quantitative marks and qualitative comments from the participants. Even though both the quantitative and qualitative results are not evident enough to address the readability advantages of ViSaL (DSL) over C++ (GPL), there are still certain improvements observed during the experiments while the targeted DSL is introduced as a completely unknown language versus C++. Ingibergsson et al. (2018)
mutate()
, select()
, filter()
and group_by()
are some of the commonly used ones in dplyr. Beyond these, the pipe
structure from magrittr connects the codes/words as a sentence, which are more easily interpratable by a human.Here is an example for dplyr
We would like to have some filters on the selected data, and then get some grouped statistics.
data <- select(mtcars, c(mpg, hp, cyl))
data <- filter(data, mpg >= 16 & hp >=100)
data <- group_by(data, cyl)
summarise(data, mean(mpg), mean(hp))
What if we use the pipe
(from magrittr
) structure provided by dplyr
?
mtcars %>%
select(mpg, hp, cyl) %>%
filter(mpg >= 16 & hp >= 100) %>%
group_by(cyl) %>%
summarise(mean(mpg), mean(hp))
Obviously, according to the pipe
structure, the dataset mtcars
is the main object to be manipulated in the code chunk. Then the pipe
symbol connects the select
, filter
, group_by
and summarise
operations sequentially.
These common functions in dplyr have been powerful to deal with dataframes and tibbles. With the pipe
structure, the readability of code is further improved. Rather than assigning multiple variables that is typical in programming scripts, the pipe
provides a more handscript style which neatly shows the logic in a series of operations.
Scenarios that require such a DSL (some similar packages or DSL: Distributional)
Part of the idea and the motivation comes from the case of flood prediction by Vincenzo Coia. In this project, the main work is to utilize the flow maximums for years and fit distributions of annual maximum to predict floods. In complicated cases with two peaks per year, a maximise
function for the values of each single distribution will be needed. In more complicated cases, several distributions will be mixed by their weights and a new mixture distribution comes up. Traditionally, these processes require a large amount of work and the possibility of having mistaken codes is increasing as well. Without functions, people will have to derive from theories and code up the formulas. Hence, there is a need to manipulate the distributions in an elegant and efficient way, while the flexibility and the precision should be kept.
distributional is an existing DSL which provides the functions to manipulate the statistic distributions and vectorize the context. It wraps the base-R
functions such as q
, p
, d
and r
. Some of the statistics such as mean
, variance
and median
can be computed with this package. distributional expands the interaction between users and their predictive models by making distributions compatible with the model outputs and visualizing the victorized distribution through ggplot2
.
However, distributional is still not well-modularized and it has limited compatibility with types of distributions. While most of the common-used distributions can be found in its functions, complex ones such as some parametric distributions or some mixed distributions may be difficult being realized or utilized through Distributional.
Here are a few examples of distributional which have potential improvements.
transformed()
distribution, which loses information of the applied operations.library(distributional)
dist_1 <- dist_normal(mu = 1, sigma = 3)
dist_2 <- dist_gamma(shape = 2, rate = 0.5)
dist_3 <- c(dist_1, dist_2)
dist_4 <- dist_poisson(1)
dist_5 <- dist_poisson(1) + 2
dist_6 <- c(0.5*dist_3, dist_4, dist_5)
dist_1
## <distribution[1]>
## [1] N(1, 9)
dist_2
## <distribution[1]>
## [1] Gamma(2, 0.5)
dist_3
## <distribution[2]>
## [1] N(1, 9) Gamma(2, 0.5)
dist_4
## <distribution[1]>
## [1] Pois(1)
dist_5
## <distribution[1]>
## [1] t(Pois(1))
dist_6
## <distribution[4]>
## [1] N(0.5, 2.2) t(Gamma(2, 0.5)) Pois(1) t(Pois(1))
# Note: inconsistant output format from cdf() etc.
# Note: Distributional relies on ggdist to make plots.
mean()
, quantile()
and cdf()
can accept a vector of input distributions, the actual outputs are sometimes inconsistent. Here are the examples.distributional::cdf(dist_1, 0.1)
## [1] 0.3820886
distributional::cdf(dist_1, c(0.1, 1, 4))
## [[1]]
## [1] 0.3820886 0.5000000 0.8413447
For single input (0.1
in example), the returned output will be a single number, however the output of a vector (c(0.1, 1, 4)
) becomes a list. This would be an issue when the users define their functions with the outputs from these vectorized ones. Users may have to deliberate the format for different input types, which cost more source and running time.
distplyr is introduced to provide a grammar for manipulating the probability distributions under a clean and user-friendly framework. It is expected to handle the distributions even beyond the normal ones, including more parametric distributions. distplyr also offers users with an easy access to a vast space of realistic probability distributions.
dst_*
to categorize different base distributions. distplyr integrated statistic metrics as the parameters into the complex functions. Users will be able to specify the metric types including survival
, quantile
and cdf
in the basic functions such as plot
.graft
, slice
and mix
functions. With invert
, shift
, multiply
, flip
and maximize
functions, users can even execute more operations on the single or mixture distributions.slice()
and graft()
operations.pipe
utility is kept and the “verb” can be embedded in the pipe
workflow as a series of manipulation on distributions.dst_*()
family of functions in distionary and provide the basic objects to be manipulate under distplyr.library(distplyr)
dst_1 <- dst_norm(mean = 1, variance = 3)
dst_2 <- dst_gamma(shape = 2, rate = 0.5)
dst_3 <- dst_pois(1)
shift
,multiply
, flip
, invert
and maximise
are basic operations that can be executed in distplyr. shift
,multiply
, flip
lead to the linear transformation on the input distribution. invert
returns the distributions of \(1/x\) when the input is the one of random variable \(x\). maximise
selects the distribution with the maximum value from independent draws of component distributions in a collection.
shift
, multiply
. We can change the start value of the random variable in a distribution. For example, to make the poisson variable that starts from 1 instead of 0. The output distribution keeps as a poisson one while a transformed factor shift: 1
is attached. multiply
can be also applied to a distribution object and the scale
will be stored.(dst_shift <- dst_3 + 1)
## shift dst
##
## components :
## $distribution
## pois parametric dst
##
## name :
## [1] "pois"
##
## $shift
## [1] 1
(dst_shift_multi <- dst_shift * 2)
## scale dst
##
## components :
## $distribution
## shift dst
##
## components :
## $distribution
## pois parametric dst
##
## name :
## [1] "pois"
##
## $shift
## [1] 1
##
##
## $scale
## [1] 2
(dst_flip <- flip(dst_3))
## negative dst
##
## distribution :
## pois parametric dst
##
## name :
## [1] "pois"
plot(dst_3, "cdf", xlim=c(0, 8))
plot(dst_shift, "cdf", add = TRUE, col = "blue")
plot(dst_shift_multi, "cdf", add = TRUE, col = "red")
flip
.(dst_flip <- flip(dst_3))
## negative dst
##
## distribution :
## pois parametric dst
##
## name :
## [1] "pois"
plot(dst_flip, xlim = c(-5, 5), col = "blue")
plot(dst_3, add = TRUE, col = "red")
invert
.# dst_gamma(shape = 2, rate = 0.5)
(dst_invert <- invert(dst_2))
## inverse dst
##
## distribution :
## gamma parametric dst
##
## name :
## [1] "gamma"
#plot(dst_invert, "survival")
#eval_survival(dst_invert, at = 2)
maximise
.plot(dst_1, "survival", xlim = c(0, 6))
plot(dst_3, "survival", add = TRUE, col = "red")
(dst_max = maximise(dst_1, dst_3))
## max dst
##
## components :
## $distributions
## $distributions[[1]]
## norm parametric dst
##
## name :
## [1] "norm"
##
## $distributions[[2]]
## pois parametric dst
##
## name :
## [1] "pois"
##
##
## $draws
## [1] 1 1
plot(dst_max, "survival")
slice
function removes probability to the left or the right of some breakpoint, conditioning the random variable to be the leftover part. There is an option to include the breakpoint probability in case the probability is non-zero.(dst_slice_l <- slice_left(dst_3, breakpoint = 2))
## slice_left dst
##
## distribution :
## pois parametric dst
##
## name :
## [1] "pois"
(dst_slice_r <- slice_right(dst_1, breakpoint = 2))
## slice_right dst
##
## distribution :
## norm parametric dst
##
## name :
## [1] "norm"
plot(dst_slice_l, "survival", xlim=c(-5, 7), col = "blue")
plot(dst_slice_r, "survival", add = TRUE, col = "red")
graft
function replaces the tail of a base distribution with a selected distribution. It will slice
the unneeded part besides the breakpoint and connect the new distribution to the removed side of breakpoint. The grafted distribution (the new one) will be normalized according to the replaced one (the old one). The output distribution of graft
is classified as “Graft distribution object,” which is a special type of mixture distribution.(dst_graft_l <- graft_left(dst_3, dst_1, breakpoint = 2))
## Mixture Distribution
##
## Components:
## # A tibble: 2 x 4
## distributions probs breakpoint include
## <named list> <dbl> <dbl> <lgl>
## 1 <slc_rght> 0.920 2 FALSE
## 2 <slic_lft> 0.0803 2 FALSE
plot(dst_graft_l, "survival", xlim = c(-1, 5))
mix
function creates a mixture distribution according to the weights of the sub-distributions. The output is classified as “mix” object. The component distributions and the type of random variable are stored in the object as well. Such mixture distribution has been automatic normalized according to the weights.(dst_mix <- mix(dst_slice_r, dst_3, weights = c(1, 3)))
## Mixture Distribution
##
## Components:
## # A tibble: 2 x 2
## distributions probs
## <named list> <dbl>
## 1 <slc_rght> 0.25
## 2 <pois> 0.75
plot(dst_mix, "survival")
Here is an application of distplyr in the case of flood prediction.
library(distplyr)
library(broom)
library(MASS)
The dms
data records the largest peak associated with snowmelt, atomospheric rivers (AR’s), and the overall maximum flow for each year.
head(dms)
With the package distionary, We can fit normal distributions on the three peaks and get a mix model between snomelt peaks and atomospheric rivers’ peaks. A gev distribution can be obtained based on the snowmelt data as well.
ar_norm <- dst_norm(mean = mean(dms$ar), variance = var(dms$ar))
sn_norm <- dst_norm(mean = mean(dms$snow), variance = var(dms$snow))
max_norm <- dst_norm(mean = mean(dms$max), variance = var(dms$max))
ar_sn_mix <- mix(ar_norm, sn_norm)
sn_gev <- fit_dst_gev(dms$snow)
Here is the survival plot of the maximized distribution over AR and SN peaks with the single component dists.
plot(ar_norm, "survival")
plot(sn_norm, "survival", add=TRUE, col="blue")
plot(maximise(ar_norm, sn_norm), "survival", add=TRUE, col="red")
We can also get the maximized distribution over two models of SN peaks, which are normal model and gev model. A mixture distribution can be obtained as well.
sn_norm_gev_max <- maximise(sn_norm, sn_gev)
sn_norm_max <- maximise(sn_norm)
sn_norm_gev_mix <- mix(sn_norm, sn_gev)
Here we get a survival plot of these distributions.
plot(sn_norm, "chf", col="red")
plot(sn_gev, "chf", col="blue", add=TRUE)
plot(sn_norm_gev_max, "chf", col="black", add=TRUE)
plot(sn_norm_gev_mix, "chf", col="green", add=TRUE)
ggplot(ams, aes(year)) +
geom_point(aes(y = flow), alpha = 0.25) +
theme_minimal() +
scale_y_log10(label_flow)
## Index flood
index_model <- loess(
log(flow) ~ year, data = ams, degree = 1, span = 0.3
)
index_floods <- broom::augment(index_model) %>%
transmute(year, index_flood = exp(.fitted))
ams_with_index <- ams %>%
left_join(index_floods, by = "year") %>%
mutate(ratio = flow / index_flood)
ggplot(ams_with_index, aes(year)) +
geom_point(aes(y = flow), alpha = 0.1) +
geom_line(aes(y = index_flood), colour = "blue") +
scale_y_log10(label_flow) +
ggtitle("Index Flood (geometric mean)") +
theme_minimal()
ggplot(ams_with_index, aes(year, ratio)) +
geom_point(alpha = 0.1) +
ggtitle("Dimensionless data") +
theme_minimal()
## Ratio GEV distribution
ratio_gev <- fit_dst_gev(ams_with_index$ratio)
Make a new column of distributions for each year by multiplying the ratio_gev
by the index_flood
estimate.
ams_models <- index_floods %>%
mutate(dist = map(index_flood, `*`, ratio_gev))
head(ams_models)
Here is an “enframe” workflow within the ams_models
tibble for evaluating \(95\%\) and \(5\%\) quantiles
ams_quantiles <- ams_models %>%
mutate(quantiles = map(dist, enframe_quantile, at = c(0.05, 0.95)))
Now we get the distributions for every year and the quantiles of these individual distributions.
head(ams_quantiles)
ams_quantiles$quantiles[1]
## [[1]]
## # A tibble: 2 x 2
## .arg quantile
## <dbl> <dbl>
## 1 0.05 75.7
## 2 0.95 166.
ams_quantiles %>% unnest(quantiles)
As shown in the examples, distplyr provides a compact and clean syntax for manipulating distributions. The base package distionary ensures the reliability of distribution objects used in distplyr. Such advantages would be more notable when the composed distributions become more complicated. The users will be able to reduce a large amount of time on developing the complex models and the code style makes the debug process accessible to non-programmers.
In the bundle of distplyr and distionary, we provide the verb and the noun to compose the language syntax. Now we show the potential expansibility and the promising application of the package bundle, and promote them as a practical domain-specific language to all analysts.