Intro of DSL Distplyr & Distionary

Introduction of non-DSL (GPL)

GPL R

The basic R language is classified as a general purpose language instead of a DSL. It has more functionality of a common-known computer language including mathematical operations and variables assignments.

Here are a few typical examples from R.

  • Variable assignments
word <- "Hello, R"
number <- 28
multiply <- number * 2
quotient <- number / 2
logic <- multiply < quotient
  • Loops
for (i in 1:quotient){
  if (i %% 2 == 0){
    cat("Even number", i, "\n")
  }
}
## Even number 2 
## Even number 4 
## Even number 6 
## Even number 8 
## Even number 10 
## Even number 12 
## Even number 14
  • Define functions
is_even <- function(x){
  if (is.numeric(x)){
    return (x %% 2 == 0)
  }
  else{
    cat("Object is not numeric.")
  }
}

is_even(24)
## [1] TRUE
is_even(c(0, 21, 38, 29, 2.5, 4.1))
## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE
is_even("Hello")
## Object is not numeric.

Compared to the basic R language, the DSLs have limited functionality. Most of the DSLs do not have abilities of mathematical computing and rely on the base structures provided by their parent languages, such as the loops.

Introduction of DSL

A Domain-Specific Language (DSL) is a programming language developed for a specific type of problem with a highly concentrated functionality. As a DSL does not have comprehensive functionality as its parent GPL, it can be designed as less complex and more accurate than the General-Purpose Languages (GPL), such as C, Python and Rust.

Professional people who are not programmers benefit from the straightforwardness and efficient automated functions of a DSL. The readability of DSLs greatly reduce the learning process for non-programmers. DSLs can provide a more literary expression such as the following example using the R Project for Statistical Computing https://www.r-project.org/

# Base-R-style
# df[which(x), ... , drop = FALSE]
# Dplyr-style
# filter(df, x)

Such readability eases the process of finding mistakes in projects compared to debug through the command lines. In 2018, Johann Thor Mogensen Ingibergsso and his team conducted an empirical study of readability of a DSL called ViSaL. The study contains experiments aiming to answer three research questions, including checking the connection between textual descriptions to program fragments, examining the debugging performance by two-way bug-finding test and figuring out the design advantages of ViSaL over C++ in bug-finding. All of these research questions receive quantitative marks and qualitative comments from the participants. Even though both the quantitative and qualitative results are not evident enough to address the readability advantages of ViSaL (DSL) over C++ (GPL), there are still certain improvements observed during the experiments while the targeted DSL is introduced as a completely unknown language versus C++. Ingibergsson et al. (2018)

Applications and Examples of DSL/GPL

dplyr

  • dplyr is a grammar of data manipulation composed of a set of verbs. Functions such as mutate(), select(), filter() and group_by() are some of the commonly used ones in dplyr. Beyond these, the pipe structure from magrittr connects the codes/words as a sentence, which are more easily interpratable by a human.

Here is an example for dplyr

We would like to have some filters on the selected data, and then get some grouped statistics.

data <- select(mtcars, c(mpg, hp, cyl))
data <- filter(data, mpg >= 16 & hp >=100)
data <- group_by(data, cyl)
summarise(data, mean(mpg), mean(hp))

What if we use the pipe (from magrittr) structure provided by dplyr?

mtcars %>% 
  select(mpg, hp, cyl) %>%
  filter(mpg >= 16 & hp >= 100) %>%
  group_by(cyl) %>%
  summarise(mean(mpg), mean(hp))

Obviously, according to the pipe structure, the dataset mtcars is the main object to be manipulated in the code chunk. Then the pipe symbol connects the select, filter, group_by and summarise operations sequentially.

These common functions in dplyr have been powerful to deal with dataframes and tibbles. With the pipe structure, the readability of code is further improved. Rather than assigning multiple variables that is typical in programming scripts, the pipe provides a more handscript style which neatly shows the logic in a series of operations.

Intro to distplyr

Background

The Need

Scenarios that require such a DSL (some similar packages or DSL: Distributional)

Flood prediction

Part of the idea and the motivation comes from the case of flood prediction by Vincenzo Coia. In this project, the main work is to utilize the flow maximums for years and fit distributions of annual maximum to predict floods. In complicated cases with two peaks per year, a maximise function for the values of each single distribution will be needed. In more complicated cases, several distributions will be mixed by their weights and a new mixture distribution comes up. Traditionally, these processes require a large amount of work and the possibility of having mistaken codes is increasing as well. Without functions, people will have to derive from theories and code up the formulas. Hence, there is a need to manipulate the distributions in an elegant and efficient way, while the flexibility and the precision should be kept.

Distributional

distributional is an existing DSL which provides the functions to manipulate the statistic distributions and vectorize the context. It wraps the base-R functions such as q, p, d and r. Some of the statistics such as mean, variance and median can be computed with this package. distributional expands the interaction between users and their predictive models by making distributions compatible with the model outputs and visualizing the victorized distribution through ggplot2.

However, distributional is still not well-modularized and it has limited compatibility with types of distributions. While most of the common-used distributions can be found in its functions, complex ones such as some parametric distributions or some mixed distributions may be difficult being realized or utilized through Distributional.

Here are a few examples of distributional which have potential improvements.

  1. distributional does not provide a feature of creating mixture distribution and will only create a vector of multiple distributions. The mathematical operations on distributions will result in a transformed() distribution, which loses information of the applied operations.
library(distributional)
dist_1 <- dist_normal(mu = 1, sigma = 3)
dist_2 <- dist_gamma(shape = 2, rate = 0.5)
dist_3 <- c(dist_1, dist_2)
dist_4 <- dist_poisson(1) 
dist_5 <- dist_poisson(1) + 2
dist_6 <- c(0.5*dist_3, dist_4, dist_5)

dist_1
## <distribution[1]>
## [1] N(1, 9)
dist_2
## <distribution[1]>
## [1] Gamma(2, 0.5)
dist_3
## <distribution[2]>
## [1] N(1, 9)       Gamma(2, 0.5)
dist_4
## <distribution[1]>
## [1] Pois(1)
dist_5
## <distribution[1]>
## [1] t(Pois(1))
dist_6
## <distribution[4]>
## [1] N(0.5, 2.2)      t(Gamma(2, 0.5)) Pois(1)          t(Pois(1))
# Note: inconsistant output format from cdf() etc.

# Note: Distributional relies on ggdist to make plots.
  1. Meanwhile, though distributional is well-vectorized, which means the functions such as mean(), quantile() and cdf() can accept a vector of input distributions, the actual outputs are sometimes inconsistent. Here are the examples.
distributional::cdf(dist_1, 0.1)
## [1] 0.3820886
distributional::cdf(dist_1, c(0.1, 1, 4))
## [[1]]
## [1] 0.3820886 0.5000000 0.8413447

For single input (0.1 in example), the returned output will be a single number, however the output of a vector (c(0.1, 1, 4)) becomes a list. This would be an issue when the users define their functions with the outputs from these vectorized ones. Users may have to deliberate the format for different input types, which cost more source and running time.

Purpose of distplyr

distplyr is introduced to provide a grammar for manipulating the probability distributions under a clean and user-friendly framework. It is expected to handle the distributions even beyond the normal ones, including more parametric distributions. distplyr also offers users with an easy access to a vast space of realistic probability distributions.

A brief introduction of the advantages of distplyr

  • Modular functions for distributons
    • While distplyr has the similar usage of dst_* to categorize different base distributions. distplyr integrated statistic metrics as the parameters into the complex functions. Users will be able to specify the metric types including survival, quantile and cdf in the basic functions such as plot.
  • Expanded compatibility with distributions
    • distplyr works with both the continuous and discrete random variables, and even with the mixed variables that are both discrete and continuous. This means even for a discrete distribution, distplyr can still provide a continuous fitted tails for analysis.
    • distplyr keeps track on discrete values in its object distributions. Users can access these discrete values for evaluating quantile or other purpose.
    • The functions in distplyr are mainly “verb”s and the base distionary package offers a framework of distribution objects, which become the “noun” in the workflow.
    • distplyr manipulates and creates distributions by the graft, slice and mix functions. With invert, shift, multiply, flip and maximize functions, users can even execute more operations on the single or mixture distributions.
    • Corresponding properties, including distribution names and types, mixture components and mixture weights, will be stored in the output objects and can be accessed by users. The original properties of component distributions, such as mean and variance, will not change after the slice() and graft() operations.

Structure of distplyr

The general framework of Distplyr

  • Most of the functions in distplyr are composed of the base-R functions and a few of them use hardcoded part as complement of the base functions. The structure of distplyr is constructed around the core “verb” functions, with necessary adapt for variety of distributions.
  • distplyr keeps a similar syntax style from Dplyr. The pipe utility is kept and the “verb” can be embedded in the pipe workflow as a series of manipulation on distributions.
  • The distribution types can be selected as a parameter in the distplyr “verb” functions. These types are connected to the dst_*() family of functions in distionary and provide the basic objects to be manipulate under distplyr.

The theories and the realized functions in Distplyr

library(distplyr)
dst_1 <- dst_norm(mean = 1, variance = 3)
dst_2 <- dst_gamma(shape = 2, rate = 0.5)
dst_3 <- dst_pois(1)
  • shiftmultiply, flip, invert and maximise are basic operations that can be executed in distplyr. shiftmultiply, flip lead to the linear transformation on the input distribution. invert returns the distributions of \(1/x\) when the input is the one of random variable \(x\). maximise selects the distribution with the maximum value from independent draws of component distributions in a collection.

    • shift, multiply. We can change the start value of the random variable in a distribution. For example, to make the poisson variable that starts from 1 instead of 0. The output distribution keeps as a poisson one while a transformed factor shift: 1 is attached. multiply can be also applied to a distribution object and the scale will be stored.
(dst_shift <- dst_3 + 1)
## shift dst
## 
##  components :
## $distribution
## pois parametric dst
## 
##  name :
## [1] "pois"
## 
## $shift
## [1] 1
(dst_shift_multi <- dst_shift * 2)
## scale dst
## 
##  components :
## $distribution
## shift dst
## 
##  components :
## $distribution
## pois parametric dst
## 
##  name :
## [1] "pois"
## 
## $shift
## [1] 1
## 
## 
## $scale
## [1] 2
(dst_flip <- flip(dst_3))
## negative dst
## 
##  distribution :
## pois parametric dst
## 
##  name :
## [1] "pois"
plot(dst_3, "cdf", xlim=c(0, 8))
plot(dst_shift, "cdf", add = TRUE, col = "blue")
plot(dst_shift_multi, "cdf", add = TRUE, col = "red")

  • flip.
(dst_flip <- flip(dst_3))
## negative dst
## 
##  distribution :
## pois parametric dst
## 
##  name :
## [1] "pois"
plot(dst_flip, xlim = c(-5, 5), col = "blue")
plot(dst_3, add = TRUE, col = "red")

  • invert.
# dst_gamma(shape = 2, rate = 0.5)
(dst_invert <- invert(dst_2))
## inverse dst
## 
##  distribution :
## gamma parametric dst
## 
##  name :
## [1] "gamma"
#plot(dst_invert, "survival")
#eval_survival(dst_invert, at = 2)
  • maximise.
plot(dst_1, "survival", xlim = c(0, 6))

plot(dst_3, "survival", add = TRUE, col = "red")

(dst_max = maximise(dst_1, dst_3))
## max dst
## 
##  components :
## $distributions
## $distributions[[1]]
## norm parametric dst
## 
##  name :
## [1] "norm"
## 
## $distributions[[2]]
## pois parametric dst
## 
##  name :
## [1] "pois"
## 
## 
## $draws
## [1] 1 1
plot(dst_max, "survival")

  • slice function removes probability to the left or the right of some breakpoint, conditioning the random variable to be the leftover part. There is an option to include the breakpoint probability in case the probability is non-zero.
(dst_slice_l <- slice_left(dst_3, breakpoint = 2))
## slice_left dst
## 
##  distribution :
## pois parametric dst
## 
##  name :
## [1] "pois"
(dst_slice_r <- slice_right(dst_1, breakpoint = 2))
## slice_right dst
## 
##  distribution :
## norm parametric dst
## 
##  name :
## [1] "norm"
plot(dst_slice_l, "survival", xlim=c(-5, 7), col = "blue")
plot(dst_slice_r, "survival", add = TRUE, col = "red")

  • graft function replaces the tail of a base distribution with a selected distribution. It will slice the unneeded part besides the breakpoint and connect the new distribution to the removed side of breakpoint. The grafted distribution (the new one) will be normalized according to the replaced one (the old one). The output distribution of graft is classified as “Graft distribution object,” which is a special type of mixture distribution.
(dst_graft_l <- graft_left(dst_3, dst_1, breakpoint = 2))
## Mixture Distribution
## 
## Components: 
## # A tibble: 2 x 4
##   distributions  probs breakpoint include
##   <named list>   <dbl>      <dbl> <lgl>  
## 1 <slc_rght>    0.920           2 FALSE  
## 2 <slic_lft>    0.0803          2 FALSE
plot(dst_graft_l, "survival", xlim = c(-1, 5))

  • mix function creates a mixture distribution according to the weights of the sub-distributions. The output is classified as “mix” object. The component distributions and the type of random variable are stored in the object as well. Such mixture distribution has been automatic normalized according to the weights.
(dst_mix <- mix(dst_slice_r, dst_3, weights = c(1, 3)))
## Mixture Distribution
## 
## Components: 
## # A tibble: 2 x 2
##   distributions probs
##   <named list>  <dbl>
## 1 <slc_rght>     0.25
## 2 <pois>         0.75
plot(dst_mix, "survival")

  • There are more advanced operations could be considered in distplyr. For example, the popular term convolution. However, in the current version of distplyr, functions to realize the convolution operations are not included. There are a few considerations for the delaying or absence of the function:
    • The function to realize convolution operations will be too complicated for the package. It needs much more sources to be developed and tested. Currently it is too big of a goal at best.
    • The running of convolution operations needs more performance than other functions in current distplyr. As R is slow in executing numerical methods, it would be better to code up the convolution functions in C++. In distplyr, we try to prevent using the numerical methods and such methods only exist for the quantile algorithms, as a part of functions such as inverting a cdf.
    • There exists a package distr computing the convolution. The distr package is another object oriented manipulation of distributions, however, with too heavy S4 structure compared to distplyr with S3 programming. It is also a complicated package for most of the users, and requires a thoroughly reading and learning of its documentation.

Examples

Here is an application of distplyr in the case of flood prediction.

library(distplyr)
library(broom)
library(MASS)

The dms data records the largest peak associated with snowmelt, atomospheric rivers (AR’s), and the overall maximum flow for each year.

head(dms)

With the package distionary, We can fit normal distributions on the three peaks and get a mix model between snomelt peaks and atomospheric rivers’ peaks. A gev distribution can be obtained based on the snowmelt data as well.

ar_norm <- dst_norm(mean = mean(dms$ar), variance = var(dms$ar))
sn_norm <- dst_norm(mean = mean(dms$snow), variance = var(dms$snow))
max_norm <- dst_norm(mean = mean(dms$max), variance = var(dms$max))
ar_sn_mix <- mix(ar_norm, sn_norm)
sn_gev <- fit_dst_gev(dms$snow)

Here is the survival plot of the maximized distribution over AR and SN peaks with the single component dists.

plot(ar_norm, "survival")
plot(sn_norm, "survival", add=TRUE, col="blue")
plot(maximise(ar_norm, sn_norm), "survival", add=TRUE, col="red")

We can also get the maximized distribution over two models of SN peaks, which are normal model and gev model. A mixture distribution can be obtained as well.

sn_norm_gev_max <- maximise(sn_norm, sn_gev)
sn_norm_max <- maximise(sn_norm)
sn_norm_gev_mix <- mix(sn_norm, sn_gev)

Here we get a survival plot of these distributions.

plot(sn_norm, "chf", col="red")
plot(sn_gev, "chf", col="blue", add=TRUE)
plot(sn_norm_gev_max, "chf", col="black", add=TRUE)
plot(sn_norm_gev_mix, "chf", col="green", add=TRUE)

ggplot(ams, aes(year)) +
  geom_point(aes(y = flow), alpha = 0.25) +
  theme_minimal() +
  scale_y_log10(label_flow)

## Index flood
index_model <- loess(
  log(flow) ~ year, data = ams, degree = 1, span = 0.3
)
index_floods <- broom::augment(index_model) %>% 
  transmute(year, index_flood = exp(.fitted))

ams_with_index <- ams %>% 
  left_join(index_floods, by = "year") %>% 
  mutate(ratio = flow / index_flood)

ggplot(ams_with_index, aes(year)) +
  geom_point(aes(y = flow), alpha = 0.1) +
  geom_line(aes(y = index_flood), colour = "blue") +
  scale_y_log10(label_flow) +
  ggtitle("Index Flood (geometric mean)") +
  theme_minimal()

ggplot(ams_with_index, aes(year, ratio)) +
  geom_point(alpha = 0.1) +
  ggtitle("Dimensionless data") +
  theme_minimal()

## Ratio GEV distribution
ratio_gev <- fit_dst_gev(ams_with_index$ratio)

Make a new column of distributions for each year by multiplying the ratio_gev by the index_flood estimate.

ams_models <- index_floods %>% 
  mutate(dist = map(index_flood, `*`, ratio_gev))
head(ams_models)

Here is an “enframe” workflow within the ams_models tibble for evaluating \(95\%\) and \(5\%\) quantiles

ams_quantiles <- ams_models %>% 
  mutate(quantiles = map(dist, enframe_quantile, at = c(0.05, 0.95)))

Now we get the distributions for every year and the quantiles of these individual distributions.

head(ams_quantiles)
ams_quantiles$quantiles[1]
## [[1]]
## # A tibble: 2 x 2
##    .arg quantile
##   <dbl>    <dbl>
## 1  0.05     75.7
## 2  0.95    166.
ams_quantiles %>% unnest(quantiles)

Conclusion

As shown in the examples, distplyr provides a compact and clean syntax for manipulating distributions. The base package distionary ensures the reliability of distribution objects used in distplyr. Such advantages would be more notable when the composed distributions become more complicated. The users will be able to reduce a large amount of time on developing the complex models and the code style makes the debug process accessible to non-programmers.

In the bundle of distplyr and distionary, we provide the verb and the noun to compose the language syntax. Now we show the potential expansibility and the promising application of the package bundle, and promote them as a practical domain-specific language to all analysts.

Ingibergsson, Johann Thor Mogensen, Stefan Hanenberg, Joshua Sunshine, and Ulrik Pagh Schultz. 2018. “Experience Report: Studying the Readability of a Domain Specific Language.” Proceedings of the 33rd Annual ACM Symposium on Applied Computing.