Introducing famish 0.2.0
In the release of distplyr, I wrote about needing to modify probability distributions to make them realistic for applications. Once distplyr made that possible along with distionary, I still needed to tune those distributions based on data.
That means estimation. It also means being able to assess how well the fitted distributions actually match the data, especially at the quantiles that matter for decisions.
And because that demand came directly from projects, we needed the code …yesterday. So I wrote a few quick functions that would later become the famish package.
famish and “families”
I chose the name “famish” because it suggests reducing a family of distributions down to either a smaller family, or all the way to a fully resolved distribution.
That naming also points to an important conceptual difference. A distribution has one particular CDF, one particular mean, and so on. A family of distributions contains multiple distributions, indexed by parameters: the Normal family, for example, can be indexed by its mean and standard deviation. The famish package is about taking that family-level view seriously.
The first CRAN release of famish is now available as a core probaverse package, and it can fit most of the distribution families currently included in distionary.1
How it works
Here is a quick look at the workflow, fitting a couple of distributions to a dataset.
library(distionary)
library(famish)
x <- c(4.0, 2.7, 3.5, 3.2, 7.1, 3.1, 2.5, 5.0, 2.3, 4.5, 3.0, 3.8)
gev <- fit_dst_gev(x)
gev## Generalised Extreme Value distribution (continuous)
## --Parameters--
## location scale shape
## 3.0658476 0.7426435 0.2699160lp3 <- fit_dst("lp3", x, method = "lmom-log")
lp3## Log Pearson Type III distribution (continuous)
## --Parameters--
## meanlog sdlog skew
## 1.2652738 0.3382651 1.0430967These are ordinary distribution objects made by distionary, so they are immediately ready for downstream work in the broader probaverse.
Next, grab a few quantiles, expressed in terms of return periods2 (useful for risk modelling).
quantiles <- enframe_return(
gev, lp3,
at = c(5, 10, 20, 50, 100),
arg_name = "return_period",
fn_prefix = "level"
)
quantiles## # A tibble: 5 × 3
## return_period level_gev level_lp3
## <dbl> <dbl> <dbl>
## 1 5 4.44 4.57
## 2 10 5.37 5.58
## 3 20 6.45 6.70
## 4 50 8.20 8.43
## 5 100 9.84 9.95One way of checking which model has better-estimated quantiles is to calculate the quantile score with the quantile_score() function, for which smaller scores are better. Since our quantiles are in tabular format, we’ll calculate that using a tidyverse workflow.
library(tidyverse)
quantiles |>
mutate(
tau = 1 - 1 / return_period,
gev_score = map2_dbl(
level_gev, tau, \(l, p) mean(quantile_score(x, l, tau = p))
),
lp3_score = map2_dbl(
level_lp3, tau, \(l, p) mean(quantile_score(x, l, tau = p))
)
) |>
select(return_period, contains("score"))## # A tibble: 5 × 3
## return_period gev_score lp3_score
## <dbl> <dbl> <dbl>
## 1 5 0.416 0.416
## 2 10 0.309 0.312
## 3 20 0.190 0.182
## 4 50 0.0895 0.0940
## 5 100 0.0611 0.0622The two distributions might be tied, but this gives a useful starting point for investigating which distribution is better.
What’s next?
Like the rest of the probaverse project, famish has ambitious goals. Aside from allowing for more insight into distribution fit, I have been needing a way to create my own distribution families for the task at hand, and then estimate them directly. This would also aid in the other type of task that I’ve been needing functionality for: resolving a distribution in multiple steps. For example, I often need to restrict a family to distributions with a given mean supplied by a machine learning model, and then estimate within the remaining parameter space.
That is still ahead. But even in its current form, famish already fills an important gap in the probaverse: it connects distributions to data, and makes it easier to judge whether those fitted distributions are actually useful.
If you want to learn more, take a look at the famish home page. The probaverse project is ambitious, and there is a lot more I would like to build, but progress on that work will depend on future funding.