Describe and understand the world through data.

Data collection and data comparison are the foundations of scientific research. *Mathematics* provides the abstract framework to describe patterns we observe in nature and *Statistics* provides the framework to quantify the uncertainty of these patterns. In statistics, natural patterns are described in form of probability distributions which either follow a fixed pattern (parametric distributions) or more dynamic patterns (non-parametric distributions).

The `philentropy`

package implements fundamental distance and similarity measures to quantify distances between probability density functions as well as traditional information theory measures. In this regard, it aims to provide a framework for comparing natural patterns in a statistical notation.

This project is born out of my passion for statistics and I hope that it will be useful to the people who share it with me.

**I am developing philentropy in my spare time and would be very grateful if you would consider citing the following paper in case philentropy was useful for your own research. I plan on maintaining and extending the philentropy functionality and usability in the next years and require citations to back up these efforts. Many thanks in advance :)**

HG Drost, (2018).

Philentropy: Information Theory and Distance Quantification with R.Journal of Open Source Software, 3(26), 765. https://doi.org/10.21105/joss.00765

- Introduction to the philentropy package
- Distance and Similarity Measures implemented in philentropy
- Information Theory Metrics implemented in philentropy

```
[1] "euclidean" "manhattan" "minkowski"
[4] "chebyshev" "sorensen" "gower"
[7] "soergel" "kulczynski_d" "canberra"
[10] "lorentzian" "intersection" "non-intersection"
[13] "wavehedges" "czekanowski" "motyka"
[16] "kulczynski_s" "tanimoto" "ruzicka"
[19] "inner_product" "harmonic_mean" "cosine"
[22] "hassebrook" "jaccard" "dice"
[25] "fidelity" "bhattacharyya" "hellinger"
[28] "matusita" "squared_chord" "squared_euclidean"
[31] "pearson" "neyman" "squared_chi"
[34] "prob_symm" "divergence" "clark"
[37] "additive_symm" "kullback-leibler" "jeffreys"
[40] "k_divergence" "topsoe" "jensen-shannon"
[43] "jensen_difference" "taneja" "kumar-johnson"
[46] "avg"
```

```
# define a probability density function P
P <- 1:10/sum(1:10)
# define a probability density function Q
Q <- 20:29/sum(20:29)
# combine P and Q as matrix object
x <- rbind(P,Q)
# compute the jensen-shannon distance between
# probability density functions P and Q
distance(x, method = "jensen-shannon")
```

```
jensen-shannon using unit 'log'.
jensen-shannon
0.02628933
```

Alternatively, users can also retrieve values from all available distance/similarity metrics using `dist.diversity()`

:

```
euclidean manhattan
0.12807130 0.35250464
minkowski chebyshev
0.12807130 0.06345083
sorensen gower
0.17625232 0.03525046
soergel kulczynski_d
0.29968454 0.42792793
canberra lorentzian
2.09927095 0.49712136
intersection non-intersection
0.82374768 0.17625232
wavehedges czekanowski
3.16657887 0.17625232
motyka kulczynski_s
0.58812616 2.33684211
tanimoto ruzicka
0.29968454 0.70031546
inner_product harmonic_mean
0.10612245 0.94948528
cosine hassebrook
0.93427641 0.86613103
jaccard dice
0.13386897 0.07173611
fidelity bhattacharyya
0.97312397 0.03930448
hellinger matusita
0.32787819 0.23184489
squared_chord squared_euclidean
0.05375205 0.01640226
pearson neyman
0.16814418 0.36742465
squared_chi prob_symm
0.10102943 0.20205886
divergence clark
1.49843905 0.86557468
additive_symm kullback-leibler
0.53556883 0.13926288
jeffreys k_divergence
0.31761069 0.04216273
topsoe jensen-shannon
0.07585498 0.03792749
jensen_difference taneja
0.03792749 0.04147518
kumar-johnson avg
0.62779644 0.20797774
```

```
# install.packages("devtools")
# install the current version of philentropy on your system
library(devtools)
install_github("HajkD/philentropy", build_vignettes = TRUE, dependencies = TRUE)
```

The current status of the package as well as a detailed history of the functionality of each version of `philentropy`

can be found in the NEWS section.

`distance()`

: Implements 46 fundamental probability distance (or similarity) measures`getDistMethods()`

: Get available method names for ‘distance’`dist.diversity()`

: Distance Diversity between Probability Density Functions`estimate.probability()`

: Estimate Probability Vectors From Count Vectors

`H()`

: Shannon’s Entropy H(X)`JE()`

: Joint-Entropy H(X,Y)`CE()`

: Conditional-Entropy H(X | Y)`MI()`

: Shannon’s Mutual Information I(X,Y)`KL()`

: Kullback–Leibler Divergence`JSD()`

: Jensen-Shannon Divergence`gJSD()`

: Generalized Jensen-Shannon Divergence

`philentropy`

package

Single cell census of human kidney organoids shows reproducibility and diminished off-target cells after transplantationA Subramanian et al. -Nature Communications, 2019

Different languages, similar encoding efficiency: Comparable information rates across the human communicative nicheC Coupé, YM Oh, D Dediu, F Pellegrino -Science Advances, 2019

Loss of adaptive capacity in asthmatic patients revealed by biomarker fluctuation dynamics after rhinovirus challengeA Sinha et al. -eLife, 2019

Evacuees and Migrants Exhibit Different Migration Systems after the Great East Japan Earthquake and TsunamiM Hauer, S Holloway, T Oda – 2019

Robust comparison of similarity measures in analogy based software effort estimationP Phannachitta -11th International Conference on Software, 2017

Expression variation analysis for tumor heterogeneity in single-cell RNA-sequencing dataEF Davis-Marcisak, P Orugunta et al. -BioRxiv, 2018

SEDE-GPS: socio-economic data enrichment based on GPS informationT Sperlea, S Füser, J Boenigk, D Heider -BMC bioinformatics, 2018

How the Choice of Distance Measure Influences the Detection of Prior-Data ConflictK Lek, R Van De Schoot -Entropy, 2019

Concept acquisition and improved in-database similarity analysis for medical dataI Wiese, N Sarna, L Wiese, A Tashkandi, U Sax -Distributed and Parallel Databases, 2019

Differential variation analysis enables detection of tumor heterogeneity using single-cell RNA-sequencing dataEF Davis-Marcisak, TD Sherman et al. -Cancer research, 2019

Dynamics of Vaginal and Rectal Microbiota over Several Menstrual Cycles in Female Cynomolgus MacaquesMT Nugeyre, N Tchitchek, C Adapen et al. -Frontiers in Cellular and Infection Microbiology, 2019

Inferring the quasipotential landscape of microbial ecosystems with topological data analysisWK Chang, L Kelly -BioRxiv, 2019

Shifts in the nasal microbiota of swine in response to different dosing regimens of oxytetracycline administrationKT Mou, HK Allen, DP Alt, J Trachsel et al. -Veterinary microbiology, 2019

The Patchy Distribution of Restriction–Modification System Genes and the Conservation of Orphan Methyltransferases in HalobacteriaMS Fullmer, M Ouellette, AS Louyakis et al. -Genes, 2019

Genetic differentiation and intrinsic genomic features explain variation in recombination hotspots among cocoa tree populationsEJ Schwarzkopf, JC Motamayor, OE Cornejo -BioRxiv, 2019

Metastable regimes and tipping points of biochemical networks with potential applications in precision medicineSS Samal, J Krishnan, AH Esfahani et al. -Reasoning for Systems Biology and Medicine, 2019

Genome‐wide characterization and developmental expression profiling of long non‐coding RNAs in Sogatella furciferaZX Chang, OE Ajayi, DY Guo, QF Wu -Insect science, 2019

Loss of adaptive capacity in asthmatics revealed by biomarker fluctuation dynamics upon experimental rhinovirus challengeA Sinha, R Lutter, B Xu, T Dekker, B Dierdorp et al. -BioRxiv, 2019

Development of a simulation system for modeling the stock market to study its characteristicsP Mariya – 2018

The Tug1 Locus is Essential for Male FertilityJP Lewandowski, G Dumbović, AR Watson, T Hwang et al. -BioRxiv, 2019

Microbiotyping the sinonasal microbiomeA Bassiouni, S Paramasivan, A Shiffer et al. -BioRxiv, 2019

Critical search: A procedure for guided reading in large-scale textual corporaJ Guldi -Journal of Cultural Analytics, 2018

A Bibliography of Publications about the R, S, and S-Plus Statistics Programming LanguagesNHF Beebe – 2019

Improved state change estimation in dynamic functional connectivity using hidden semi-Markov modelsH Shappell, BS Caffo, JJ Pekar, MA Lindquist -NeuroImage, 2019

A Smart Recommender Based on Hybrid Learning Methods for Personal Well-Being ServicesRM Nouh, HH Lee, WJ Lee, JD Lee -Sensors, 2019

Cognitive Structural AccuracyV Frenz – 2019

Kidney organoid reproducibility across multiple human iPSC lines and diminished off target cells after transplantation revealed by single cell transcriptomicsA Subramanian, EH Sidhom, M Emani et al. -BioRxiv, 2019

Multi-classifier majority voting analyses in provenance studies on iron artefactsG Żabiński et al. -Journal of Archaeological Science, 2020

I would be very happy to learn more about potential improvements of the concepts and functions provided in this package.

Furthermore, in case you find some bugs or need additional (more flexible) functionality of parts of this package, please let me know:

https://github.com/HajkD/philentropy/issues

or find me on twitter: HajkDrost