The EHRtemporalVariability
package contains functions to
delineate temporal dataset shifts in Electronic Health Records through
the projection and visualization of dissimilarities among data temporal
batches. This is done through the estimation of data statistical
distributions over time and their projection in non-parametric
statistical manifolds, uncovering the patterns of the data latent
temporal variability. Dataset shifts can be explored and identified
through visual analytics formats such as Data Temporal heatmaps and
Information Geometric Temporal (IGT) plots [1–3]. An additional EHRtemporalVariability
Shiny app can be used to load and explore the package results
towards an improved investigation experience and even to allow the use
of these functions to those users non-experienced in R coding.
If you use EHRtemporalVariability
please cite:
Carlos Sáez, Alba Gutiérrez-Sacristán, Isaac Kohane, Juan M García-Gómez, Paul Avillach. EHRtemporalVariability: delineating temporal data-set shifts in Electronic Health Records. GigaScience, Volume 9, Issue 8, August 2020, giaa079. doi:10.1093/gigascience/giaa079 [4]
Biomedical Data research repositories and property biomedical research databases are becoming bigger both in terms of sample size and collected variables [5, 6]. Two significant reasons behind of this are the widespread adoption of data-sharing initiatives and technological infrastructures, and the continuous and systematic population of those repositories over longer periods of time. However, these two situations can also introduce potential confounding factors in data which may hinder their reuse for research, such as in population research or in statistical and machine learning modeling. Concretely, differences in protocols, populations, or even unexpected biases, either caused by systems or humans, can lead to temporal dataset shifts [7, 8], changes of reference which are reflected in the statistical distributions of data. This temporal variability of data represent a Data Quality (DQ) issue which must be addressed for a reliable data reuse [2, 9].
The EHRtemporalVariability
package has been developed to
help preventing this problem.
The tasks that can be performed with
EHRtemporalVariability
package are the following:
data temporal heatmaps
of absolute and
relative frequencies of variable values over temporal batches.Information Geometric Temporal plot
, which helps
delineating reference changes in data over time, including abrupt and
recurrent changes, conceptually-related time periods (periods with
similar data distributions), but also smooth temporal trends.In the following sections the specific functions that can be used to address each of these tasks are presented.
For more information about the methods please check reference [1] and the Suplemental Material in [2].
EHRtemporalVariability
is provided through CRAN and
GitHub. To install the CRAN version the user must type the following
commands in an R session:
The GitHub version of the package will, in general, provide the
latest updates before these are commited to the CRAN version. In order
to install it, devtools
package - available in CRAN (https://cran.r-project.org/) - is required. To install
devtools
the user must type the following commands in an R
session:
Once devtools
package has been installed the user can
install EHRtemporalVariability
typing the following
commands in an R session:
DataTemporalMap
The DataTemporalMap
object contains the statistical
distributions of data estimated at a specific time period.
## [1] "DataTemporalMap"
## attr(,"package")
## [1] "EHRtemporalVariability"
DataTemporalMap
object is the output of
estimateDataTemporalMap
function. It is used as input for
plotDataTemporalMap
functions.
Note that objects of
this class can be generated automatically by the
estimateDataTemporalMap
function, but its construction and
extension is open towards fostering its use through external methods.
E.g., one may use additional probability distribution estimation
methods, or even contruct compatible DataTemporalMap
object
for other unstructured data shuch as images or free text.
IGTProjection
The IGTProjection
object contains the projected
non-parametric statistical manifold of a DataTemporalMap
object (also included in the object) estimated in a specific number of
dimensions.
## [1] "IGTProjection"
## attr(,"package")
## [1] "EHRtemporalVariability"
IGTProjection
object is the output of
estimateIGTProjection
function. It is used as input for
plotIGTProjection
functions.
Note that objects of
this class are generated automatically by the
estimateIGTProjection
function.
The first step consists on read the CSV file that contains the data
for the analysis. To do it, the user can apply the read.csv
function.
The read.csv
function reads a file in table format and
creates a data frame from it, with cases corresponding to lines and
variables to fields in the file. It is important to define the class of
each column when reading the CSV file.
An example of how to read the CSV file is shown next:
dataset <- read.csv2( "http://github.com/hms-dbmi/EHRtemporalVariability-DataExamples/raw/master/nhdsSubset.csv",
sep = ",",
header = TRUE,
na.strings = "",
colClasses = c( "character", "numeric", "factor",
"numeric" , rep( "factor", 22 ) ) )
head( dataset)
## date age sex newborn race marital disstatus dayscare lengthflag region
## 1 00/01 0 2 1 2 9 1 8 1 1
## 2 00/01 53 1 2 6 9 1 8 1 1
## 3 00/01 49 1 2 2 9 1 6 1 1
## 4 00/01 76 2 2 1 9 3 53 1 1
## 5 00/01 53 2 2 2 9 1 3 1 1
## 6 00/01 38 2 2 2 9 1 6 1 1
## hospbeds hospownership diagcode1 diagcode2 diagcode3 diagcode4 diagcode5
## 1 4 2 V3101 76518 V298- V053- N/A
## 2 4 2 2252- 78039 25000 N/A N/A
## 3 4 2 29181 30391 4019- N/A N/A
## 4 4 2 29532 49390 700-- N/A N/A
## 5 4 2 2967- V08-- N/A N/A N/A
## 6 4 2 30421 30391 N/A N/A N/A
## diagcode6 diagcode7 proccode1 proccode2 proccode3 proccode4 princpayment
## 1 N/A N/A 9955 9921 N/A N/A 8
## 2 N/A N/A 0159 9921 N/A N/A 8
## 3 N/A N/A 9462 N/A N/A N/A 3
## 4 N/A N/A 8622 9423 9439 9394 2
## 5 N/A N/A 9411 9423 9438 N/A 8
## 6 N/A N/A 9468 N/A N/A N/A 3
## secondpayment drg
## 1 NA 388
## 2 NA 1
## 3 NA 435
## 4 6 424
## 5 NA 430
## 6 NA 435
The second step will be transform the date column in ‘Date’ R format.
The EHRtemporalVariability
R package allows the user to do
this transformation applying the formatDate
function.
The formatDate
function transform the column containing
the dates from the given data.frame input, using the following
arguments:
The formatDate
function output is the same data.frame
with the date column transformed.
## [1] "character"
datasetFormatted <- EHRtemporalVariability::formatDate(
input = dataset,
dateColumn = "date",
dateFormat = "%y/%m"
)
head( datasetFormatted )[1:5, 1:5]
## date age sex newborn race
## 1 2000-01-01 0 2 1 2
## 2 2000-01-01 53 1 2 6
## 3 2000-01-01 49 1 2 2
## 4 2000-01-01 76 2 2 1
## 5 2000-01-01 53 2 2 2
## [1] "Date"
Once the CSV file has been readed and the date column transformed, the third step is to transform the ICD9-CM to PheWAS codes if needed (https://phewascatalog.org/) [10].
This step is not mandatory, is an option for those analysis in which ICD9-CM are analyzed and want to be reduced and transformed into PheWAS codes.
The EHRtemporalVariability
R package allows the user to
do the mapping in an automatic way applying the
icd9toPheWAS
function.
The icd9toPheWAS
map to PheWAS codes using as input a
column containing ICD9-CM codes. This function has as input the
following arguments:
The icd9toPheWAS
function output is the ICD9-CM column
transformed into PheWAS codes. In this specific “NHDS” example we will
create a new column with the PheWAS codes that map to the diagcode2
column in the original data.frame.
datasetPheWAS <- icd9toPheWAS(data = datasetFormatted,
icd9ColumnName = "diagcode1",
phecodeDescription = TRUE,
missingValues = "N/A",
statistics = TRUE,
replaceColumn = FALSE)
head( datasetPheWAS[, c( "diagcode1", "diagcode1-phewascode")] )
## diagcode1 diagcode1-phewascode
## 1 V3101 Multiple gestation
## 2 2252- ICD9codeNotMapped
## 3 29181 Alcoholism
## 4 29532 Schizophrenia
## 5 2967- ICD9codeNotMapped
## 6 30421 Substance addiction and disorders
The estimateDataTemporalMap
function estimates a
DataTemporalMap
object from a data.frame containing
individuals in rows and the variables in columns, being one of these
columns the analysis date. This function has as input the following
arguments:
Additionally this function has the following optional arguments:
numeric
,
integer
, character
, or factor
(accordingly to the variable type), and where the name of the list
element must correspond to the column name of its variable. If not
provided it is automatically estimated from data.The estimateDataTemporalMap
function output is a
DataTemporalMap
object or a list of
DataTemporalMap
objects depending on the number of analysis
variables.
probMaps <- estimateDataTemporalMap(data = datasetPheWAS,
dateColumnName = "date",
period = "month")
In the previous specific example, nhds data frame is used as input,
with the date
column previously formated to
Date
format. The estimateDataTemporalMap
function has been applied to the X variables present in the initial data
set. As a result, a list of X DataTemporalMap
objects is
obtained.
## [1] "list"
## [1] "DataTemporalMap"
## attr(,"package")
## [1] "EHRtemporalVariability"
Variable supports can be set manually for all or some of the
variables using the support
parameter. The support of those
variables not present in the support
parameter will be
estimated automatically as when the parameter is not passed.
Additionally, the EHRtemporalVariability
R package
contains a function that allows the user to trim any
DataTemporalMap
object according to a start and end
date.
The trimDataTemporalMap
function needs as input the
following arguments:
DataTemporalMap
object.The trimDataTemporalMap
function output is a new
DataTemporalMap
object.
## [1] "DataTemporalMap"
## attr(,"package")
## [1] "EHRtemporalVariability"
probMapTrimmed <- trimDataTemporalMap(
dataTemporalMap = probMaps[[1]],
startDate = "2005-01-01",
endDate = "2008-12-01"
)
class( probMapTrimmed )
## [1] "DataTemporalMap"
## attr(,"package")
## [1] "EHRtemporalVariability"
The estimateIGTProjection
function estimates a
IGTProjection
object from a DataTemporalMap
object. This function has as input the following arguments:
DataTemporalMap
object.classicalmds
and
nonmetricmds
, with classicalmds
as
default.The estimateIGTProjection
function output is a
IGTProjection
object.
igtProj <- estimateIGTProjection( dataTemporalMap = probMaps[[1]],
dimensions = 2,
startDate = "2000-01-01",
endDate = "2010-12-31")
The estimateIGTProjection
function can be applied to one
DataTemporalMap
object. As a result, an
IGTProjection
object is obtained.
## [1] "IGTProjection"
## attr(,"package")
## [1] "EHRtemporalVariability"
The sapply
function can be used to apply the
estimateIGTProjection
function to the output from
estimateDataTemporalMap
function including more than one
variable, which result is a list of DataTemporalMap
objects, as follows:
EHRtemporalVariability
offers two different options to
visualize the results, heatmaps and Information Geometric Temporal (IGT)
plots. An special focus is made on visualization colors.
The default “Spectral” palette shows a color temperature scheme from blue, through yellow, to red. The four remaining options are better suited for those with colorblindness, including “Viridis”, “Magma”, and their reversed versions “Viridis-reversed” and “Magma-reversed”.
The plotDataTemporalMap
function returns a heatmap or
time series plot. This function has as input the following
arguments:
DataTemporalMap
object.To illustrate the next examples we load the example .Rdata file, which contains the results from analyzing the complete “NHDS” dataset.
githubURL <- "https://github.com/hms-dbmi/EHRtemporalVariability-DataExamples/raw/master/variabilityDemoNHDS.RData"
load(url(githubURL))
The plotIGTProjection
function returns an interactive
Information Geometric Temporal (IGT) plot from an
IGTProjection
object. This function has as input the
following arguments:
IGTProjection
object.plotIGTProjection(
igtProjection = igtProjs[["diagcode1-phewascode"]],
colorPalette = "Spectral",
dimensions = 2)
An IGT plot visualizes the variability among time batches in a data repository in a 2D or 3D plot. Time batches are positioned as points where the distance between them represents the probabilistic distance between their distributions (currently Jensen-Shannon distance, more distances will be supported in the future).
To track the temporal evolution, temporal batches are labeled to
show their date and colored according to their season or period,
according to the analysis period, as follows. If period==“year” the
label is “yy” (2 digit year) and the color is according to year. If
period==“month” the label is “yym” (yy + abbreviatted month) and the
color is according to the season (yearly). If period==“week” the label
is “yymmw” (yym + ISO week number in 1-2 digit) and the color is
according to the season (yearly).
Month | Abbreviaton |
---|---|
January | J |
February | F |
March | M |
April | a |
May | m |
June | j |
July | x |
August | a |
September | S |
October | O |
November | N |
December | D |
The plotIGTProjection
function allows overlying to both
2D and 3D IGT plots an smoothed trajectory of the information evolution
over time, which is calculated with smoothed splines. To plot the
trajectory set the trajectory parameter to
TRUE
.
The EHRtemporalVariability Shiny app allows loading your own .csv file through simple configuration steps but, however, with limited data pre-processing. Consequently, users can export the DataTemporalHeatmaps and IGTplots generated from the R package as an .RData file for their exporation through the interactive Shiny app dashboard. The export is done as follows (note that both DataTemporalHeatmaps and IGTplots must be lists of same size where the names of their items correspond to the variable names):
According to the layout of time batches in IGT projections we define the following four types of temporal changes (quoting text from our previous publication [3]):
Trend: Continuous and smooth change in the probability distributions of time batches over time, along the full-time period, or within a sub-period. Trends can be linear or, more generally, curved. Trends are represented in the IGT projection as a continuous flow of time batches through a time-related direction related.
Abrupt change: A sudden change in probability distributions at a specific time point, leading to a new data inherent concept which is maintained afterwards. Abrupt changes are represented in the IGT projection as a gap between two groups of continuous time batches. Multiple abrupt changes can occur in a data repository, splitting the dataset into multiple clusters of time batches (see the definition of temporal subgroups). A single time batch could be abruptly separated from the rest, generally due to some specific context in its data (e.g., transient states, incomplete batches), in that case, we will talk about an outlier batch.
Temporal subgroups: Conceptually related groups of time periods at which probability distributions are similar within a group, but dissimilar between groups, i.e., forming clusters of time batches. Abrupt changes do generally split data into temporal subgroups. A consecutive time flow between batches at two temporal subgroups would indicate a recurrent behaviour. An outlier batch will not be considered within any subgroup.
Seasonality: Repetition of some change patterns at a specific time period throughout the IGT projection. Seasonality should be represented in the IGT projection as repetitive cycles over the general temporal flow. We could find local seasonality, within a specific time period, or global seasonality, along the full study period. Global seasonality should be maintained even across multiple temporal subgroups, e.g., in a data repository which is partitioned on various temporal subgroups, a global yearly variation should be maintained across the different subgroups.
To ilulstrate this example we will refer to the IGT plot for variable
diagcode1-phewascode
shown in the temporal trajectory visualization section. In
that IGT plot we can find the following changes.
Trend: A main trend through the entire period of study can be found across dimension D1, where there is a continuous flow of time batches. This is possibly associated to smooth population changes over time, e.g, including changes in life expectancy.
Seasonality: There is an evident yearly seasonality through the entire period of study layering across dimension D2, and highlighted by the batches coloring scheme and cycles on the trajectory. This is likely associated to the yearly seasonality of diseases.
Abrupt change: There is an abrupt change which splits the main trend between 2007 and 2008, with an apparent minor transient start in November 2007. This is possibly related to changes in the hospitals providing data to the NHDS as well as yearly ICD-9-CM updates in 2007. Another minor abrupt change shows as well between 2004 and 2005, with possible equivalent causes.
Temporal subgroups: The abrupt changes described before define as well distinct temporal subgroups. One possibility to validate them is using clustering algorithms. As an example, we can apply the DBSCAN clustering algorithm to the IGT projection points, to obtain the following results.
# We set the minimum number of batches in a subgroup as 2
# We set eps based on the knee of the following KNNdistplot, at around 0.023
# kNNdistplot(igtProj@projection, k = 2, all = FALSE)
igtProj = igtProjs[["diagcode1-phewascode"]]
# We select the 2 first dimensions for consistency with the IGT plot examples above
dbscanResults <- dbscan(igtProj@projection[,c(1,2)], eps = 0.023, minPts = 2)
clusterNames <- vector(mode = "character", length = 10)
clusterNames[dbscanResults$cluster == 0] <- "Outlier batches"
clusterNames[! dbscanResults$cluster == 0] <- paste("Temporal subgroup",dbscanResults$cluster[! dbscanResults$cluster == 0])
plotly::plot_ly(x = igtProj@projection[,1], y = igtProj@projection[,2],
color = as.factor(clusterNames),
type = "scatter", mode = "markers",
text = paste0("Date: ",igtProj@dataTemporalMap@dates)) %>%
plotly::config(displaylogo = FALSE)
Note that the sensitivity of the cluster validation can be refined by using IGT projections at 3 or more dimensions, modifying the DBSCAN parameters, or using other clustering algorithms.
EHRtemporalVariability
available
functionsInput Object | EHRtemporalVariability function |
Output Generated |
---|---|---|
data.frame | formatDate |
Given a data.frame object with a column of dates in ‘character’ format, it generates a new data.frame object with the dates transformed into “Date” R format. |
data.frame | icd9toPheWAS |
Given a data.frame object with a column of ICD9-CM codes, it generates a new data.frame object with the ICD9-CM codes transformed into PheWAS codes |
data.frame | estimateDataTemporalMap |
Given a data.frame object containing individuals in rows and the
variables in columns, it generates a DataTemporalMap object
or a list of DataTemporalMap objects depending on the number of analysis
variables |
DataTemporalMap |
trimDataTemporalMap |
Given a DataTemporalMap object, it generates a trimmed
DataTemporalMap object |
DataTemporalMap |
estimateIGTProjection |
Given a DataTemporalMap object, it generates a
IGTProjection object |
DataTemporalMap |
plotDataTemporalMap |
Given a DataTemporalMap object, it generates an
interactive heatmap |
IGTProjection |
plotIGTProjection |
Given an IGTProjection object, it generates an
interactive Information Geometric Temporal (IGT) plot |
IGTProjection |
estimateIGTTrajectory |
Given anIGTProjection object, it estimates a trajectory
of the information temporal evolution in a IGT projection by fitting a
cubic smoothing spline |