An introduction to the isatabr package
Bart-Jan van Rossum
2023-08-25
Source:vignettes/isatabr.Rmd
isatabr.Rmd
The isatabr package
The isatabr package is developed as a easy-to-use package for reading, modifying and writing files in the Investigation/Study/Assay (ISA) Abstract Model of the metadata framework using the ISA tab-delimited (TAB) format.
ISA is a metadata framework to manage an increasingly diverse set of life science, environmental and biomedical experiments that employ one or a combination of technologies. Built around the Investigation (the project context), Study (a unit of research) and Assay (analytical measurements) concepts, ISA helps you to provide rich descriptions of experimental metadata (i.e. sample characteristics, technology and measurement types, sample-to-data relationships) so that the resulting data and discoveries are reproducible and reusable.
The ISA tab structure
The ISA-Tab structure is described in full detail on the ISA-tab website. The description below is mostly taken from there and slightly condensed when appropriate.
ISA-Tab uses three types of file to capture the experimental metadata:
- Investigation file
- Study file
- Assay file (with associated data files)
The Investigation file contains all the information needed to understand the overall goals and means used in an experiment; experimental steps (or sequences of events) are described in the Study and in the Assay file(s). For each Investigation file there may be one or more Studies defined with a corresponding Study file; for each Study there may be one or more Assays defined with corresponding Assay files.
In order to facilitate identification of ISA-Tab component files, specific naming patterns should be followed:
- i_*.txt for identifying the Investigation file, e.g. i_investigation.txt
- s_*.txt for identifying Study file(s), e.g. s_gene_survey.txt
- a_*.txt for identifying Assay file(s), e.g. a_transcription.txt
The Investigation file
The Investigation file fulfills four needs:
- to declare key entities, such as factors, protocols, which may be referenced in the other files;
- to track provenance of the terminologies (controlled vocabularies or ontologies) there are used, where applicable;
- to relate each Study file to an Investigation (this only becomes necessary when two or more Study files need to be grouped);
- to relate Assay files to Studies.
An Investigation file is structured as a table with vertical headings along the first column, and corresponding values in the subsequent columns. The following section headings must appear in the Investigation file (in order), and the study block (headings from STUDY to STUDY CONTACTS) can be repeated, one block per study associated with the investigation.
- ONTOLOGY SOURCE REFERENCE
- INVESTIGATION
- INVESTIGATION PUBLICATIONS
- INVESTIGATION CONTACTS
- STUDY
- STUDY DESIGN DESCRIPTORS
- STUDY PUBLICATIONS
- STUDY FACTORS
- STUDY ASSAYS
- STUDY PROTOCOLS
- STUDY CONTACTS
For a full description of all sections see the aforementioned site.
The Study file
The Study file contains contextualizing information for one or more assays, for example; the subjects studied; their source(s); the sampling methodology; their characteristics; and any treatments or manipulations performed to prepare the specimens.
The Assay file
The Assay file represents a portion of the experimental graph (i.e., one part of the overall structure of the workflow); each Assay file must contain assays of the same type, defined by the type of measurement (e.g. gene expression) and the technology employed (e.g. DNA microarray). Assay-related information includes protocols, additional information relating to the execution of those protocols and references to data files (whether raw or derived).
Reading files in the ISA-Tab format
ISA-Tab files can be stored in two different ways, either as separate
files in a directory, or as .zip file containing the files. The example
data is included in both ways in the package. Both formats can be read
into R
using the readISATab
function.
When reading ISA-Tab files from a directory, only the name of the directory, where the ISA-TAB files are located, needs to be specified.
## Read ISA-Tab files from directory.
isaObject1 <- readISATab(path = file.path(system.file("extdata/Atwell", package = "isatabr")))
When reading zipped files, both the directory, where the zip-file is located, and the name of the file need to be specified.
## Read ISA-Tab files from directory.
isaObject2 <- readISATab(path = file.path(system.file("extdata", package = "isatabr")),
zipfile = "Atwell.zip")
In both cases readISATab
will automatically detect the
Investigation, Study and Assay files assuming the naming conventions
described in the previous section are followed. If this is not the case,
the function will give an error indicating the problem. The imported
ISA-Tab files are stored in an object of the S4 class ISA
.
Since the information is almost identical for reading files from a
directory and zipped-files, the following sections will show the example
for the files read from a directory only.
Accessing and updating ISA objects.
All information from the ISA-Tab files is stored within slots in the
ISA
object. The table below gives an overview of the
different slots and a brief description of the information stored in the
slot. For a more exhaustive description see help("ISA")
.
Note that an investigation may have multiple studies. Therefore, data
concerning studies is stored in a list
object, where one
element in the list
corresponds to one study. Likewise a
study may consist of multiple assays and assay data is stored in a
list
object, where one element in the list
corresponds to one assay.
Slot | Type | Description |
---|---|---|
path | character |
path to the ISA-Tab files |
iFileName | character |
name of the investigation file |
oSR | data.frame |
ONTOLOGY SOURCE REFERENCE section of investigation file |
invest | data.frame |
INVESTIGATION section of investigation file |
iPubs | data.frame |
INVESTIGATION PUBLICATIONS section of investigation file |
iContacts | data.frame |
INVESTIGATION CONTACTS section of investigation file |
study |
list of data.frames
|
STUDY sections of investigation file |
sDD |
list of data.frames
|
STUDY DESIGN DESCRIPTORS sections of investigation file |
sPubs |
list of data.frames
|
STUDY PUBLICATIONS sections of investigation file |
sFacts |
list of data.frames
|
STUDY FACTORS sections of investigation file |
sAssays |
list of data.frames
|
STUDY ASSAYS sections of investigation file |
sProts |
list of data.frames
|
STUDY PROTOCOLS sections of investigation file |
sContacts |
list of data.frames
|
STUDY CONTACTS sections of investigation file |
sFiles |
list of data.frames
|
content of study files |
aFiles |
list of data.frames
|
content of assay files |
All slots have corresponding functions for accessing and modifying
information. The names of these access functions are the same as the
slots they refer to, e.g. accessing the iFileName slot in an
ISA
object can be done using the iFileName()
function. There is one exception to this. To prevent problems with the
path()
function, that already exists in quite some other
packages, the path slot in an ISA
object should be accessed
using the isaPath()
function.
## Access path for isaObjects
isaPath(isaObject1)
#> [1] "/home/runner/work/_temp/Library/isatabr/extdata/Atwell"
isaPath(isaObject2)
#> [1] "/tmp/RtmpFP54n5"
The path for isaObject1
shows the directory from which
the files were read. As isaObject2
was read directly for a
zipped archive, the files were first extracted into a temporary folder
and subsequently read from there. This temporary folder is shown as the
path.
The other slots are accessible in a similar way. Some more examples are shown below.
## Access studies.
isaStudies <- study(isaObject1)
## Print study names.
names(isaStudies)
#> [1] "GMI_Atwell_study"
## Access study descriptors.
isaSDD <- sDD(isaObject1)
## Shows study descriptors for study GMI_Atwell_study.
isaSDD$GMI_Atwell_study
#> Study Design Type
#> 1 GWAS of 107 phenotypes in Arabidopsis thaliana inbred lines using ~250k SNPs in 199 accessions
#> Study Design Type Term Accession Number Study Design Type Term Source REF
#> 1 <NA> <NA>
It is not only possible to access the different slots in an
ISA
object, the slots can also be updated. As the access
function, the update functions have the same name as the slots they
refer to. As an example, let’s assume an error sneaked into the ONTOLOGY
SOURCE REFERENCE section and we want to update one of the source
versions.
First have a look at the current content of the ONTOLOGY SOURCE REFERENCE section.
(isaOSR <- oSR(isaObject1))
#> Term Source Name Term Source File Term Source Version
#> 1 OBI http://data.bioontology.org/ontologies/OBI 23
#> 2 EFO http://data.bioontology.org/ontologies/EFO 118
#> 3 UO http://purl.obolibrary.org/obo/UO <NA>
#> 4 NCBITaxon http://data.bioontology.org/ontologies/NCBITAXON 6
#> 5 PO http://data.bioontology.org/ontologies/PO 10
#> 6 GMI http://gwas.gmi.oeaw.ac.at/ <NA>
#> Term Source Description
#> 1 Ontology for Biomedical Investigations
#> 2 Experimental Factor Ontology
#> 3 Unit Ontology
#> 4 National Center for Biotechnology Information (NCBI) Organismal Classification
#> 5 Plant Ontology
#> 6 Cataloque of Arabidopsis accessions at GMI
Now we update the version of the OBI ontology source from 23 to 24.
Then we update the modified ontology source data.frame
in
the ISA
object.
## Update version number.
isaOSR[1, "Term Source Version"] <- 24
## Update oSR in ISA object.
oSR(isaObject1) <- isaOSR
## Check the updated oSR.
oSR(isaObject1)
#> Term Source Name Term Source File Term Source Version
#> 1 OBI http://data.bioontology.org/ontologies/OBI 24
#> 2 EFO http://data.bioontology.org/ontologies/EFO 118
#> 3 UO http://purl.obolibrary.org/obo/UO <NA>
#> 4 NCBITaxon http://data.bioontology.org/ontologies/NCBITAXON 6
#> 5 PO http://data.bioontology.org/ontologies/PO 10
#> 6 GMI http://gwas.gmi.oeaw.ac.at/ <NA>
#> Term Source Description
#> 1 Ontology for Biomedical Investigations
#> 2 Experimental Factor Ontology
#> 3 Unit Ontology
#> 4 National Center for Biotechnology Information (NCBI) Organismal Classification
#> 5 Plant Ontology
#> 6 Cataloque of Arabidopsis accessions at GMI
In a similar way all slots in an ISA
object can be
accessed and updated.
Processing assay files
The assay files may contain information about the files used to store the actual data for the assay. Per assay file two types of data files may be referred to: 1) the file(s) containing the raw data, and 2) the file(s) containing derived data.
Looking at the assay tab file in our example data, we see that the
Raw Data File column is empty, no raw data files are available. However,
the Derived Data File shows the file d_data.txt
.
## Inspect assay tab.
isaAFile <- aFiles(isaObject1)
head(isaAFile$a_study1.txt)
#> Sample Name Protocol REF Parameter Value[Organism part] Term Source REF Term Accession Number
#> 1 sample1 Phenotyping NA NA NA
#> 2 sample2 Phenotyping NA NA NA
#> 3 sample3 Phenotyping NA NA NA
#> 4 sample4 Phenotyping NA NA NA
#> 5 sample5 Phenotyping NA NA NA
#> 6 sample6 Phenotyping NA NA NA
#> Parameter Value[Trait Definition File] Assay Name Raw Data File Protocol REF
#> 1 tdf.txt assay1020 NA Data transformation
#> 2 tdf.txt assay1 NA Data transformation
#> 3 tdf.txt assay1131 NA Data transformation
#> 4 tdf.txt assay569 NA Data transformation
#> 5 tdf.txt assay293 NA Data transformation
#> 6 tdf.txt assay388 NA Data transformation
#> Derived Data File
#> 1 d_data.txt
#> 2 d_data.txt
#> 3 d_data.txt
#> 4 d_data.txt
#> 5 d_data.txt
#> 6 d_data.txt
To read the contents of the data files, either raw or derived, in the
assay tab file, we can use the processAssay()
function. The
exact working of this function depends on the technology type of the
assay. For most technology types the data files are read as plain
.txt
files assuming a tab-delimited format. Only for mass
spectrometry and microarray data the files are read differently (see the
sections below). As the output above shows, the assay file in the
example has a Data Transformation technology and is therefore read as
tab-delimited file.
Before being able to process the assay file, i.e. read the data, we
first have to extract the assay tabs using the
getAssayTabs()
function. This function extracts all the
assay files from an ISA
object and stores them as
assayTab
objects. These assayTab
objects
contain not only the content of the assay tab file, but also extra
information, e.g. technology type.
## Get assay tabs for isaObject1.
aTabObjects <- getAssayTabs(isaObject1)
## Process assay data.
isaDat <- processAssay(isaObject = isaObject1,
aTabObject = aTabObjects$s_study1.txt$a_study1.txt,
type = "derived")
## Display first rows and columns.
head(isaDat[, 1:10])
#> Assay Name LD LDV SD SDV FT10 FT16 FT22 Seed Dormancy Emco5
#> 1 assay152 6.84105 32.6 93.0417 4.97494 74 87 89 NA NA
#> 2 assay279 NA NA NA NA NA NA NA NA NA
#> 3 assay211 NA NA NA NA NA NA NA NA NA
#> 4 assay256 NA NA NA NA NA NA NA NA NA
#> 5 assay907 NA NA NA NA NA NA NA NA NA
#> 6 assay948 NA NA NA NA NA NA NA NA NA
The data is now stored in isaDat
and can be used for
further analysis within R
.
Mass spectrometry assay files
Mass spectrometry data is often stored in Network Common Data Form (NetCDF) files, i.e. in .CDF files. Assay data containing these data will be processed in a different way than regular assay data. To be able to do this the xcms package is required. This package is available from Bioconductor.
As an example for the processing of mass spectrometry files we will use a subset of the quantitated LC/MS peaks from the spinal cords of 6 wild-type and 6 fatty acid amide hydrolase (FAAH) knockout mice described in Saghatelian et al. (2004). A more extensive version of this data set is available in the faahKO data package on Bioconductor.
## Read ISA-Tab files for faahKO.
isaObject3 <- readISATab(path = file.path(system.file("extdata/faahKO", package = "isatabr")))
After reading the ISA-Tab files, we can now process the mass
spectrometry assay data. In this example the raw data is available, so
when processing the assay we specify type = "raw"
. The rest
of the code is similar to the previous section.
## Get assay tabs for isaObject3.
aTabObjects3 <- getAssayTabs(isaObject3)
## Process assay data.
isaDat3 <- processAssay(isaObject = isaObject3,
aTabObject = aTabObjects3$s_Proteomic_profiling_of_yeast.txt$a_metabolite.txt,
type = "raw")
## Display output.
isaDat3
#> An "xcmsSet" object with 1 samples
#>
#> Time range: 2506.1-4132.1 seconds (41.8-68.9 minutes)
#> Mass range: 200.1-599.3129 m/z
#> Peaks: 470 (about 470 per sample)
#> Peak Groups: 0
#> Sample classes:
#>
#> Feature detection:
#> o Peak picking performed on MS1.
#> Profile settings: method = bin
#> step = 0.1
#>
#> Memory usage: 0.0804 MB
As the output shows, processing the mass spectrometry data gives an
object of class xcmsSet
from the xcms
package.
This object contains all available information from the .CDF file that
was read and can be used for further analysis.
Microarray assay files
Microarray data is often stored in an Affymetrix Probe Results file.
These .CEL
files contain information on the probe set’s
intensity values, and a probe set represents a gene. Assay data
containing these data will be processed in a different way than regular
assay data. To be able to do this the affy
package is required. This package is available from Bioconductor.
Processing microarray data is done in a very similar way as
processing mass spectrometry data, as described in the previous section.
The main difference is that the resulting object will in this case be an
object object of class ExpressionSet
, which is used as
input in many Bioconductor packages.
Writing files in the ISA-Tab format.
After updating an ISA
object, it can be written back to
a directory using the writeISAtab()
function. All content
of the ISA
object will be written to investigation, study
and assay files following the ISA-Tab standard for file specification.
By default the files are written to the current working directory, but
the directory can be specified using the path
argument.
## Write content of ISA object to a temporary directory.
writeISAtab(isaObject = isaObject1,
path = tempdir())
Note that existing files are always overwritten. Therefore, writing files to the same directory, from where the original files were read, will result in the original files being overwritten.
Besides writing the full ISA
object it is also possible
to write only the investigation file , one or more study files or one or
more assay files.
## Write investigation file.
writeInvestigationFile(isaObject = isaObject1,
path = tempdir())
## Write study file.
writeStudyFiles(isaObject = isaObject1,
studyFilenames = "s_study1.txt",
path = tempdir())
## Write assay file.
writeAssayFiles(isaObject = isaObject1,
assayFilenames = "a_study1.txt",
path = tempdir())