vignettes/challenges/bioplex.Rmd
bioplex.Rmd
Abstract
The BioPlex project has created two proteome-scale, cell-line-specific protein-protein interaction (PPI) networks: the first in 293T cells, including 120k interactions among 15k proteins; and the second in HCT116 cells, including 70k interactions between 10k proteins. The BioPlex R package (https://github.com/ccb-hms/BioPlex, submitted to Bioconductor) implements access to the BioPlex PPI networks and related resources from within R. Besides PPI networks for 293T and HCT116 cells, this includes access to CORUM protein complex data, and transcriptome and proteome data for the two cell lines. The goal of this challenge is to introduce the BioPlex data and package to the community, and work together on several analysis and programming challenges around the data including: (a) transcriptomic and proteomic data integration on the BioPlex networks, (b) assessing the impact of alternative splicing on bait proteins of the networks, (c) integration with public databases for disease-associated genes and variants, (d) implementing a GraphFrames backend for efficient representation and analysis of the networks, and (e) designing an R/Shiny graph viewer that allows flexible inspection of experimental data and metadata for nodes and edges of the networks.
A list of GitHub repositories that participate in this challenge.
## Warning in .cache_github_repo(repo): Querying github
## Warning in .cache_github_repo(repo): Querying github
## Warning in .cache_github_repo(repo): Querying github
repository | stargazers | latest.push |
---|---|---|
ccb-hms/BioPlex | 3 | 2021-10-29 16:42:10 |
ccb-hms/BioPlexAnalysis | 0 | 2021-09-20 23:29:08 |
ccb-hms/GraphViewer | 0 | 2021-09-28 17:13:32 |
Background: The BioPlex project has created two proteome-scale, cell-line-specific PPI networks: one for human embryonic kidney 293T cells, and one for human colon cancer HCT116 cells. The BioPlex R package serves transcriptome and proteome data for the two cell lines in dedicated Bioconductor data structures. This includes (i) transcriptome and proteome data for HCT116 from the Cancer Cell Line Encyclopedia (CCLE) , and (ii) transcriptome and proteome data profiling differential expression between 293T and HCT116 cells from different sources.
Objective: Can we identify patterns driving (dis-)agreement between transcriptome and proteome of HEK293 and HCT116 cells?
Proposed methods: The Transcriptome-Proteome analysis vignette implements a scaffold for integrative analysis of transcriptome and proteome data from both cell lines. This includes demonstration of how to obtain the corresponding datasets and initial results from (i) correlating HCT116 transcript levels and protein levels from CCLE, and (ii) correlating differential gene expression and differential protein abundance when comparing 293T against HCT116 cells.
Given that HCT116 is a cancer cell line, an important driver of transcriptomic and proteomic expression could be somatic copy number alteration. One approach to this problem would thus be to obtain copy number data from cBioPortal for HCT116 (the cBioPortalData package provides an R/Bioconductor interface for accessing cBioPortal study data incl. CCLE). A simplified version of this would be to obtain pre-computed gene sets of high and low copy number in HCT116 from Enrichr and test for their enrichment in genes/proteins that have high or low expression in HCT116 cells (i) individually, and (ii) when compared to 293T cells.
More generally, standard gene set enrichment analysis of eg. GO terms and KEGG pathways could be employed to identify themes in the sets of genes showing agreement for (i) transcript levels and protein expression in HCT116 CCLE samples, and (ii) differential gene and protein expression between 293T and HCT116 cells from different sources.
What a successful result would look like: Pull request to the BioPlexAnalysis github repository that extends the Transcriptome-Proteome analysis vignette. Pull requests will be reviewed and discussed. Accepted contributions will be acknowledged.
Potential follow-up work: Analyze PPI networks for differential wiring in areas of differential expression between 293T and HCT116 cells.
Background: Network propagation of genetic disease associations is an approach for identifying potential drug targets (MacNamara et al., 2020). Such approaches typically leverage publicly available collections of genes and variants associated with human diseases such as OpenTargets. OpenTargets integrates data from expert curated repositories, GWAS catalogues, animal models, and the scientific literature. It has been shown that protein networks from specific functional linkages such as protein complexes are suitable for simple guilt-by-association network propagation approaches (MacNamara et al., 2020). Global protein–protein interaction networks typically require more sophisticated approaches such as HotNet2.
Objective: Application of network propagation methods to CORUM complexes and BioPlex networks using genetic disease association scores from OpenTargets.
Proposed methods: A list of all OpenTargets datasets is available from the Platform Data Downloads page. Example scripts for accessing the data from within R are available here. Following MacNamara et al., 2020, one would first use the scored gene-disease associations from OpenTargets to obtain high‑confidence genetic hits (HCGHs). Proxy gene sets of HCGHs can then be defined by identifying genes that (i) share a protein complex with a HCGH, (ii) are annotated to the same KEGG or REACTOME pathway, (iii) are first- or second-degree neighbors in a BioPlex network, and (iv) are found in a network module identified with approaches such as BioNet or HotNet2.
What a successful result would look like: Pull request to the BioPlexAnalysis github repository outlining the approach in a separate network propagation vignette (.Rmd
file in the vignettes
folder). Pull requests will be reviewed and discussed. Accepted contributions will be acknowledged.
Potential follow-up work: Discussion of how to best encapsulate the OpenTargets gene-disease associations in a package with a sparklyr backend.
Background: GraphFrames is a package for Apache Spark that provides a DataFrame-based API for working with graphs. Functionality includes motif finding and common graph algorithms, such as PageRank and Breadth-first search.
Objective: Can we refactor the graph backend of the BioPlex data package using graphframes
?
Proposed methods: The graph backend of the BioPlex data package is currently based on the Bioconductor graph package. The function bioplex2graph
turns an ordinary data.frame
storing the BioPlex PPI data into a graphNEL
object, where bait and prey relationships are represented by directed edges from bait to prey. Data for individual nodes/proteins is stored in the nodeData
slot, and data for the edges/interactions is stored in the edgeData
slot. We might also want to store graph-level metadata such as cell line, network version, and PMID of the primary publication associated with a PPI network, among others. The function bioplex2graph
will need to be adapted to be based on GraphFrames
instead.
What a successful result would look like: Pull request to the BioPlex github repository on a new branch (named graphframes). Pull requests will be reviewed and discussed. Contributions will be acknowledged.
Potential follow-up work: Several functions of the BioPlex package add data via the graph API to the nodeData
and the edgeData
of a BioPlex graph (typically obtained via bioplex2graph
). This includes (i) the annotatePFAM
function, which adds PFAM protein domain annotations to the nodes, and (ii) the mapSummarizedExperimentOntoGraph
function, which allows to transfer assay
data and/or rowData
annotations from the rows of a SummarizedExperiment
to the node data of a graph. The implementation of these functions will need to be adapted to make use of the GraphFrames
API instead.
Background: A variety of data and annotations are available for the nodes (= proteins) and the edges (= PPIs) of the BioPlex networks. As illustrated in the BioPlex data retrieval vignette, this includes (i) for the nodes: transcript and protein abundance, results of differential expression analysis, disease association scores, etc., and (ii) for the edges: confidence scores, interaction probabilities, etc. On the other hand, graph algorithms such as maximum scoring subnetwork analysis, potentially combined with gene set enrichment analysis, typically produce subnetwork/pathway-sized graphs of interest prompting further interactive exploration.
Objective: Setting up an R/Shiny graph viewer that allows exploration of data and annotations for nodes and edges of the networks.
Proposed methods: The GraphViewer github repository contains a scaffold implementation based on ggnetwork. Alternatives implementing a ggplot2 approach to the visualization of graphs and networks such as ggraph exist. The app is supposed to take subnetwork/pathway-sized graphs of interest as input, and is expected to support overlay of node attributes and edge attributes using ggplot2 grammar. The UI should be minimal. A drop-down for the node data attribute to display on the graph, and a drop-down to display the edge data attribute to display on a graph. Upload of a serialized graph object by the user should be supported.
What a successful result would look like: Pull request to the GraphViewer github repository. Pull requests will be reviewed and discussed. Contributions will be acknowledged.
Potential follow-up work: Integration of the graph viewer as a new panel type for iSEE, which takes a panel-based approach to visualizing assay data, column data, and row data of a SummarizedExperiment
. Extensions for individual panels for visualizing node data and edge data of a graph could be modeled after the RowDataPlot
and ColumnDataPlot
classes.