The pypath tutorial collection

On the OmniPath webpage (http://omnipathdb.org/) we had a few tutorials for pypath. However over the past years we developed a lot pypath and especially recently a number of important points in the interface changed (although we wanted to keep compatibility as much as possible). This is a new comprehensive tutorial which will replace the ones currently on the webpage.

1: Quick start – How do I build OmniPath data with pypath?

pypath provides an easy way to build the OmniPath network as it has been described in our paper. At the first time this will take several minutes, because all data will be downloaded from the original providers. Next time pypath will use the data from its cache directory, so the network will build much faster. If you want to load it even faster, you can save it into a pickle dump.

In [1]:
	=== d i s c l a i m e r ===

	All data accessed through this module,
	either as redistributed copy or downloaded using the
	programmatic interfaces included in the present module,
	are free to use at least for academic research or
	education purposes.
	Please be aware of the licenses of all the datasets
	you use in your analysis, and please give appropriate
	credits for the original sources when you publish your
	results. To find out more about data sources please
	look at `pypath.descriptions` or
	http://omnipathdb.org/info and 
	`pypath.data_formats.urls`.

[2019-06-06 12:41:27] [pypath] 
	- session ID: `o9jzg`
	- working directory: `/home/denes/Dokumentumok/norwich2019`
	- logfile: `/home/denes/Dokumentumok/norwich2019/pypath_log/pypath-o9jzg.log`
	- pypath version: 0.8.13

In [ ]:

2: Quick start – I just want a network quickly and play around with pypath

You can find the predefined formats in the pypath.data_formats module. For example, to load one resource from there, let's say Signor:

In [2]:

Or to load all activity flow resources with literature references:

In [3]:
In [4]:

Or to load all activity flow resources, including the ones without literature references:

In [5]:

3: Quick start – How do I build networks from any data with pypath?

Here we show how to build a network from your own files. The advantage of building network with pypath is that you don't need to worry about merging redundant elements, neither about different formats and identifiers. Let's say you have two files with network data:

network1.csv

entrezA,entrezB,effect
1950,1956,inhibition
5290,207,stimulation
207,2932,inhibition
1956,5290,stimulation

network2.sif

EGF + EGFR
EGFR + PIK3CA
EGFR + SOS1
PIK3CA + RAC1
RAC1 + MAP3K1
SOS1 + HRAS
HRAS + MAP3K1
PIK3CA + AKT1
AKT1 - GSK3B

Note: you need to create these files in order to load them.

3a: Defining input formats

In [6]:

3b: Creating PyPath object and loading the 2 test files

In [7]:

4: Plotting the network with igraph

Here we use the network created above (because it is reasonable size, not like the networks we could get from most of the network databases). Igraph has excellent plotting capabilities built on top of the cairo library.

In [8]:
Out[8]:

5: Building networks

For this you will need the PyPath class from the pypath.main module which takes care about building and querying the network. Also you need the pypath.data_formats module where you find a number of predefined input settings organized in larger categories (e.g. activity flow, enzyme-substrate, transcriptional regulation, etc). These input settings will tell pypath how to download and process the data.

In [20]:

For example data_formats.pathway is a collection of databases which fit into the activity flow concept, i.e. one protein either stimulates or inhibits the other. It is a dictionary with names as keys and the input settings as values:

In [9]:
Out[9]:
{'trip': <pypath.input_formats.ReadSettings at 0x6da2497bc940>,
 'spike': <pypath.input_formats.ReadSettings at 0x6da2497bc9b0>,
 'signalink3': <pypath.input_formats.ReadSettings at 0x6da2497bc9e8>,
 'guide2pharma': <pypath.input_formats.ReadSettings at 0x6da2497bca20>,
 'ca1': <pypath.input_formats.ReadSettings at 0x6da2497bca58>,
 'arn': <pypath.input_formats.ReadSettings at 0x6da2497bcac8>,
 'nrf2': <pypath.input_formats.ReadSettings at 0x6da2497bcb00>,
 'macrophage': <pypath.input_formats.ReadSettings at 0x6da2497bca90>,
 'death': <pypath.input_formats.ReadSettings at 0x6da2497bcb38>,
 'pdz': <pypath.input_formats.ReadSettings at 0x6da2497bcb70>,
 'signor': <pypath.input_formats.ReadSettings at 0x6da2497bcba8>,
 'adhesome': <pypath.input_formats.ReadSettings at 0x6da2497bcbe0>,
 'hpmr': <pypath.input_formats.ReadSettings at 0x6da2497c0908>,
 'cellphonedb': <pypath.input_formats.ReadSettings at 0x6da2497c09e8>,
 'ramilowski2015': <pypath.input_formats.ReadSettings at 0x6da2497c0ac8>}

Such a dictionary you can pass to the init_network method of the PyPath object. Then it will download the data from the original sources, translate the identifiers and merge the networks. Pypath stores all downloaded data in a cache, by default ~/.pypath/cache in your user's home directory. For this reason when you load a resource for the first time it might take long but next time will be faster as data will be fetched from the cache. First create a pypath.main.PyPath object, then build the network:

In [10]:

You can add more resource sets a similar way:

In [23]:

To load one single resource simply create a one element dict:

In [24]:

5a: Which network datasets are pre-defined in pypath?

You can find all the pre-defined datasets in the pypath.data_formats module. As already we mentined above, the pathway dataset contains the literature curated activity flow resources. This was the original focus of pypath and OmniPath, however since then we added a great variety of other kinds of resource definitions. Here we give an overview of these.

  • data_formats.pathway: activity flow networks with literature references
  • data_formats.activity_flow: synonym for pathway
  • data_formats.pathway_noref: activity flow networks without literature references
  • data_formats.pathway_all: all activity flow data
  • data_formats.ptm: enzyme-substrate interaction networks with literature references
  • data_formats.enzyme_substrate: synonym for ptm
  • data_formats.ptm_noref: enzyme-substrate networks without literature references
  • data_formats.ptm_all: all enzyme-substrate data
  • data_formats.interaction: undirected interactions from both literature curated and high-throughput collections (e.g. IntAct, BioGRID)
  • data_formats.interaction_misc: undirected, high-scale interaction networks without the constraint of having any literature reference (e.g. the unbiased human interactome screen from the Vidal lab)
  • data_formats.transcription_onebyone: transcriptional regulation databases (TF-target interactions) with all databases downloaded directly and processed by pypath
  • data_formats.transcription: transcriptional regulation only from the DoRothEA data
  • data_formats.mirna_target: miRNA-mRNA interactions from literature curated resources
  • data_formats.tf_mirna: transcriptional regulation of miRNA from literature curated resources
  • data_formats.lncrna_protein: lncRNA-protein interactions from literature curated datasets
  • data_formats.ligand_receptor: ligand-receptor interactions from both literature curated and other kinds of resources
  • data_formats.pathwaycommons: the PathwayCommons database
  • data_formats.reaction: process description databases; not guaranteed to work at this moment
  • data_formats.reaction_misc: alternative definitions to load process description databases; not guaranteed to work at this moment
  • data_formats.small_molecule_protein: signaling interactions between small molecules and proteins

To see the list of the resources in a dataset, you can check the dict keys or the name attribute of each element:

In [17]:
Out[17]:
dict_keys(['trip', 'spike', 'signalink3', 'guide2pharma', 'ca1', 'arn', 'nrf2', 'macrophage', 'death', 'pdz', 'signor', 'adhesome', 'hpmr', 'cellphonedb', 'ramilowski2015'])
In [19]:
Out[19]:
['TRIP',
 'SPIKE',
 'SignaLink3',
 'Guide2Pharma',
 'CA1',
 'ARN',
 'NRF2ome',
 'Macrophage',
 'DeathDomain',
 'PDZBase',
 'Signor',
 'Adhesome',
 'HPMR',
 'CellPhoneDB',
 'Ramilowski2015']

6: How to access the network

Once you built a network you can use it for various purposes and write your own scripts for further processing or analysis. The network is represented by an igraph object (igraph.org):

In [25]:
Out[25]:
<igraph.Graph at 0x6ee60f2c7318>

Number of edges and nodes:

In [12]:
Out[12]:
(22101, 5184)

The edge and vertex sequences you can access in the es and vs attributes, you can iterate these or index by integers. The edge and vertex attributes you can access by string keys. E.g. get the sources of edge 0:

In [15]:
Out[15]:
{'SPIKE', 'SignaLink3'}

7: Directions and signs

By default the igraph object is undirected but it carries all direction information in Python objects assigned to each edge. Pypath can convert it to a directed igraph object, but you still need the Direction objects to have the signs, as igraph has no signed network representation. Certain methods need the directed igraph object and they will automatically create it, but you can create it manually:

In [40]:

You find the directed network in the pa.dgraph attribute:

In [41]:
Out[41]:
<igraph.Graph at 0x6ee649d04318>

Now let's take a look on the pypath.main.Direction objects which contain details about directions and signs. First as an example, select a random edge:

In [54]:

The Direction object is in the dirs edge attribute:

In [55]:

It has a method to print its content a human readable way:

In [56]:
Directions and signs of interaction between Q13489 and Q13546

	Q13489 ===> Q13546 :: SPIKE, SignaLink3
	Q13489 <=== Q13546 :: SignaLink3
	Q13489 =+=> Q13546 :: SPIKE

From this we see the databases phosphoELM and Signor agree that protein P17252 has an effect on Q15139 and Signor in addition tells us this effect is stimulatory. However in your scripts you can query the Direction objects a number of ways. Each Direction object calls the two possible directions either straight or reverse:

In [57]:
Out[57]:
('Q13489', 'Q13546')
In [58]:
Out[58]:
('Q13546', 'Q13489')

It can tell you if one of these directions is supported by any of the network resources:

In [59]:
Out[59]:
True

Or it can return those resources:

In [60]:
Out[60]:
{'SPIKE', 'SignaLink3'}

The opposite direction is not supported by any resource:

In [61]:
Out[61]:
{'SignaLink3'}

Similar way the signs can be queried. The returned pair of boolean values mean if the interaction in this direction is stimulatory or inhibitory, respectively.

In [62]:
Out[62]:
[True, False]

Or you can ask whether it is inhibition:

In [63]:
Out[63]:
False

Or if the interaction is directed at all:

In [64]:
Out[64]:
True

Sometimes resources don't agree, for example one tells an interaction is inhibition while according to others it is stimulation; or one tells A effects B and another resource the other way around. Here we preserve all these potentially contradicting information in the Direction object and at the end you decide what to do with it depending on your purpose. If you want to get rid of ambiguity there is a method to get a consensus direction and sign which returns the attributes the most resources agree on:

In [65]:
Out[65]:
[['Q13489', 'Q13546', 'directed', 'positive']]

8: Accessing nodes in the network

In igraph the vertices are numbered but this numbering can change at certain operations. Instead the we can use the vertex attributes. In PyPath for proteins the name attribute is UniProt ID by default and the label is Gene Symbol.

In [66]:
Out[66]:
['P63000', 'O00161', 'Q9GZU1', 'Q96H20', 'Q9NWB7']
In [67]:
Out[67]:
['RAC1', 'SNAP23', 'MCOLN1', 'SNF8', 'IFT57']

The PyPath object offers a number of helper methods to access the nodes by their names. For example, uniprot or up returns the igraph.Vertex for a UniProt ID:

In [68]:
Out[68]:
igraph.Vertex

Similarly genesymbol or gs for Gene Symbols:

In [36]:
Out[36]:
igraph.Vertex

Each of these has a "plural" version:

In [69]:
Out[69]:
3

And a generic method where you can mix UniProts and Gene Symbols:

In [70]:
Out[70]:
2

9: Querying relationships with our without causality

Above you could see how to query the directions and names of individual edges and nodes. Building on top of these, other methods give a way to query causality, i.e. which proteins are affected by an other one, and which others are its regulators. The example below returns the nodes PIK3CA is stimulated by, the gs prefix tells we query by the Gene Symbol:

In [71]:
Out[71]:
<pypath.main._NamedVertexSeq at 0x6ee604b0a8c8>

It returns a so called _NamedVertexSeq object, which you can get a series of igraph.Vertex objects or Gene Symbols or UniProt IDs from:

In [72]:
Out[72]:
['NTRK1', 'SRC', 'GAB1', 'PTPN11', 'NRAS']
In [73]:
Out[73]:
['P04629', 'P12931', 'Q13480', 'Q06124', 'P01111']

Note, the names of these methods are a bit contraintuitive, the for example the gs_stimulates returns the genes stimulated by PIK3CA:

In [74]:
Out[74]:
['MTOR', 'AKT1']
In [75]:
Out[75]:
True

There are many similary methods, inhibited_by returns negative regulators, affected_by does not consider +/- signs, without gs_ and up_ prefixes you can provide either of these identifiers, neighbors does not consider the direction. At the end .gs() converts the result for a list of Gene Symbols, up() to UniProts, .ids() to vertex IDs and by default it yields igraph.Vertex objects:

In [76]:
Out[76]:
[0, 32, 38, 50, 69]

Finally, with neighborhood methods return the indirect neighborhood in custom number of steps (however size of the neighborhood increases rapidly with number of steps):

In [77]:
['ATG3', 'GABARAP', 'ATG5', 'GABARAPL2', 'ATG12', 'ATG7', 'CFLAR', 'MAP1LC3B', 'MAP1LC3A', 'TP63']
In [78]:
['ATG3', 'GABARAP', 'ATG5', 'GABARAPL2', 'ATG12', 'ATG7', 'CFLAR', 'MAP1LC3B', 'MAP1LC3A', 'TP63', 'TRPV1', 'CLTC', 'FNBP1', 'NBR1', 'BNIP3L', 'ATG13', 'SQSTM1', 'RB1CC1', 'FYCO1', 'ATG4B', 'ULK1', 'ULK2', 'DVL2', 'OPTN', 'IFIH1', 'BCL2L1', 'ATF4', 'TP73', 'WDFY3', 'CAPN2', 'FADD', 'CAPN1', 'ATG10', 'DDX58', 'DDIT3', 'MAVS', 'ATG16L1', 'ATG16L2', 'TECPR1', 'PPHLN1', 'COX5B', 'UBA5', 'NEK9', 'ATG4A', 'BNIP3', 'NIPSNAP2', 'EP300', 'FOXO1', 'HSF1', 'TAX1BP3', 'ITCH', 'RIPK1', 'FAS', 'NFKB1', 'PRKCB', 'RIPK2', 'TRAF2', 'AR', 'CASP8', 'AKT1', 'MAP3K14', 'CASP10', 'PRKACA', 'MAP1B', 'EGR1', 'MAPK8', 'KEAP1', 'ZKSCAN3', 'TFEB', 'P27791', 'TBC1D5', 'E2F1', 'MAP1A', 'RAB3GAP1', 'HNRNPAB', 'FBXW7', 'ATM', 'TP53', 'MDM2', 'RPS6KB1', 'CDK2', 'IKBKB', 'ATG9A', 'BECN1']
In [79]:
Out[79]:
1735
In [80]:
Out[80]:
5344

10: Accessing edges by identifiers

Just like nodes also edges can be accessed by identifiers like Gene Symbols. get_edge returns an igraph.Edge if the edge exists otherwise None.

In [81]:
Out[81]:
igraph.Edge
In [82]:
Out[82]:
igraph.Edge
In [83]:
Out[83]:
NoneType
In [84]:
Directions and signs of interaction between P00533 and P01133

	P00533 <=== P01133 :: SPIKE, HPMR, SignaLink3
	P00533 <=+= P01133 :: SPIKE, SignaLink3

11: Literature references

Select a random edge and in the references attribute you find a list of references:

In [86]:
Out[86]:
[<pypath.refs.Reference at 0x6ee605f6dd98>,
 <pypath.refs.Reference at 0x6ee605f6dd68>]

Each reference has a PubMed ID:

In [132]:
Out[132]:
'17580304'
In [133]:

These 3 references come from 3 different databases, but there must be 2 overlaps between them:

In [87]:
Out[87]:
{'NRF2ome': {<pypath.refs.Reference at 0x6ee605f6dd98>},
 'ELM': {<pypath.refs.Reference at 0x6ee5fdc8cd98>,
  <pypath.refs.Reference at 0x6ee605f6dd68>}}

12: Translating identifiers

The pypath.mapping module is for ID translation, most of the time you can simply call the map_name method:

In [22]:
Out[22]:
{'EGFR'}
In [89]:
Out[89]:
{'O75385'}

A number of mapping tables are predefined and loaded automatically. However it does not translate in 2 steps if no direct translation table is available. For example Entrez to Gene Symbol you can translate this way:

In [90]:
Out[90]:
{'ULK1'}

By default the map_name function returns a set because it accounts for ambiguous mapping. However most often the ID translation is unambiguous, and you want to retrieve only one ID. The map_name0 returns a string, even in case of ambiguity, it returns a random element from the resulted set:

In [23]:
Out[23]:
'Q9BY60'

13: Enzyme-substrate interactions

The pypath.ptm module builds a database of enzyme-substrate interactions.

In [91]:

Here you got a dictionary with pairs of UniProt IDs as keys and a list of special objects representing enzyme-substrate interactions as values:

In [92]:
Domain-motif interaction:
  Domain in protein Q13177-1:
	Name: unknown
	Range: 0-0
	3D structures: 
  PTM: phosphorylation in protein P01236-1
    Motif: Motif in protein P01236-1:
	Name: unknown
	ELM: unknown
	Range: 200-214
	Regex: unknown
	Instance: LHCLRRDSHKIDNYL


    Residue: Residue S-207 in protein P01236-1
  Data sources: PhosphoSite, phosphoELM, MIMP
  References: 11943200
  3D structures: 

Alternatively the enzyme-substrate interactions can be assigned to network edges:

In [93]:
In [97]:
Domain-motif interaction:
  Domain in protein P17612-1:
	Name: unknown
	Range: 0-0
	3D structures: 
  PTM: phosphorylation in protein P17600-1
    Motif: unknown

    Residue: Residue S-9 in protein P17600-1
  Data sources: MIMP
  References: 
  3D structures: 

14: Complexes

The pypath.complex module builds a comprehensive database of protein complexes:

In [160]:

This complex is supported by 4 resources and has stoichiometry data, it has 2-2 of both components:

In [166]:
Out[166]:
{'CORUM', 'Compleat', 'PDB', 'Signor'}
In [167]:
Out[167]:
{'Q13564': 2, 'Q8TBC4': 2}

15: Annotations

This module provides various annotations about the function and location of the proteins.

In [33]:

OmniPath contains annotations from 27 resources. These provide various information about the characteristics of the proteins, e.g. their localization or function. The AnnotationTable object loads all annotations by default, optionally you can limit this to certain resources. For example, if you only want to load the pathway membership annotations from SIGNOR, SignaLink, NetPath and KEGG, you can provide the names of the appropriate classes:

In [44]:

The AnnotationTable object provides methods to query all resources together, or build a boolean array out of them. To see all annotations of one protein:

In [34]:
Out[34]:
[SignalinkPathway(pathway='WNT'),
 SignalinkPathway(pathway='TNF/Apoptosis'),
 SignalinkPathway(pathway='RTK'),
 SignalinkPathway(pathway='IIP'),
 KeggPathway(pathway='Choline metabolism in cancer'),
 KeggPathway(pathway='Non-small cell lung cancer'),
 KeggPathway(pathway='HIF-1 signaling pathway'),
 KeggPathway(pathway='Breast cancer'),
 KeggPathway(pathway='Human papillomavirus infection'),
 KeggPathway(pathway='ErbB signaling pathway'),
 KeggPathway(pathway='Estrogen signaling pathway'),
 KeggPathway(pathway='Adherens junction'),
 KeggPathway(pathway='Phospholipase D signaling pathway'),
 KeggPathway(pathway='Relaxin signaling pathway'),
 KeggPathway(pathway='Epithelial cell signaling in Helicobacter pylori infection'),
 KeggPathway(pathway='Glioma'),
 KeggPathway(pathway='Hepatocellular carcinoma'),
 KeggPathway(pathway='Prostate cancer'),
 KeggPathway(pathway='Endometrial cancer'),
 KeggPathway(pathway='Human cytomegalovirus infection'),
 KeggPathway(pathway='EGFR tyrosine kinase inhibitor resistance'),
 KeggPathway(pathway='Focal adhesion'),
 KeggPathway(pathway='Regulation of actin cytoskeleton'),
 KeggPathway(pathway='Cushing syndrome'),
 KeggPathway(pathway='Endocrine resistance'),
 KeggPathway(pathway='Central carbon metabolism in cancer'),
 KeggPathway(pathway='Parathyroid hormone synthesis, secretion and action'),
 KeggPathway(pathway='Gap junction'),
 KeggPathway(pathway='GnRH signaling pathway'),
 KeggPathway(pathway='Hepatitis C'),
 KeggPathway(pathway='Pancreatic cancer'),
 KeggPathway(pathway='FoxO signaling pathway'),
 KeggPathway(pathway='Endocytosis'),
 KeggPathway(pathway='Oxytocin signaling pathway'),
 KeggPathway(pathway='Bladder cancer'),
 KeggPathway(pathway='Pathways in cancer'),
 KeggPathway(pathway='Melanoma'),
 KeggPathway(pathway='Proteoglycans in cancer'),
 KeggPathway(pathway='Calcium signaling pathway'),
 KeggPathway(pathway='Colorectal cancer'),
 NetpathPathway(pathway='Prolactin'),
 NetpathPathway(pathway='Leptin'),
 NetpathPathway(pathway='Androgen receptor (AR)'),
 NetpathPathway(pathway='Receptor activator of nuclear factor kappa-B ligand (RANKL)'),
 NetpathPathway(pathway='Tumor necrosis factor (TNF) alpha'),
 NetpathPathway(pathway='Gastrin'),
 NetpathPathway(pathway='Alpha6 Beta4 Integrin'),
 NetpathPathway(pathway='Advanced glycation end-products (AGE/RAGE)'),
 NetpathPathway(pathway='Epidermal growth factor receptor (EGFR)'),
 NetpathPathway(pathway='Follicle-stimulating hormone (FSH)'),
 SignorPathway(pathway='Glioblastoma Multiforme'),
 SignorPathway(pathway='PI3K/AKT'),
 SignorPathway(pathway='EGFR')]
In [46]:
In [47]:
Out[47]:
SignaLink3 SignaLink3__ SignaLink3__BCR SignaLink3__GPCR SignaLink3__HH SignaLink3__HIPPO SignaLink3__Hedgehog(core) SignaLink3__Hedgehog(non-core) SignaLink3__Hippo SignaLink3__IIP ... CORUM_Funcat__vacuole or lysosome CORUM_Funcat__vascular organs CORUM_Funcat__vesicle formation CORUM_Funcat__vesicle fusion CORUM_Funcat__vesicle recycling CORUM_Funcat__vesicular transport (Golgi network, etc.) CORUM_Funcat__vessels CORUM_Funcat__visual transduction CORUM_Funcat__water homeostasis HPMR_complex
A0A024RBG1 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6H9 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6I0 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6I1 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6I4 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6I9 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6J1 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6J6 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6J9 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6K0 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6K2 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6K4 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6K5 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6K6 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6N1 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6N2 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6N3 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6N4 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6P5 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6Q5 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6R0 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6R2 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6S0 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6S2 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6S4 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6S5 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6S6 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6T6 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6T7 False False False False False False False False False False ... False False False False False False False False False False
A0A075B6T8 False False False False False False False False False False ... False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Q9Y6U7 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6V0 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6V7 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6W3 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6W5 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6W6 True False False False False False False False False False ... False False False False False False False False False False
Q9Y6W8 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6X0 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6X1 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6X2 True False False False False False False False False True ... False False False False False False False False False False
Q9Y6X3 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6X4 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6X5 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6X6 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6X8 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6X9 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6Y0 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6Y1 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6Y8 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6Y9 True False False False False False False False False False ... False False False False False False False False False False
Q9Y6Z2 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6Z4 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6Z5 False False False False False False False False False False ... False False False False False False False False False False
Q9Y6Z7 False False False False False False False False False False ... False False False False False False False False False False
Q9YNA8 False False False False False False False False False False ... False False False False False False False False False False
S4R3P1 False False False False False False False False False False ... False False False False False False False False False False
S4R3Y5 False False False False False False False False False False ... False False False False False False False False False False
U3KPV4 False False False False False False False False False False ... False False False False False False False False False False
W5XKT8 False False False False False False False False False False ... False False False False False False False False False False
W6CW81 False False False False False False False False False False ... False False False False False False False False False False

34397 rows × 1618 columns

The AnnotationTable object contains the resource specific annotation objects:

In [48]:
Out[48]:
{'CellPhoneDB': <pypath.annot.CellPhoneDB at 0x6da1f8add470>,
 'Locate': <pypath.annot.Locate at 0x6da2470594e0>,
 'GO_Intercell': <pypath.annot.GOIntercell at 0x6da1f82a0400>,
 'Adhesome': <pypath.annot.Adhesome at 0x6da1f82a0080>,
 'NetPath': <pypath.annot.NetpathPathways at 0x6da1f82a0f28>,
 'ComPPI': <pypath.annot.Comppi at 0x6da1f8932eb8>,
 'Signor': <pypath.annot.SignorPathways at 0x6da23b1a74a8>,
 'Ramilowski2015': <pypath.annot.Ramilowski2015 at 0x6da1f84ec588>,
 'Ramilowski_location': <pypath.annot.Ramilowski2015Location at 0x6da23baaf438>,
 'Exocarta': <pypath.annot.Exocarta at 0x6da224b4b2b0>,
 'Matrisome': <pypath.annot.Matrisome at 0x6da1f8026ac8>,
 'Integrins': <pypath.annot.Integrins at 0x6da1f8031978>,
 'CSPA': <pypath.annot.CellSurfaceProteinAtlas at 0x6da1f8031630>,
 'HPA': <pypath.annot.HumanProteinAtlas at 0x6da1f3b9ef28>,
 'Surfaceome': <pypath.annot.Surfaceome at 0x6da1f89f25c0>,
 'HPMR': <pypath.annot.HumanPlasmaMembraneReceptome at 0x6da1e7e58d30>,
 'Zhong2015': <pypath.annot.Zhong2015 at 0x6da1f8019b38>,
 'Membranome': <pypath.annot.Membranome at 0x6da1e7e695f8>,
 'KEGG': <pypath.annot.KeggPathways at 0x6da23a689e48>,
 'Kirouac2010': <pypath.annot.Kirouac2010 at 0x6da1e7e69898>,
 'Guide2Pharma': <pypath.annot.GuideToPharmacology at 0x6da1d6a302b0>,
 'SignaLink3': <pypath.annot.SignalinkPathways at 0x6da1d6592320>,
 'OPM': <pypath.annot.Opm at 0x6da1d6614f60>,
 'Vesiclepedia': <pypath.annot.Vesiclepedia at 0x6da1d6614dd8>,
 'TopDB': <pypath.annot.Topdb at 0x6da1d4d159b0>,
 'HGNC': <pypath.annot.Hgnc at 0x6da1d64db860>,
 'MatrixDB': <pypath.annot.Matrixdb at 0x6da1d659c390>,
 'CORUM_GO': <pypath.annot.CorumGO at 0x6da1d5c03a90>,
 'CellPhoneDB_complex': <pypath.annot.CellPhoneDBComplex at 0x6da1d3314828>,
 'CORUM_Funcat': <pypath.annot.CorumFuncat at 0x6da1d3351cf8>,
 'HPMR_complex': <pypath.annot.HpmrComplex at 0x6da1d32ee2e8>}

For each of these you can query the names of the fields, their possible values and the set of proteins annotated with any combination of the values:

In [49]:
In [50]:
Out[50]:
('mainclass', 'subclass', 'subsubclass')
In [52]:
Out[52]:
{'Collagens',
 'ECM Glycoproteins',
 'ECM Regulators',
 'ECM-affiliated Proteins',
 'Proteoglycans',
 'Secreted Factors',
 'n/a'}
In [53]:
Out[53]:
{'A2A2Y8',
 'A2A352',
 'A2AAS7',
 'A6NCT7',
 'A6NDR9',
 'A6NEQ6',
 'A6NMZ7',
 'A6PVD9',
 'A8MWQ5',
 'A8MXH5',
 'A8TX70',
 'B1AKJ1',
 'B1AKJ3',
 'B4DZ39',
 'B7ZBI4',
 'B7ZBI5',
 'C9JBL3',
 'C9JH44',
 'C9JMN2',
 'C9JNG9',
 'C9JPW4',
 'C9JTN9',
 Complex Collagen type I homotrimer: COMPLEX:P02452,
 Complex Collagen type I trimer: COMPLEX:P02452-P08123,
 Complex Collagen type II trimer: COMPLEX:P02458,
 Complex Collagen type XI trimer variant 1: COMPLEX:P02458-P12107-P13942,
 Complex: COMPLEX:P02458-P20908-P25067-P29400,
 Complex: COMPLEX:P02458-P25067-P29400,
 Complex Collagen type III trimer: COMPLEX:P02461,
 Complex: COMPLEX:P02462,
 Complex Collagen type IV trimer variant 1: COMPLEX:P02462-P08572,
 Complex Collagen type XI trimer variant 2: COMPLEX:P05997-P12107,
 Complex Collagen type XI trimer variant 3: COMPLEX:P05997-P12107-P20908,
 Complex Collagen type V trimer variant 1: COMPLEX:P05997-P20908,
 Complex Collagen type V trimer variant 2: COMPLEX:P05997-P20908-P25940,
 Complex: COMPLEX:P08572,
 Complex: COMPLEX:P12109-P12110,
 Complex Collagen type VI trimer: COMPLEX:P12109-P12110-P12111,
 Complex Collagen type IX trimer: COMPLEX:P20849-Q14050-Q14055,
 Complex Collagen type V trimer variant 3: COMPLEX:P20908,
 Complex: COMPLEX:P20908-P25067,
 Complex Collagen type VIII trimer variant 3: COMPLEX:P25067,
 Complex Collagen type VIII trimer variant 1: COMPLEX:P25067-P27658,
 Complex: COMPLEX:P25067-P29400,
 Complex Collagen type VIII trimer variant 2: COMPLEX:P27658,
 Complex Collagen type IV trimer variant 3: COMPLEX:P29400-P53420-Q01955,
 Complex Collagen type IV trimer variant 2: COMPLEX:P29400-Q14031,
 Complex Collagen type XV trimer: COMPLEX:P39059,
 Complex Collagen type XVIII trimer: COMPLEX:P39060,
 Complex: COMPLEX:P53420,
 Complex: COMPLEX:Q01955,
 Complex Collagen type VII trimer: COMPLEX:Q02388,
 Complex Collagen type X trimer: COMPLEX:Q03692,
 Complex Collagen type XIV trimer: COMPLEX:Q05707,
 Complex Collagen type XVI trimer: COMPLEX:Q07092,
 Complex Collagen type XIX trimer: COMPLEX:Q14993,
 Complex Collagen type XXIV trimer: COMPLEX:Q17RW2,
 Complex Collagen type XXVIII trimer: COMPLEX:Q2UY09,
 Complex Collagen type XIII trimer: COMPLEX:Q5TAT6,
 Complex Collagen type XXIII trimer: COMPLEX:Q86Y22,
 Complex Collagen type XXVII trimer: COMPLEX:Q8IZC6,
 Complex Collagen type XXII trimer: COMPLEX:Q8NFW1,
 Complex Collagen type XXVI trimer: COMPLEX:Q96A83,
 Complex Collagen type XXI trimer: COMPLEX:Q96P44,
 Complex Collagen type XII trimer: COMPLEX:Q99715,
 Complex Collagen type XXV trimer, variant 2: COMPLEX:Q9BXS0,
 Complex Collagen type XX trimer: COMPLEX:Q9P218,
 Complex Collagen type XVII trimer: COMPLEX:Q9UMD9,
 'D6R8Y2',
 'D6RGG3',
 'E7ENL6',
 'E7ENY8',
 'E7ES46',
 'E7ES47',
 'E7ES49',
 'E7ES50',
 'E7ES51',
 'E7ES55',
 'E7ES56',
 'E7EX21',
 'E9PAL5',
 'E9PCV6',
 'E9PEG9',
 'E9PNK8',
 'E9PNV9',
 'E9PP49',
 'F5GZK2',
 'F5H3Q5',
 'F5H5K0',
 'F5H851',
 'F8W6Y7',
 'F8W8G8',
 'F8WDM8',
 'G5E987',
 'H0Y393',
 'H0Y3B3',
 'H0Y3B5',
 'H0Y3M9',
 'H0Y409',
 'H0Y420',
 'H0Y4C9',
 'H0Y4P7',
 'H0Y5N9',
 'H0Y935',
 'H0Y940',
 'H0Y991',
 'H0Y998',
 'H0Y9H0',
 'H0Y9R8',
 'H0Y9T2',
 'H0YA33',
 'H0YAE1',
 'H0YAX7',
 'H0YBB2',
 'H0YCZ7',
 'H0YD40',
 'H0YDH6',
 'H0YHM5',
 'H0YHM9',
 'H0YIS1',
 'H7BXM4',
 'H7BXV5',
 'H7BY82',
 'H7BY97',
 'H7BYT9',
 'H7BZB6',
 'H7BZL8',
 'H7BZU0',
 'H7C0M5',
 'H7C381',
 'H7C3F0',
 'H7C3P2',
 'H7C435',
 'H7C457',
 'I3L392',
 'I3L3H7',
 'J3KNM7',
 'J3QT75',
 'J3QT83',
 'P02452',
 'P02458',
 'P02461',
 'P02462',
 'P05997',
 'P08123',
 'P08572',
 'P12107',
 'P12109',
 'P12110',
 'P12111',
 'P13942',
 'P20849',
 'P20908',
 'P25067',
 'P25940',
 'P27658',
 'P29400',
 'P39059',
 'P39060',
 'P53420',
 'Q01955',
 'Q02388',
 'Q03692',
 'Q05707',
 'Q07092',
 'Q14031',
 'Q14050',
 'Q14055',
 'Q14993',
 'Q17RW2',
 'Q2UY09',
 'Q4G0W3',
 'Q4VXW1',
 'Q4VXY6',
 'Q5JVU1',
 'Q5QPC7',
 'Q5QPC8',
 'Q5T1U7',
 'Q5TAT6',
 'Q86Y22',
 'Q8IZC6',
 'Q8NFW1',
 'Q96A83',
 'Q96P44',
 'Q99715',
 'Q9BXS0',
 'Q9P218',
 'Q9UMD9'}

16: Gene Ontology

pypath.go is an almost standalone module for management of the Gene Ontology tree and annotations. The main objects here are GeneOntology and GOAnnotation. The former represents the ontology tree, i.e. terms and their relationships, the latter their assignment to gene products. Both provides many versatile methods for querying.

In [54]:
In [55]:
Out[55]:
<pypath.go.GeneOntology at 0x6da24311be48>
In [56]:
Out[56]:
<pypath.go.GOAnnotation at 0x6da24311bbe0>

Among many others, the most versatile method is select which is able to select the annotated gene products by various expressions built from GO terms or IDs. It understands AND, OR, NOT and parentheses.

In [58]:
['O15400', 'P62258', 'P0DP23', 'Q8N335', 'P05771', 'Q9Y6J6', 'P57796']
In [69]:
Out[69]:
{'GO:0001507',
 'GO:0001527',
 'GO:0003351',
 'GO:0003355',
 'GO:0005201',
 'GO:0005576',
 'GO:0005577',
 'GO:0005582',
 'GO:0005583',
 'GO:0005584',
 'GO:0005585',
 'GO:0005586',
 'GO:0005587',
 'GO:0005588',
 'GO:0005590',
 'GO:0005591',
 'GO:0005592',
 'GO:0005595',
 'GO:0005596',
 'GO:0005599',
 'GO:0005601',
 'GO:0005602',
 'GO:0005604',
 'GO:0005606',
 'GO:0005607',
 'GO:0005608',
 'GO:0005609',
 'GO:0005610',
 'GO:0005611',
 'GO:0005612',
 'GO:0005614',
 'GO:0005615',
 'GO:0005616',
 'GO:0006858',
 'GO:0006859',
 'GO:0006860',
 'GO:0009519',
 'GO:0010367',
 'GO:0016914',
 'GO:0016942',
 'GO:0020003',
 'GO:0020004',
 'GO:0020005',
 'GO:0020006',
 'GO:0030020',
 'GO:0030021',
 'GO:0030023',
 'GO:0030197',
 'GO:0030345',
 'GO:0030934',
 'GO:0030935',
 'GO:0030938',
 'GO:0031012',
 'GO:0031395',
 'GO:0032311',
 'GO:0032579',
 'GO:0033165',
 'GO:0033166',
 'GO:0034358',
 'GO:0034359',
 'GO:0034360',
 'GO:0034361',
 'GO:0034362',
 'GO:0034363',
 'GO:0034364',
 'GO:0034365',
 'GO:0034366',
 'GO:0034385',
 'GO:0035182',
 'GO:0035183',
 'GO:0035323',
 'GO:0035324',
 'GO:0035581',
 'GO:0035582',
 'GO:0035583',
 'GO:0036117',
 'GO:0038098',
 'GO:0038101',
 'GO:0038105',
 'GO:0042567',
 'GO:0042568',
 'GO:0042571',
 'GO:0042627',
 'GO:0043083',
 'GO:0043230',
 'GO:0043245',
 'GO:0043256',
 'GO:0043257',
 'GO:0043258',
 'GO:0043259',
 'GO:0043260',
 'GO:0043261',
 'GO:0043263',
 'GO:0043264',
 'GO:0043509',
 'GO:0043510',
 'GO:0043511',
 'GO:0043512',
 'GO:0043513',
 'GO:0043514',
 'GO:0043655',
 'GO:0044420',
 'GO:0044421',
 'GO:0045171',
 'GO:0045172',
 'GO:0048046',
 'GO:0048180',
 'GO:0048183',
 'GO:0055039',
 'GO:0060102',
 'GO:0060103',
 'GO:0060104',
 'GO:0060105',
 'GO:0060106',
 'GO:0060107',
 'GO:0060108',
 'GO:0060109',
 'GO:0060110',
 'GO:0060111',
 'GO:0060287',
 'GO:0061696',
 'GO:0061701',
 'GO:0061800',
 'GO:0061801',
 'GO:0062023',
 'GO:0062039',
 'GO:0062040',
 'GO:0065010',
 'GO:0070062',
 'GO:0070289',
 'GO:0070505',
 'GO:0070645',
 'GO:0070701',
 'GO:0070702',
 'GO:0070703',
 'GO:0070743',
 'GO:0070744',
 'GO:0070745',
 'GO:0071736',
 'GO:0071739',
 'GO:0071743',
 'GO:0071746',
 'GO:0071748',
 'GO:0071749',
 'GO:0071750',
 'GO:0071751',
 'GO:0071752',
 'GO:0071754',
 'GO:0071756',
 'GO:0071757',
 'GO:0071914',
 'GO:0071953',
 'GO:0072534',
 'GO:0072562',
 'GO:0072563',
 'GO:0085026',
 'GO:0085036',
 'GO:0085040',
 'GO:0090658',
 'GO:0090660',
 'GO:0090733',
 'GO:0097058',
 'GO:0097059',
 'GO:0097189',
 'GO:0097311',
 'GO:0097312',
 'GO:0097313',
 'GO:0097579',
 'GO:0097619',
 'GO:0097691',
 'GO:0098549',
 'GO:0098595',
 'GO:0098642',
 'GO:0098643',
 'GO:0098644',
 'GO:0098645',
 'GO:0098646',
 'GO:0098648',
 'GO:0098651',
 'GO:0098652',
 'GO:0098774',
 'GO:0098875',
 'GO:0098965',
 'GO:0098966',
 'GO:0099126',
 'GO:0099535',
 'GO:0099544',
 'GO:0120197',
 'GO:0150043',
 'GO:1900115',
 'GO:1900116',
 'GO:1903561',
 'GO:1990318',
 'GO:1990323',
 'GO:1990324',
 'GO:1990325',
 'GO:1990326',
 'GO:1990338',
 'GO:1990339',
 'GO:1990340',
 'GO:1990341',
 'GO:1990377',
 'GO:1990562',
 'GO:1990563',
 'GO:1990742',
 'GO:1990971',
 'GO:1990972'}

17: Protein complexes

The pypath.complex module builds a non-redundant list of complexes from 10 original resources. Complexes are unique considering their set of components, and optionally carry stoichiometry information.

In [71]:
In [72]:
Out[72]:
<pypath.complex.ComplexAggregator at 0x6da2347f92b0>

To retrieve all complexes containing a specific protein, here MTOR:

In [92]:
Out[92]:
{Complex: COMPLEX:P23443-P42345,
 Complex: COMPLEX:P42345,
 Complex: COMPLEX:P42345-P62942,
 Complex: COMPLEX:P42345-P83436-Q14746-Q8WTW3-Q96JB2-Q96MW5-Q9H9E3-Q9UP83-Q9Y2V7,
 Complex: COMPLEX:P42345-Q00688,
 Complex: COMPLEX:P42345-Q02790,
 Complex: COMPLEX:P42345-Q13451,
 Complex: COMPLEX:P42345-Q13535-Q92616-Q9UIA9,
 Complex: COMPLEX:P42345-Q13535-Q96QU8,
 Complex: COMPLEX:P42345-Q13535-Q96QU8-Q9UIA9,
 Complex: COMPLEX:P42345-Q13541-Q15382-Q8N122-Q9BVC4,
 Complex: COMPLEX:P42345-Q13541-Q8N122-Q9BVC4,
 Complex mTORC2 complex: COMPLEX:P42345-Q6R327-Q9BPZ7-Q9BVC4,
 Complex mTOR complex (MTOR, RICTOR, MLST8): COMPLEX:P42345-Q6R327-Q9BVC4,
 Complex mTOR complex (MTOR, RAPTOR): COMPLEX:P42345-Q8N122,
 Complex mTORC1: COMPLEX:P42345-Q8N122-Q8TB45-Q96B36-Q9BVC4,
 Complex mTOR complex (MTOR, RAPTOR, MLST8): COMPLEX:P42345-Q8N122-Q9BVC4,
 Complex: COMPLEX:P42345-Q96B36-Q9BVC4,
 Complex: COMPLEX:P42345-Q9BVC4,
 Complex: COMPLEX:P42345-Q9BVC4-Q9UJ68}

Take a closer look on one complex object. The hash of the is equivalent with the string representation below, where the UniProt IDs are unique and alphabetically sorted. Hence you can look up complexes using strings as keys despite the dict keys are indeed pypath.intera.Complex objects:

In [100]:
In [101]:
Out[101]:
{'Q13451': 2, 'P42345': 2}
In [102]:
Out[102]:
{'PDB'}

18: Saving datasets as pickles

The large datasets above are compiled from many resources. Even if these are already available in the cache, the data processing often takes longer than convenient, e.g. few minutes. Most of the data integration objects in pypath provide methods to save and load their contents as pickle dumps.

In [ ]: