The pypath book§

Contents

  • 1  Introduction

  • 2  Build, load and save databases

    • 2.1  The OmniPath app

    • 2.2  Built-in database definitions

    • 2.3  Networks

      • 2.3.1  Strictly literature curated network

      • 2.3.2  The OmniPath network with extra activity flow, enzyme-substrate and ligand-receptor interactions

      • 2.3.3  Transcriptional regulation network from DoRothEA and other resources

      • 2.3.4  Literature curated miRNA post-transcriptional regulation network

      • 2.3.5  Transcriptional regulation of miRNA

      • 2.3.6  lncRNA-mRNA interactions

      • 2.3.7  Small molecule-protein interactions

    • 2.4  Enzyme-substrate relationships

    • 2.5  Protein complexes

    • 2.6  Annotations

    • 2.7  Inter-cellular communication roles

  • 3  Data directly from the original resources

  • 4  Download management

    • 4.1  Cache management and customization

    • 4.2  Download failures

      • 4.2.1  Corrupted cache content

      • 4.2.2  Network communication issues: look into the curl debug log

      • 4.2.3  Timeouts

      • 4.2.4  Access and inspect the Curl object

      • 4.2.5  Is it failing only for you?

      • 4.2.6  Read the log

      • 4.2.7  TLS (SSL, HTTPS) errors

  • 5  Resources

    • 5.1  Licenses

    • 5.2  Resource information

    • 5.3  Resource definitions for a certain database or dataset

  • 6  Building networks

    • 6.1  Which network datasets are pre-defined in pypath?

    • 6.2  The Network object

    • 6.3  Network in pandas.DataFrame

  • 7  Translating identifiers

    • 7.1  Pre-defined ID translation tables

  • 8  Homology translation

    • 8.1  Homology translation tables as dictionaries

    • 8.2  Homology translation data frames

  • 9  Taxonomy

    • 9.1  Translating to NCBI Taxonomy, scientific names and common names

    • 9.2  Organism from UniProt ID

  • 10  UniProt

    • 10.1  The UniProt input module

      • 10.1.1  All UniProt IDs for one organism

      • 10.1.2  UniProt ID format validation

      • 10.1.3  UniProt ID validation

      • 10.1.4  Single UniProt protein datasheet

      • 10.1.5  History of UniProt records

      • 10.1.6  UniProt legacy API

      • 10.1.7  Processed UniProt annotations

    • 10.2  The UniProt utils module

      • 10.2.1  Datasheets

      • 10.2.2  Tables

    • 10.3  Sanitizing UniProt IDs

  • 11  Enzyme-substrate interactions

    • 11.1  Enzyme-substrate objects

    • 11.2  Enzyme-substrate data frame

  • 12  Protein sequences

  • 13  Annotations

    • 13.1  Load a single annotation resource

    • 13.2  Load the full annotations database by the database manager

    • 13.3  Load only selected annotations

    • 13.4  Data frames of annotations

  • 14  Inter-cellular signaling roles

    • 14.1  Build an intercellular communication network

    • 14.2  Quantitative overview of intercell annotations

    • 14.3  Intercell database as data frame

    • 14.4  Browse intercell categories

  • 15  Gene Ontology

  • 16  Protein complexes

    • 16.1  Protein complex objects

    • 16.2  Protein complex data frame

  • 17  Saving datasets as pickles

  • 18  Log messages and sessions

    • 18.1  Basic info about the session

    • 18.2  Read the log file

    • 18.3  Logging to the console

    • 18.4  Disable logging

    • 18.5  Write to the log

      • 18.5.1  Sending a single message

      • 18.5.2  Connect a module or class to the pypath logger

  • 19  BEL export

  • 20  CellPhoneDB export

  • 21  The legacy igraph-based network object

    • 21.1  I just want a network quickly and play around with pypath

    • 21.2  How do I build networks from any data with pypath?

      • 21.2.1  Defining input formats

      • 21.2.2  Creating PyPath object and loading the 2 test files

    • 21.3  Structure of the legacy network object

      • 21.3.1  Directions and signs

      • 21.3.2  Accessing nodes in the network

    • 21.4  Querying relationships with our without causality

    • 21.5  Accessing edges by identifiers

    • 21.6  Literature references

    • 21.7  Plotting the network with igraph

Introduction§

OmniPath consists of 5 main database segments: network (interactions), enzyme-substrate interactions (enz_sub or ptms), protein complexes (complexes), molecular entity annotations (annotations) and intercellular communication roles (intercell). You can access all these by the web service at https://omnipathdb.org/ and the R/Bioconductor package OmnipathR, furthermore the network and some of the annotations by the Cytoscape app. However only pypath is able to build these databases directly from the original sources with various options for customization and to provide a rich and versatile API for each database enjoying the almost unlimited flexibility of Python. This book attempts to be a guided tour around pypath, however almost all objects, modules, APIs presented here have many more methods, options and features than we have a chance to cover. If you feel like there might be something useful for you, don’t hesitate to ask us by github.

This document has been run with the following pypath version:

[1]:
import pypath
pypath.__version__

executed in 0ms, finished 14:11:06 2022-12-03

[1]:
'0.14.32'

Build, load and save databases§

We provide a high level interface in the module pypath.omnipath.app. This is the easiest way to build, manage and access the OmniPath databases, hence this is what we present in the Quick start section. In further sections we show the lower level modules more in detail.

The OmniPath app§

pypath.omnipath is an application which contains a database manager at omnipath.db. This manager is empty by default. It builds and loads the databases on demand.

[2]:
from pypath import omnipath

omnipath.db

executed in 1.34s, finished 14:11:27 2022-12-03

[2]:
<pypath.omnipath.app.DatabaseManager at 0x602fb851cd90>

Built-in database definitions§

The databases presented below are pre-defined in pypath. You can also list them by:

[3]:
from pypath import omnipath
omnipath.db.datasets

executed in 0ms, finished 14:11:32 2022-12-03

[3]:
['omnipath',
 'curated',
 'complex',
 'annotations',
 'intercell',
 'tf_target',
 'dorothea',
 'small_molecule',
 'tf_mirna',
 'mirna_mrna',
 'lncrna_mrna',
 'enz_sub']

Networks§

OmniPath offers multiple built in network datasets: the OmniPath PPI network the more strict literature curated PPI network, the special ligand-receptor PPI network and various other PPI datasets, the transcriptional regulation network from DoRothEA and other resources, miRNA post-transcriptional regulation network and also transcriptional regulation network for miRNAs.

Strictly literature curated network§

[4]:
from pypath import omnipath
cu = omnipath.db.get_db('curated')
cu

executed in 16.83s, finished 13:17:13 2022-12-02

[4]:
<Network: 7980 nodes, 35551 interactions>

The OmniPath network with extra activity flow, enzyme-substrate and ligand-receptor interactions§

[5]:
from pypath import omnipath
op = omnipath.db.get_db('omnipath')
op

executed in 1m, finished 13:18:55 2022-12-02

[5]:
<Network: 18558 nodes, 94358 interactions>

Transcriptional regulation network from DoRothEA and other resources§

Note: according to the default settings, DoRothEA confidence levels A-D and all original resources will be loaded. To load only DoRothEA, use the key "dorothea" instead of "tf_target".

[6]:
from pypath import omnipath
tft = omnipath.db.get_db('tf_target')
tft

executed in 2m 12.72s, finished 13:21:54 2022-12-02

[6]:
<Network: 18986 nodes, 326708 interactions>

Literature curated miRNA post-transcriptional regulation network§

[1]:
from pypath import omnipath
mi = omnipath.db.get_db('mirna_mrna')
mi

executed in 2.28s, finished 13:31:55 2022-12-02

[1]:
<Network: 1264 nodes, 3288 interactions>

Transcriptional regulation of miRNA§

[4]:
from pypath import omnipath
tmi = omnipath.db.get_db('tf_mirna')
tmi

executed in 0ms, finished 13:32:41 2022-12-02

[4]:
<Network: 1032 nodes, 4960 interactions>

lncRNA-mRNA interactions§

[6]:
from pypath import omnipath
lnc = omnipath.db.get_db('lncrna_mrna')
lnc

executed in 0ms, finished 13:33:03 2022-12-02

[6]:
<Network: 243 nodes, 217 interactions>

Small molecule-protein interactions§

These interactions are either ligand-receptor connections, enzyme inhibitions, allosteric regulations or enzyme-metabolite interactions. Currently it is a small, experimental dataset, but will be largely extended in the future.

[1]:
from pypath import omnipath
smol = omnipath.db.get_db('small_molecule')
smol

executed in 7.94s, finished 13:57:17 2022-12-02

[1]:
<Network: 1980 nodes, 3147 interactions>

Enzyme-substrate relationships§

[7]:
from pypath import omnipath
es = omnipath.db.get_db('enz_sub')
es

executed in 6.14s, finished 13:33:26 2022-12-02

[7]:
<Enzyme-substrate database: 41426 relationships>

Protein complexes§

[8]:
from pypath import omnipath
co = omnipath.db.get_db('complex')
co

executed in 0ms, finished 13:33:31 2022-12-02

[8]:
<Complex database: 28173 complexes>

Annotations§

The annotations database is huge, building or even loading it takes long time and requires quite some memory.

[9]:
from pypath import omnipath
an = omnipath.db.get_db('annotations')
an

executed in 2m 43.60s, finished 13:36:28 2022-12-02

[9]:
<Annotation database: 5490653 records about 50872 entities from 68 resources>

Inter-cellular communication roles§

This database is quick to build, but it requires the annotations database, which is really heavy.

[10]:
from pypath import omnipath
ic = omnipath.db.get_db('intercell')
ic

executed in 23.34s, finished 13:37:12 2022-12-02

[10]:
<Intercell annotations: 301527 records about 48570 entities>

Data directly from the original resources§

The pypath.inputs module contains clients for more than 150 molecular biology and biomedical resources, and overall almost 500 functions that download data directly from these resources. Maintaining such a large number of clients is troublesome, hence at any time some of them are broken, you can check them in our daily status report. Each submodule of pypath.inputs is named after its corresponding resource, all lowercase, e.g. “depod” (DEPOD) or “cytosig” (CytoSig). Within these modules each function name starts with the name of the resource, and ends with the kind of data it retrieves. For example, pypath.inputs.signor.signor_interactions downloads interactions from SIGNOR. The labels *”_interactions“,”_enz_sub“,”_complexes”* and *”_annotations”* retrieve records intended to these respective databases. However, the records at this stage are not fully processed yet. Some functions have different postfixes, e.g. “_raw”* means the data is close to the format provided by the resource itself; *”_mapping”* means it is intended for a translation table. The purpose of the input functions is to 1) handle the download; 2) read the raw data; 3) extract the relevant parts; 4) do the specific part of processing, i.e. bring the data to a state when it is suitable for the generic database classes for further processing. The outputs of these functions is not standard in any ways, though you may observ repeated patterns. The input functions typically return lists or dictionaries. These are arbitrarily designed towards the aims of selecting the relevant fields and give straightforward, accessible Python data structures for processing within or outside of *pypath.

We use SIGNOR as an example because this resource provides data for almost all OmniPath databases. The signor_complexes function returns a set of pypath.internals.intera.Complex objects, ready to be added to the OmniPath complexes database (built by pypath.core.complex.ComplexAggregator).

[2]:
from pypath.inputs import signor
signor.signor_complexes()

executed in 0ms, finished 15:24:43 2022-12-03

[2]:
{'COMPLEX:P23511_P25208_Q13952': Complex NFY: COMPLEX:P23511_P25208_Q13952,
 'COMPLEX:P68104_P85299_Q6R327_Q8TB45_Q9BVC4': Complex mTORC2: COMPLEX:P68104_P85299_Q6R327_Q8TB45_Q9BVC4,
 'COMPLEX:P42345_Q8N122_Q8TB45_Q96B36_Q9BVC4': Complex mTORC1: COMPLEX:P42345_Q8N122_Q8TB45_Q96B36_Q9BVC4,
 'COMPLEX:P63208_Q13616_Q9Y297': Complex SCF-betaTRCP: COMPLEX:P63208_Q13616_Q9Y297,
 'COMPLEX:Q09472_Q92793': Complex CBP/p300: COMPLEX:Q09472_Q92793,
 'COMPLEX:Q09472_Q92793_Q92831': Complex P300/PCAF: COMPLEX:Q09472_Q92793_Q92831,
 'COMPLEX:Q13485_Q15796': Complex SMAD2/SMAD4: COMPLEX:Q13485_Q15796,
 'COMPLEX:P84022_Q13485': Complex SMAD3/SMAD4: COMPLEX:P84022_Q13485,
 'COMPLEX:P05412_Q13485': Complex SMAD4/JUN: COMPLEX:P05412_Q13485,
 'COMPLEX:Q15796_Q9HAU4': Complex SMAD2/SMURF2: COMPLEX:Q15796_Q9HAU4,
 'COMPLEX:O15105_Q01094_Q13547': Complex SMAD7/HDAC1/E2F-1: COMPLEX:O15105_Q01094_Q13547,
 'COMPLEX:P19838_Q04206': Complex NfKb-p65/p50: COMPLEX:P19838_Q04206,
 'COMPLEX:O14920_O15111': Complex IK
Output truncated: showing 1000 of 17699 characters

The signor_interactions function returns a list of arbitrary tuples that represent the most important properties of SIGNOR interaction records in a human readable way, and ready to be processed by the pypath.core.network.Network object.

[5]:
signor.signor_interactions()[:10]

executed in 0ms, finished 14:11:52 2022-12-03

[5]:
[SignorInteraction(source='O15530', target='O15530', source_isoform=None, target_isoform=None, source_type='protein', target_type='protein', effect='unknown', mechanism='phosphorylation', ncbi_tax_id='9606', pubmeds='10455013', direct=True, ptm_type='phosphorylation', ptm_residue='Ser396', ptm_motif='SSSSSSHsLSASDTG'),
 SignorInteraction(source='Q9NQ66', target='CHEBI:18035', source_isoform=None, target_isoform=None, source_type='protein', target_type='smallmolecule', effect='up-regulates quantity', mechanism='', ncbi_tax_id='-1', pubmeds='23880553', direct=True, ptm_type='', ptm_residue='Small molecule catalysis', ptm_motif=''),
 SignorInteraction(source='P62136', target='O15169', source_isoform=None, target_isoform=None, source_type='protein', target_type='protein', effect='down-regulates activity', mechanism='dephosphorylation', ncbi_tax_id='9606', pubmeds='17318175', direct=True, ptm_type='dephosphorylation', ptm_residue='Ser77', ptm_motif='YEPEGSAsPTPPYLK'),
 SignorInteraction(sou
Output truncated: showing 1000 of 3285 characters

Note, the records above contain also enzyme-PTM data, hence the signor.signor_enzyme_substrate function only converts them to an intermediate format to make it easier to process for pypath.core.enz_sub.EnzymeSubstrateAggregator.

[4]:
signor.signor_enzyme_substrate()[:2]

executed in 0ms, finished 13:58:20 2022-12-02

[4]:
[{'typ': 'phosphorylation',
  'resnum': 396,
  'instance': 'SSSSSSHSLSASDTG',
  'substrate': 'O15530',
  'start': 389,
  'end': 403,
  'kinase': 'O15530',
  'resaa': 'S',
  'motif': 'SSSSSSHSLSASDTG',
  'enzyme_isoform': None,
  'substrate_isoform': None,
  'references': {'10455013'}},
 {'typ': 'dephosphorylation',
  'resnum': 77,
  'instance': 'YEPEGSASPTPPYLK',
  'substrate': 'O15169',
  'start': 70,
  'end': 84,
  'kinase': 'P62136',
  'resaa': 'S',
  'motif': 'YEPEGSASPTPPYLK',
  'enzyme_isoform': None,
  'substrate_isoform': None,
  'references': {'17318175'}}]

Finally, SIGNOR also assigns proteins to pathways. This information is intended for the OmniPath annotations database, and retrieved by the signor.signor_pathway_annotations function. This function returns a dict of sets which is typical for *_annotation* functions. This format requires practically no further processing.

[5]:
signor.signor_pathway_annotations()['O14733']

executed in 1.48s, finished 13:58:28 2022-12-02

[5]:
{SignorPathway(pathway='TNF alpha'),
 SignorPathway(pathway='Toll like receptors')}

We haven’t mention all functions in the inputs.signor module. The rest of the functions retrieve additional information needed by the four functions above, and are of limited direct use for users. For example, signor_protein_families returns a dict with the internal ID and members of protein families; this data is used to process the interactions and complexes, but not too interesting on its own.

[6]:
signor.signor_protein_families()['SIGNOR-PF2']

executed in 0ms, finished 13:58:53 2022-12-02

[6]:
['Q9HBW0', 'Q92633', 'Q9UBY5']

Download management§

Cache management and customization§

The pypath.omnipath.app saves the databases to pickle dumps by default under the ~/.pypath/pickles/ directory and after the first build loads them from there. The very first build of each database might take quite long time (up to >90 min in case of the OmniPath network or annotations) because of the large number of downloads. Subsequent builds will be much faster because pypath stores all the downloaded data in a local cache and downloads again only upon request from the user. Loading the databases from pickle dumps takes only seconds. However if you want to build with different settings you should be aware to set up a different cache file name.

Download failures§

Issuing hundreds of requests to dozens of servers sooner or later comes with failures. These might happen just by accident, especially on slow networks, it is always recommended to try again. The

Corrupted cache content§

Sometimes a truncated or corrupted file remains in the cache, in this case you can use the context managers in pypath.share.curl to control the cache. E.g. if the download of the DEPOD database failed and keeps failing due to a corrupted file, use the cache_delete_on context:

[7]:
from pypath.share import curl
from pypath.inputs import depod

with curl.cache_delete_on():
    depod = depod.depod_enzyme_substrate()

executed in 5.61s, finished 13:59:07 2022-12-02

The cache_off context forces download even if a cache item is available; the cache_print_on context prints paths to the accessed cache files to the terminal, though the paths can always be found in the log; the dry_run_on context sets up the pypath.share.curl.Curl object and stops just before the actual download.

Network communication issues: look into the curl debug log§

Downloads might fail also due to TLS or HTTP errors, wrong headers or parameters, and many other reasons. In this case a full debug output from curl might be very useful. The debug_on context writes curl debug into the logfile:

[8]:
from pypath.share import curl
from pypath.inputs import depod

with curl.debug_on():
    depod = depod.depod_enzyme_substrate()

executed in 0ms, finished 13:59:12 2022-12-02

Timeouts§

From the log we can find out if the download fails due to a timeout. In this case, the timeout parameters can be altered by a settings context. Apart from a timeout for the completion of the download, there is curl_connect_timeout (timeout for establishing connection to the server), and curl_extended_timeout, that is used for servers that are known to be exceptionally slow. Another parameter, curl_retries is the number of attempts before giving up. By default it’s 3, and that should be more than enough.

[9]:
from pypath.share import settings
from pypath.inputs import depod

with settings.context(curl_timeout = 360):
    depod = depod.depod_enzyme_substrate()

executed in 0ms, finished 13:59:17 2022-12-02

Access and inspect the Curl object§

Often the Curl object is created in a function from the pypath.inputs module, deep in a call stack, hence accessing it for investigation is difficult. Using the preserve_on context, the last Curl instance is kept under the pypath.share.curl.LASTCURL attribute:

[10]:
from pypath.share import curl
from pypath.inputs import depod

with curl.preserve_on():
    depod = depod.depod_enzyme_substrate()

depod_curl = curl.LASTCURL
depod_curl

executed in 0ms, finished 13:59:24 2022-12-02

[10]:
<pypath.share.curl.Curl at 0x6947386dc8b0>
[11]:
depod_curl.url, depod_curl.req_headers, depod_curl.fileobj, depod_curl.status

executed in 0ms, finished 13:59:28 2022-12-02

[11]:
('http://depod.bioss.uni-freiburg.de/download/DEPOD_201405_human_phosphatase-substrate.mitab',
 [],
 <_io.TextIOWrapper name='/home/denes/.pypath/cache/6a711369ecf9dcff8c5ed88996685b54-DEPOD_201405_human_phosphatase-substrate.mitab' mode='r' encoding='iso-8859-1'>,
 0)

Is it failing only for you?§

Okay, this is the one you should check first: we run almost all downloads in pypath daily, you can always check in the report wether a particular function run successfully last night on our server. If it fails also in our daily build, it still can be a transient error that disappears within a few days, or it can be a permanent error. In the latter case, we first try to fix the issue in pypath (maybe the behaviour or the address of the third party server has changed). If we have no way to fix it, we start hosting the data on our own server and make pypath download it from there.

Read the log§

Above we mentioned a lot the pypath log. Here is how to access the log, see more details in the section about logging:

[12]:
import pypath
pypath.log()

executed in 0ms, finished 13:59:34 2022-12-02

[2022-12-02 14:57:09] Welcome!
[2022-12-02 14:57:09] Logger started, logging into `/home/denes/pypath/notebooks/pypath_log/pypath-s3e92.log`.
[2022-12-02 14:57:09] Session `s3e92` started.
[2022-12-02 14:57:09] [pypath]
        - session ID: `s3e92`
        - working directory: `/home/denes/pypath/notebooks`
        - logfile: `/home/denes/pypath/notebooks/pypath_log/pypath-s3e92.log`
        - pypath version: 0.14.30
[2022-12-02 14:57:09] [curl] Creating Curl object to retrieve data from `https://www.ensembl.org/info/about/species.html`
[2022-12-02 14:57:09] [curl] Cache file path: `/home/denes/.pypath/cache/535b06d53a59e75bb693369bc5fdc556-species.html`
[2022-12-02 14:57:09] [curl] Cache file found, no need for download.
[2022-12-02 14:57:09] [curl] Opening plain text file `/home/denes/.pypath/cache/535b06d53a59e75bb693369bc5fdc556-species.html`.
[2022-12-02 14:57:09] [curl] Contents of `/home/denes/.pypath/cache/535b06d53a59e75bb693369bc5fdc556-species.html` has been read and the file has been closed.
[2022-1
Output truncated: showing 1000 of 112963 characters

TLS (SSL, HTTPS) errors§

Failed to verify certificate, invalid, expired, self-signed, missing certificates. These might be the most common reasons why people open issues for our software. TLS is a method for encrypted, typically HTTP, communication. The server has a certificate and uses it to sign and encrypt the data before sending it to the client. The client trusts the server certificate because it is signed by another certificate. And that is signed by another one, and so on, until we reach a so called root certificate that is known and trusted by the client. The number of root certificates used globally is so small that every single computer stores them locally and updates them time to time from trusted sources, such as the provider of the operating system, web browser or programming language. Having up-to-date certificate store and correctly configured TLS clients on your computer is your (or your system admin’s) duty, we can here only give a generic procedure to address these issues. In 97% of the cases the issue is in your computer, but sometimes the server might be responsible. If you experience a TLS issue:

  • Check the status of the server: initiate a scan at a free TLS checking service, such as SSL Labs: look for any issue with the certificate chain, such as missing or expired certificates, old or too new ciphers not supported by your client, etc.

  • Identify the server that your client failed to establish a TLS connection to (in case of pypath, look into the log)

  • Identify your software that contains the TLS client: in case of pypath, it uses pycurl, a Python module built on libcurl

  • Identify the provider of the client software: it can be PyPI, Anaconda, your operating system, etc.

  • Find out which certificate store that software uses: most of them uses the store from your operating system, but for example Java or Mozilla Firefox come with their own certificates

  • Check if the certificate store is up-to-date, update if necessary

  • Alternatively, identify the missing root certificate and add it manually to the store; you can also add a non-root certificate if the server has a serious issue and the chain can not be followed until a valid root certificate

Please open TLS related issues for our software only if you

  • Experience a server side issue with omnipathdb.org

  • You have a strong reason to think the reason is in the code written by us or can be easily fixed within our code

Resources§

[2]:
from pypath import resources
rc = resources.get_controller()
rc

executed in 0ms, finished 14:27:45 2022-12-03

[2]:
<pypath.resources.controller.ResourceController at 0x6cc25e25dcf0>

Licenses§

The license of SIGNOR is CC BY-SA, it allows commercial (for-profit) use:

[3]:
rc.license('SIGNOR'), rc.license('SIGNOR').commercial

executed in 0ms, finished 14:27:47 2022-12-03

[3]:
(<License CC BY-SA 4.0>, True)

Resource information§

[4]:
rc['MatrixDB']

executed in 0ms, finished 14:27:49 2022-12-03

[4]:
{'yearUsedRelease': 2015,
 'releases': [2009, 2011, 2015],
 'urls': {'articles': ['http://bioinformatics.oxfordjournals.org/content/25/5/690.long',
   'http://nar.oxfordjournals.org/content/43/D1/D321.long',
   'http://nar.oxfordjournals.org/content/39/suppl_1/D235.long'],
  'webpages': ['http://matrixdb.univ-lyon1.fr/'],
  'omictools': ['http://omictools.com/matrixdb-tool']},
 'pubmeds': [19147664, 20852260, 25378329],
 'taxons': ['mammalia'],
 'annot': ['experiment'],
 'recommend': ['small, literature curated interaction resource; many interactions for',
  'receptors and extracellular proteins'],
 'descriptions': ['Protein data were imported from the UniProtKB/Swiss-Prot database (Bairoch et',
  'al., 2005) and identified by UniProtKB/SwissProt accession numbers. In order to',
  'list all the partners of a protein, interactions are associated by default to the',
  'accession number of the human protein. The actual source species used in experiments is',
  'indicated in the page repor
Output truncated: showing 1000 of 4479 characters

Resource definitions for a certain database or dataset§

Note: This does not work yet for all databases and datasets, but likely in the near future this will be the preferred method to access resource definitions.

[197]:
rc.collect_enzyme_substrate()

executed in 0ms, finished 20:08:29 2022-12-02

[197]:
[<EnzymeSubstrateResource: phosphoELM>,
 <EnzymeSubstrateResource: dbPTM>,
 <EnzymeSubstrateResource: SIGNOR>,
 <EnzymeSubstrateResource: HPRD>,
 <EnzymeSubstrateResource: Li2012>,
 <EnzymeSubstrateResource: DEPOD>,
 <EnzymeSubstrateResource: PhosphoSite>,
 <EnzymeSubstrateResource: PhosphoNetworks>,
 <EnzymeSubstrateResource: MIMP>,
 <EnzymeSubstrateResource: ProtMapper>,
 <EnzymeSubstrateResource: KEA>]

The resource definitions carry all information necessary to load the resource, for example:

[202]:
phosphoelm = rc.collect_enzyme_substrate()[0]
phosphoelm.input_method, phosphoelm.id_type_enzyme

executed in 0ms, finished 20:09:51 2022-12-02

[202]:
('phosphoelm.phosphoelm_enzyme_substrate', 'uniprot')

Building networks§

For this you will need the Network class from the pypath.core.network module which takes care about building and querying the network. Also you need the pypath.resources.network module where you find a number of predefined input settings organized in larger categories (e.g. activity flow, enzyme-substrate, transcriptional regulation, etc). These input settings will tell pypath how to download and process the data.

[13]:
from pypath.core import network
from pypath.resources import network as netres

executed in 0ms, finished 13:59:49 2022-12-02

For example the netres.pathway is a collection of databases which fit into the activity flow concept, i.e. one protein either stimulates or inhibits the other. It is a dictionary with names as keys and the input settings as values:

[14]:
netres.pathway

executed in 0ms, finished 13:59:52 2022-12-02

[14]:
{'trip': <NetworkResource: TRIP (post_translational, activity_flow)>,
 'spike': <NetworkResource: SPIKE (post_translational, activity_flow)>,
 'signalink3': <NetworkResource: SignaLink3 (post_translational, activity_flow)>,
 'guide2pharma': <NetworkResource: Guide2Pharma (post_translational, activity_flow)>,
 'ca1': <NetworkResource: CA1 (post_translational, activity_flow)>,
 'arn': <NetworkResource: ARN (post_translational, activity_flow)>,
 'nrf2': <NetworkResource: NRF2ome (post_translational, activity_flow)>,
 'macrophage': <NetworkResource: Macrophage (post_translational, activity_flow)>,
 'death': <NetworkResource: DeathDomain (post_translational, activity_flow)>,
 'pdz': <NetworkResource: PDZBase (post_translational, activity_flow)>,
 'signor': <NetworkResource: SIGNOR (post_translational, activity_flow)>,
 'adhesome': <NetworkResource: Adhesome (post_translational, activity_flow)>,
 'icellnet': <NetworkResource: ICELLNET (post_translational, activity_flow)>,
 'celltalkdb': <Net
Output truncated: showing 1000 of 1864 characters

Such a dictionary you can pass to the load method of the network.Network object. Then it will download the data from the original sources, translate the identifiers and merge the networks. Pypath stores all downloaded data in a cache, by default ~/.pypath/cache in your user’s home directory. For this reason when you load a resource for the first time it might take long but next time will be faster as data will be fetched from the cache. First create a pypath.network.Network object, then build the network:

[15]:
n = network.Network()
n.load(netres.pathway)

executed in 32.90s, finished 14:00:36 2022-12-02

[16]:
n

executed in 0ms, finished 14:02:23 2022-12-02

[16]:
<Network: 6833 nodes, 25607 interactions>

You can add more resource sets a similar way:

[18]:
n.load(netres.enzyme_substrate)

executed in 30.04s, finished 14:04:29 2022-12-02

[19]:
n

executed in 0ms, finished 14:05:38 2022-12-02

[19]:
<Network: 7979 nodes, 35550 interactions>

To load one single resource simply pass the NetworkResource directly:

[20]:
n.load(netres.interaction['matrixdb'])

executed in 0ms, finished 14:05:42 2022-12-02

[21]:
n

executed in 0ms, finished 14:05:44 2022-12-02

[21]:
<Network: 8002 nodes, 35748 interactions>

Which network datasets are pre-defined in pypath?§

You can find all the pre-defined datasets in the pypath.resources.network module. This module currently is a wrapper around an older module, pypath.resources.data_formats, the actual definitions are written in this latter. As already we mentined above, the pathway dataset contains the literature curated activity flow resources. This was the original focus of pypath and OmniPath, however since then we added a great variety of other kinds of resource definitions. Here we give an overview of these.

  • pypath.resources.network.pathway: activity flow networks with literature references

  • pypath.resources.network.activity_flow: synonym for pathway

  • pypath.resources.network.pathway_noref: activity flow networks without literature references

  • pypath.resources.network.pathway_all: all activity flow data

  • pypath.resources.network.ptm: enzyme-substrate interaction networks with literature references

  • pypath.resources.network.enzyme_substrate: synonym for ptm

  • pypath.resources.network.ptm_noref: enzyme-substrate networks without literature references

  • pypath.resources.network.ptm_all: all enzyme-substrate data

  • pypath.resources.network.interaction: undirected interactions from both literature curated and high-throughput collections (e.g. IntAct, BioGRID)

  • pypath.resources.network.interaction_misc: undirected, high-scale interaction networks without the constraint of having any literature reference (e.g. the unbiased human interactome screen from the Vidal lab)

  • pypath.resources.network.transcription_onebyone: transcriptional regulation databases (TF-target interactions) with all databases downloaded directly and processed by pypath

  • pypath.resources.network.transcription: transcriptional regulation only from the DoRothEA data

  • pypath.resources.network.mirna_target: miRNA-mRNA interactions from literature curated resources

  • pypath.resources.network.tf_mirna: transcriptional regulation of miRNA from literature curated resources

  • pypath.resources.network.lncrna_protein: lncRNA-protein interactions from literature curated datasets

  • pypath.resources.network.ligand_receptor: ligand-receptor interactions from both literature curated and other kinds of resources

  • pypath.resources.network.pathwaycommons: the PathwayCommons database

  • pypath.resources.network.reaction: process description databases; not guaranteed to work at this moment

  • pypath.resources.network.reaction_misc: alternative definitions to load process description databases; not guaranteed to work at this moment

  • pypath.resources.network.small_molecule_protein: signaling interactions between small molecules and proteins

To see the list of the resources in a dataset, you can check the dict keys or the name attribute of each element:

[22]:
netres.pathway.keys()

executed in 0ms, finished 14:05:57 2022-12-02

[22]:
dict_keys(['trip', 'spike', 'signalink3', 'guide2pharma', 'ca1', 'arn', 'nrf2', 'macrophage', 'death', 'pdz', 'signor', 'adhesome', 'icellnet', 'celltalkdb', 'cellchatdb', 'connectomedb', 'talklr', 'cellinker', 'scconnect', 'hpmr', 'cellphonedb', 'ramilowski2015', 'lrdb', 'baccin2019'])
[23]:
[resource.name for resource in netres.pathway.values()]

executed in 0ms, finished 14:06:00 2022-12-02

[23]:
['TRIP',
 'SPIKE',
 'SignaLink3',
 'Guide2Pharma',
 'CA1',
 'ARN',
 'NRF2ome',
 'Macrophage',
 'DeathDomain',
 'PDZBase',
 'SIGNOR',
 'Adhesome',
 'ICELLNET',
 'CellTalkDB',
 'CellChatDB',
 'connectomeDB2020',
 'talklr',
 'Cellinker',
 'scConnect',
 'HPMR',
 'CellPhoneDB',
 'Ramilowski2015',
 'LRdb',
 'Baccin2019']

The resource definitions above carry all the information about how to load the resource: which function to call, how to process the identifiers, references, directions, and all other attributes from the input. E.g. which column from SPIKE corresponds to the source node? Which identifier type is used? It is the second column, and it has gene symbols in it:

[24]:
netres.pathway['spike'].networkinput.id_col_a, netres.pathway['spike'].networkinput.id_type_a

executed in 0ms, finished 14:06:07 2022-12-02

[24]:
(1, 'genesymbol')

The Network object§

Once you built a network you can use it for various purposes and write your own scripts for further processing or analysis. Below we create a Network object and populate it with the pathway dataset.

Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.

[2]:
from pypath.core import network
from pypath.resources import network as netres

n = network.Network()
n.load(netres.pathway)
n

executed in 36.07s, finished 14:15:48 2022-12-02

[2]:
<Network: 6833 nodes, 25607 interactions>

Almost all data is stored as a dict node pairs vs. interactions in Network.interactions:

[3]:
n.interactions

executed in 0ms, finished 14:17:02 2022-12-02

[3]:
{(<Entity: TRPC1>,
  <Entity: KCNMA1>): <Interaction: TRPC1 ============= KCNMA1 [Evidences: TRIP (2 references)]>,
 (<Entity: TRPC1>,
  <Entity: PPP3CA>): <Interaction: TRPC1 ============= PPP3CA [Evidences: TRIP (1 references)]>,
 (<Entity: CALM2>,
  <Entity: TRPC1>): <Interaction: CALM2 =======(-)==> TRPC1 [Evidences: TRIP (3 references)]>,
 (<Entity: CALM3>,
  <Entity: TRPC1>): <Interaction: CALM3 =======(-)==> TRPC1 [Evidences: TRIP (3 references)]>,
 (<Entity: CALM1>,
  <Entity: TRPC1>): <Interaction: CALM1 =======(-)==> TRPC1 [Evidences: TRIP (3 references)]>,
 (<Entity: CASP1>,
  <Entity: TRPC1>): <Interaction: CASP1 ============= TRPC1 [Evidences: TRIP (1 references)]>,
 (<Entity: TRPC1>,
  <Entity: CASP4>): <Interaction: TRPC1 ============= CASP4 [Evidences: TRIP (1 references)]>,
 (<Entity: TRPC1>,
  <Entity: CACNA1C>): <Interaction: TRPC1 ============= CACNA1C [Evidences: TRIP (1 references)]>,
 (<Entity: TRPC1>,
  <Entity: CAV1>): <Interaction: TRPC1 <=(+)======== CAV1 [Ev
Output truncated: showing 1000 of 118492 characters

The dict under Network.nodes is kept in sync with the interactions, and facilitates node access. Keys are primary identifiers (for proteins UniProt IDs by default), values are Entity objects:

[26]:
n.nodes

executed in 0ms, finished 14:06:21 2022-12-02

[26]:
{'P48995': <Entity: TRPC1>,
 'Q12791': <Entity: KCNMA1>,
 'Q08209': <Entity: PPP3CA>,
 'P0DP24': <Entity: CALM2>,
 'P0DP25': <Entity: CALM3>,
 'P0DP23': <Entity: CALM1>,
 'P29466': <Entity: CASP1>,
 'P49662': <Entity: CASP4>,
 'Q13936': <Entity: CACNA1C>,
 'Q03135': <Entity: CAV1>,
 'P56539': <Entity: CAV3>,
 'Q14247': <Entity: CTTN>,
 'P14416': <Entity: DRD2>,
 'P11532': <Entity: DMD>,
 'P11362': <Entity: FGFR1>,
 'Q02790': <Entity: FKBP4>,
 'Q86YM7': <Entity: HOMER1>,
 'Q9NSC5': <Entity: HOMER3>,
 'Q99750': <Entity: MDFI>,
 'Q14571': <Entity: ITPR2>,
 'Q14573': <Entity: ITPR3>,
 'P29966': <Entity: MARCKS>,
 'Q13255': <Entity: GRM1>,
 'P20591': <Entity: MX1>,
 'P62166': <Entity: NCS1>,
 'Q96D31': <Entity: ORAI1>,
 'Q96SN7': <Entity: ORAI2>,
 'Q9BRQ5': <Entity: ORAI3>,
 'P11171': <Entity: EPB41>,
 'P61586': <Entity: RHOA>,
 'Q9Y225': <Entity: RNF24>,
 'P21817': <Entity: RYR1>,
 'P16615': <Entity: ATP2A2>,
 'Q93084': <Entity: ATP2A3>,
 'P60880': <Entity: SNAP25>,
 'Q13586': <Entity: STI
Output truncated: showing 1000 of 30573 characters

An interaction between a pair of entities can be accessed:

[27]:
n.interaction('EGF', 'EGFR')

executed in 0ms, finished 14:06:27 2022-12-02

[27]:
<Interaction: EGFR <=(+)======== EGF [Evidences: Baccin2019, CellTalkDB, Fantom5, Guide2Pharma, HPMR, HPRD, ICELLNET, LRdb, Ramilowski2015, SIGNOR, SPIKE, SignaLink3, cellsignal.com, connectomeDB2020 (17 references)]>

Similarly, individual nodes can be looked up:

[28]:
n.entity('EGFR')

executed in 0ms, finished 14:06:29 2022-12-02

[28]:
<Entity: EGFR>

Labels (gene symbols for proteins by default), identifiers (such as UniProt IDs) and Entity objects can be used to refer to nodes. Each node carries some basic information:

[29]:
egfr = n.entity('EGFR')
egfr.identifier, egfr.label, egfr.entity_type, egfr.id_type, egfr.taxon

executed in 0ms, finished 14:06:32 2022-12-02

[29]:
('P00533', 'EGFR', 'protein', 'uniprot', 9606)

Interactions feature a number of methods to access various information, such as their types, direction, effect, resources, references, etc. The very same methods are also available for the whole network. Below we only show a few examples of these methods.

[30]:
ia = n.interaction('EGF', 'EGFR')
ia

executed in 0ms, finished 14:06:34 2022-12-02

[30]:
<Interaction: EGFR <=(+)======== EGF [Evidences: Baccin2019, CellTalkDB, Fantom5, Guide2Pharma, HPMR, HPRD, ICELLNET, LRdb, Ramilowski2015, SIGNOR, SPIKE, SignaLink3, cellsignal.com, connectomeDB2020 (17 references)]>
[31]:
ia.get_resource_names()

executed in 0ms, finished 14:06:47 2022-12-02

[31]:
{'Baccin2019',
 'CellTalkDB',
 'HPMR',
 'ICELLNET',
 'LRdb',
 'SIGNOR',
 'SPIKE',
 'SignaLink3',
 'connectomeDB2020'}
[32]:
ia.get_references()

executed in 0ms, finished 14:06:50 2022-12-02

[32]:
{<Reference: 10085134>,
 <Reference: 10209155>,
 <Reference: 10788520>,
 <Reference: 12093292>,
 <Reference: 12297050>,
 <Reference: 12620237>,
 <Reference: 12648462>,
 <Reference: 15620700>,
 <Reference: 16274239>,
 <Reference: 17145710>,
 <Reference: 19531499>,
 <Reference: 20458382>,
 <Reference: 21071413>,
 <Reference: 23331499>,
 <Reference: 3494473>,
 <Reference: 6289330>,
 <Reference: 8639530>}

This is a valid direction for this interaction:

[33]:
ia.get_direction(('EGF', 'EGFR'))

executed in 0ms, finished 14:06:53 2022-12-02

[33]:
True

The opposite direction is not supported by any of the resources:

[34]:
ia.get_direction(('EGFR', 'EGF'))

executed in 0ms, finished 14:06:55 2022-12-02

[34]:
False

However, some resources provide no direction information, these are classified as “undirected”:

ia.get_direction(‘undirected’)

We can check which resources are those exactly:

[35]:
ia.get_direction('undirected', sources = True)

executed in 0ms, finished 14:07:23 2022-12-02

[35]:
{'HPMR', 'SPIKE'}

Effect signs (stimulation, inhibition) are available in a similar way. The first one of the Boolean values mean stimulation (activation), the second one inhibition.

[36]:
ia.get_sign(('EGF', 'EGFR'))

executed in 0ms, finished 14:07:25 2022-12-02

[36]:
[True, False]

Which resources support the effect signs:

[37]:
ia.get_sign(('EGF', 'EGFR'), sources = True)

executed in 0ms, finished 14:07:28 2022-12-02

[37]:
[{'SIGNOR', 'SPIKE', 'SignaLink3'}, set()]

Many methods start by get_..., such as:

[38]:
ia.get_interaction_types()

executed in 0ms, finished 14:07:30 2022-12-02

[38]:
{'post_translational'}

Others are called ..._by_..., these combine two get_... methods:

[39]:
ia.references_by_resource()

executed in 0ms, finished 14:07:32 2022-12-02

[39]:
{'ICELLNET': {<Reference: 8639530>},
 'SIGNOR': {<Reference: 12297050>, <Reference: 12648462>},
 'SignaLink3': {<Reference: 10085134>,
  <Reference: 10209155>,
  <Reference: 19531499>,
  <Reference: 21071413>,
  <Reference: 23331499>},
 'Baccin2019': {<Reference: 10788520>,
  <Reference: 12093292>,
  <Reference: 12297050>,
  <Reference: 12620237>,
  <Reference: 15620700>,
  <Reference: 16274239>,
  <Reference: 6289330>},
 'LRdb': {<Reference: 10788520>,
  <Reference: 12093292>,
  <Reference: 12297050>,
  <Reference: 12620237>,
  <Reference: 15620700>,
  <Reference: 16274239>,
  <Reference: 6289330>},
 'SPIKE': {<Reference: 12297050>,
  <Reference: 17145710>,
  <Reference: 20458382>,
  <Reference: 3494473>},
 'CellTalkDB': {<Reference: 12093292>},
 'connectomeDB2020': {<Reference: 10788520>,
  <Reference: 12093292>,
  <Reference: 12297050>,
  <Reference: 12620237>,
  <Reference: 15620700>,
  <Reference: 16274239>,
  <Reference: 6289330>},
 'HPMR': {<Reference: 6289330>}}

And all these methods accept the same filtering parameters. E.g. if you are interested only in certain resources, it’s possible to restrict the query to those. For example, the two resources below provide no positive sign interaction:

[40]:
ia.get_interactions_positive(resources = {'ICELLNET', 'HPMR'})

executed in 0ms, finished 14:07:39 2022-12-02

[40]:
()

While some other resources do:

[41]:
ia.get_interactions_positive(resources = {'SignaLink3'})

executed in 0ms, finished 14:07:42 2022-12-02

[41]:
((<Entity: EGF>, <Entity: EGFR>),)

Or see the references that do or do not provide effect sign:

[42]:
ia.get_references(effect = True), ia.get_references(effect = False)

executed in 0ms, finished 14:07:44 2022-12-02

[42]:
({<Reference: 10085134>,
  <Reference: 10209155>,
  <Reference: 12297050>,
  <Reference: 12648462>,
  <Reference: 19531499>,
  <Reference: 20458382>,
  <Reference: 21071413>,
  <Reference: 23331499>},
 {<Reference: 10085134>,
  <Reference: 10209155>,
  <Reference: 10788520>,
  <Reference: 12093292>,
  <Reference: 12297050>,
  <Reference: 12620237>,
  <Reference: 12648462>,
  <Reference: 15620700>,
  <Reference: 16274239>,
  <Reference: 17145710>,
  <Reference: 19531499>,
  <Reference: 20458382>,
  <Reference: 21071413>,
  <Reference: 23331499>,
  <Reference: 3494473>,
  <Reference: 6289330>,
  <Reference: 8639530>})

Network in pandas.DataFrame§

Contents of a pypath.core.network.Network object can be exported to a pandas.DataFrame:

[1]:
from pypath import omnipath
cu = omnipath.db.get_db('curated')
cu.make_df()
cu.df

executed in 23.41s, finished 15:24:19 2022-12-03

[1]:
id_a id_b type_a type_b directed effect type dmodel sources references
0 P48995 Q12791 protein protein False 0 post_translational {activity_flow} {TRIP} NaN
1 P48995 Q08209 protein protein False 0 post_translational {activity_flow} {TRIP} NaN
2 P0DP23 P48995 protein protein True -1 post_translational {activity_flow} {TRIP} NaN
3 P0DP25 P48995 protein protein True -1 post_translational {activity_flow} {TRIP} NaN
4 P0DP24 P48995 protein protein True -1 post_translational {activity_flow} {TRIP} NaN
... ... ... ... ... ... ... ... ... ... ...
44033 Q14289 Q9ULZ3 protein protein True 0 post_translational {enzyme_substrate} {iPTMnet} NaN
44034 P54646 Q9Y2I7 protein protein True 0 post_translational {enzyme_substrate} {iPTMnet} NaN
44035 Q9BXM7 Q9Y2N7 protein protein True 0 post_translational {enzyme_substrate} {iPTMnet} NaN
44036 P49137 Q9Y385 protein protein True 0 post_translational {enzyme_substrate} {iPTMnet} NaN
44037 Q9UHC7 P04637 protein protein True 0 post_translational {enzyme_substrate} {iPTMnet} NaN

44038 rows × 10 columns

Translating identifiers§

The pypath.utils.mapping module is for ID translation, most of the time you can simply call the map_name method:

[4]:
from pypath.utils import mapping
mapping.map_name('P00533', 'uniprot', 'genesymbol')

executed in 0ms, finished 14:17:27 2022-12-02

[4]:
{'EGFR'}

By default the map_name function returns a set because it accounts for ambiguous mapping. However most often the ID translation is unambiguous, and you want to retrieve only one ID. The map_name0 returns a string, even in case of ambiguity, it returns a random element from the resulted set:

[5]:
mapping.map_name0('GABARAPL3', 'genesymbol', 'uniprot')

executed in 0ms, finished 14:17:31 2022-12-02

[5]:
'Q9BY60'

Molecules have large variety of identifiers, but in pypath two identifier types are special:

  • The primary identifier defines the molecule category, e.g. if UniProt is the primary identifier for proteins, then a protein is anything that has a UniProt ID

  • The label is a human readable identifier, for proteins it’s gene symbol

The primary ID and label types are configured for each molecule type (protein, miRNA, drug, etc) in the module settings. The mapping module provides shortcuts to translate between these identifiers: label and id_from_label.

[6]:
mapping.label('O75385')

executed in 0ms, finished 14:17:33 2022-12-02

[6]:
'ULK1'
[7]:
mapping.id_from_label('ULK1')

executed in 0ms, finished 14:17:35 2022-12-02

[7]:
{'O75385'}
[8]:
mapping.id_from_label0('ULK1')

executed in 0ms, finished 14:17:37 2022-12-02

[8]:
'O75385'

Multiple IDs can be translated in one call, however, it’s not possible to know certainly which output corresponds to which input.

[9]:
mapping.map_names(['ULK1', 'EGFR', 'SMAD2'], 'genesymbol', 'uniprot')

executed in 0ms, finished 14:17:40 2022-12-02

[9]:
{'O75385', 'P00533', 'Q15796'}

The default organism is defined in the module settings, it is human by default. Translating for other organisms requires the ncbi_tax_id argument. Most of the functions in pypath accepts also common or latin names, but map_name accepts only numeric taxon IDs for efficiency. Let’s translate a mouse identifier:

[10]:
mapping.map_name('Smad2', 'genesymbol', 'uniprot', ncbi_tax_id = 10090)

executed in 0ms, finished 14:17:44 2022-12-02

[10]:
{'Q62432'}

If no direct translation table is available between two ID types, pypath will try to translate by an intermediate ID type.

[11]:
mapping.map_name('8408', 'entrez', 'genesymbol')

executed in 0ms, finished 14:17:46 2022-12-02

[11]:
{'ULK1'}

Behind the scenes the chain_map function is called:

[12]:
m = mapping.get_mapper()
m.chain_map('8408', id_type = 'entrez', target_id_type = 'genesymbol', by_id_type = 'uniprot')

executed in 0ms, finished 14:17:47 2022-12-02

[12]:
{'ULK1'}

And the procedure corresponds to the following:

[13]:
mapping.map_names(
    mapping.map_name('8408', 'entrez', 'uniprot'),
    'uniprot',
    'genesymbol',
)

executed in 0ms, finished 14:17:49 2022-12-02

[13]:
{'ULK1'}

Pre-defined ID translation tables§

A number of mapping tables are pre-defined, these load automatically on demand, and are removed from the memory if not used for some time (5 minutes by default). New mapping tables are saved directly into pickle files in the cache for a quick reload. Tables are either organism specific (hence loaded for each organism one-by-one), or non-organism specific, such as drug IDs (pypath uses integer 0 in this case in place of the numeric NCBI Taxonomy ID). The identifier translation data is retrieved from the following sources:

  • UniProt legacy API (main UniProt API until autumn 2022): internals.input_formats.UniprotMapping

  • UniProt uploadlists API (also outdated, replaced by the new UniProt API): internals.inputs_formats.UniprotListMapping

  • Ensembl Biomart: internals.input_formats.BiomartMapping and internals.input_formats.ArrayMapping (for microarray probes)

  • Protein Ontology Consortium: internals.input_formats.ProMapping

  • UniChem: internals.input_formats.UnichemMapping

  • Arbitrary files: internals.input_formats.FileMapping (this class is used to process data from miRBase, some files from the UniProt FTP site, and also user defined, custom cases)

Some of the classes above are instantiated in internals.maps, but most of the instances are created on the fly when loading a mapping table in utils.mapping.MapReader. This latter class is responsible to take a table definition and load a utils.mapping.MappingTable instance. The whole process is managed by utils.mapping.Mapper, this is the object all the ID translation queries are dispatched to. It has a method to list the defined ID translation tables:

[14]:
m = mapping.get_mapper()
m.mapping_tables()

executed in 0ms, finished 14:17:53 2022-12-02

[14]:
[MappingTableDefinition(id_type_a='embl', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='database(embl)', resource_id_type_b='id'),
 MappingTableDefinition(id_type_a='genesymbol', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='genes(PREFERRED)', resource_id_type_b='id'),
 MappingTableDefinition(id_type_a='genesymbol-syn', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='genes(ALTERNATIVE)', resource_id_type_b='id'),
 MappingTableDefinition(id_type_a='entrez', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='database(geneid)', resource_id_type_b='id'),
 MappingTableDefinition(id_type_a='hgnc', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='database(HGNC)', resource_id_type_b='id'),
 MappingTableDefinition(id_type_a='refseqp', id_type_b='uniprot', resource='uniprot', input_cl
Output truncated: showing 1000 of 22169 characters

Pypath uses synonyms to refer to ID types: these are intended to be short, clear and lowercase for ease of use. Most of the synonyms are defined in internals.input_formats, in the AC_QUERY, AC_MAPPING, BIOMART_MAPPING, PRO_MAPPING and ARRAY_MAPPING dictionaries. UniChem ID types are used exactly as provided by UniChem. To list all available ID types (below pypath is the synonym used here, original is the name in the original resource):

[15]:
m = mapping.get_mapper()
m.id_types()

executed in 0ms, finished 14:17:58 2022-12-02

[15]:
{IdType(pypath='MedChemExpress', original='MedChemExpress'),
 IdType(pypath='actor', original='actor'),
 IdType(pypath='affy', original='affy'),
 IdType(pypath='affymetrix', original='affymetrix'),
 IdType(pypath='agilent', original='agilent'),
 IdType(pypath='alzforum', original='Alzforum_mut'),
 IdType(pypath='araport', original='Araport'),
 IdType(pypath='atlas', original='atlas'),
 IdType(pypath='bindingdb', original='bindingdb'),
 IdType(pypath='brenda', original='brenda'),
 IdType(pypath='carotenoiddb', original='carotenoiddb'),
 IdType(pypath='cgnc', original='CGNC'),
 IdType(pypath='chebi', original='chebi'),
 IdType(pypath='chembl', original='chembl'),
 IdType(pypath='chemicalbook', original='chemicalbook'),
 IdType(pypath='clinicaltrials', original='clinicaltrials'),
 IdType(pypath='codelink', original='codelink'),
 IdType(pypath='comptox', original='comptox'),
 IdType(pypath='dailymed', original='dailymed'),
 IdType(pypath='dailymed_old', original='dailymed_old'),
 IdType(py
Output truncated: showing 1000 of 6649 characters

Homology translation§

The utils.homology module handles translation of data between organism by orthologous gene pairs. Its most important function is translate. The source organism is human by default, the target must be provided, below we use mouse (NCBI Taxonomy 10090):

[4]:
from pypath.utils import homology
homology.translate('P00533', target = 10090)

executed in 0ms, finished 15:36:30 2022-12-03

[4]:
{'Q01279'}

ID translation and homology translation are integrated, hence not only UniProt IDs can be translated:

[17]:
homology.translate('EGFR', target = 10090, id_type = 'genesymbol')

executed in 0ms, finished 14:18:05 2022-12-02

[17]:
{'Egfr'}

This module uses data from NCBI HomoloGene and Ensembl. The latter covers more organisms, and accepts some parameters (high confidence, one-to-one vs. one-to-many mapping). These parameters can be controlled by the settings module, or passed to the functions above and below. For some reason the settings below fail to find any ortholog of our example protein:

[18]:
homology.translate('P00533', target = 10090, homologene = False, ensembl = False, ensembl_hc = False, ensembl_types = 'one2one')

executed in 0ms, finished 14:18:07 2022-12-02

[18]:
set()

Homology translation tables as dictionaries§

The translation tables are available as dicts of sets, these are convenient for use outside of pypath:

[19]:
human_mouse_genesymbols = homology.get_dict(target = 'mouse', id_type = 'genesymbol')
human_mouse_genesymbols['EGFR']

executed in 1ms, finished 14:18:09 2022-12-02

[19]:
{'Egfr'}

Homology translation data frames§

Similarly, pandas.DataFrames are available:

[5]:
human_mouse_genesymbols = homology.get_df(target = 'mouse', id_type = 'genesymbol')
human_mouse_genesymbols

executed in 2.08s, finished 15:36:36 2022-12-03

[5]:
source target
0 MICOS13 Micos13
1 FAT4 Fat4
2 RARS2 Rars2
4 ZFP36L2 Zfp36l2
5 LAMC1 Lamc1
... ... ...
24843 KRTAP4-16 Gm40460
24844 KRTAP4-16 Gm45618
24845 KRTAP4-16 Gm4559
24846 IGKV1OR2-108 Igkv20-101-2
24847 FPGT-TNNI3K Tnni3k

22266 rows × 2 columns

Taxonomy§

Organisms matter everywhere, both in the input, output and processing parts of pypath. For this reason we created a utility module to deal with translation of organism identifiers. We prefer NCBI Taxonomy IDs as the primary organism identifier. These are simple numbers, 9606 is human, 10090 is mouse, etc. Many databases use common English names or latin (scientific) names. Then some databases use custom codes, such as hsapiens in Ensmebl (first letter of genus name + species name, without space, all lowercase); hsa in miRBase and KEGG (first letter of genus name, first two letters of species name). The pypath.utils.taxonomy module features some convenient functions for handling all these names.

Translating to NCBI Taxonomy, scientific names and common names§

The most often used is ensure_ncbi_tax_id, which returns the NCBI Taxonomy ID for any comprehensible input:

[21]:
from pypath.utils import taxonomy
taxonomy.ensure_ncbi_tax_id('human'), taxonomy.ensure_ncbi_tax_id('H sapiens'), taxonomy.ensure_ncbi_tax_id('hsapiens'), taxonomy.ensure_ncbi_tax_id(9606), taxonomy.ensure_ncbi_tax_id('Homo sapiens')

executed in 0ms, finished 14:18:22 2022-12-02

[21]:
(9606, 9606, 9606, 9606, 9606)

To access scientific names or common names:

[22]:
taxonomy.ensure_latin_name('cow')

executed in 0ms, finished 14:18:25 2022-12-02

[22]:
'Bos taurus'
[23]:
taxonomy.ensure_common_name('Erithacus rubecula')

executed in 0ms, finished 14:18:27 2022-12-02

[23]:
'European robin'

Organism from UniProt ID§

The uniprot_taxid function returns the taxonomy ID for a SwissProt ID. Unfortunately it does not work for TrEMBL IDs, that would require to keep too much data in memory.

[24]:
taxonomy.ensure_latin_name(taxonomy.uniprot_taxid('P53104'))

executed in 1.19s, finished 14:18:30 2022-12-02

[24]:
'Saccharomyces cerevisiae'

UniProt§

UniProt is a huge, diverse resource that is essential for pypath as we use it as a reference set for proteomes and it provides ID translation data. Its input module pypath.inputs.uniprot is already more complex than an average input module. It harbors a little database manager that loads and unloads tables on demand, ensuring fast and convenient operation. Further services are available in the pypath.utils.uniprot module.

The UniProt input module§

All UniProt IDs for one organism§

The complete set of UniProt IDs for an organism is considered to be the proteome of the organism, and it is used in many procedures across pypath. All SwissProt IDs, all TrEMBL IDs or both together can be retrieved:

[119]:
from pypath.inputs import uniprot as iuniprot
(
    len(iuniprot.all_uniprots(organism = 10090)),
    len(iuniprot.all_swissprots(organism = 10090)),
    len(iuniprot.all_trembls(organism = 10090)),
)

executed in 3m 33.99s, finished 16:07:43 2022-12-02

[119]:
(86440, 17131, 69300)

UniProt ID format validation§

UniProt defines a format for its accessions, any string can be checked against this template to tell if it’s possibly a valid ID:

[124]:
from pypath.inputs import uniprot as iuniprot
iuniprot.valid_uniprot('A0A8D0H0C2')

executed in 0ms, finished 16:17:41 2022-12-02

[124]:
True

UniProt ID validation§

Another functions check if an ID indeed exists in UniProt. These functions require loading the list of all UniProt IDs for the organism, hence calling them the first time might take even a few minutes (in case new download is necessary). Subsequent calls will be much faster.

[125]:
from pypath.inputs import uniprot as iuniprot
iuniprot.is_uniprot('P00533')

executed in 0ms, finished 16:17:44 2022-12-02

[125]:
True
[122]:
iuniprot.is_swissprot('P00533')

executed in 0ms, finished 16:14:14 2022-12-02

[122]:
True

If the organism doesn’t match:

[123]:
iuniprot.is_uniprot('P00533', organism = 10090)

executed in 0ms, finished 16:15:07 2022-12-02

[123]:
False

Single UniProt protein datasheet§

Raw contents of protein datasheets can be retrieved. The structure is a Python list with tuples of two elements, the first is the tag of the line, the second is the line content.

[126]:
from pypath.inputs import uniprot as iuniprot
iuniprot.protein_datasheet('P00533')

executed in 0ms, finished 16:18:06 2022-12-02

[126]:
[('ID', 'EGFR_HUMAN              Reviewed;        1210 AA.'),
 ('AC',
  'P00533; O00688; O00732; P06268; Q14225; Q68GS5; Q92795; Q9BZS2; Q9GZX1;'),
 ('AC', 'Q9H2C9; Q9H3C9; Q9UMD7; Q9UMD8; Q9UMG5;'),
 ('DT', '21-JUL-1986, integrated into UniProtKB/Swiss-Prot.'),
 ('DT', '01-NOV-1997, sequence version 2.'),
 ('DT', '12-OCT-2022, entry version 283.'),
 ('DE', 'RecName: Full=Epidermal growth factor receptor {ECO:0000305};'),
 ('DE', 'EC=2.7.10.1;'),
 ('DE', 'AltName: Full=Proto-oncogene c-ErbB-1;'),
 ('DE', 'AltName: Full=Receptor tyrosine-protein kinase erbB-1;'),
 ('DE', 'Flags: Precursor;'),
 ('GN', 'Name=EGFR {ECO:0000312|HGNC:HGNC:3236}; Synonyms=ERBB, ERBB1, HER1;'),
 ('OS', 'Homo sapiens (Human).'),
 ('OC',
  'Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;'),
 ('OC',
  'Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae;'),
 ('OC', 'Homo.'),
 ('OX', 'NCBI_TaxID=9606;'),
 ('RN', '[1]'),
 ('RP',
  'NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM
Output truncated: showing 1000 of 58080 characters

History of UniProt records§

[131]:
from pypath.inputs import uniprot as iuniprot
egfr_history = list(iuniprot.uniprot_history('P00533'))
egfr_history

executed in 0ms, finished 16:21:15 2022-12-02

[131]:
[UniprotRecordHistory(entry_version='283', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_04', date='2022-10-12', replaces='', replaced_by=''),
 UniprotRecordHistory(entry_version='282', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_03', date='2022-08-03', replaces='', replaced_by=''),
 UniprotRecordHistory(entry_version='281', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_02', date='2022-05-25', replaces='', replaced_by=''),
 UniprotRecordHistory(entry_version='280', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_01', date='2022-02-23', replaces='', replaced_by=''),
 UniprotRecordHistory(entry_version='279', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2021_04', date='2021-09-29', replaces='', replaced_by=''),
 UniprotRecordHistory(entry_version='278', sequence_version='2', entry_name='EGFR_HUMAN', database='
Output truncated: showing 1000 of 50933 characters
[132]:
iuniprot.uniprot_recent_version('P00533')

executed in 0ms, finished 16:21:57 2022-12-02

[132]:
UniprotRecordHistory(entry_version='283', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_04', date='2022-10-12', replaces='', replaced_by='')
[133]:
iuniprot.uniprot_history_recent_datasheet('P00533')

executed in 1ms, finished 16:22:33 2022-12-02

[133]:
[('ID', 'EGFR_HUMAN              Reviewed;        1210 AA.'),
 ('AC',
  'P00533; O00688; O00732; P06268; Q14225; Q68GS5; Q92795; Q9BZS2; Q9GZX1;'),
 ('AC', 'Q9H2C9; Q9H3C9; Q9UMD7; Q9UMD8; Q9UMG5;'),
 ('DT', '21-JUL-1986, integrated into UniProtKB/Swiss-Prot.'),
 ('DT', '01-NOV-1997, sequence version 2.'),
 ('DT', '12-OCT-2022, entry version 283.'),
 ('DE', 'RecName: Full=Epidermal growth factor receptor {ECO:0000305};'),
 ('DE', 'EC=2.7.10.1;'),
 ('DE', 'AltName: Full=Proto-oncogene c-ErbB-1;'),
 ('DE', 'AltName: Full=Receptor tyrosine-protein kinase erbB-1;'),
 ('DE', 'Flags: Precursor;'),
 ('GN', 'Name=EGFR {ECO:0000312|HGNC:HGNC:3236}; Synonyms=ERBB, ERBB1, HER1;'),
 ('OS', 'Homo sapiens (Human).'),
 ('OC',
  'Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;'),
 ('OC',
  'Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae;'),
 ('OC', 'Homo.'),
 ('OX', 'NCBI_TaxID=9606;'),
 ('RN', '[1]'),
 ('RP',
  'NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM
Output truncated: showing 1000 of 58080 characters

The functions above are able to retrieve the latest datasheet of deleted UniProt records. However, they are slow as several queries are performed to process a single protein.

UniProt legacy API§

UniProt deployed its new API in the autumn of 2022, since then the old API is available as a legacy option. In pypath this API is well supported. It is accessed by the inputs.uniprot.uniprot_data function, though higher level functions are more convenient for the users. For the function above, a list of fields can be passed. By default it uses only SwissProt. The output is a dict of dicts with fields as top level keys and UniProt IDs as second level keys. The results often contain notes, additional info in parentheses, prefixes and postfixes for identifiers, that are not needed in every situation. Using uniprot_preprocess instead of uniprot_data cleans up some of this clutter.

[135]:
from pypath.inputs import uniprot as iuniprot
iuniprot.uniprot_data(field = ('family', 'keywords', 'transmembrane'))

executed in 8.54s, finished 16:32:47 2022-12-02

[135]:
{'family': {'P63120': 'Peptidase A2 family, HERV class-II K(HML-2) subfamily',
  'Q96EC8': 'YIP1 family',
  'Q6ZMS4': 'Krueppel C2H2-type zinc-finger protein family',
  'Q8N8L2': 'Krueppel C2H2-type zinc-finger protein family',
  'Q3MIS6': 'Krueppel C2H2-type zinc-finger protein family',
  'Q86UK7': 'ZNF598 family',
  'Q6P280': 'Krueppel C2H2-type zinc-finger protein family',
  'Q969W1': 'DHHC palmitoyltransferase family',
  'O14978': 'Krueppel C2H2-type zinc-finger protein family',
  'Q15937': 'Krueppel C2H2-type zinc-finger protein family',
  'Q9P2J8': 'Krueppel C2H2-type zinc-finger protein family',
  'Q8IUH4': 'DHHC palmitoyltransferase family, AKR/ZDHHC17 subfamily',
  'Q9Y2D9': 'Krueppel C2H2-type zinc-finger protein family',
  'Q14588': 'Krueppel C2H2-type zinc-finger protein family',
  'Q6XR72': 'Cation diffusion facilitator (CDF) transporter (TC 2.A.4) family, SLC30A subfamily',
  'P58557': 'Endoribonuclease YbeY family',
  'Q9Y5A9': 'YTHDF family, YTHDF2 subfamily',
  'Q8N9L1
Output truncated: showing 1000 of 501768 characters

Processed UniProt annotations§

For a few important fields we have dedicated processing functions with the aim of making their format cleaner and better usable. Sometimes even these do an imperfect job, and certain fields are badly truncated or contain residual fragments of the stripped labels.

Note: All the data presented below is part of the OmniPath annotations database, the recommended way to access it is by the database manager.

[136]:
from pypath.inputs import uniprot as iuniprot
iuniprot.uniprot_taxonomy()

executed in 1ms, finished 16:40:33 2022-12-02

[136]:
{'P00521': {'Abelson murine leukemia virus'},
 'P03333': {'Abelson murine leukemia virus'},
 'H8ZM73': {'Abies balsamea', 'Balsam fir', 'Pinus balsamea'},
 'H8ZM71': {'Abies balsamea', 'Balsam fir', 'Pinus balsamea'},
 'Q9MV51': {'Abies firma', 'Momi fir'},
 'O81086': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'O24474': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'O24475': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'O64404': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'O64405': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'Q948Z0': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'Q9M7D1': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'Q9M7D0': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'O22340': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'Q9M7C9': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'Q5K3V1': {'Abies homolepis', 'Nikko fir'},
 'P21715': {'Abrothrix jelskii', 'Akodon jelskii', "Jelski's altiplano mouse"},
 'P11140': {'Abru
Output truncated: showing 1000 of 56985 characters
[139]:
iuniprot.uniprot_ncbi_taxids_2()

executed in 0ms, finished 16:42:33 2022-12-02

[139]:
{648330: Taxon(ncbi_id=648330, latin='Aedes albopictus densovirus (isolate Boublik/1994)', english='AalDNV', latin_synonym=None),
 10804: Taxon(ncbi_id=10804, latin='Adeno-associated virus 2', english='AAV-2', latin_synonym=None),
 648242: Taxon(ncbi_id=648242, latin='Adeno-associated virus 2 (isolate Srivastava/1982)', english='AAV-2', latin_synonym=None),
 118452: Taxon(ncbi_id=118452, latin='Abacion magnum', english='Millipede', latin_synonym=None),
 72259: Taxon(ncbi_id=72259, latin='Abaeis nicippe', english='Sleepy orange butterfly', latin_synonym='Eurema nicippe'),
 102642: Taxon(ncbi_id=102642, latin='Abax parallelepipedus', english='Ground beetle', latin_synonym=None),
 392897: Taxon(ncbi_id=392897, latin='Abalistes stellaris', english='Starry triggerfish', latin_synonym='Balistes stellaris'),
 75332: Taxon(ncbi_id=75332, latin='Abbottina rivularis', english='Chinese false gudgeon', latin_synonym='Gobio rivularis'),
 515833: Taxon(ncbi_id=515833, latin='Abdopus aculeatus', engl
Output truncated: showing 1000 of 118050 characters
[140]:
iuniprot.uniprot_locations()

executed in 0ms, finished 16:42:50 2022-12-02

[140]:
{'Q96EC8': {UniprotLocation(location='Golgi apparatus membrane', features=('Multi-pass membrane protein',))},
 'Q6ZMS4': {UniprotLocation(location='Nucleus', features=None)},
 'Q8N8L2': {UniprotLocation(location='Nucleus', features=None)},
 'Q15916': {UniprotLocation(location='Nucleus', features=None)},
 'Q3MIS6': {UniprotLocation(location='Nucleus', features=None)},
 'Q6P280': {UniprotLocation(location='Nucleus', features=None)},
 'Q969W1': {UniprotLocation(location='Endoplasmic reticulum membrane', features=('Multi-pass membrane protein',))},
 'O14978': {UniprotLocation(location='Nucleus', features=None)},
 'Q66K41': {UniprotLocation(location='Nucleus', features=None)},
 'Q15937': {UniprotLocation(location='Nucleus', features=None)},
 'Q9P2J8': {UniprotLocation(location='Nucleus', features=None)},
 'Q8ND82': {UniprotLocation(location='Nucleus', features=None)},
 'Q9NP64': {UniprotLocation(location='Nucleolus', features=None),
  UniprotLocation(location='Nucleus', features=None)},
 'P
Output truncated: showing 1000 of 143466 characters
[141]:
iuniprot.uniprot_keywords()

executed in 0ms, finished 16:43:06 2022-12-02

[141]:
{'P63120': {UniprotKeyword(keyword='Aspartyl protease'),
  UniprotKeyword(keyword='Autocatalytic cleavage'),
  UniprotKeyword(keyword='ERV'),
  UniprotKeyword(keyword='Hydrolase'),
  UniprotKeyword(keyword='Protease'),
  UniprotKeyword(keyword='Reference proteome'),
  UniprotKeyword(keyword='Ribosomal frameshifting'),
  UniprotKeyword(keyword='Transposable element')},
 'Q96EC8': {UniprotKeyword(keyword='Acetylation'),
  UniprotKeyword(keyword='Alternative splicing'),
  UniprotKeyword(keyword='Golgi apparatus'),
  UniprotKeyword(keyword='Membrane'),
  UniprotKeyword(keyword='Phosphoprotein'),
  UniprotKeyword(keyword='Reference proteome'),
  UniprotKeyword(keyword='Transmembrane'),
  UniprotKeyword(keyword='Transmembrane helix')},
 'Q6ZMS4': {UniprotKeyword(keyword='Metal-binding'),
  UniprotKeyword(keyword='Nucleus'),
  UniprotKeyword(keyword='Phosphoprotein'),
  UniprotKeyword(keyword='Reference proteome'),
  UniprotKeyword(keyword='Repeat'),
  UniprotKeyword(keyword='Zinc'),
  Unipro
Output truncated: showing 1000 of 445111 characters
[142]:
iuniprot.uniprot_families()

executed in 0ms, finished 16:43:22 2022-12-02

[142]:
{'P63120': {UniprotFamily(family='Peptidase A2', subfamily='HERV class-II K(HML-2)')},
 'Q96EC8': {UniprotFamily(family='YIP1', subfamily=None)},
 'Q6ZMS4': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q8N8L2': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q3MIS6': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q86UK7': {UniprotFamily(family='ZNF598', subfamily=None)},
 'Q6P280': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q969W1': {UniprotFamily(family='DHHC palmitoyltransferase', subfamily=None)},
 'O14978': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q15937': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q9P2J8': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q8IUH4': {UniprotFamily(family='DHHC palmitoyltransferase',
Output truncated: showing 1000 of 77892 characters
[143]:
iuniprot.uniprot_tissues()

executed in 1.12s, finished 16:43:55 2022-12-02

[143]:
{'Q15916': {UniprotTissue(tissue='Brain', level='high'),
  UniprotTissue(tissue='Wide', level='high')},
 'Q969W1': {UniprotTissue(tissue='Wide', level='undefined')},
 'O14978': {UniprotTissue(tissue='Brain', level='undefined'),
  UniprotTissue(tissue='Colon', level='undefined'),
  UniprotTissue(tissue='Heart', level='undefined'),
  UniprotTissue(tissue='Kidney', level='undefined'),
  UniprotTissue(tissue='Leukocyte', level='undefined'),
  UniprotTissue(tissue='Liver', level='undefined'),
  UniprotTissue(tissue='Lung', level='undefined'),
  UniprotTissue(tissue='Ovary', level='undefined'),
  UniprotTissue(tissue='Pancreas', level='undefined'),
  UniprotTissue(tissue='Placenta', level='undefined'),
  UniprotTissue(tissue='Prostate', level='undefined'),
  UniprotTissue(tissue='Skeletal muscle', level='undefined'),
  UniprotTissue(tissue='Small intestine', level='undefined'),
  UniprotTissue(tissue='Spleen', level='undefined'),
  UniprotTissue(tissue='Testis', level='undefined'),
  Uniprot
Output truncated: showing 1000 of 318790 characters
[144]:
iuniprot.uniprot_topology()

executed in 0ms, finished 16:44:13 2022-12-02

[144]:
{'Q96EC8': {UniprotTopology(topology='Cytoplasmic', start=2, end=84),
  UniprotTopology(topology='Cytoplasmic', start=137, end=146),
  UniprotTopology(topology='Cytoplasmic', start=206, end=212),
  UniprotTopology(topology='Lumenal', start=106, end=115),
  UniprotTopology(topology='Lumenal', start=168, end=184),
  UniprotTopology(topology='Lumenal', start=234, end=236),
  UniprotTopology(topology='Transmembrane', start=85, end=105),
  UniprotTopology(topology='Transmembrane', start=116, end=136),
  UniprotTopology(topology='Transmembrane', start=147, end=167),
  UniprotTopology(topology='Transmembrane', start=185, end=205),
  UniprotTopology(topology='Transmembrane', start=213, end=233)},
 'Q969W1': {UniprotTopology(topology='Cytoplasmic', start=1, end=77),
  UniprotTopology(topology='Cytoplasmic', start=138, end=198),
  UniprotTopology(topology='Cytoplasmic', start=288, end=377),
  UniprotTopology(topology='Lumenal', start=99, end=116),
  UniprotTopology(topology='Lumenal', start=220,
Output truncated: showing 1000 of 544230 characters

The UniProt utils module§

Datasheets§

The pypath.utils.uniprot module is an API around UniProt protein datasheets. It is not suitable for bulk retrieval: that would work but take really long time. Calling its bulk methods with more than a few dozens or hundreds of proteins might take minutes, as it downloads protein datasheets one-by-one. To retrieve the full datasheets of one or more proteins use query:

[153]:
from pypath.utils import uniprot
uniprot.query('P00533', 'O75385', 'Q14457')

executed in 1ms, finished 17:57:18 2022-12-02

[153]:
[<UniProt datasheet P00533 (EGFR)>,
 <UniProt datasheet O75385 (ULK1)>,
 <UniProt datasheet Q14457 (BECN1)>]
[154]:
ulk1 = uniprot.query('O75385')
ulk1

executed in 0ms, finished 17:57:58 2022-12-02

[154]:
<UniProt datasheet O75385 (ULK1)>

Many attributes are available from the datasheet objects, just a few examples:

[156]:
ulk1.weight, ulk1.length, ulk1.subcellular_location, ulk1.sequence

executed in 0ms, finished 17:59:18 2022-12-02

[156]:
(112631,
 1050,
 'Cytoplasm, cytosol. Preautophagosomal structure. Note=Under starvation conditions, is localized to puncate structures primarily representing the isolation membrane that sequesters a portion of the cytoplasm resulting in the formation of an autophagosome.',
 'MEPGRGGTETVGKFEFSRKDLIGHGAFAVVFKGRHREKHDLEVAVKCINKKNLAKSQTLLGKEIKILKELKHENIVALYDFQEMANSVYLVMEYCNGGDLADYLHAMRTLSEDTIRLFLQQIAGAMRLLHSKGIIHRDLKPQNILLSNPAGRRANPNSIRVKIADFGFARYLQSNMMAATLCGSPMYMAPEVIMSQHYDGKADLWSIGTIVYQCLTGKAPFQASSPQDLRLFYEKNKTLVPTIPRETSAPLRQLLLALLQRNHKDRMDFDEFFHHPFLDASPSVRKSPPVPVPSYPSSGSGSSSSSSSTSHLASPPSLGEMQQLQKTLASPADTAGFLHSSRDSGGSKDSSCDTDDFVMVPAQFPGDLVAEAPSAKPPPDSLMCSGSSLVASAGLESHGRTPSPSPPCSSSPSPSGRAGPFSSSRCGASVPIPVPTQVQNYQRIERNLQSPTQFQTPRSSAIRRSGSTSPLGFARASPSPPAHAEHGGVLARKMSLGGGRPYTPSPQVGTIPERPGWSGTPSPQGAEMRGGRSPRPGSSAPEHSPRTSGLGCRLHSAPNLSDLHVVRPKLPKPPTDPLGAVFSPPQASPPQPSHGLQSCRNLRGSPKLPDFLQRNPLPPILGSPTKAVPSFDFPKTPSSQNLLALLARQGVVMTPPRNRTLPDLSEVGPFHGQPLGPGLRPGEDPKGPFGRSFSTSRLTDLLLKAAFGTQAPDPGSTESLQEK
Output truncated: showing 1000 of 1329 characters

The collect function collects certain features for a set of proteins.

Warning: This is a really inefficient way of retrieving data from UniProt. If you work with more than a handful of proteins, go for pypath.inputs.uniprot_data instead.

[158]:
uniprot.collect(['P00533', 'O75385', 'Q14457'], 'weight', 'length')

executed in 0ms, finished 18:02:29 2022-12-02

[158]:
OrderedDict([('ac', ['P00533', 'O75385', 'Q14457']),
             ('weight', [134277, 112631, 51896]),
             ('length', [1210, 1050, 450])])

Tables§

UniProt data can be printed to the console in a tabular format:

[159]:
uniprot.print_features(['P00533', 'O75385', 'Q14457'], 'weight', 'length')

executed in 0ms, finished 18:07:18 2022-12-02

╒═══════╤════════╤══════════╤══════════╕
│   No. │ ac     │   weight │   length │
╞═══════╪════════╪══════════╪══════════╡
│     1 │ P00533 │   134277 │     1210 │
├───────┼────────┼──────────┼──────────┤
│     2 │ O75385 │   112631 │     1050 │
├───────┼────────┼──────────┼──────────┤
│     3 │ Q14457 │    51896 │      450 │
╘═══════╧════════╧══════════╧══════════╛

There is a shortcut to print essential characterization of proteins as such a table. The info function is really useful if you get to a set of proteins at some point of your analysis and you want to quickly check what kind they are. To iterate through multiple groups of proteins, use utils.uniprot.browse. The columns and format of these tables can be customized by kwargs.

[160]:
uniprot.info(['P00533', 'O75385', 'Q14457'])

executed in 0ms, finished 18:09:45 2022-12-02

=====> [3 proteins] <=====
╒═══════╤════════╤══════════════╤══════════╤══════════╤═════════════╤══════════════╤════════════╤══════════════╕
│   No. │ ac     │ genesymbol   │   length │   weight │ full_name   │ function_o   │ keywords   │ subcellula   │
│       │        │              │          │          │             │ r_genecard   │            │ r_location   │
│       │        │              │          │          │             │ s            │            │              │
╞═══════╪════════╪══════════════╪══════════╪══════════╪═════════════╪══════════════╪════════════╪══════════════╡
│     1 │ P00533 │ EGFR         │     1210 │   134277 │ Epidermal   │ Receptor     │ 3D-        │ Cell         │
│       │        │              │          │          │ growth      │ tyrosine     │ structure, │ membrane;    │
│       │        │              │          │          │ factor      │ kinase       │ Alternativ │ Single-      │
│       │        │              │          │          │ receptor    │
Output truncated: showing 1000 of 20254 characters

Sanitizing UniProt IDs§

It is important to know that the ID translation module always do a number of checks when translating to UniProt IDs. Unless the uniprot_cleanup parameter is disabled. It translates secondary IDs to primary, attempts to map TrEMBL IDs to SwissProts by gene symbols, removes IDs of other organisms or invalid format. To exploit this behaviour it’s enough to map from UniProt to UniProt:

[162]:
from pypath.utils import mapping
mapping.map_name('Q9UQ28', 'uniprot', 'uniprot')

executed in 0ms, finished 18:20:02 2022-12-02

[162]:
{'O75385'}

Enzyme-substrate interactions§

The database is an instance of pypath.core.enz_sub.EnzymeSubstrateAggregator class. The database is built with the default or current configuration by the core.enz_sub.get_db method.

Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.

[25]:
from pypath.core import enz_sub
es = enz_sub.get_db()

executed in 8m 1.81s, finished 14:26:37 2022-12-02

Instead, let’s acquire the database from the manager:

[6]:
from pypath import omnipath
es = omnipath.db.get_db('enz_sub')

executed in 7.27s, finished 15:37:33 2022-12-03

The database itself is stored as a dictionary (EnzymeSubstrateAggregator.enz_sub) with pairs of proteins as keys and a list of special objects representing enzyme-substrate interactions as values. These can be accessed by pairs of labels, identifiers or Entity objects, e.g. mTOR phosphorylates AKT1:

[27]:
es[('MTOR', 'AKT1')]

executed in 0ms, finished 14:40:55 2022-12-02

[27]:
[<MTOR => Residue AKT1-1:S473:phosphorylation [Evidences: HPRD, KEA, MIMP, PhosphoSite, ProtMapper, REACH, SIGNOR, Sparser, dbPTM, phosphoELM (15 references)]>,
 <MTOR => Residue AKT1-1:T450:phosphorylation [Evidences: HPRD, MIMP, PhosphoSite, ProtMapper, phosphoELM (0 references)]>,
 <MTOR => Residue AKT1-1:T308:phosphorylation [Evidences: ProtMapper, Sparser (1 references)]>]

Enzyme-substrate objects§

Let’s take a closer look at one of the enzyme-PTM relationships, represented by pypath.internals.intera.DomainMotif objects. Below some of the attributes are shown:

[28]:
e_ptm = es[('MTOR', 'AKT1')][0]
e_ptm.ptm.protein, e_ptm.ptm.protein.identifier, e_ptm.ptm.isoform, e_ptm.ptm.residue, e_ptm.ptm.residue.name, e_ptm.ptm.residue.number, e_ptm.ptm.typ, e_ptm.domain.protein

executed in 0ms, finished 14:40:57 2022-12-02

[28]:
(<Entity: AKT1>,
 'P31749',
 1,
 <Residue AKT1-1:S473>,
 'S',
 473,
 'phosphorylation',
 <Entity: MTOR>)

The resources and references are available in Evidences objects:

[29]:
e_ptm.evidences

executed in 0ms, finished 14:41:00 2022-12-02

[29]:
<Evidences: HPRD, KEA, MIMP, PhosphoSite, ProtMapper, REACH, SIGNOR, Sparser, dbPTM, phosphoELM (15 references)>
[30]:
e_ptm.evidences.get_resource_names()

executed in 0ms, finished 14:41:03 2022-12-02

[30]:
{'KEA', 'MIMP', 'PhosphoSite', 'ProtMapper', 'SIGNOR', 'dbPTM'}
[31]:
e_ptm.evidences.get_references()

executed in 0ms, finished 14:41:04 2022-12-02

[31]:
{<Reference: 14761976>,
 <Reference: 15047712>,
 <Reference: 15364915>,
 <Reference: 15718470>,
 <Reference: 15899889>,
 <Reference: 16221682>,
 <Reference: 17013611>,
 <Reference: 19844585>,
 <Reference: 20333297>,
 <Reference: 20489726>,
 <Reference: 21157483>,
 <Reference: 21592956>,
 <Reference: 23006971>,
 <Reference: 8978681>,
 <Reference: 9736715>}

Enzyme-substrate data frame§

The dabase object is able to export its contents into a pandas.DataFrame:

[7]:
es.make_df()
es.df

executed in 1.03s, finished 15:37:39 2022-12-03

[7]:
enzyme enzyme_genesymbol substrate substrate_genesymbol isoforms residue_type residue_offset modification sources references curation_effort
0 P31749 AKT1 P63104 YWHAZ 1 S 58 phosphorylation HPRD;HPRD_MIMP;KEA;MIMP;PhosphoSite;PhosphoSit... HPRD:11956222;KEA:11956222;KEA:12861023;KEA:16... 11
1 P31749 AKT1 P63104 YWHAZ 1 S 184 phosphorylation HPRD;HPRD_MIMP;KEA;MIMP;PhosphoSite_MIMP;phosp... HPRD:11956222;KEA:11956222;KEA:15071501 3
2 P45983 MAPK8 P63104 YWHAZ 1 S 184 phosphorylation HPRD;HPRD_MIMP;KEA;MIMP;PhosphoNetworks;Phosph... HPRD:15696159;KEA:11956222;KEA:15071501;KEA:15... 9
3 P06493 CDK1 P11171 EPB41 1 S 712 phosphorylation HPRD_MIMP;MIMP;PhosphoSite_MIMP;ProtMapper;REA... ProtMapper:15525677;dbPTM:15525677;dbPTM:18220... 5
4 P06493 CDK1 P11171 EPB41 1;2;5;7 T 60 phosphorylation MIMP;PhosphoSite;PhosphoSite_MIMP;ProtMapper;R... ProtMapper:15525677;dbPTM:15525677;dbPTM:2171679 3
... ... ... ... ... ... ... ... ... ... ... ...
41421 P29597 TYK2 P51692 STAT5B 1 Y 699 phosphorylation KEA KEA:10830280;KEA:11751923;KEA:12411494 3
41422 Q06418 TYRO3 P19174 PLCG1 1;2 Y 771 phosphorylation KEA KEA:12601080;KEA:15144186;KEA:15592455;KEA:160... 8
41423 Q9H4A3 WNK1 Q8TAX0 OSR1 1 T 185 phosphorylation KEA KEA:18270262 1
41424 Q9H4A3 WNK1 Q96J92 WNK4 1;3 S 335 phosphorylation KEA KEA:15883153 1
41425 Q9NYL2 MAP3K20 Q92903 CDS1 1 T 68 phosphorylation KEA KEA:10973490 1

41426 rows × 11 columns

Protein sequences§

The APIs for sequences are very basic, because we’ve never really needed them; but the fundamentals are probably there to make a nice, powerful API. Still, I don’t believe pypath will ever be strong in sequences, it’s just not our main topic.

[186]:
from pypath.utils import homology
seqc = homology.SequenceContainer(preload_seq = [9606])
akt1 = seqc.get_seq('P31749')
akt1.get_region(start = 10, end = 19, isoform = 2)

executed in 0ms, finished 19:40:09 2022-12-02

[186]:
(10, 19, 'TFIIRCLQWT')
[187]:
from pypath.utils import seq
human_proteome = seq.swissprot_seq()
human_proteome

executed in 0ms, finished 19:44:52 2022-12-02

[187]:
{'P63120': <pypath.utils.seq.Seq at 0x689900d45cc0>,
 'Q96EC8': <pypath.utils.seq.Seq at 0x689908ea8f70>,
 'Q6ZMS4': <pypath.utils.seq.Seq at 0x689908eaa4a0>,
 'Q8N8L2': <pypath.utils.seq.Seq at 0x6899223538b0>,
 'Q15916': <pypath.utils.seq.Seq at 0x689922353c70>,
 'O60384': <pypath.utils.seq.Seq at 0x689922350730>,
 'Q3MIS6': <pypath.utils.seq.Seq at 0x689922353310>,
 'Q86UK7': <pypath.utils.seq.Seq at 0x689922353760>,
 'Q6P280': <pypath.utils.seq.Seq at 0x689922353190>,
 'Q969W1': <pypath.utils.seq.Seq at 0x689922350d90>,
 'O14978': <pypath.utils.seq.Seq at 0x689922353220>,
 'P61129': <pypath.utils.seq.Seq at 0x689922353370>,
 'Q66K41': <pypath.utils.seq.Seq at 0x6899223534f0>,
 'Q15937': <pypath.utils.seq.Seq at 0x689922350c70>,
 'Q9P2J8': <pypath.utils.seq.Seq at 0x689922351450>,
 'Q8ND82': <pypath.utils.seq.Seq at 0x689922353910>,
 'Q9NP64': <pypath.utils.seq.Seq at 0x6899223502b0>,
 'P98182': <pypath.utils.seq.Seq at 0x689922350280>,
 'Q8IUH4': <pypath.utils.seq.Seq at 0x68992235
Output truncated: showing 1000 of 53045 characters
[191]:
list(human_proteome['P00533'].findall('YGCT'))

executed in 0ms, finished 19:48:41 2022-12-02

[191]:
[SeqLookup(isoform=1, offset=625)]

Annotations§

This database provides various annotations about the function, structure, localization and many other properties of the proteins and genes. The database is an instance of pypath.core.annot.AnnotationTable class. The database is built with the default or current configuration by the core.annot.get_db method.

Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.

[38]:
from pypath.core import annot
an = annot.get_db()
an

executed in 1ms, finished 15:07:08 2022-12-02

[38]:
<Annotation database: 3788067 records about 51636 entities from 78 resources>

Load a single annotation resource§

The annotations database is huge, on disk it takes up 1-2 GB of space, it consists of 60-70 resources. But all these resources are not integrated with each other, each can be loaded individually, by their dedicated classes in the core.annot module. This practice can be recommended and will be supported better in the future. Let’s load one resource:

[8]:
from pypath.core import annot
cpad = annot.Cpad()
cpad

executed in 48.26s, finished 15:38:57 2022-12-03

[8]:
<CPAD annotations: 2308 records about 1358 entities>

The resulted object is derived from the AnnotationBase class, its data is stored under the annot attribute, in a dict where identifiers are keys and sets of annotation records are the values. The keys of the records are shown by the get_names method:

[35]:
cpad.get_names()

executed in 0ms, finished 15:06:45 2022-12-02

[35]:
('regulator_type',
 'effect_on_pathway',
 'pathway',
 'effect_on_cancer',
 'effect_on_cancer_outcome',
 'cancer',
 'pathway_category')

For each name we can list the possible values:

[36]:
cpad.get_values('cancer')

executed in 0ms, finished 15:06:47 2022-12-02

[36]:
{'Acute lymphoblastic leukemia (ALL) (precursor T lymphoblastic leukemia)',
 'Acute myeloid leukemia (AML)',
 'Basal cell carcinoma',
 'Bladder cancer',
 'Breast cancer',
 'Cervical cancer',
 'Cholangiocarcinoma',
 'Choriocarcinoma',
 'Chronic lymphocytic leukemia (CLL)',
 'Chronic myeloid leukemia (CML)',
 'Colorectal cancer',
 'Endometrial cancer',
 'Esophageal cancer',
 "Ewing's sarcoma",
 'Gallbladder cancer',
 'Gastric cancer',
 'Glioma',
 'Hepatocellular carcinoma',
 'Hodgkin lymphoma',
 'Infantile hemangioma',
 'Laryngeal cancer',
 'Malignant melanoma',
 'Malignant pleural mesothelioma',
 'Mantle cell lymphoma',
 'Multiple myeloma',
 'Nasopharyngeal cancer',
 'Neuroblastoma',
 'Non-small cell lung cancer',
 'Oral cancer',
 'Osteosarcoma',
 'Ovarian cancer',
 'Pancreatic cancer',
 'Pituitary adenomas',
 'Prostate cancer',
 'Renal cell carcinoma',
 'Small cell lung cancer',
 'Squamous cell carcinoma',
 'Synovial sarcoma',
 'Testicular cancer',
 'Thyroid cancer'}

Based on their annotations the select method filters the annotated molecules. For example, 78 complexes, miRNAs and proteins are annotated as inhibiting colorectal cancer:

[37]:
cpad.select(cancer = 'Colorectal cancer', effect_on_cancer = 'Inhibiting')

executed in 0ms, finished 15:06:50 2022-12-02

[37]:
{'A6NDV4',
 Complex: COMPLEX:O14745,
 Complex: COMPLEX:O14862,
 Complex: COMPLEX:O15169_P25054,
 Complex: COMPLEX:O94813,
 Complex: COMPLEX:O94953,
 Complex: COMPLEX:P00533,
 Complex: COMPLEX:P06733,
 Complex Glucose transporter complex 1: COMPLEX:P11166,
 Complex: COMPLEX:P25054,
 Complex: COMPLEX:P40261,
 Complex: COMPLEX:P49327,
 Complex: COMPLEX:P54687,
 Complex PTEN phosphatase complex: COMPLEX:P60484,
 Complex: COMPLEX:Q01973,
 Complex: COMPLEX:Q12888,
 Complex: COMPLEX:Q13620,
 Complex: COMPLEX:Q96CX2,
 Complex: COMPLEX:Q99558,
 'MIMAT0000069',
 'MIMAT0000089',
 'MIMAT0000093',
 'MIMAT0000262',
 'MIMAT0000274',
 'MIMAT0000422',
 'MIMAT0000427',
 'MIMAT0000437',
 'MIMAT0000449',
 'MIMAT0000455',
 'MIMAT0000460',
 'MIMAT0000461',
 'MIMAT0000617',
 'MIMAT0003266',
 'MIMAT0003320',
 'O14745',
 'O14862',
 'O15169',
 'O75473',
 'O75888',
 'O76041',
 'O94813',
 'O94953',
 'P00533',
 'P06733',
 'P06756',
 'P11166',
 'P13631',
 'P22676',
 'P25054',
 'P25791',
 'P40261',
 'P49327',
 'P546
Output truncated: showing 1000 of 1279 characters

Load the full annotations database by the database manager§

Alternatively, the full annotations database can be accessed in the usual way:

[215]:
from pypath import omnipath
an = omnipath.db.get_db('annotations')
an
[215]:
<Annotation database: 5490653 records about 50872 entities from 68 resources>

The AnnotationTable object contains the resource specific annotation objects under the annots attribute:

[40]:
an.annots

executed in 0ms, finished 15:07:39 2022-12-02

[40]:
{'CellTypist': <CellTypist annotations: 927 records about 473 entities>,
 'Integrins': <Integrins annotations: 62 records about 62 entities>,
 'CellCellInteractions': <CellCellInteractions annotations: 5544 records about 4960 entities>,
 'PanglaoDB': <PanglaoDB annotations: 8479 records about 4813 entities>,
 'Lambert2018': <Lambert2018 annotations: 3281 records about 3277 entities>,
 'CancerSEA': <CancerSEA annotations: 2515 records about 1992 entities>,
 'Phobius': <Phobius annotations: 35382 records about 35382 entities>,
 'GO_Intercell': <GO_Intercell annotations: 48799 records about 18377 entities>,
 'MatrixDB': <MatrixDB annotations: 18127 records about 15903 entities>,
 'Surfaceome': <Surfaceome annotations: 3558 records about 3558 entities>,
 'Matrisome': <Matrisome annotations: 1514 records about 1514 entities>,
 'HPA_secretome': <HPA_secretome annotations: 3568 records about 3568 entities>,
 'HPMR': <HPMR annotations: 1748 records about 1695 entities>,
 'CPAD': <CPAD annotati
Output truncated: showing 1000 of 5842 characters

For each of these you can query the names of the fields, their possible values and the set of proteins annotated with any combination of the values, just like for CPAD above. As another exemple, let’s take a look into the Matrisome database:

[41]:
matrisome = an.annots['Matrisome']

executed in 0ms, finished 15:07:45 2022-12-02

[42]:
matrisome.get_names()

executed in 0ms, finished 15:07:49 2022-12-02

[42]:
('mainclass', 'subclass', 'subsubclass')
[43]:
matrisome.get_values('subclass')

executed in 0ms, finished 15:07:53 2022-12-02

[43]:
{'Collagens',
 'ECM Glycoproteins',
 'ECM Regulators',
 'ECM-affiliated Proteins',
 'Proteoglycans',
 'Secreted Factors',
 'n/a'}
[44]:
matrisome.get_subset(subclass = 'Collagens')

executed in 0ms, finished 15:07:56 2022-12-02

[44]:
{'A6NMZ7',
 'A8TX70',
 'B4DZ39',
 Complex Collagen type I homotrimer: COMPLEX:P02452,
 Complex HT_DM_Cluster278: COMPLEX:P02452_P02462_P08572_P29400_P53420_Q01955_Q02388_Q14031_Q17RW2_Q8NFW1,
 Complex Collagen type I trimer: COMPLEX:P02452_P08123,
 Complex Collagen type II trimer: COMPLEX:P02458,
 Complex Collagen type XI trimer variant 1: COMPLEX:P02458_P12107_P13942,
 Complex: COMPLEX:P02458_P20908_P25067,
 Complex: COMPLEX:P02458_P20908_P25067_P29400,
 Complex: COMPLEX:P02458_P25067_P29400,
 Complex Collagen type III trimer: COMPLEX:P02461,
 Complex: COMPLEX:P02462,
 Complex Collagen type IV trimer variant 1: COMPLEX:P02462_P08572,
 Complex Collagen type XI trimer variant 2: COMPLEX:P05997_P12107,
 Complex Collagen type XI trimer variant 3: COMPLEX:P05997_P12107_P20908,
 Complex Collagen type V trimer variant 1: COMPLEX:P05997_P20908,
 Complex Collagen type V trimer variant 2: COMPLEX:P05997_P20908_P25940,
 Complex: COMPLEX:P08572,
 Complex: COMPLEX:P12109_P12110,
 Complex Collagen
Output truncated: showing 1000 of 3072 characters

Load only selected annotations§

Another option is to load only certain annotation resources into an AnnotationTable object. We refer to the resources by class names. For example, if you only want to load the pathway membership annotations from SIGNOR, SignaLink, NetPath and KEGG, you can provide the names of the appropriate classes:

[47]:
pathways = annot.AnnotationTable(
    protein_sources = (
        'SignalinkPathways',
        'KeggPathways',
        'NetpathPathways',
        'SignorPathways',
    ),
    complex_sources = (),
)
pathways

executed in 12.07s, finished 15:09:48 2022-12-02

[47]:
<Annotation database: 28745 records about 6762 entities from 4 resources>

The AnnotationTable object provides methods to query all resources together, or build a boolean array out of them. To see all annotations of one protein:

[48]:
pathways.all_annotations('P00533')

executed in 0ms, finished 15:10:17 2022-12-02

[48]:
[SignalinkPathway(pathway='Receptor tyrosine kinase'),
 SignalinkPathway(pathway='JAK/STAT'),
 KeggPathway(pathway='Proteoglycans in cancer'),
 KeggPathway(pathway='Regulation of actin cytoskeleton'),
 KeggPathway(pathway='Oxytocin signaling pathway'),
 KeggPathway(pathway='Phospholipase D signaling pathway'),
 KeggPathway(pathway='Pathways in cancer'),
 KeggPathway(pathway='Hepatocellular carcinoma'),
 KeggPathway(pathway='Colorectal cancer'),
 KeggPathway(pathway='Melanoma'),
 KeggPathway(pathway='EGFR tyrosine kinase inhibitor resistance'),
 KeggPathway(pathway='Human papillomavirus infection'),
 KeggPathway(pathway='Pancreatic cancer'),
 KeggPathway(pathway='Non-small cell lung cancer'),
 KeggPathway(pathway='Central carbon metabolism in cancer'),
 KeggPathway(pathway='Endocytosis'),
 KeggPathway(pathway='Endometrial cancer'),
 KeggPathway(pathway='Choline metabolism in cancer'),
 KeggPathway(pathway='Bladder cancer'),
 KeggPathway(pathway='Parathyroid hormone synthesis, secretion
Output truncated: showing 1000 of 2540 characters

Data frames of annotations§

Data from annotation objects can be exported to a pandas.DataFrame:

[9]:
cpad.make_df()
cpad.df

executed in 0ms, finished 15:40:14 2022-12-03

[9]:
uniprot genesymbol entity_type source label value record_id
0 Q16181 SEPT7 protein CPAD regulator_type protein 0
1 Q16181 SEPT7 protein CPAD effect_on_pathway Upregulation 0
2 Q16181 SEPT7 protein CPAD pathway Actin cytoskeleton pathway 0
3 Q16181 SEPT7 protein CPAD effect_on_cancer Inhibiting 0
4 Q16181 SEPT7 protein CPAD effect_on_cancer_outcome inhibit glioma cell migration 0
... ... ... ... ... ... ... ...
14396 COMPLEX:P30990 COMPLEX:NTS complex CPAD cancer Hepatocellular carcinoma 2306
14397 COMPLEX:P30990 COMPLEX:NTS complex CPAD effect_on_pathway Upregulation 2307
14398 COMPLEX:P30990 COMPLEX:NTS complex CPAD pathway ERK signaling pathway 2307
14399 COMPLEX:P30990 COMPLEX:NTS complex CPAD effect_on_cancer Activating 2307
14400 COMPLEX:P30990 COMPLEX:NTS complex CPAD cancer Gastric cancer 2307

14401 rows × 7 columns

The data frame has a long format. It can be converted to the more conventional wide format using standard pandas procedures (well, in tidyverse you would simply call tidyr::pivot_wider, in pandas you have to do an unintuitive sequence of 6 calls):

[10]:
index_cols = ['record_id', 'uniprot', 'genesymbol', 'label', 'entity_type']

(
    cpad.df.drop('source', axis=1).
    set_index(index_cols).
    unstack('label').
    droplevel(axis=1, level=0).
    reset_index().
    drop('record_id', axis=1)
)

executed in 0ms, finished 15:40:19 2022-12-03

[10]:
label uniprot genesymbol entity_type cancer effect_on_cancer effect_on_cancer_outcome effect_on_pathway pathway pathway_category regulator_type
0 Q16181 SEPT7 protein Glioma Inhibiting inhibit glioma cell migration Upregulation Actin cytoskeleton pathway Regulation of actin cytoskeleton protein
1 MIMAT0000431 hsa-miR-140 mirna Squamous cell carcinoma Inhibiting suppress tumor cell migration and invasion Upregulation ADAM10 mediated Notch1 signaling pathway Notch signaling pathway mirna
2 MIMAT0005886 hsa-miR-1297 mirna Prostate cancer Inhibiting inhibit proliferation and invasion Upregulation AEG1/Wnt signaling pathway Wnt signaling pathway mirna
3 Q9UP65 PLA2G4C protein Breast cancer Inhibiting inhibit EGF-induced chemotaxis Downregulation Akt signaling pathway PI3K-Akt signaling pathway protein
4 Q92600 CNOT9 protein Breast cancer Inhibiting suppress cell proliferation Downregulation Akt signaling pathway PI3K-Akt signaling pathway protein
... ... ... ... ... ... ... ... ... ... ...
2303 COMPLEX:P16422 COMPLEX:EPCAM complex Prostate cancer Inhibiting NaN Downregulation PI3K-Akt-mTOR signaling pathway NaN NaN
2304 COMPLEX:Q9Y6Y0 COMPLEX:IVNS1ABP complex Prostate cancer Inhibiting NaN Upregulation Akt signaling pathway NaN NaN
2305 COMPLEX:Q96CX2 COMPLEX:KCTD12 complex Colorectal cancer Inhibiting NaN Upregulation ERK signaling pathway NaN NaN
2306 COMPLEX:P30990 COMPLEX:NTS complex Hepatocellular carcinoma Activating NaN Upregulation Wnt/beta-catenin signaling pathway NaN NaN
2307 COMPLEX:P30990 COMPLEX:NTS complex Gastric cancer Activating NaN Upregulation ERK signaling pathway NaN NaN

2308 rows × 10 columns

Inter-cellular signaling roles§

pypath does not combine the annotations in the annot module, exactly what goes in goes out. For example, WNT pathway from Signor and SignaLink won’t be merged automatically. However with the pypath.core.annot.CustomAnnotation class anyone can do it. For inter-cellular communication categories the pypath.core.intercell module combines the data from all the relevant resources and creates categories based on a combination of evidences. The database is an instance of the IntercellAnnotation object, and the build is executed by the pypath.core.intercell.get_db function.

Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.

[53]:
from pypath.core import intercell
ic = intercell.get_db() # this takes quite some time
                       # unless you load annotations from a pickle cache
ic

executed in 0ms, finished 15:13:03 2022-12-02

[53]:
<Intercell annotations: 310033 records about 43617 entities>
[11]:
from pypath import omnipath
ic = omnipath.db.get_db('intercell')
ic

executed in 2m 55.47s, finished 15:43:27 2022-12-03

[11]:
<Intercell annotations: 301527 records about 48570 entities>

This object stores its data under the classes attribute. Classes are defined in pypath.core.intercell_annot.annot_combined_classes. In addition, we manually revised and excluded some proteins from the more generic classes, these are listed in pypath.core.intercell_annot.excludes. Each class has the following properties:

  • name: all lowercase, human understandable name, without repeating the parent class (e.g. WNT receptors will be simply wnt, and the parent class will be receptor)

  • parent: for a specific class the parent is the generic category it belongs to; for generic classes the name and parent are the same

  • resource: the resource the data comes from, or OmniPath for composite classes (combined from multiple resources)

  • scope: specific or generic; e.g. TGF ligand is specific, ligand is generic

  • aspect: locational (e.g. plasma membrane) or functional (e.g. transporter)

Read more about the design of the intercell database in our paper.

[55]:
ic.classes

executed in 0ms, finished 15:16:54 2022-12-02

[55]:
{AnnotDefKey(name='transmembrane', parent='transmembrane', resource='UniProt_location'): <AnnotationGroup `transmembrane` from UniProt_location, 5150 elements>,
 AnnotDefKey(name='transmembrane', parent='transmembrane', resource='UniProt_topology'): <AnnotationGroup `transmembrane` from UniProt_topology, 5760 elements>,
 AnnotDefKey(name='transmembrane', parent='transmembrane', resource='UniProt_keyword'): <AnnotationGroup `transmembrane` from UniProt_keyword, 7041 elements>,
 AnnotDefKey(name='transmembrane', parent='transmembrane_predicted', resource='Phobius'): <AnnotationGroup `transmembrane` from Phobius, 6444 elements>,
 AnnotDefKey(name='transmembrane_phobius', parent='transmembrane_predicted', resource='Almen2009'): <AnnotationGroup `transmembrane_phobius` from Almen2009, 2072 elements>,
 AnnotDefKey(name='transmembrane_sosui', parent='transmembrane_predicted', resource='Almen2009'): <AnnotationGroup `transmembrane_sosui` from Almen2009, 1663 elements>,
 AnnotDefKey(name='trans
Output truncated: showing 1000 of 143945 characters

An easy way to access the classes is the select method. The AnnotationGroup objects behave as plain Python sets, and besides that, they feature many further attributes and methods.

[56]:
gaba_receptors = ic.select('gaba', parent = 'receptor')
gaba_receptors

executed in 0ms, finished 15:17:00 2022-12-02

[56]:
<AnnotationGroup `gaba` from HGNC, 40 elements>
[57]:
gaba_receptors.members

executed in 0ms, finished 15:17:02 2022-12-02

[57]:
{'A8MPY1',
 Complex GABA-A receptor (GABRA1, GABRB2, GABRD): COMPLEX:O14764_P14867_P47870,
 Complex GABA-A receptor, alpha-4/beta-3/delta: COMPLEX:O14764_P28472_P48169,
 Complex GABA-A receptor, alpha-6/beta-3/delta: COMPLEX:O14764_P28472_Q16445,
 Complex GABA-A receptor, alpha-4/beta-2/delta: COMPLEX:O14764_P47870_P48169,
 Complex GABA-A receptor, alpha-6/beta-2/delta: COMPLEX:O14764_P47870_Q16445,
 Complex GABBR1-GABBR2 complex: COMPLEX:O75899_Q9UBS5,
 Complex: COMPLEX:P14867,
 Complex GABA-A receptor, alpha-1/beta-3/gamma-2: COMPLEX:P14867_P18507_P28472,
 Complex GABA-A receptor (GABRA1, GABRB2, GABRG2): COMPLEX:P14867_P18507_P47870,
 Complex GABA-A receptor, alpha-5/beta-3/gamma-2: COMPLEX:P18507_P28472_P31644,
 Complex GABA-A receptor, alpha-3/beta-3/gamma-2: COMPLEX:P18507_P28472_P34903,
 Complex GABA-A receptor, alpha-2/beta-3/gamma-2: COMPLEX:P18507_P28472_P47869,
 Complex GABA-A receptor, alpha-6/beta-3/gamma-2: COMPLEX:P18507_P28472_Q16445,
 Complex: COMPLEX:P18507_Q8N1C3,
 C
Output truncated: showing 1000 of 1368 characters

Build an intercellular communication network§

The intercell database can be connected to a Network object to create an intercellular communication network:

[58]:
cu = omnipath.db.get_db('curated')
ic.register_network(cu)

executed in 0ms, finished 15:17:08 2022-12-02

Quantitative overview of intercell annotations§

A data frame with basic statistics is available:

[13]:
ic.counts_df()

executed in 0ms, finished 15:45:17 2022-12-03

[13]:
category parent database scope aspect source consensus_score transmitter receiver secreted plasma_membrane_transmembrane plasma_membrane_peripheral n_uniprot
0 transmembrane transmembrane UniProt_location generic locational resource_specific 6 False False False True False 5150
1 transmembrane transmembrane UniProt_topology generic locational resource_specific 6 False False False True False 5760
2 transmembrane transmembrane UniProt_keyword generic locational resource_specific 1 False False False False False 7041
3 transmembrane transmembrane_predicted Phobius generic locational resource_specific 1 False False False False False 6444
4 transmembrane_phobius transmembrane_predicted Almen2009 generic locational resource_specific 0 False False False True False 2072
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1120 parin_adhesion_regulator intracellular_intercellular_related HGNC specific functional resource_specific 0 True False False False False 5
1121 plakophilin_adhesion_regulator intracellular_intercellular_related HGNC specific functional resource_specific 0 True False False False False 3
1122 actin_regulation_adhesome intracellular_intercellular_related Adhesome specific functional resource_specific 0 True False False False False 22
1123 adhesion_cytoskeleton_adaptor intracellular_intercellular_related Adhesome specific functional resource_specific 0 True False False False False 118
1124 intracellular_intercellular_related intracellular_intercellular_related OmniPath generic functional composite 0 True False False False False 291

1125 rows × 13 columns

Intercell database as data frame§

Just like the other databases, the object can be exported into a pandas.DataFrame:

[14]:
ic.make_df()
ic.df[:10]

executed in 22.72s, finished 15:45:46 2022-12-03

[14]:
category parent database scope aspect source uniprot genesymbol entity_type consensus_score transmitter receiver secreted plasma_membrane_transmembrane plasma_membrane_peripheral
0 transmembrane transmembrane UniProt_location generic locational resource_specific Q96JP9 CDHR1 protein 6 False False False True False
1 transmembrane transmembrane UniProt_location generic locational resource_specific Q9P126 CLEC1B protein 8 False False False True False
2 transmembrane transmembrane UniProt_location generic locational resource_specific Q13585 GPR50 protein 6 False False False True False
3 transmembrane transmembrane UniProt_location generic locational resource_specific Q8N9I0 SYT2 protein 7 False False False False False
4 transmembrane transmembrane UniProt_location generic locational resource_specific O43614 HCRTR2 protein 6 False False False True False
5 transmembrane transmembrane UniProt_location generic locational resource_specific A6NJY1 SLC9B1P1 protein 4 False False False False False
6 transmembrane transmembrane UniProt_location generic locational resource_specific Q5RI15 COX20 protein 5 False False False False False
7 transmembrane transmembrane UniProt_location generic locational resource_specific Q13948 CUX1 protein 5 False False False False False
8 transmembrane transmembrane UniProt_location generic locational resource_specific Q8NGK4 OR52K1 protein 6 False False False False False
9 transmembrane transmembrane UniProt_location generic locational resource_specific Q8IYS2 KIAA2013 protein 7 False False False True False

Browse intercell categories§

Use the select method to access intercell classes:

[72]:
ic.select(definition = 'neurotensin', parent = 'receptor')

executed in 0ms, finished 15:27:15 2022-12-02

[72]:
<AnnotationGroup `neurotensin` from HGNC, 2 elements>

Proteins in each category can be listed with their descriptions from UniProt. Loading the UniProt datasheets for each protein is a slow process, we don’t recomment calling this method on more than a few dozens of proteins.

[79]:
ic.show('neurotensin', parent = 'receptor')

executed in 1ms, finished 15:35:58 2022-12-02

=====> [2 proteins] <=====
╒═══════╤════════╤══════════════╤══════════╤══════════╤═════════════╤══════════════╤════════════╤══════════════╕
│   No. │ ac     │ genesymbol   │   length │   weight │ full_name   │ function_o   │ keywords   │ subcellula   │
│       │        │              │          │          │             │ r_genecard   │            │ r_location   │
│       │        │              │          │          │             │ s            │            │              │
╞═══════╪════════╪══════════════╪══════════╪══════════╪═════════════╪══════════════╪════════════╪══════════════╡
│     1 │ O95665 │ NTSR2        │      410 │    45385 │ Neurotensi  │ Receptor     │ Cell       │ Cell         │
│       │        │              │          │          │ n receptor  │ for the tr   │ membrane,  │ membrane;    │
│       │        │              │          │          │ type 2      │ idecapepti   │ Disulfide  │ Multi-pass   │
│       │        │              │          │          │             │
Output truncated: showing 1000 of 7598 characters

Gene Ontology§

pypath.utils.go is an almost standalone module for management of the Gene Ontology tree and annotations. The main objects here are GeneOntology and GOAnnotation. The former represents the ontology tree, i.e. terms and their relationships, the latter their assignment to gene products. Both provides many versatile methods for querying.

[80]:
from pypath.utils import go
goa = go.GOAnnotation()

executed in 1.26s, finished 15:36:46 2022-12-02

[81]:
goa.ontology # the GeneOntology object

executed in 0ms, finished 15:36:48 2022-12-02

[81]:
<pypath.utils.go.GeneOntology at 0x689946b55570>
[82]:
goa # the GOAnnotation object

executed in 0ms, finished 15:36:50 2022-12-02

[82]:
<pypath.utils.go.GOAnnotation at 0x68991cdc9b40>

Among many others, the most versatile method is select which is able to select the annotated gene products by various expressions built from GO terms or IDs. It understands AND, OR, NOT and parentheses.

[83]:
query = """(cell surface OR
        external side of plasma membrane OR
        extracellular region) AND
        (regulation of transmembrane transporter activity OR
        channel regulator activity)"""
result = goa.select(query)
print(list(result)[:7])

executed in 0ms, finished 15:36:55 2022-12-02

['P21333', 'P80108', 'P62258', 'Q9NRX4', 'P54710', 'Q8NER1', 'P01303']
[84]:
goa.ontology.get_all_descendants('GO:0005576')

executed in 0ms, finished 15:36:56 2022-12-02

[84]:
{'GO:0001507',
 'GO:0001527',
 'GO:0003351',
 'GO:0003355',
 'GO:0005201',
 'GO:0005576',
 'GO:0005577',
 'GO:0005582',
 'GO:0005583',
 'GO:0005584',
 'GO:0005585',
 'GO:0005586',
 'GO:0005587',
 'GO:0005588',
 'GO:0005590',
 'GO:0005591',
 'GO:0005592',
 'GO:0005595',
 'GO:0005596',
 'GO:0005599',
 'GO:0005601',
 'GO:0005602',
 'GO:0005604',
 'GO:0005606',
 'GO:0005607',
 'GO:0005608',
 'GO:0005609',
 'GO:0005610',
 'GO:0005611',
 'GO:0005612',
 'GO:0005614',
 'GO:0005615',
 'GO:0005616',
 'GO:0006858',
 'GO:0006859',
 'GO:0006860',
 'GO:0009519',
 'GO:0010367',
 'GO:0016914',
 'GO:0016942',
 'GO:0020003',
 'GO:0020004',
 'GO:0020005',
 'GO:0020006',
 'GO:0030020',
 'GO:0030021',
 'GO:0030023',
 'GO:0030197',
 'GO:0030345',
 'GO:0030934',
 'GO:0030935',
 'GO:0030938',
 'GO:0031012',
 'GO:0031395',
 'GO:0032311',
 'GO:0032579',
 'GO:0033165',
 'GO:0033166',
 'GO:0034358',
 'GO:0034359',
 'GO:0034360',
 'GO:0034361',
 'GO:0034362',
 'GO:0034363',
 'GO:0034364',
 'GO:0034365',
 'GO:00343
Output truncated: showing 1000 of 3104 characters

Protein complexes§

The pypath.complex module builds a non-redundant list of complexes from about 12 original resources. Complexes are unique considering their set of components, and optionally carry stoichiometry information. Homomultimers are also included, hence some complexes consist only of a single kind of protein. The database is an instance of pypath.core.complex.ComplexAggregator object and the built by the pypath.core.complex.get_db function.

Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.

[90]:
from pypath.core import complex
co = complex.get_db()
co.update_index()
co

executed in 0ms, finished 15:39:31 2022-12-02

[90]:
<Complex database: 28173 complexes>

To retrieve all complexes containing a specific protein, here MTOR:

[91]:
co.proteins['P42345']

executed in 0ms, finished 15:39:42 2022-12-02

[91]:
{Complex: COMPLEX:O00141_O15530_O75879_P23443_P34931_P42345_Q6R327_Q8N122_Q9BPZ7_Q9BVC4_Q9H672,
 Complex: COMPLEX:O00141_O15530_P07900_P23443_P31749_P31751_P42345_P78527_Q05513_Q05655_Q6R327_Q8N122_Q9BPZ7_Q9BVC4,
 Complex: COMPLEX:O00141_O15530_P0CG47_P0CG48_P23443_P42345_Q15118_Q6R327_Q8N122_Q96BR1_Q9BPZ7_Q9BVC4,
 Complex: COMPLEX:O00141_O15530_P23443_P42345_Q15118_Q6R327_Q8N122_Q96BR1_Q96J02_Q9BPZ7_Q9BVC4,
 Complex: COMPLEX:O00141_O75879_P0CG48_P23443_P34931_P42345_P62753_Q6R327_Q8N122_Q9BPZ7_Q9BVC4_Q9NY26,
 Complex: COMPLEX:O00141_P0CG48_P23443_P36894_P42345_P62942_P68106_Q15427_Q6R327_Q8N122_Q9BPZ7_Q9BVC4,
 Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P46781_P62753_Q6R327_Q8N122_Q96KQ7_Q9BPZ7_Q9BVC4_Q9NY26,
 Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P62753_P62942_Q6R327_Q8N122_Q9BPZ7_Q9BVC4_Q9NY26,
 Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P62753_Q15172_Q6R327_Q8IW41_Q9BPZ7_Q9BVC4_Q9H672,
 Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P62753_Q6R327_Q70Z35_Q8N122_Q8TCU6_Q9BPZ7
Output truncated: showing 1000 of 5348 characters

Note some of the complexes have human readable names, these are preferred at printing if available from any of the databases. Otherwise the complexes are labelled by COMPLEX:list-of-components.

Protein complex objects§

Take a closer look on one complex object. The hash of the is equivalent with the string representation below, where the UniProt IDs are unique and alphabetically sorted. Hence you can look up complexes using strings as keys despite the dict keys are in fact pypath.intera.Complex objects:

[97]:
cplex = co.complexes['COMPLEX:Q09472_Q92793']
cplex

executed in 0ms, finished 15:41:36 2022-12-02

[97]:
Complex CBP/p300: COMPLEX:Q09472_Q92793
[98]:
cplex.components # stoichiometry

executed in 0ms, finished 15:41:38 2022-12-02

[98]:
{'Q92793': 1, 'Q09472': 1}
[99]:
cplex.sources # resources

executed in 0ms, finished 15:41:39 2022-12-02

[99]:
{'Signor'}

Protein complex data frame§

The database can be exported into a pandas.DataFrame:

[18]:
co.make_df()
co.df

executed in 3.40s, finished 15:47:16 2022-12-03

[18]:
name components components_genesymbols stoichiometry sources references identifiers
0 NFY P23511_P25208_Q13952 NFYA_NFYB_NFYC 1:1:1 CORUM;Compleat;PDB;Signor;ComplexPortal;hu.MAP... 15243141;14755292;9372932 Signor:SIGNOR-C1;CORUM:4478;Compleat:HC1449;in...
1 mTORC2 P68104_P85299_Q6R327_Q8TB45_Q9BVC4 DEPTOR_EEF1A1_MLST8_PRR5_RICTOR 0:0:0:0:0 Signor Signor:SIGNOR-C2
2 mTORC1 P42345_Q8N122_Q8TB45_Q96B36_Q9BVC4 AKT1S1_DEPTOR_MLST8_MTOR_RPTOR 0:0:0:0:0 Signor Signor:SIGNOR-C3
3 SCF-betaTRCP P63208_Q13616_Q9Y297 BTRC_CUL1_SKP1 1:1:1 CORUM;Compleat;Signor 9990852 Signor:SIGNOR-C5;CORUM:227;Compleat:HC757
4 CBP/p300 Q09472_Q92793 CREBBP_EP300 0:0 Signor Signor:SIGNOR-C6
... ... ... ... ... ... ... ...
28168 Npnt complex 2 Q5SZK8_Q6UXI9_Q86XX4 FRAS1_FREM2_NPNT 0:0:0 CellChatDB
28169 NRP1_NRP2 O14786_O60462_Q9Y4D7 NRP1_NRP2_PLXND1 0:0:0 CellChatDB
28170 NRP2_PLXNA2 O60462_O75051 NRP2_PLXNA2 0:0 CellChatDB
28171 NRP2_PLXNA4 O60462_Q9HCM2 NRP2_PLXNA4 0:0 CellChatDB
28172 PTCH2_SMO Q99835_Q9Y6C5 PTCH2_SMO 0:0 CellChatDB

28173 rows × 7 columns

Saving datasets as pickles§

The large datasets above are compiled from many resources. Even if these are already available in the cache, the data processing often takes longer than convenient, e.g. from a few minutes up to half an hour. Most of the data integration objects in pypath provide methods to save and load their contents as pickle dumps. In fact, the database manager does this all the time, in a coordinated way – for this reason, the methods below should be used only with good reason, and relying on the database manager is preferred.

[ ]:
# for `pypath.annot.AnnotationTable` objects:
a.save_to_pickle('myannots.pickle')
a = annot.AnnotationTable(pickle_file = 'myannots.pickle')
# for `pypath.complex.ComplexAggregator` objects:
complexdb.save_to_pickle('mycomplexes.pickle')
complexdb = complex.ComplexAggregator(pickle_file = 'mycomplexes.pickle')

Log messages and sessions§

In pypath all modules sends messages to a log file named by default by the session ID (a 5 char random string). The default path to the log file is ./pypath_log/pypath-xxxxx.log where xxxxx is the session ID.

Warning: The logger of pypath is really verbose, the log files can grow huge: several tens of thousands of lines, few MBs. It is recommended to empty the pypath_log directories time to time.

Basic info about the session§

The info function prints the most important information about the current session:

[100]:
import pypath
pypath.info()

executed in 0ms, finished 15:41:55 2022-12-02

[2022-12-02 16:41:55] [pypath]
        - session ID: `l0n17`
        - working directory: `/home/denes/pypath/notebooks`
        - logfile: `/home/denes/pypath/notebooks/pypath_log/pypath-l0n17.log`
        - pypath version: 0.14.31

Another function prints a disclaimer about licenses. Until recently this message was printed every time upon import, it is still important, but we removed it as in certain situations it can be annoying.

[101]:
pypath.disclaimer()

executed in 0ms, finished 15:41:59 2022-12-02


        === d i s c l a i m e r ===

        All data accessed through this module,
        either as redistributed copy or downloaded using the
        programmatic interfaces included in the present module,
        are free to use at least for academic research or
        education purposes.
        Please be aware of the licenses of all the datasets
        you use in your analysis, and please give appropriate
        credits for the original sources when you publish your
        results. To find out more about data sources please
        look at `pypath/resources/data/resources.json` or
        https://omnipathdb.org/info and
        `pypath.resources.urls.urls`.

Read the log file§

Calling pypath.log opens the logfile by the default console application for paginating text files (in GNU systems typically less):

[ ]:
pypath.log()

executed in 0ms, finished 15:42:08 2022-12-02

The logger and the log file are bound to the session (the 5 random characters is the session ID):

[104]:
pypath.session

executed in 0ms, finished 15:42:27 2022-12-02

[104]:
<Session l0n17>

The logger:

[105]:
pypath.session.log

executed in 0ms, finished 15:42:46 2022-12-02

[105]:
Logger [/home/denes/pypath/notebooks/pypath_log/pypath-l0n17.log]

The path to the log file:

[106]:
pypath.session.log.fname

executed in 0ms, finished 15:42:49 2022-12-02

[106]:
'/home/denes/pypath/notebooks/pypath_log/pypath-l0n17.log'

Logging to the console§

Each log message has a numeric priority level, and messages with lower level than a threshold are printed to the console. By default only important warnings are dispatched to the console. To log everything to the console, set the threshold to a large number:

[107]:
pypath.session.log.console_level = 10

from pypath.inputs import signor

si = signor.signor_interactions()
pypath.session.log.console_level = -1

executed in 0ms, finished 15:42:56 2022-12-02

[2022-12-02 16:42:55] [curl] Creating Curl object to retrieve data from `https://signor.uniroma2.it/download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file path: `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file found, no need for download.
[2022-12-02 16:42:55] [curl] Opening plain text file `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`.
[2022-12-02 16:42:55] [curl] Creating Curl object to retrieve data from `https://signor.uniroma2.it/download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file path: `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file found, no need for download.
[2022-12-02 16:42:55] [curl] Opening plain text file `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`.
[2022-12-02 16:42:55] [curl] Creating Curl object to retrieve data from `https
Output truncated: showing 1000 of 1046 characters

Disable logging§

To avoid creation of a log file (and the directory pypath_log) set the environment variable PYPATH_LOG or the builtins.PYPATH_LOG attribute:

[ ]:
# shell:
export PYPATH_LOG="/dev/null"
# then, start Python and use pypath
[108]:
import os
import builtins
builtins.PYPATH_LOG=os.devnull
import pypath

executed in 0ms, finished 15:43:10 2022-12-02

Write to the log§

Sending a single message§

First we change the console level so we can see the log messages. The label is optional. The priority of the message is given by the level, notice that the second message won’t be printed to the console as its level is higher than 10:

[109]:
pypath.session.log.console_level = 10
pypath.session.log.msg('Greetings from the pypath tutorial notebook! :)', label = 'book')
pypath.session.log.msg('Not important, not shown on console but printed to the logfile.', level = 11)

executed in 0ms, finished 15:43:13 2022-12-02

[2022-12-02 16:43:13] [book] Greetings from the pypath tutorial notebook! :)

Connect a module or class to the pypath logger§

The preferred way of connecting to the logger is to make a class inherit from the Logger class. Here the name will be the default label for all messages coming from the instances of this class:

[110]:
from pypath.share import session

class ChildOfLogger(session.Logger):

    def __init__(self):

        session.Logger.__init__(self, name = 'child')

    def say_something(self):

        self._log('Have a nice day! :D')


col = ChildOfLogger()
col.say_something()

executed in 0ms, finished 15:43:17 2022-12-02

[2022-12-02 16:43:17] [child] Have a nice day! :D

Alternatively, a logger can be created anywhere and used from any module or function:

[111]:
from pypath.share import session

_logger = session.Logger(name = 'mylogger')
_log = _logger._log

_log('Message from a stray logger')

executed in 0ms, finished 15:43:20 2022-12-02

[2022-12-02 16:43:20] [mylogger] Message from a stray logger

Finally we just set the console level to a lower value, to avoid flooding the rest of this book with log messages:

[112]:
pypath.session.log.console = -1

executed in 0ms, finished 15:43:23 2022-12-02

BEL export§

Warning: This section hasn’t been thoroughly revised for long time, some parts might be outdated or broken.

Biological Expression Language (BEL, https://bel-commons.scai.fraunhofer.de/) is a versatile description language to capture relationships between various biological entities spanning wide range of the levels of biological organization. pypath has a dedicated module to convert the network and the enzyme-substrate interactions to BEL format:

[ ]:
from pypath.legacy import main
from pypath.resources import data_formats
from pypath.omnipath import bel
[ ]:
pa = main.PyPath()
pa.init_network(data_formats.pathway)

You can provide one or more resources to the Bel class. Supported resources currently are pypath.main.PyPath and pypath.ptm.PtmAggregator.

[ ]:
b = bel.Bel(resource = pa)

From the resources we compile a BELGraph object which provides a Python interface for various operations and you can also export the data in BEL format:

[ ]:
b.main()
[ ]:
b.bel_graph
[ ]:
b.bel_graph.summarize()
[ ]:
b.export_relationships('omnipath_pathways.bel')
[ ]:
with open('omnipath_pathways.bel', 'r') as fp:
    bel_str = fp.read()
[ ]:
print(bel_str[:333])

CellPhoneDB export§

Warning: This section hasn’t been thoroughly revised for long time, some parts might be outdated or broken.

CellPhoneDB is a statistical method and a database for inferring inter-cellular communication pathways between specific cell types from single-cell data. OmniPath/pypath uses CellPhoneDB as a resource for interaction, protein complex and annotation data. Apart from this, pypath is able to export its data in the appropriate format to provide input for the CellPhoneDB Python module. For this you can use the pypath.cellphonedb module:

[ ]:
from pypath.omnipath import cellphonedb
from pypath.share import settings

settings.setup(network_expand_complexes = False)

Here you can provide parameters for the network or provide an already built network. Also you can provide the datasets as pickles to make them load really fast. Otherwise this step will take quite long.

[ ]:
c = cellphonedb.CellPhoneDB()

You can access each of the CellPhoneDB input files as a pandas.DataFrame and also they’ve been exported to csv files. For example the interaction_input.csv contains interactions from all the resources used for building the network (here Signor, SingnaLink, etc.):

[ ]:
c.interaction_dataframe[:10]

The proteins and complexes are annotated (transmembrane, peripheral, secreted, etc.) using data from the pypath.intercell module (identical to the http://omnipathdb.org/intercell query of the web service):

[ ]:
c.protein_dataframe[:10]
[ ]:

                        

The legacy igraph-based network object§

Warning: This section hasn’t been thoroughly revised for long time, some parts might be outdated or broken.

Until about 2019 (before pypath version 0.9) pypath used an igraph.Graph object (igraph.org) to organize all data structures around. This legacy API still present in pypath.legacy.main, however it is not maintained. This section of the book is still here, but will be removed soon, along with the legacy module.

[43]:
from pypath.legacy import main
No module `cairo` available.
Some plotting functionalities won't be accessible.
[ ]:
pa = main.PyPath()
#pa.load_omnipath() # This is commented out because it takes > 1h
                    # to run it for the first time due to the vast
                    # amount of data download.
                    # Once you populated the cache it still takes
                    # approx. 30 min to build the entire OmniPath
                    # as the process consists of quite some data
                    # processing. If you dump it in a pickle, you
                    # can load the network in < 1 min

I just want a network quickly and play around with pypath§

You can find the predefined formats in the pypath.resources.network module. For example, to load one resource from there, let’s say SIGNOR:

[ ]:
from pypath.legacy import main
from pypath.resources import network as netres
pa = main.PyPath()
pa.load_resources({'signor': netres.pathway['signor']})

Or to load all activity flow resources with literature references:

[ ]:
from pypath.legacy import main
from pypath.resources import network as netres
[ ]:
pa = main.PyPath()
pa.init_network(netres.pathway)

Or to load all activity flow resources, including the ones without literature references:

[ ]:
pa = main.PyPath()
pa.init_network(data_formats.pathway_all)

How do I build networks from any data with pypath?§

Here we show how to build a network from your own files. The advantage of building network with pypath is that you don’t need to worry about merging redundant elements, neither about different formats and identifiers. Let’s say you have two files with network data:

network1.csv

entrezA,entrezB,effect
1950,1956,inhibition
5290,207,stimulation
207,2932,inhibition
1956,5290,stimulation

network2.sif

EGF + EGFR
EGFR + PIK3CA
EGFR + SOS1
PIK3CA + RAC1
RAC1 + MAP3K1
SOS1 + HRAS
HRAS + MAP3K1
PIK3CA + AKT1
AKT1 - GSK3B

Note: you need to create these files in order to load them.

Defining input formats§

[ ]:
import pypath
import pypath.iinput_formats as input_formats

input1 = input_formats.ReadSettings(
    name = 'egf1',
    input = 'network1.csv',
    header = True,
    separator = ',',
    id_col_a = 0,
    id_col_b = 1,
    id_type_a = 'entrez',
    id_type_b = 'entrez',
    sign = (2, 'stimulation', 'inhibition'),
    ncbi_tax_id = 9606,
)

input2 = input_formats.ReadSettings(
    name = 'egf2',
    input = 'network2.sif',
    separator = ' ',
    id_col_a = 0,
    id_col_b = 2,
    id_type_a = 'genesymbol',
    id_type_b = 'genesymbol',
    sign = (1, '+', '-'),
    ncbi_tax_id = 9606,
)

Creating PyPath object and loading the 2 test files§

[ ]:
inputs = {
    'egf1': input1,
    'egf2': input2
}

pa = main.PyPath()
pa.reload()
pa.init_network(lst = inputs)

Structure of the legacy network object§

[ ]:
from pypath.legacy import main as legacy
pa = legacy.PyPath()
[ ]:
pa.graph

Number of edges and nodes:

[ ]:
pa.ecount, pa.vcount

The edge and vertex sequences you can access in the es and vs attributes, you can iterate these or index by integers. The edge and vertex attributes you can access by string keys. E.g. get the sources of edge 0:

[ ]:
pa.graph.es[81]['sources']

Directions and signs§

By default the igraph object is undirected but it carries all direction information in Python objects assigned to each edge. Pypath can convert it to a directed igraph object, but you still need the Direction objects to have the signs, as igraph has no signed network representation. Certain methods need the directed igraph object and they will automatically create it, but you can create it manually:

[ ]:
pa.get_directed()

You find the directed network in the pa.dgraph attribute:

[ ]:
pa.dgraph

Now let’s take a look on the pypath.main.Direction objects which contain details about directions and signs. First as an example, select a random edge:

[ ]:
edge = pa.graph.es[3241]

The Direction object is in the dirs edge attribute:

[ ]:
d = edge['dirs']

It has a method to print its content a human readable way:

[ ]:
print(pa.graph.es[3241]['dirs'])

From this we see the databases phosphoELM and Signor agree that protein P17252 has an effect on Q15139 and Signor in addition tells us this effect is stimulatory. However in your scripts you can query the Direction objects a number of ways. Each Direction object calls the two possible directions either straight or reverse:

[ ]:
d.straight
[ ]:
d.reverse

It can tell you if one of these directions is supported by any of the network resources:

[ ]:
d.get_dir(d.straight)

Or it can return those resources:

[ ]:
d.get_dir(d.straight, sources = True)

The opposite direction is not supported by any resource:

[ ]:
d.get_dir(d.reverse, sources = True)

Similar way the signs can be queried. The returned pair of boolean values mean if the interaction in this direction is stimulatory or inhibitory, respectively.

[ ]:
d.get_sign(d.straight)

Or you can ask whether it is inhibition:

[ ]:
d.is_inhibition(d.straight)

Or if the interaction is directed at all:

[ ]:
d.is_directed()

Sometimes resources don’t agree, for example one tells an interaction is inhibition while according to others it is stimulation; or one tells A effects B and another resource the other way around. Here we preserve all these potentially contradicting information in the Direction object and at the end you decide what to do with it depending on your purpose. If you want to get rid of ambiguity there is a method to get a consensus direction and sign which returns the attributes the most resources agree on:

[ ]:
d.consensus_edges()

Accessing nodes in the network§

In igraph the vertices are numbered but this numbering can change at certain operations. Instead the we can use the vertex attributes. In PyPath for proteins the name attribute is UniProt ID by default and the label is Gene Symbol.

[ ]:
pa.graph.vs['name'][:5]
[ ]:
pa.graph.vs['label'][:5]

The PyPath object offers a number of helper methods to access the nodes by their names. For example, uniprot or up returns the igraph.Vertex for a UniProt ID:

[ ]:
type(pa.up('P00533'))

Similarly genesymbol or gs for Gene Symbols:

[ ]:
type(pa.gs('ESR1'))

Each of these has a “plural” version:

[ ]:
len(list(pa.gss(['MTOR', 'ATG16L2', 'ULK1'])))

And a generic method where you can mix UniProts and Gene Symbols:

[ ]:
len(list(pa.proteins(['MTOR', 'P00533'])))

Querying relationships with our without causality§

Above you could see how to query the directions and names of individual edges and nodes. Building on top of these, other methods give a way to query causality, i.e. which proteins are affected by an other one, and which others are its regulators. The example below returns the nodes PIK3CA is stimulated by, the gs prefix tells we query by the Gene Symbol:

[ ]:
pa.gs_stimulated_by('PIK3CA')

It returns a so called _NamedVertexSeq object, which you can get a series of igraph.Vertex objects or Gene Symbols or UniProt IDs from:

[ ]:
list(pa.gs_stimulated_by('PIK3CA').gs())[:5]
[ ]:
list(pa.gs_stimulated_by('PIK3CA').up())[:5]

Note, the names of these methods are a bit contraintuitive, the for example the gs_stimulates returns the genes stimulated by PIK3CA:

[ ]:
list(pa.gs_stimulates('PIK3CA').gs())[:5]
[ ]:
'PIK3CA' in set(pa.affected_by('AKT1').gs())

There are many similary methods, inhibited_by returns negative regulators, affected_by does not consider +/- signs, without gs_ and up_ prefixes you can provide either of these identifiers, neighbors does not consider the direction. At the end .gs() converts the result for a list of Gene Symbols, up() to UniProts, .ids() to vertex IDs and by default it yields igraph.Vertex objects:

[ ]:
list(pa.neighbors('AKT1').ids())[:5]

Finally, with neighborhood methods return the indirect neighborhood in custom number of steps (however size of the neighborhood increases rapidly with number of steps):

[ ]:
print(list(pa.neighborhood('ATG3', 1).gs()))
[ ]:
print(list(pa.neighborhood('ATG3', 2).gs()))
[ ]:
len(list(pa.neighborhood('ATG3', 3).gs()))
[ ]:
len(list(pa.neighborhood('ATG3', 4).gs()))

Accessing edges by identifiers§

Just like nodes also edges can be accessed by identifiers like Gene Symbols. get_edge returns an igraph.Edge if the edge exists otherwise None.

[ ]:
type(pa.get_edge('EGF', 'EGFR'))
[ ]:
type(pa.get_edge('EGF', 'P00533'))
[ ]:
type(pa.get_edge('EGF', 'AKT1'))
[ ]:
print(pa.get_edge('EGF', 'EGFR')['dirs'])

Literature references§

Select a random edge and in the references attribute you find a list of references:

[ ]:
edge = pa.get_edge( 'MAP1LC3B', 'SQSTM1')
edge['references']

Each reference has a PubMed ID:

[ ]:
edge['references'][0].pmid
[ ]:
edge['references'][0].open()

These 3 references come from 3 different databases, but there must be 2 overlaps between them:

[ ]:
edge['refs_by_source']

Plotting the network with igraph§

Here we use the network created above (because it is reasonable size, not like the networks we could get from most of the network databases). Igraph has excellent plotting abilities built on top of the cairo library.

[ ]:
import igraph
plot = igraph.plot(pa.graph, target = 'egf_network.png',
            edge_width = 0.3, edge_color = '#777777',
            vertex_color = '#97BE73', vertex_frame_width = 0,
            vertex_size = 70.0, vertex_label_size = 15,
            vertex_label_color = '#FFFFFF',
            # due to a bug in either igraph or IPython,
            # vertex labels are not visible on inline plots:
            inline = False, margin = 120)
from IPython.display import Image
Image(filename='egf_network.png')