The pypath book§

Contents

  • 1  Introduction

  • 2  Build, load and save databases

    • 2.1  The OmniPath app

    • 2.2  Built-in database definitions

    • 2.3  Networks

      • 2.3.1  Strictly literature curated network

      • 2.3.2  The OmniPath network with extra activity flow, enzyme-substrate and ligand-receptor interactions

      • 2.3.3  Transcriptional regulation network from DoRothEA and other resources

      • 2.3.4  Literature curated miRNA post-transcriptional regulation network

      • 2.3.5  Transcriptional regulation of miRNA

      • 2.3.6  lncRNA-mRNA interactions

      • 2.3.7  Small molecule-protein interactions

    • 2.4  Enzyme-substrate relationships

    • 2.5  Protein complexes

    • 2.6  Annotations

    • 2.7  Inter-cellular communication roles

  • 3  Data directly from the original resources

  • 4  Interesting resources

    • 4.1  RaMP

      • 4.1.1  TL;DR

    • 4.2  HMDB (Human Metabolome Database)

      • 4.2.1  Direct access to HMDB data

      • 4.2.2  Higher level access to HMDB data

      • 4.2.3  ID translation with HMDB

    • 4.3  NCBI E-Utils

  • 5  Download management

    • 5.1  Cache management and customization

    • 5.2  Download failures

      • 5.2.1  Corrupted cache content

      • 5.2.2  Network communication issues: look into the curl debug log

      • 5.2.3  Timeouts

      • 5.2.4  Access and inspect the Curl object

      • 5.2.5  Is it failing only for you?

      • 5.2.6  Read the log

      • 5.2.7  TLS (SSL, HTTPS) errors

  • 6  Resources

    • 6.1  Licenses

      • 6.1.1  Example: build a network for commercial use

    • 6.2  Resource information

    • 6.3  Resource definitions for a certain database or dataset

  • 7  Building networks

    • 7.1  Which network datasets are pre-defined in pypath?

    • 7.2  The Network object

    • 7.3  Network in pandas.DataFrame

    • 7.4  Self interactions (loop edges) in the network

    • 7.5  Molecular complexes in the network

  • 8  Translating identifiers

    • 8.1  Pre-defined ID translation tables

    • 8.2  Direct access to ID translation tables

  • 9  Orthology translation

    • 9.1  Orthology translation tables as dictionaries

    • 9.2  Orthology translation data frames

  • 10  Taxonomy

    • 10.1  Translating to NCBI Taxonomy, scientific names and common names

    • 10.2  Organism from UniProt ID

  • 11  UniProt

    • 11.1  The UniProt input module

      • 11.1.1  All UniProt IDs for one organism

      • 11.1.2  UniProt ID format validation

      • 11.1.3  UniProt ID validation

      • 11.1.4  Single UniProt protein datasheet

      • 11.1.5  History of UniProt records

      • 11.1.6  UniProt REST API

      • 11.1.7  Processed UniProt annotations

    • 11.2  The UniProt utils module

      • 11.2.1  Datasheets

      • 11.2.2  Tables

    • 11.3  Sanitizing UniProt IDs

  • 12  Enzyme-substrate interactions

    • 12.1  Enzyme-substrate objects

    • 12.2  Enzyme-substrate data frame

  • 13  Protein sequences

  • 14  Annotations

    • 14.1  Load a single annotation resource

    • 14.2  Load the full annotations database by the database manager

    • 14.3  Load only selected annotations

    • 14.4  Data frames of annotations

  • 15  Inter-cellular signaling roles

    • 15.1  Build an intercellular communication network

    • 15.2  Quantitative overview of intercell annotations

    • 15.3  Intercell database as data frame

    • 15.4  Browse intercell categories

  • 16  Gene Ontology

  • 17  Protein complexes

    • 17.1  Protein complex objects

    • 17.2  Protein complex data frame

  • 18  Saving datasets as pickles

  • 19  Log messages and sessions

    • 19.1  Basic info about the session

    • 19.2  Read the log file

    • 19.3  Logging to the console

    • 19.4  Disable logging

    • 19.5  Write to the log

      • 19.5.1  Sending a single message

      • 19.5.2  Connect a module or class to the pypath logger

  • 20  BEL export

  • 21  CellPhoneDB export

  • 22  The legacy igraph-based network object

    • 22.1  I just want a network quickly and play around with pypath

    • 22.2  How do I build networks from any data with pypath?

      • 22.2.1  Defining input formats

      • 22.2.2  Creating PyPath object and loading the 2 test files

    • 22.3  Structure of the legacy network object

      • 22.3.1  Directions and signs

      • 22.3.2  Accessing nodes in the network

    • 22.4  Querying relationships with our without causality

    • 22.5  Accessing edges by identifiers

    • 22.6  Literature references

    • 22.7  Plotting the network with igraph

Introduction§

OmniPath consists of 5 main database segments: network (interactions), enzyme-substrate interactions (enz_sub or ptms), protein complexes (complexes), molecular entity annotations (annotations) and intercellular communication roles (intercell). You can access all these by the web service at https://omnipathdb.org/ and the R/Bioconductor package OmnipathR, furthermore the network and some of the annotations by the Cytoscape app. However only pypath is able to build these databases directly from the original sources with various options for customization and to provide a rich and versatile API for each database enjoying the almost unlimited flexibility of Python. This book attempts to be a guided tour around pypath, however almost all objects, modules, APIs presented here have many more methods, options and features than we have a chance to cover. If you feel like there might be something useful for you, don’t hesitate to ask us by github.

This document has been run with the following pypath version:

[1]:
import pypath
pypath.__version__

executed in 0ms, finished 16:49:47 2023-03-09

[1]:
'0.14.36'

Build, load and save databases§

We provide a high level interface in the module pypath.omnipath.app. This is the easiest way to build, manage and access the OmniPath databases, hence this is what we present in the Quick start section. In further sections we show the lower level modules more in detail.

The OmniPath app§

pypath.omnipath is an application which contains a database manager at omnipath.db. This manager is empty by default. It builds and loads the databases on demand.

[2]:
from pypath import omnipath

omnipath.db

executed in 1.34s, finished 14:11:27 2022-12-03

[2]:
<pypath.omnipath.app.DatabaseManager at 0x602fb851cd90>

Built-in database definitions§

The databases presented below are pre-defined in pypath. You can also list them by:

[3]:
from pypath import omnipath
omnipath.db.datasets

executed in 0ms, finished 14:11:32 2022-12-03

[3]:
['omnipath',
 'curated',
 'complex',
 'annotations',
 'intercell',
 'tf_target',
 'dorothea',
 'small_molecule',
 'tf_mirna',
 'mirna_mrna',
 'lncrna_mrna',
 'enz_sub']

Networks§

OmniPath offers multiple built in network datasets: the OmniPath PPI network the more strict literature curated PPI network, the special ligand-receptor PPI network and various other PPI datasets, the transcriptional regulation network from DoRothEA and other resources, miRNA post-transcriptional regulation network and also transcriptional regulation network for miRNAs.

Strictly literature curated network§

[4]:
from pypath import omnipath
cu = omnipath.db.get_db('curated')
cu

executed in 16.83s, finished 13:17:13 2022-12-02

[4]:
<Network: 7980 nodes, 35551 interactions>

The OmniPath network with extra activity flow, enzyme-substrate and ligand-receptor interactions§

[5]:
from pypath import omnipath
op = omnipath.db.get_db('omnipath')
op

executed in 1m, finished 13:18:55 2022-12-02

[5]:
<Network: 18558 nodes, 94358 interactions>

Transcriptional regulation network from DoRothEA and other resources§

Note: according to the default settings, DoRothEA confidence levels A-D and all original resources will be loaded. To load only DoRothEA, use the key "dorothea" instead of "tf_target".

[6]:
from pypath import omnipath
tft = omnipath.db.get_db('tf_target')
tft

executed in 2m 12.72s, finished 13:21:54 2022-12-02

[6]:
<Network: 18986 nodes, 326708 interactions>

Literature curated miRNA post-transcriptional regulation network§

[1]:
from pypath import omnipath
mi = omnipath.db.get_db('mirna_mrna')
mi

executed in 2.28s, finished 13:31:55 2022-12-02

[1]:
<Network: 1264 nodes, 3288 interactions>

Transcriptional regulation of miRNA§

[4]:
from pypath import omnipath
tmi = omnipath.db.get_db('tf_mirna')
tmi

executed in 0ms, finished 13:32:41 2022-12-02

[4]:
<Network: 1032 nodes, 4960 interactions>

lncRNA-mRNA interactions§

[6]:
from pypath import omnipath
lnc = omnipath.db.get_db('lncrna_mrna')
lnc

executed in 0ms, finished 13:33:03 2022-12-02

[6]:
<Network: 243 nodes, 217 interactions>

Small molecule-protein interactions§

These interactions are either ligand-receptor connections, enzyme inhibitions, allosteric regulations or enzyme-metabolite interactions. Currently it is a small, experimental dataset, but will be largely extended in the future.

[1]:
from pypath import omnipath
smol = omnipath.db.get_db('small_molecule')
smol

executed in 7.94s, finished 13:57:17 2022-12-02

[1]:
<Network: 1980 nodes, 3147 interactions>

Enzyme-substrate relationships§

[7]:
from pypath import omnipath
es = omnipath.db.get_db('enz_sub')
es

executed in 6.14s, finished 13:33:26 2022-12-02

[7]:
<Enzyme-substrate database: 41426 relationships>

Protein complexes§

[8]:
from pypath import omnipath
co = omnipath.db.get_db('complex')
co

executed in 0ms, finished 13:33:31 2022-12-02

[8]:
<Complex database: 28173 complexes>

Annotations§

The annotations database is huge, building or even loading it takes long time and requires quite some memory.

[9]:
from pypath import omnipath
an = omnipath.db.get_db('annotations')
an

executed in 2m 43.60s, finished 13:36:28 2022-12-02

[9]:
<Annotation database: 5490653 records about 50872 entities from 68 resources>

Inter-cellular communication roles§

This database is quick to build, but it requires the annotations database, which is really heavy.

[10]:
from pypath import omnipath
ic = omnipath.db.get_db('intercell')
ic

executed in 23.34s, finished 13:37:12 2022-12-02

[10]:
<Intercell annotations: 301527 records about 48570 entities>

Data directly from the original resources§

The pypath.inputs module contains clients for more than 150 molecular biology and biomedical resources, and overall almost 500 functions that download data directly from these resources. Maintaining such a large number of clients is troublesome, hence at any time some of them are broken, you can check them in our daily status report. Each submodule of pypath.inputs is named after its corresponding resource, all lowercase, e.g. “depod” (DEPOD) or “cytosig” (CytoSig). Within these modules each function name starts with the name of the resource, and ends with the kind of data it retrieves. For example, pypath.inputs.signor.signor_interactions downloads interactions from SIGNOR. The labels *”_interactions”,”_enz_sub”,”_complexes”* and *”_annotations”* retrieve records intended to these respective databases. However, the records at this stage are not fully processed yet. Some functions have different postfixes, e.g. *”_raw”* means the data is close to the format provided by the resource itself; *”_mapping”* means it is intended for a translation table. The purpose of the input functions is to 1) handle the download; 2) read the raw data; 3) extract the relevant parts; 4) do the specific part of processing, i.e. bring the data to a state when it is suitable for the generic database classes for further processing. The outputs of these functions is not standard in any ways, though you may observ repeated patterns. The input functions typically return lists or dictionaries. These are arbitrarily designed towards the aims of selecting the relevant fields and give straightforward, accessible Python data structures for processing within or outside of pypath.

We use SIGNOR as an example because this resource provides data for almost all OmniPath databases. The signor_complexes function returns a set of pypath.internals.intera.Complex objects, ready to be added to the OmniPath complexes database (built by pypath.core.complex.ComplexAggregator).

[2]:
from pypath.inputs import signor
signor.signor_complexes()

executed in 0ms, finished 15:24:43 2022-12-03

[2]:
{'COMPLEX:P23511_P25208_Q13952': Complex NFY: COMPLEX:P23511_P25208_Q13952,
 'COMPLEX:P68104_P85299_Q6R327_Q8TB45_Q9BVC4': Complex mTORC2: COMPLEX:P68104_P85299_Q6R327_Q8TB45_Q9BVC4,
 'COMPLEX:P42345_Q8N122_Q8TB45_Q96B36_Q9BVC4': Complex mTORC1: COMPLEX:P42345_Q8N122_Q8TB45_Q96B36_Q9BVC4,
 'COMPLEX:P63208_Q13616_Q9Y297': Complex SCF-betaTRCP: COMPLEX:P63208_Q13616_Q9Y297,
 'COMPLEX:Q09472_Q92793': Complex CBP/p300: COMPLEX:Q09472_Q92793,
 'COMPLEX:Q09472_Q92793_Q92831': Complex P300/PCAF: COMPLEX:Q09472_Q92793_Q92831,
 'COMPLEX:Q13485_Q15796': Complex SMAD2/SMAD4: COMPLEX:Q13485_Q15796,
 'COMPLEX:P84022_Q13485': Complex SMAD3/SMAD4: COMPLEX:P84022_Q13485,
 'COMPLEX:P05412_Q13485': Complex SMAD4/JUN: COMPLEX:P05412_Q13485,
 'COMPLEX:Q15796_Q9HAU4': Complex SMAD2/SMURF2: COMPLEX:Q15796_Q9HAU4,
 'COMPLEX:O15105_Q01094_Q13547': Complex SMAD7/HDAC1/E2F-1: COMPLEX:O15105_Q01094_Q13547,
 'COMPLEX:P19838_Q04206': Complex NfKb-p65/p50: COMPLEX:P19838_Q04206,
 'COMPLEX:O14920_O15111': Complex IK
Output truncated: showing 1000 of 17699 characters

The signor_interactions function returns a list of arbitrary tuples that represent the most important properties of SIGNOR interaction records in a human readable way, and ready to be processed by the pypath.core.network.Network object.

[5]:
signor.signor_interactions()[:10]

executed in 0ms, finished 14:11:52 2022-12-03

[5]:
[SignorInteraction(source='O15530', target='O15530', source_isoform=None, target_isoform=None, source_type='protein', target_type='protein', effect='unknown', mechanism='phosphorylation', ncbi_tax_id='9606', pubmeds='10455013', direct=True, ptm_type='phosphorylation', ptm_residue='Ser396', ptm_motif='SSSSSSHsLSASDTG'),
 SignorInteraction(source='Q9NQ66', target='CHEBI:18035', source_isoform=None, target_isoform=None, source_type='protein', target_type='smallmolecule', effect='up-regulates quantity', mechanism='', ncbi_tax_id='-1', pubmeds='23880553', direct=True, ptm_type='', ptm_residue='Small molecule catalysis', ptm_motif=''),
 SignorInteraction(source='P62136', target='O15169', source_isoform=None, target_isoform=None, source_type='protein', target_type='protein', effect='down-regulates activity', mechanism='dephosphorylation', ncbi_tax_id='9606', pubmeds='17318175', direct=True, ptm_type='dephosphorylation', ptm_residue='Ser77', ptm_motif='YEPEGSAsPTPPYLK'),
 SignorInteraction(sou
Output truncated: showing 1000 of 3285 characters

Note, the records above contain also enzyme-PTM data, hence the signor.signor_enzyme_substrate function only converts them to an intermediate format to make it easier to process for pypath.core.enz_sub.EnzymeSubstrateAggregator.

[4]:
signor.signor_enzyme_substrate()[:2]

executed in 0ms, finished 13:58:20 2022-12-02

[4]:
[{'typ': 'phosphorylation',
  'resnum': 396,
  'instance': 'SSSSSSHSLSASDTG',
  'substrate': 'O15530',
  'start': 389,
  'end': 403,
  'kinase': 'O15530',
  'resaa': 'S',
  'motif': 'SSSSSSHSLSASDTG',
  'enzyme_isoform': None,
  'substrate_isoform': None,
  'references': {'10455013'}},
 {'typ': 'dephosphorylation',
  'resnum': 77,
  'instance': 'YEPEGSASPTPPYLK',
  'substrate': 'O15169',
  'start': 70,
  'end': 84,
  'kinase': 'P62136',
  'resaa': 'S',
  'motif': 'YEPEGSASPTPPYLK',
  'enzyme_isoform': None,
  'substrate_isoform': None,
  'references': {'17318175'}}]

Finally, SIGNOR also assigns proteins to pathways. This information is intended for the OmniPath annotations database, and retrieved by the signor.signor_pathway_annotations function. This function returns a dict of sets which is typical for *_annotation* functions. This format requires practically no further processing.

[5]:
signor.signor_pathway_annotations()['O14733']

executed in 1.48s, finished 13:58:28 2022-12-02

[5]:
{SignorPathway(pathway='TNF alpha'),
 SignorPathway(pathway='Toll like receptors')}

We haven’t mention all functions in the inputs.signor module. The rest of the functions retrieve additional information needed by the four functions above, and are of limited direct use for users. For example, signor_protein_families returns a dict with the internal ID and members of protein families; this data is used to process the interactions and complexes, but not too interesting on its own.

[6]:
signor.signor_protein_families()['SIGNOR-PF2']

executed in 0ms, finished 13:58:53 2022-12-02

[6]:
['Q9HBW0', 'Q92633', 'Q9UBY5']

Interesting resources§

Here we showcase a few potentially useful features in pypath.inputs.

RaMP§

RaMP is a human metabolite and metabolic network database providing ID translation, annotations and enzymatic reactions of metabolites. Let’s take a closer look first at the full database contents. It is available as a MySQL database, below we list the tables and their column names:

[6]:
from pypath.inputs import ramp
ramp.ramp_list_tables()

executed in 2.20s, finished 16:51:14 2023-03-09

[6]:
{'analyte': ['rampId', 'type'],
 'analytehasontology': ['rampCompoundId', 'rampOntologyId'],
 'analytehaspathway': ['rampId', 'pathwayRampId', 'pathwaySource'],
 'analytesynonym': ['Synonym', 'rampId', 'geneOrCompound', 'source'],
 'catalyzed': ['rampCompoundId', 'rampGeneId'],
 'chem_props': ['ramp_id',
  'chem_data_source',
  'chem_source_id',
  'iso_smiles',
  'inchi_key_prefix',
  'inchi_key',
  'inchi',
  'mw',
  'monoisotop_mass',
  'common_name',
  'mol_formula'],
 'db_version': ['ramp_version',
  'load_timestamp',
  'version_notes',
  'met_intersects_json',
  'gene_intersects_json',
  'met_intersects_json_pw_mapped',
  'gene_intersects_json_pw_mapped',
  'db_sql_url'],
 'entity_status_info': ['status_category',
  'entity_source_id',
  'entity_source_name',
  'entity_count'],
 'metabolite_class': ['ramp_id',
  'class_source_id',
  'class_level_name',
  'class_name',
  'source'],
 'ontology': ['rampOntologyId', 'commonName', 'HMDBOntologyType', 'metCount'],
 'pathway': ['pathwayR
Output truncated: showing 1000 of 1368 characters

Using the ramp_raw function, we can access these tables either as Python dicts, or pandas.DataFrames, or loaded into an SQLite database. For further inspection, the data frames are the most convenient. Most of the ID translation data is contained in the source table:

Note: At the very first time, retrieving these tables takes quite some time, not only due to the large download, but also a performance bottleneck when processing the MySQL dumps. Thanks to caching, loading the tables subsequently happens much faster.

[8]:
tables = ramp.ramp_raw(['analytesynonym', 'chem_props', 'source'])
tables['source']

executed in 4.25s, finished 16:54:17 2023-03-09

[8]:
sourceId rampId IDtype geneOrCompound commonName priorityHMDBStatus dataSource pathwayCount
0 hmdb:HMDB0000001 RAMP_C_000000001 hmdb compound 1-Methylhistidine quantified hmdb 2
1 hmdb:HMDB0000479 RAMP_C_000000001 hmdb compound 3-Methylhistidine quantified hmdb 2
2 chebi:50599 RAMP_C_000000001 chebi compound 1-Methylhistidine quantified hmdb 2
3 chemspider:83153 RAMP_C_000000001 chemspider compound 1-Methylhistidine quantified hmdb 2
4 kegg:C01152 RAMP_C_000000001 kegg compound 1-Methylhistidine quantified hmdb_kegg 2
... ... ... ... ... ... ... ... ...
756552 uniprot:H0YDB7 RAMP_G_000009307 uniprot gene RAB38 NULL wiki 10
756553 uniprot:A0A024R191 RAMP_G_000009307 uniprot gene RAB38 NULL wiki 10
756554 uniprot:H0YEA4 RAMP_G_000009307 uniprot gene RAB38 NULL wiki 10
756555 entrez:23682 RAMP_G_000009307 entrez gene RAB38 NULL wiki 10
756556 gene_symbol:RAB38 RAMP_G_000009307 gene_symbol gene RAB38 NULL wiki 10

756557 rows × 8 columns

Structural and physicochemical info is available in the chem_props table:

[10]:
tables['chem_props']

executed in 0ms, finished 17:00:46 2023-03-09

[10]:
ramp_id chem_data_source chem_source_id iso_smiles inchi_key_prefix inchi_key inchi mw monoisotop_mass common_name mol_formula
0 RAMP_C_000000001 hmdb hmdb:HMDB0000001 [H]OC(=O)[C@@]([H])(N([H])[H])C([H])([H])C1=C(... BRMWTNUJHUMWMS BRMWTNUJHUMWMS-LURJTMIESA-N InChI=1S/C7H11N3O2/c1-10-3-5(9-4-10)2-6(8)7(11... 169.181 169.085 1-Methylhistidine C7H11N3O2
1 RAMP_C_000000001 hmdb hmdb:HMDB0000479 [H][C@](N)(CC1=CN=CN1C)C(O)=O JDHILDINMRGULE JDHILDINMRGULE-LURJTMIESA-N InChI=1S/C7H11N3O2/c1-10-4-9-3-5(10)2-6(8)7(11... 169.181 169.085 3-Methylhistidine C7H11N3O2
2 RAMP_C_000000001 chebi chebi:27596 Cn1cncc1C[C@H](N)C(O)=O JDHILDINMRGULE JDHILDINMRGULE-LURJTMIESA-N InChI=1S/C7H11N3O2/c1-10-4-9-3-5(10)2-6(8)7(11... NULL 169.085 N(pros)-methyl-L-histidine C7H11N3O2
3 RAMP_C_000000001 chebi chebi:50599 Cn1cnc(C[C@H](N)C(O)=O)c1 BRMWTNUJHUMWMS BRMWTNUJHUMWMS-LURJTMIESA-N InChI=1S/C7H11N3O2/c1-10-3-5(9-4-10)2-6(8)7(11... NULL 169.085 N(tele)-methyl-L-histidine C7H11N3O2
4 RAMP_C_000000002 hmdb hmdb:HMDB0000002 NCCCN XFNJVJPLKCPIBV XFNJVJPLKCPIBV-UHFFFAOYSA-N InChI=1S/C3H10N2/c4-2-1-3-5/h1-5H2 74.1249 74.0844 1,3-Diaminopropane C3H10N2
... ... ... ... ... ... ... ... ... ... ... ...
275898 RAMP_C_000258279 lipidmaps LIPIDMAPS:LMPK15050003 C1(OC)C(=O)C(C[C@H](OC(C)=O)CCCCCCCCCCCCC)=C(O... UXLMJHNFDRMGPW UXLMJHNFDRMGPW-LJQANCHMSA-N InChI=1S/C24H38O6/c1-4-5-6-7-8-9-10-11-12-13-1... NULL 422.267 2-hydroxy-5-methoxy-3-(2R-acetoxy-pentadecyl)-... C24H38O6
275899 RAMP_C_000258280 lipidmaps LIPIDMAPS:LMPK15050004 C1(OC)C(=O)C(C[C@H](OC(C)=O)CCCCCCCCCCCCC)=CC(... CVZNKLNAHBTINT CVZNKLNAHBTINT-JOCHJYFZSA-N InChI=1S/C24H38O5/c1-4-5-6-7-8-9-10-11-12-13-1... NULL 406.272 5-methoxy-3-(2R-acetoxy-pentadecyl)-1,4-benzoq... C24H38O5
275900 RAMP_C_000226089 lipidmaps LIPIDMAPS:LMPK15050005 C1(OC)C(=O)C(C[C@H](OC(C)=O)CCCCCCCCCCC)=CC(=O... JIUGZSYPFREDLG JIUGZSYPFREDLG-HXUWFJFHSA-N InChI=1S/C22H34O5/c1-4-5-6-7-8-9-10-11-12-13-2... NULL 378.241 5-methoxy-3-(2R-acetoxy-tridecyl)-1,4-benzoqui... C22H34O5
275901 RAMP_C_000258283 lipidmaps LIPIDMAPS:LMPK15050008 C1(O)C(=O)C(CCCCCCCCCCCCCCC)=C(O)C(=O)C=1 GXDURRGUXLDZKN GXDURRGUXLDZKN-UHFFFAOYSA-N InChI=1S/C21H34O4/c1-2-3-4-5-6-7-8-9-10-11-12-... NULL 350.246 Suberonone C21H34O4
275902 RAMP_C_000258284 lipidmaps LIPIDMAPS:LMPK15050009 C1(O)C(=O)C(CCCCCCCCCCCCC)=C(O)C(=O)C=1 AMKNOBHCKRZHIO AMKNOBHCKRZHIO-UHFFFAOYSA-N InChI=1S/C19H30O4/c1-2-3-4-5-6-7-8-9-10-11-12-... NULL 322.214 Rapanone C19H30O4

275903 rows × 11 columns

Raw RaMP data can be accessed also as an SQLite database. The advantage here is the high performance and flexibility of operations. Conversion to pandas and vice versa is really easy, you can always have the result in a data frame. Below, con is a database connection ready to execute your queries. It is an in-memory database, using alternatively an on-disk database is possible. We use pypath.formats.sqlite to look into the SQLite database.

[11]:
con = ramp.ramp_raw(['source', 'chem_props', 'analytesynonym'], sqlite = True)
con

executed in 10.56s, finished 17:07:00 2023-03-09

[11]:
<sqlite3.Connection at 0x6fa1e9e4e940>

Now we have already loaded these 3 big tables both as data frames and as SQLite tables, let’s see how much memory they use (normally half is enough, and they should stay in the memory only for short periods):

[13]:
from pypath.share import common
common.format_bytes(common.python_memory_usage())

executed in 0ms, finished 17:07:44 2023-03-09

[13]:
'3.7 GB'

Looking into the database, we see the 3 tables loaded, and their column names:

[19]:
from pypath.formats import sqlite
sqlite.list_columns(con)

executed in 0ms, finished 17:13:01 2023-03-09

[19]:
{'source': ['sourceId',
  'rampId',
  'IDtype',
  'geneOrCompound',
  'commonName',
  'priorityHMDBStatus',
  'dataSource',
  'pathwayCount'],
 'analytesynonym': ['Synonym', 'rampId', 'geneOrCompound', 'source'],
 'chem_props': ['ramp_id',
  'chem_data_source',
  'chem_source_id',
  'iso_smiles',
  'inchi_key_prefix',
  'inchi_key',
  'inchi',
  'mw',
  'monoisotop_mass',
  'common_name',
  'mol_formula']}

Let’s see how to execute an SQL query and fetch the output into a data frame. This query takes the source table, selects the records with HMDB and ChEBI IDs in two subqueries, and joins the two by rampId, in order to obtain a HMDB ←→ ChEBI mapping table:

[22]:
import pandas as pd

query = (
    'SELECT DISTINCT a.sourceId as hmdb, b.sourceId as chebi '
    'FROM '
    '   (SELECT sourceId, rampId '
    '    FROM source '
    '   WHERE geneOrCompound = "compound" AND IDtype = "hmdb") a '
    'JOIN '
    '   (SELECT sourceId, rampId '
    '    FROM source '
    '   WHERE geneOrCompound = "compound" AND IDtype = "chebi") b '
    'ON a.rampId = b.rampId;'
)
df = pd.read_sql_query(query, con)
df

executed in 1ms, finished 17:18:37 2023-03-09

[22]:
hmdb chebi
0 hmdb:HMDB0000001 chebi:27596
1 hmdb:HMDB0000001 chebi:50599
2 hmdb:HMDB0000479 chebi:27596
3 hmdb:HMDB0000479 chebi:50599
4 hmdb:HMDB00001 chebi:27596
... ... ...
104129 hmdb:HMDB0126033 chebi:25882
104130 hmdb:HMDB0141947 chebi:180150
104131 hmdb:HMDB0128505 chebi:7870
104132 hmdb:HMDB0130984 chebi:8227
104133 hmdb:HMDB0130987 chebi:8630

104134 rows × 2 columns

Such mapping tables can be easily accessed for any pairs of identifiers by the ramp_mapping function. Before that, let’s see the complete list of supported ID types:

[24]:
ramp.ramp_id_types()

executed in 4.45s, finished 17:23:09 2023-03-09

[24]:
{'CAS',
 'EN',
 'LIPIDMAPS',
 'brenda',
 'chebi',
 'chemspider',
 'ensembl',
 'entrez',
 'gene_symbol',
 'hmdb',
 'kegg',
 'kegg_glycan',
 'lipidbank',
 'ncbiprotein',
 'plantfa',
 'pubchem',
 'swisslipids',
 'uniprot',
 'wikidata'}
[31]:
ramp.ramp_mapping('LIPIDMAPS', 'swisslipids')

executed in 4.94s, finished 17:29:17 2023-03-09

[31]:
{'LMFA00000008': {'SLM:000390048'},
 'LMFA01010001': {'SLM:000000510'},
 'LMFA01010002': {'SLM:000000449'},
 'LMFA01010003': {'SLM:000001194'},
 'LMFA01010004': {'SLM:000001195'},
 'LMFA01010005': {'SLM:000389552'},
 'LMFA01010006': {'SLM:000001196'},
 'LMFA01010007': {'SLM:000389947'},
 'LMFA01010008': {'SLM:000000853'},
 'LMFA01010010': {'SLM:000000855'},
 'LMFA01010011': {'SLM:000389946'},
 'LMFA01010012': {'SLM:000000719'},
 'LMFA01010013': {'SLM:000001198'},
 'LMFA01010014': {'SLM:000000825'},
 'LMFA01010015': {'SLM:000001199'},
 'LMFA01010017': {'SLM:000001095'},
 'LMFA01010019': {'SLM:000001205'},
 'LMFA01010020': {'SLM:000000829'},
 'LMFA01010021': {'SLM:000001207'},
 'LMFA01010022': {'SLM:000000827'},
 'LMFA01010023': {'SLM:000001128'},
 'LMFA01010024': {'SLM:000000414'},
 'LMFA01010026': {'SLM:000000539'},
 'LMFA01010027': {'SLM:000000980'},
 'LMFA01010028': {'SLM:000000540'},
 'LMFA01010030': {'SLM:000000543'},
 'LMFA01010032': {'SLM:000000544'},
 'LMFA01010034': {'SLM:00000
Output truncated: showing 1000 of 44684 characters

Above we got a dict of sets, alternatively data frames are available:

[32]:
ramp.ramp_mapping('LIPIDMAPS', 'swisslipids', return_df = True)

executed in 4.63s, finished 17:30:27 2023-03-09

[32]:
id_type_a id_type_b
0 LMST02030086 SLM:000485328
1 LMST02030087 SLM:000485330
2 LMSP06020013 SLM:000000534
3 LMST02020001 SLM:000001055
4 LMST02020001 SLM:000485315
... ... ...
35218 LMPR0104010007 SLM:000389242
35219 LMPR0104030005 SLM:000390232
35220 LMPR0104030006 SLM:000390227
35221 LMPR01070626 SLM:000000432
35222 LMPR01090015 SLM:000389419

35223 rows × 2 columns

RaMP ID translation is also integrated into the higher level APIs in pypath.utils.mapping. Below, we first look into the available ID types and translation tables:

[34]:
from pypath.utils import mapping
m = mapping.get_mapper()
m.id_types()

executed in 0ms, finished 17:38:25 2023-03-09

[34]:
{IdType(pypath='CAS', original='CAS'),
 IdType(pypath='LIPIDMAPS', original='LIPIDMAPS'),
 IdType(pypath='MedChemExpress', original='MedChemExpress'),
 IdType(pypath='actor', original='actor'),
 IdType(pypath='affy', original='affy'),
 IdType(pypath='affymetrix', original='affymetrix'),
 IdType(pypath='agilent', original='agilent'),
 IdType(pypath='alzforum', original='Alzforum_mut'),
 IdType(pypath='araport', original='Araport'),
 IdType(pypath='atlas', original='atlas'),
 IdType(pypath='bindingdb', original='bindingdb'),
 IdType(pypath='brenda', original='brenda'),
 IdType(pypath='carotenoiddb', original='carotenoiddb'),
 IdType(pypath='cas', original='CAS'),
 IdType(pypath='cas_id', original='CAS'),
 IdType(pypath='cgnc', original='CGNC'),
 IdType(pypath='chebi', original='chebi'),
 IdType(pypath='chembl', original='chembl'),
 IdType(pypath='chemicalbook', original='chemicalbook'),
 IdType(pypath='chemspider', original='chemspider'),
 IdType(pypath='clinicaltrials', original='clinic
Output truncated: showing 1000 of 7422 characters

These are ID types not only from RaMP, but all the supported resources. In the mapping table definitions, as translation between any two ID types is supported, id_type_b is always None:

[35]:
[t for t in m.mapping_tables() if t.resource == 'ramp']

executed in 0ms, finished 17:46:56 2023-03-09

[35]:
[MappingTableDefinition(id_type_a='kegg_glycan', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='kegg_glycan', resource_id_type_b=None),
 MappingTableDefinition(id_type_a='hmdb', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='hmdb', resource_id_type_b=None),
 MappingTableDefinition(id_type_a='wikidata', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='wikidata', resource_id_type_b=None),
 MappingTableDefinition(id_type_a='LIPIDMAPS', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='LIPIDMAPS', resource_id_type_b=None),
 MappingTableDefinition(id_type_a='kegg', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='kegg', resource_id_type_b=None),
 MappingTableDefinition(id_type_a='CAS', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='CAS', resource_id_type_b=None),
 MappingTableDefinition(id_type_a='chebi
Output truncated: showing 1000 of 3238 characters

TL;DR§

Up until this point this section is about extra insights, but what 99% of the users will do looks like this:

[36]:
from pypath.utils import mapping
mapping.map_name('131431', 'chebi', 'hmdb')

executed in 0ms, finished 17:53:38 2023-03-09

[36]:
{'HMDB0094709'}

HMDB (Human Metabolome Database)§

Direct access to HMDB data§

In the inputs.hmdb module processes metabolite and protein data using lxml.etree and some minimal utilities from formats.xml. The metabolite or protein records are available as lxml.etree.Element objects, or custom fields can be extracted into dicts, or into data frames. To iterate through the xml elements, each representing a metabolite:

[1]:
from pypath.inputs import hmdb
next(hmdb.iter_metabolites())

executed in 1ms, finished 12:23:11 2023-04-24

[1]:
<Element {http://www.hmdb.ca}metabolite at 0x60b1846262c0>

On the Element objects you can use directly lxml.etree’s methods to extract information. An easier and flexible way to extract information from these XML records is to define a schema with instructions for lxml. A full schema for HMDB metabolites is available in hmdb.SCHEMA:

[2]:
hmdb.METABOLITES_SCHEMA

executed in 0ms, finished 12:24:03 2023-04-24

[2]:
{'taxonomy': ('taxonomy',
  {'description': ('description', None),
   'direct_parent': ('direct_parent', None),
   'kingdom': ('kingdom', None),
   'class': ('class', None),
   'sub_class': ('sub_class', None),
   'molecular_framework': ('molecular_framework', None),
   'alternative_parents': ('alternative_parents',
    ('alternative_parent', 'findall'),
    None),
   'substituents': ('substituents', ('substituent', 'findall'), None)}),
 'spectra': ('spectra', ('spectrum', 'findall'), {'spectrum_id', 'type'}),
 'biological_properties': ('biological_properties',
  {'cellular_locations': ('cellular_locations', ('cellular', 'findall'), None),
   'biospecimen_locations': ('biospecimen_locations',
    ('biospecimen', 'findall'),
    None),
   'tissue_locations': ('tissue_locations', ('tissue', 'findall'), None),
   'pathways': ('pathways',
    ('pathway', 'findall'),
    {'kegg_map_id', 'name', 'smpdb_id'})}),
 'experimental_properties': ('experimental_properties',
  ('property', 'findall')
Output truncated: showing 1000 of 4037 characters

The schema for proteins is different:

[3]:
hmdb.PROTEINS_SCHEMA

executed in 0ms, finished 12:24:52 2023-04-24

[3]:
{'gene_properties': ('gene_properties',
  {'chromosome_location': ('chromosome_location', None),
   'locus': ('locus', None),
   'gene_sequence': ('gene_sequence', None)}),
 'protein_properties': ('protein_properties',
  {'residue_number': ('residue_number', None),
   'molecular_weight': ('molecular_weight', None),
   'theoretical_pi': ('theoretical_pi', None),
   'polypeptide_sequence': ('polypeptide_sequence', None),
   'transmembrane_regions': ('transmembrane_regions',
    ('region', 'findall'),
    None),
   'signal_regions': ('signal_regions', ('region', 'findall'), None)}),
 'pfams': ('pfams', ('pfam', 'findall'), {'name', 'pfam_id'}),
 'metabolite_associations': ('metabolite_associations',
  ('metabolite', 'findall'),
  {'accession', 'name'}),
 'go_classifications': ('go_classifications',
  ('go_class', 'findall'),
  {'category', 'description', 'go_id'}),
 'pathways': ('pathways',
  ('pathway', 'findall'),
  {'kegg_map_id', 'name', 'smpdb_id'}),
 'general_references': ('general_
Output truncated: showing 1000 of 2072 characters

By default the full schema is used by hmdb.metabolites_raw and hmdb.proteins_raw, but you can pass a smaller dict with only your fields of interest, largely reducing the processing time. Using the head argument we peek into the first N records of the data:

[4]:
list(hmdb.metabolites_raw(head = 3))

executed in 0ms, finished 12:25:31 2023-04-24

[4]:
[{'taxonomy': {'description': ' belongs to the class of organic compounds known as histidine and derivatives. Histidine and derivatives are compounds containing cysteine or a derivative thereof resulting from reaction of cysteine at the amino group or the carboxy group, or from the replacement of any hydrogen of glycine by a heteroatom.',
   'direct_parent': 'Histidine and derivatives',
   'kingdom': 'Organic compounds',
   'class': 'Carboxylic acids and derivatives',
   'sub_class': 'Amino acids, peptides, and analogues',
   'molecular_framework': 'Aromatic heteromonocyclic compounds',
   'alternative_parents': ['Amino acids',
    'Aralkylamines',
    'Azacyclic compounds',
    'Carbonyl compounds',
    'Carboxylic acids',
    'Heteroaromatic compounds',
    'Hydrocarbon derivatives',
    'Imidazolyl carboxylic acids and derivatives',
    'L-alpha-amino acids',
    'Monoalkylamines',
    'Monocarboxylic acids and derivatives',
    'N-substituted imidazoles',
    'Organic oxides',

Output truncated: showing 1000 of 132354 characters

The returned nested dict corresponds to the schema. Another example with a schema that extracts only the accession and name fields:

[6]:
list(hmdb.metabolites_raw(
    schema = {
        'accession': hmdb.METABOLITES_SCHEMA['accession'],
        'name': hmdb.METABOLITES_SCHEMA['name'],
    },
    head = 20,
))

executed in 0ms, finished 12:25:55 2023-04-24

[6]:
[{'accession': 'HMDB0000001', 'name': '1-Methylhistidine'},
 {'accession': 'HMDB0000002', 'name': '1,3-Diaminopropane'},
 {'accession': 'HMDB0000005', 'name': '2-Ketobutyric acid'},
 {'accession': 'HMDB0000008', 'name': '2-Hydroxybutyric acid'},
 {'accession': 'HMDB0000010', 'name': '2-Methoxyestrone'},
 {'accession': 'HMDB0000011', 'name': '3-Hydroxybutyric acid'},
 {'accession': 'HMDB0000012', 'name': 'Deoxyuridine'},
 {'accession': 'HMDB0000014', 'name': 'Deoxycytidine'},
 {'accession': 'HMDB0000015', 'name': 'Cortexolone'},
 {'accession': 'HMDB0000016', 'name': 'Deoxycorticosterone'},
 {'accession': 'HMDB0000017', 'name': '4-Pyridoxic acid'},
 {'accession': 'HMDB0000019', 'name': 'alpha-Ketoisovaleric acid'},
 {'accession': 'HMDB0000020', 'name': 'p-Hydroxyphenylacetic acid'},
 {'accession': 'HMDB0000021', 'name': 'Iodotyrosine'},
 {'accession': 'HMDB0000022', 'name': '3-Methoxytyramine'},
 {'accession': 'HMDB0000023', 'name': '(S)-3-Hydroxyisobutyric acid'},
 {'accession': 'HMDB00
Output truncated: showing 1000 of 1291 characters

It works a similar way for proteins:

[7]:
list(hmdb.proteins_raw(
    schema = {
        'name': hmdb.PROTEINS_SCHEMA['name'],
        'genesymbol': hmdb.PROTEINS_SCHEMA['gene_name'],
    },
    head = 20,
))

executed in 0ms, finished 12:29:23 2023-04-24

[7]:
[{'name': "5'-nucleotidase", 'genesymbol': 'NT5E'},
 {'name': 'Deoxycytidylate deaminase', 'genesymbol': 'DCTD'},
 {'name': 'UMP-CMP kinase', 'genesymbol': 'CMPK1'},
 {'name': "Cytosolic 5'-nucleotidase 1B", 'genesymbol': 'NT5C1B'},
 {'name': "Cytosolic 5'-nucleotidase 1A", 'genesymbol': 'NT5C1A'},
 {'name': "5'(3')-deoxyribonucleotidase, cytosolic type",
  'genesymbol': 'NT5C'},
 {'name': 'Deoxycytidine kinase', 'genesymbol': 'DCK'},
 {'name': "5'(3')-deoxyribonucleotidase, mitochondrial", 'genesymbol': 'NT5M'},
 {'name': 'Hydroxymethylglutaryl-CoA lyase, mitochondrial',
  'genesymbol': 'HMGCL'},
 {'name': 'ATP-citrate synthase', 'genesymbol': 'ACLY'},
 {'name': 'Histone acetyltransferase p300', 'genesymbol': 'EP300'},
 {'name': 'Pyruvate dehydrogenase E1 component subunit beta, mitochondrial',
  'genesymbol': 'PDHB'},
 {'name': 'Acetyl-CoA acetyltransferase, cytosolic', 'genesymbol': 'ACAT2'},
 {'name': 'CREB-binding protein', 'genesymbol': 'CREBBP'},
 {'name': 'Diamine acetyltransfe
Output truncated: showing 1000 of 1478 characters

Higher level access to HMDB data§

By the hmdb.metabolites_table and hmdb.proteins_table functions you can process the records into a pandas data frame. This function accepts list of nameless or named arguments using a simple notation (see its documentation). Instead of the simple notation of tuples, alternatively, hmdb.Field objects can be used to define the fields, though the arguments for Field and the tuples or strings directly passed to hmdb.*_table follow the same format. Let’s extract a data frame with SMILEs, InChi Keys and HMDB accessions:

[8]:
hmdb.metabolites_table('accession', 'smiles', 'inchikey', head = 10)

executed in 0ms, finished 12:32:01 2023-04-24

[8]:
accession smiles inchikey
0 HMDB0000001 CN1C=NC(C[C@H](N)C(O)=O)=C1 BRMWTNUJHUMWMS-LURJTMIESA-N
1 HMDB0000002 NCCCN XFNJVJPLKCPIBV-UHFFFAOYSA-N
2 HMDB0000005 CCC(=O)C(O)=O TYEYBOSBBBHJIV-UHFFFAOYSA-N
3 HMDB0000008 CC[C@H](O)C(O)=O AFENDNXGAFYKQO-VKHMYHEASA-N
4 HMDB0000010 [H][C@@]12CCC(=O)[C@@]1(C)CC[C@]1([H])C3=C(CC[... WHEUWNKSCXYKBU-QPWUGHHJSA-N
5 HMDB0000011 C[C@@H](O)CC(O)=O WHBMMWSBFZVSSR-GSVOUGTGSA-N
6 HMDB0000012 OC[C@H]1O[C@H](C[C@@H]1O)N1C=CC(=O)NC1=O MXHRCPNRJAMMIM-SHYZEUOFSA-N
7 HMDB0000014 NC1=NC(=O)N(C=C1)[C@H]1C[C@H](O)[C@@H](CO)O1 CKTSBUTUHBMZGZ-SHYZEUOFSA-N
8 HMDB0000015 [H][C@@]12CC[C@](O)(C(=O)CO)[C@@]1(C)CC[C@@]1(... WHBHBVVOGNECLV-OBQKJFGGSA-N
9 HMDB0000016 [H][C@@]12CC[C@H](C(=O)CO)[C@@]1(C)CC[C@@]1([H... ZESRJSPZRDMNHY-YFWFAHHUSA-N
10 HMDB0000017 CC1=NC=C(CO)C(C(O)=O)=C1O HXACOUQIXZGNBF-UHFFFAOYSA-N

The above example is simple, as each field has a simple string value. The synonyms is an array within each record, below first we process it as an array column, i.e. each row contains an array:

[9]:
hmdb.metabolites_table('accession', 'name', 'synonyms', head = 10)

executed in 0ms, finished 12:32:13 2023-04-24

[9]:
accession name synonyms
0 HMDB0000001 1-Methylhistidine [(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)pro...
1 HMDB0000002 1,3-Diaminopropane [1,3-Propanediamine, 1,3-Propylenediamine, Pro...
2 HMDB0000005 2-Ketobutyric acid [2-Ketobutanoic acid, 2-Oxobutyric acid, 3-Met...
3 HMDB0000008 2-Hydroxybutyric acid [(S)-2-Hydroxybutanoic acid, 2-Hydroxybutyrate...
4 HMDB0000010 2-Methoxyestrone [2-(8S,9S,13S,14S)-3-Hydroxy-2-methoxy-13-meth...
5 HMDB0000011 3-Hydroxybutyric acid [(R)-(-)-beta-Hydroxybutyric acid, (R)-3-Hydro...
6 HMDB0000012 Deoxyuridine [2-Deoxyuridine, dU, 2'-Deoxyuridine, 1-(2-Deo...
7 HMDB0000014 Deoxycytidine [4-Amino-1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymet...
8 HMDB0000015 Cortexolone [11-Desoxy-17-hydroxycorticosterone, Cortodoxo...
9 HMDB0000016 Deoxycorticosterone [21-Hydroxy-4-pregnene-3,20-dione, 21-Hydroxyp...
10 HMDB0000017 4-Pyridoxic acid [2-Methyl-3-hydroxy-4-carboxy-5-hydroxymethylp...

Each element in the column is an array:

[10]:
hmdb_synonyms = hmdb.metabolites_table('accession', 'name', 'synonyms', head = 10)
hmdb_synonyms.synonyms[0]

executed in 0ms, finished 12:32:19 2023-04-24

[10]:
['(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)propanoic acid',
 'Pi-methylhistidine',
 '(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)propanoate',
 '1 Methylhistidine',
 '1-Methyl histidine',
 '1-Methyl-histidine',
 '1-Methyl-L-histidine',
 '1-MHis',
 '1-N-Methyl-L-histidine',
 'L-1-Methylhistidine',
 'N1-Methyl-L-histidine',
 '1-Methylhistidine dihydrochloride',
 '1-Methylhistidine']

Using the @ notation, the arrays can be expanded into multiple rows:

[11]:
hmdb.metabolites_table('accession', 'name', ('synonyms', '@'), head = 10)

executed in 0ms, finished 12:32:25 2023-04-24

[11]:
accession name synonyms
0 HMDB0000001 1-Methylhistidine (2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)prop...
1 HMDB0000001 1-Methylhistidine Pi-methylhistidine
2 HMDB0000001 1-Methylhistidine (2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)prop...
3 HMDB0000001 1-Methylhistidine 1 Methylhistidine
4 HMDB0000001 1-Methylhistidine 1-Methyl histidine
... ... ... ...
291 HMDB0000017 4-Pyridoxic acid 3-Hydroxy-5-hydroxymethyl-2-methyl-isonicotins...
292 HMDB0000017 4-Pyridoxic acid 4 Pyridoxinic acid
293 HMDB0000017 4-Pyridoxic acid Pyridoxinecarboxylic acid
294 HMDB0000017 4-Pyridoxic acid 4 Pyridoxylic acid
295 HMDB0000017 4-Pyridoxic acid 4 Pyridoxic acid

296 rows × 3 columns

This already resulted almost 300 rows: be careful using @ for multiple columns, as it yields rows in a combinatorial way, and the resulted data frames can easily grow huge. Another notation is *, it means extract all elements from a dict into multiple columns. Below we apply it to the taxonomy column which is a dict of multiple fields:

[12]:
hmdb.metabolites_table('accession', 'name', ('taxonomy', '*'), head = 10)

executed in 0ms, finished 12:32:30 2023-04-24

[12]:
accession name taxonomy__alternative_parents taxonomy__class taxonomy__description taxonomy__direct_parent taxonomy__kingdom taxonomy__molecular_framework taxonomy__sub_class taxonomy__substituents
0 HMDB0000001 1-Methylhistidine [Amino acids, Aralkylamines, Azacyclic compoun... Carboxylic acids and derivatives belongs to the class of organic compounds kno... Histidine and derivatives Organic compounds Aromatic heteromonocyclic compounds Amino acids, peptides, and analogues [Alpha-amino acid, Amine, Amino acid, Aralkyla...
1 HMDB0000002 1,3-Diaminopropane [Hydrocarbon derivatives, Organopnictogen comp... Organonitrogen compounds belongs to the class of organic compounds kno... Monoalkylamines Organic compounds Aliphatic acyclic compounds Amines [Aliphatic acyclic compound, Hydrocarbon deriv...
2 HMDB0000005 2-Ketobutyric acid [Alpha-hydroxy ketones, Alpha-keto acids and d... Keto acids and derivatives belongs to the class of organic compounds kno... Short-chain keto acids and derivatives Organic compounds Aliphatic acyclic compounds Short-chain keto acids and derivatives [Aliphatic acyclic compound, Alpha-hydroxy ket...
3 HMDB0000008 2-Hydroxybutyric acid [Carbonyl compounds, Carboxylic acids, Fatty a... Hydroxy acids and derivatives belongs to the class of organic compounds kno... Alpha hydroxy acids and derivatives Organic compounds Aliphatic acyclic compounds Alpha hydroxy acids and derivatives [Alcohol, Aliphatic acyclic compound, Alpha-hy...
4 HMDB0000010 2-Methoxyestrone [1-hydroxy-2-unsubstituted benzenoids, 17-oxos... Steroids and steroid derivatives belongs to the class of organic compounds kno... Estrogens and derivatives Organic compounds Aromatic homopolycyclic compounds Estrane steroids [1-hydroxy-2-unsubstituted benzenoid, 17-oxost...
5 HMDB0000011 3-Hydroxybutyric acid [Carbonyl compounds, Carboxylic acids, Fatty a... Hydroxy acids and derivatives belongs to the class of organic compounds kno... Beta hydroxy acids and derivatives Organic compounds Aliphatic acyclic compounds Beta hydroxy acids and derivatives [Alcohol, Aliphatic acyclic compound, Beta-hyd...
6 HMDB0000012 Deoxyuridine [Azacyclic compounds, Heteroaromatic compounds... Pyrimidine nucleosides belongs to the class of organic compounds kno... Pyrimidine 2'-deoxyribonucleosides Organic compounds Aromatic heteromonocyclic compounds Pyrimidine 2'-deoxyribonucleosides [Alcohol, Aromatic heteromonocyclic compound, ...
7 HMDB0000014 Deoxycytidine [Aminopyrimidines and derivatives, Azacyclic c... Pyrimidine nucleosides belongs to the class of organic compounds kno... Pyrimidine 2'-deoxyribonucleosides Organic compounds Aromatic heteromonocyclic compounds Pyrimidine 2'-deoxyribonucleosides [Alcohol, Amine, Aminopyrimidine, Aromatic het...
8 HMDB0000015 Cortexolone [17-hydroxysteroids, 20-oxosteroids, 3-oxo del... Steroids and steroid derivatives belongs to the class of organic compounds kno... 21-hydroxysteroids Organic compounds Aliphatic homopolycyclic compounds Hydroxysteroids [17-hydroxysteroid, 20-oxosteroid, 21-hydroxys...
9 HMDB0000016 Deoxycorticosterone [20-oxosteroids, 3-oxo delta-4-steroids, Alpha... Steroids and steroid derivatives belongs to the class of organic compounds kno... 21-hydroxysteroids Organic compounds Aliphatic homopolycyclic compounds Hydroxysteroids [20-oxosteroid, 21-hydroxysteroid, 3-oxo-delta...
10 HMDB0000017 4-Pyridoxic acid [Aromatic alcohols, Azacyclic compounds, Carbo... Pyridines and derivatives belongs to the class of organic compounds kno... Pyridinecarboxylic acids Organic compounds Aromatic heteromonocyclic compounds Pyridinecarboxylic acids and derivatives [Alcohol, Aromatic alcohol, Aromatic heteromon...

We see taxonomy gave birth to 8 columns. If we expand all those columns, we get a data frame of more than 2,000 rows only from the first 10 records already:

[13]:
hmdb.metabolites_table('accession', 'name', ('taxonomy', '*', '@'), head = 10)

executed in 0ms, finished 12:32:37 2023-04-24

[13]:
accession name taxonomy__alternative_parents taxonomy__class taxonomy__description taxonomy__direct_parent taxonomy__kingdom taxonomy__molecular_framework taxonomy__sub_class taxonomy__substituents
0 HMDB0000001 1-Methylhistidine Amino acids Carboxylic acids and derivatives belongs to the class of organic compounds kno... Histidine and derivatives Organic compounds Aromatic heteromonocyclic compounds Amino acids, peptides, and analogues Alpha-amino acid
1 HMDB0000001 1-Methylhistidine Amino acids Carboxylic acids and derivatives belongs to the class of organic compounds kno... Histidine and derivatives Organic compounds Aromatic heteromonocyclic compounds Amino acids, peptides, and analogues Amine
2 HMDB0000001 1-Methylhistidine Amino acids Carboxylic acids and derivatives belongs to the class of organic compounds kno... Histidine and derivatives Organic compounds Aromatic heteromonocyclic compounds Amino acids, peptides, and analogues Amino acid
3 HMDB0000001 1-Methylhistidine Amino acids Carboxylic acids and derivatives belongs to the class of organic compounds kno... Histidine and derivatives Organic compounds Aromatic heteromonocyclic compounds Amino acids, peptides, and analogues Aralkylamine
4 HMDB0000001 1-Methylhistidine Amino acids Carboxylic acids and derivatives belongs to the class of organic compounds kno... Histidine and derivatives Organic compounds Aromatic heteromonocyclic compounds Amino acids, peptides, and analogues Aromatic heteromonocyclic compound
... ... ... ... ... ... ... ... ... ... ...
2235 HMDB0000017 4-Pyridoxic acid Vinylogous acids Pyridines and derivatives belongs to the class of organic compounds kno... Pyridinecarboxylic acids Organic compounds Aromatic heteromonocyclic compounds Pyridinecarboxylic acids and derivatives Organooxygen compound
2236 HMDB0000017 4-Pyridoxic acid Vinylogous acids Pyridines and derivatives belongs to the class of organic compounds kno... Pyridinecarboxylic acids Organic compounds Aromatic heteromonocyclic compounds Pyridinecarboxylic acids and derivatives Organopnictogen compound
2237 HMDB0000017 4-Pyridoxic acid Vinylogous acids Pyridines and derivatives belongs to the class of organic compounds kno... Pyridinecarboxylic acids Organic compounds Aromatic heteromonocyclic compounds Pyridinecarboxylic acids and derivatives Primary alcohol
2238 HMDB0000017 4-Pyridoxic acid Vinylogous acids Pyridines and derivatives belongs to the class of organic compounds kno... Pyridinecarboxylic acids Organic compounds Aromatic heteromonocyclic compounds Pyridinecarboxylic acids and derivatives Pyridine carboxylic acid
2239 HMDB0000017 4-Pyridoxic acid Vinylogous acids Pyridines and derivatives belongs to the class of organic compounds kno... Pyridinecarboxylic acids Organic compounds Aromatic heteromonocyclic compounds Pyridinecarboxylic acids and derivatives Vinylogous acid

2240 rows × 10 columns

The hmdb.metabolites_mapping and hmdb.proteins_mapping function provides data frames or dicts for translation between a pair of identifier types. For example, translate KEGG Pathway IDs to SMILES, default output is dict of sets:

[14]:
hmdb.metabolites_mapping('kegg', 'smiles', head = 10)

executed in 0ms, finished 12:33:27 2023-04-24

[14]:
{'C00109': {'CCC(=O)C(O)=O'},
 'C00526': {'OC[C@H]1O[C@H](C[C@@H]1O)N1C=CC(=O)NC1=O'},
 'C00847': {'CC1=NC=C(CO)C(C(O)=O)=C1O'},
 'C00881': {'NC1=NC(=O)N(C=C1)[C@H]1C[C@H](O)[C@@H](CO)O1'},
 'C00986': {'NCCCN'},
 'C01089': {'C[C@@H](O)CC(O)=O'},
 'C01152': {'CN1C=NC(C[C@H](N)C(O)=O)=C1'},
 'C03205': {'[H][C@@]12CC[C@H](C(=O)CO)[C@@]1(C)CC[C@@]1([H])[C@@]2([H])CCC2=CC(=O)CC[C@]12C'},
 'C05299': {'[H][C@@]12CCC(=O)[C@@]1(C)CC[C@]1([H])C3=C(CC[C@@]21[H])C=C(O)C(OC)=C3'},
 'C05488': {'[H][C@@]12CC[C@](O)(C(=O)CO)[C@@]1(C)CC[C@@]1([H])[C@@]2([H])CCC2=CC(=O)CC[C@]12C'},
 'C05984': {'CC[C@H](O)C(O)=O'}}

The same data in a data frame:

[15]:
hmdb.metabolites_mapping('kegg', 'smiles', head = 10, return_df = True)

executed in 0ms, finished 12:33:31 2023-04-24

[15]:
id_a id_b
0 C01152 CN1C=NC(C[C@H](N)C(O)=O)=C1
1 C00986 NCCCN
2 C00109 CCC(=O)C(O)=O
3 C05984 CC[C@H](O)C(O)=O
4 C05299 [H][C@@]12CCC(=O)[C@@]1(C)CC[C@]1([H])C3=C(CC[...
5 C01089 C[C@@H](O)CC(O)=O
6 C00526 OC[C@H]1O[C@H](C[C@@H]1O)N1C=CC(=O)NC1=O
7 C00881 NC1=NC(=O)N(C=C1)[C@H]1C[C@H](O)[C@@H](CO)O1
8 C05488 [H][C@@]12CC[C@](O)(C(=O)CO)[C@@]1(C)CC[C@@]1(...
9 C03205 [H][C@@]12CC[C@H](C(=O)CO)[C@@]1(C)CC[C@@]1([H...
10 C00847 CC1=NC=C(CO)C(C(O)=O)=C1O

ID translation with HMDB§

HMDB is also integrated into the ID translation service. Thanks to the multiple levels of caching, only the first call takes long time, subsequent calls are pretty fast:

[16]:
from pypath.utils import mapping
mapping.map_name('C01152', 'kegg', 'inchi')

executed in 0ms, finished 12:33:39 2023-04-24

[16]:
{'InChI=1S/C7H11N3O2/c1-10-3-5(9-4-10)2-6(8)7(11)12/h3-4,6H,2,8H2,1H3,(H,11,12)/t6-/m0/s1',
 'InChI=1S/C7H11N3O2/c1-10-4-9-3-5(10)2-6(8)7(11)12/h3-4,6H,2,8H2,1H3,(H,11,12)/t6-/m0/s1'}

The two InChi Keys correspond to the two constitutional isomers included in the KEGG ID: 1- and 3-Methylhistidine. A useful feature of HMDB that it has many synonyms and IUPAC names, making it possible to parse a large variety of metabolite names:

[17]:
mapping.map_name('C01152', 'kegg', 'hmdb_synonym')

executed in 0ms, finished 12:33:41 2023-04-24

[17]:
{'(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)propanoate',
 '(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)propanoic acid',
 '(2S)-2-Amino-3-(1-methyl-1H-imidazol-5-yl)propanoate',
 '(2S)-2-Amino-3-(1-methyl-1H-imidazol-5-yl)propanoic acid',
 '1 Methylhistidine',
 '1-MHis',
 '1-Methyl histidine',
 '1-Methyl-L-histidine',
 '1-Methyl-histidine',
 '1-Methylhistidine',
 '1-Methylhistidine dihydrochloride',
 '1-N-Methyl-L-histidine',
 '3-Methyl-L-histidine',
 '3-Methylhistidine',
 '3-Methylhistidine dihydrochloride',
 '3-Methylhistidine hydride',
 '3-N-Methyl-L-histidine',
 'L-1-Methylhistidine',
 'L-3-Methylhistidine',
 'N Tau-methylhistidine',
 'N(Tau)-methylhistidine',
 'N(pros)-Methyl-L-histidine',
 'N-pros-Methyl-L-histidine',
 'N1-Methyl-L-histidine',
 'N3-Methyl-L-histidine',
 'Pi-methylhistidine',
 'Tau-methyl-L-histidine',
 'Tau-methylhistidine'}
[18]:
mapping.map_name('N(pros)-Methyl-L-histidine', 'hmdb_synonym', 'inchi')

executed in 1.81s, finished 12:33:46 2023-04-24

[18]:
{'InChI=1S/C7H11N3O2/c1-10-4-9-3-5(10)2-6(8)7(11)12/h3-4,6H,2,8H2,1H3,(H,11,12)/t6-/m0/s1'}

The name provided by HMDB is typically the best human readable name, hence it can be used as labels in figures or tables:

[19]:
mapping.map_name('HMDB0000001', 'hmdb', 'hmdb_name')

executed in 0ms, finished 12:33:47 2023-04-24

[19]:
{'1-Methylhistidine'}

NCBI E-Utils§

The ESummary endpoint of the NCBI E-Utils API provides metadata about records in NCBI databases. A client to this API endpoint is available in the pypath.inputs.eutils module. The parameter ids can be one integer, or a list of integers or strings:

[3]:
from pypath.inputs import eutils

eutils.esummary(ids = 6063, db = 'geoprofiles')

executed in 0ms, finished 22:43:56 2023-11-14

[3]:
{'uids': ['6063'],
 '6063': {'uid': '6063',
  'gds': '5',
  'gpl': '13',
  'erank': '8eSiQ',
  'evalue': 'joAzE',
  'title': 'Diurnal and circadian-regulated genes (I)',
  'taxon': 'Arabidopsis thaliana',
  'gdstype': 'Expression profiling by array',
  'valtype': 'log ratio',
  'idref': '6063',
  'genename': '',
  'genedesc': '',
  'ugname': 'AT4G11560',
  'ugdesc': 'Bromo-adjacent homology (BAH) domain-containing protein',
  'nucdesc': '9366 Lambda-PRL2 Arabidopsis thaliana cDNA clone 135J10T7, mRNA sequence',
  'entrez_gene_id': '',
  'gbacc': 'T46103',
  'ptacc': '',
  'cloneid': '135J10T7',
  'orf': '',
  'spotid': '',
  'vmin': '-0.395000',
  'vmax': '0.201000',
  'groups': 'A1B3C1',
  'abscall': '',
  'aflag': 20,
  'aoutl': '',
  'rstd': 31,
  'rmean': 50}}

A simple wrapper for PubMed is available in the pypath.inputs.pubmed module:

[2]:
from pypath.inputs import pubmed

pubmed.get_pubmeds('33209674')

executed in 0ms, finished 22:42:02 2023-11-14

[2]:
{'uids': ['33209674'],
 '33209674': {'uid': '33209674',
  'pubdate': '2020 Oct',
  'epubdate': '',
  'source': 'Transl Androl Urol',
  'authors': [{'name': 'Kim H', 'authtype': 'Author', 'clusterid': ''},
   {'name': 'Lee SH', 'authtype': 'Author', 'clusterid': ''},
   {'name': 'Kim DH', 'authtype': 'Author', 'clusterid': ''},
   {'name': 'Lee JY', 'authtype': 'Author', 'clusterid': ''},
   {'name': 'Hong SH', 'authtype': 'Author', 'clusterid': ''},
   {'name': 'Ha US', 'authtype': 'Author', 'clusterid': ''},
   {'name': 'Kim IH', 'authtype': 'Author', 'clusterid': ''}],
  'lastauthor': 'Kim IH',
  'title': 'Gemcitabine maintenance versus observation after first-line chemotherapy in patients with metastatic urothelial carcinoma: a retrospective study.',
  'sorttitle': 'gemcitabine maintenance versus observation after first line chemotherapy in patients with metastatic urothelial carcinoma a retrospective study',
  'volume': '9',
  'issue': '5',
  'pages': '2113-2121',
  'lang': ['eng']
Output truncated: showing 1000 of 2263 characters

One last example, querying the Entrez Gene database:

[4]:
from pypath.inputs import eutils

eutils.esummary(ids = 1956, db = 'gene')

executed in 0ms, finished 22:48:09 2023-11-14

[4]:
{'uids': ['1956'],
 '1956': {'uid': '1956',
  'name': 'EGFR',
  'description': 'epidermal growth factor receptor',
  'status': '',
  'currentid': '',
  'chromosome': '7',
  'geneticsource': 'genomic',
  'maplocation': '7p11.2',
  'otheraliases': 'ERBB, ERBB1, ERRP, HER1, NISBD2, PIG61, mENA',
  'otherdesignations': 'epidermal growth factor receptor|EGFR vIII|avian erythroblastic leukemia viral (v-erb-b) oncogene homolog|cell growth inhibiting protein 40|cell proliferation-inducing protein 61|epidermal growth factor receptor tyrosine kinase domain|erb-b2 receptor tyrosine kinase 1|proto-oncogene c-ErbB-1|receptor tyrosine-protein kinase erbB-1',
  'nomenclaturesymbol': 'EGFR',
  'nomenclaturename': 'epidermal growth factor receptor',
  'nomenclaturestatus': 'Official',
  'mim': ['131550'],
  'genomicinfo': [{'chrloc': '7',
    'chraccver': 'NC_000007.14',
    'chrstart': 55019016,
    'chrstop': 55211627,
    'exoncount': 32}],
  'geneweight': 580393,
  'summary': 'The protein encoded b
Output truncated: showing 1000 of 5417 characters

Download management§

Cache management and customization§

The pypath.omnipath.app saves the databases to pickle dumps by default under the ~/.pypath/pickles/ directory and after the first build loads them from there. The very first build of each database might take quite long time (up to >90 min in case of the OmniPath network or annotations) because of the large number of downloads. Subsequent builds will be much faster because pypath stores all the downloaded data in a local cache and downloads again only upon request from the user. Loading the databases from pickle dumps takes only seconds. However if you want to build with different settings you should be aware to set up a different cache file name.

Download failures§

Issuing hundreds of requests to dozens of servers sooner or later comes with failures. These might happen just by accident, especially on slow networks, it is always recommended to try again. The

Corrupted cache content§

Sometimes a truncated or corrupted file remains in the cache, in this case you can use the context managers in pypath.share.curl to control the cache. E.g. if the download of the DEPOD database failed and keeps failing due to a corrupted file, use the cache_delete_on context:

[7]:
from pypath.share import curl
from pypath.inputs import depod

with curl.cache_delete_on():
    depod = depod.depod_enzyme_substrate()

executed in 5.61s, finished 13:59:07 2022-12-02

The cache_off context forces download even if a cache item is available; the cache_print_on context prints paths to the accessed cache files to the terminal, though the paths can always be found in the log; the dry_run_on context sets up the pypath.share.curl.Curl object and stops just before the actual download.

Network communication issues: look into the curl debug log§

Downloads might fail also due to TLS or HTTP errors, wrong headers or parameters, and many other reasons. In this case a full debug output from curl might be very useful. The debug_on context writes curl debug into the logfile:

[8]:
from pypath.share import curl
from pypath.inputs import depod

with curl.debug_on():
    depod = depod.depod_enzyme_substrate()

executed in 0ms, finished 13:59:12 2022-12-02

Timeouts§

From the log we can find out if the download fails due to a timeout. In this case, the timeout parameters can be altered by a settings context. Apart from a timeout for the completion of the download, there is curl_connect_timeout (timeout for establishing connection to the server), and curl_extended_timeout, that is used for servers that are known to be exceptionally slow. Another parameter, curl_retries is the number of attempts before giving up. By default it’s 3, and that should be more than enough.

[9]:
from pypath.share import settings
from pypath.inputs import depod

with settings.context(curl_timeout = 360):
    depod = depod.depod_enzyme_substrate()

executed in 0ms, finished 13:59:17 2022-12-02

Access and inspect the Curl object§

Often the Curl object is created in a function from the pypath.inputs module, deep in a call stack, hence accessing it for investigation is difficult. Using the preserve_on context, the last Curl instance is kept under the pypath.share.curl.LASTCURL attribute:

[10]:
from pypath.share import curl
from pypath.inputs import depod

with curl.preserve_on():
    depod = depod.depod_enzyme_substrate()

depod_curl = curl.LASTCURL
depod_curl

executed in 0ms, finished 13:59:24 2022-12-02

[10]:
<pypath.share.curl.Curl at 0x6947386dc8b0>
[11]:
depod_curl.url, depod_curl.req_headers, depod_curl.fileobj, depod_curl.status

executed in 0ms, finished 13:59:28 2022-12-02

[11]:
('http://depod.bioss.uni-freiburg.de/download/DEPOD_201405_human_phosphatase-substrate.mitab',
 [],
 <_io.TextIOWrapper name='/home/denes/.pypath/cache/6a711369ecf9dcff8c5ed88996685b54-DEPOD_201405_human_phosphatase-substrate.mitab' mode='r' encoding='iso-8859-1'>,
 0)

Is it failing only for you?§

Okay, this is the one you should check first: we run almost all downloads in pypath daily, you can always check in the report wether a particular function run successfully last night on our server. If it fails also in our daily build, it still can be a transient error that disappears within a few days, or it can be a permanent error. In the latter case, we first try to fix the issue in pypath (maybe the behaviour or the address of the third party server has changed). If we have no way to fix it, we start hosting the data on our own server and make pypath download it from there.

Read the log§

Above we mentioned a lot the pypath log. Here is how to access the log, see more details in the section about logging:

[12]:
import pypath
pypath.log()

executed in 0ms, finished 13:59:34 2022-12-02

[2022-12-02 14:57:09] Welcome!
[2022-12-02 14:57:09] Logger started, logging into `/home/denes/pypath/notebooks/pypath_log/pypath-s3e92.log`.
[2022-12-02 14:57:09] Session `s3e92` started.
[2022-12-02 14:57:09] [pypath]
        - session ID: `s3e92`
        - working directory: `/home/denes/pypath/notebooks`
        - logfile: `/home/denes/pypath/notebooks/pypath_log/pypath-s3e92.log`
        - pypath version: 0.14.30
[2022-12-02 14:57:09] [curl] Creating Curl object to retrieve data from `https://www.ensembl.org/info/about/species.html`
[2022-12-02 14:57:09] [curl] Cache file path: `/home/denes/.pypath/cache/535b06d53a59e75bb693369bc5fdc556-species.html`
[2022-12-02 14:57:09] [curl] Cache file found, no need for download.
[2022-12-02 14:57:09] [curl] Opening plain text file `/home/denes/.pypath/cache/535b06d53a59e75bb693369bc5fdc556-species.html`.
[2022-12-02 14:57:09] [curl] Contents of `/home/denes/.pypath/cache/535b06d53a59e75bb693369bc5fdc556-species.html` has been read and the file has been closed.
[2022-1
Output truncated: showing 1000 of 112963 characters

TLS (SSL, HTTPS) errors§

Failed to verify certificate, invalid, expired, self-signed, missing certificates. These might be the most common reasons why people open issues for our software. TLS is a method for encrypted, typically HTTP, communication. The server has a certificate and uses it to sign and encrypt the data before sending it to the client. The client trusts the server certificate because it is signed by another certificate. And that is signed by another one, and so on, until we reach a so called root certificate that is known and trusted by the client. The number of root certificates used globally is so small that every single computer stores them locally and updates them time to time from trusted sources, such as the provider of the operating system, web browser or programming language. Having up-to-date certificate store and correctly configured TLS clients on your computer is your (or your system admin’s) duty, we can here only give a generic procedure to address these issues. In 97% of the cases the issue is in your computer, but sometimes the server might be responsible. If you experience a TLS issue:

  • Check the status of the server: initiate a scan at a free TLS checking service, such as SSL Labs: look for any issue with the certificate chain, such as missing or expired certificates, old or too new ciphers not supported by your client, etc.

  • Identify the server that your client failed to establish a TLS connection to (in case of pypath, look into the log)

  • Identify your software that contains the TLS client: in case of pypath, it uses pycurl, a Python module built on libcurl

  • Identify the provider of the client software: it can be PyPI, Anaconda, your operating system, etc.

  • Find out which certificate store that software uses: most of them uses the store from your operating system, but for example Java or Mozilla Firefox come with their own certificates

  • Check if the certificate store is up-to-date, update if necessary

  • Alternatively, identify the missing root certificate and add it manually to the store; you can also add a non-root certificate if the server has a serious issue and the chain can not be followed until a valid root certificate

Please open TLS related issues for our software only if you

  • Experience a server side issue with omnipathdb.org

  • You have a strong reason to think the reason is in the code written by us or can be easily fixed within our code

Resources§

[2]:
from pypath import resources
rc = resources.get_controller()
rc

executed in 0ms, finished 14:27:45 2022-12-03

[2]:
<pypath.resources.controller.ResourceController at 0x6cc25e25dcf0>

Licenses§

The license of SIGNOR is CC BY-SA, it allows commercial (for-profit) use:

[3]:
rc.license('SIGNOR'), rc.license('SIGNOR').commercial

executed in 0ms, finished 14:27:47 2022-12-03

[3]:
(<License CC BY-SA 4.0>, True)

Example: build a network for commercial use§

For our users, the most important aspect of licenses is whether they allow for-profit use in companies. In the near future we intend to provide more convenient interface for license options; until then, see the example below.

[4]:
from pypath.core import network
from pypath import resources

co = resources.get_controller()
pw_academic = co.collect_network('pathway')
pw_commercial = co.collect_network('pathway', license_purpose = 'commercial')

len(pw_academic), len(pw_commercial), set(pw_academic.values()) - set(pw_commercial.values())

executed in 0ms, finished 18:45:22 2023-03-10

[4]:
(24,
 19,
 {<NetworkResource: Baccin2019 (post_translational, activity_flow)>,
  <NetworkResource: Cellinker (post_translational, activity_flow)>,
  <NetworkResource: HPMR (post_translational, activity_flow)>,
  <NetworkResource: PDZBase (post_translational, activity_flow)>,
  <NetworkResource: TRIP (post_translational, activity_flow)>})

Above we see that five resources have been disabled by applying the for-profit licensing restriction. The licenses of those five resources:

[5]:
[r.license for r in set(pw_academic.values()) - set(pw_commercial.values())]

executed in 0ms, finished 18:48:02 2023-03-10

[5]:
[<License CC BY-NC-SA 3.0>,
 <License No license>,
 <License CC BY-NC 4.0>,
 <License CC BY-NC 4.0>,
 <License CC BY-NC 4.0>]

The licenses of the resources that allow for profit use:

[6]:
[r.license for r in pw_commercial.values()]

executed in 0ms, finished 18:50:35 2023-03-10

[6]:
[<License CC BY 4.0>,
 <License CC BY-SA 3.0>,
 <License CC BY-SA 3.0>,
 <License CC BY 4.0>,
 <License CC BY-SA 3.0>,
 <License CC BY-SA 3.0>,
 <License CC BY 4.0>,
 <License NAR Open Access>,
 <License CC BY-SA 4.0>,
 <License CC BY 4.0>,
 <License GPLv3>,
 <License GPLv3>,
 <License GPLv3>,
 <License MIT>,
 <License GPLv3>,
 <License MIT>,
 <License MIT>,
 <License CC BY 4.0>,
 <License GPLv3>]

Taking a closer look at a non-profit license:

[10]:
license = pw_academic['trip'].license
license.purpose, license.purpose.enables('for-profit')

executed in 0ms, finished 18:54:45 2023-03-10

[10]:
(<License purpose: academic>, False)

The collected resources can be used directly to build databases, in this case a network database:

[11]:
net_academic = network.Network(pw_academic)
net_commercial = network.Network(pw_commercial)
net_academic, net_commercial

executed in 1m 2.79s, finished 18:57:02 2023-03-10

[11]:
(<Network: 6833 nodes, 25607 interactions>,
 <Network: 6429 nodes, 23288 interactions>)

As we see, the for-profit usable network is smaller by about 400 nodes and 2,300 edges, and it might miss even more of the fine grained details, but likely it is suitable for analysis. No legal expert here, but some thoughts about licenses: even if you work for a company, you might download and explore data under any license, the restrictions apply if you start to actually use the resource; even if some resources restrict commercial use, you can always contact the copyright owners and ask them for permission, or ask your company to pay them licensing fee, so you can legally use their product.

Resource information§

[4]:
rc['MatrixDB']

executed in 0ms, finished 14:27:49 2022-12-03

[4]:
{'yearUsedRelease': 2015,
 'releases': [2009, 2011, 2015],
 'urls': {'articles': ['http://bioinformatics.oxfordjournals.org/content/25/5/690.long',
   'http://nar.oxfordjournals.org/content/43/D1/D321.long',
   'http://nar.oxfordjournals.org/content/39/suppl_1/D235.long'],
  'webpages': ['http://matrixdb.univ-lyon1.fr/'],
  'omictools': ['http://omictools.com/matrixdb-tool']},
 'pubmeds': [19147664, 20852260, 25378329],
 'taxons': ['mammalia'],
 'annot': ['experiment'],
 'recommend': ['small, literature curated interaction resource; many interactions for',
  'receptors and extracellular proteins'],
 'descriptions': ['Protein data were imported from the UniProtKB/Swiss-Prot database (Bairoch et',
  'al., 2005) and identified by UniProtKB/SwissProt accession numbers. In order to',
  'list all the partners of a protein, interactions are associated by default to the',
  'accession number of the human protein. The actual source species used in experiments is',
  'indicated in the page repor
Output truncated: showing 1000 of 4479 characters

Resource definitions for a certain database or dataset§

Note: This does not work yet for all databases and datasets, but likely in the near future this will be the preferred method to access resource definitions.

[197]:
rc.collect_enzyme_substrate()

executed in 0ms, finished 20:08:29 2022-12-02

[197]:
[<EnzymeSubstrateResource: phosphoELM>,
 <EnzymeSubstrateResource: dbPTM>,
 <EnzymeSubstrateResource: SIGNOR>,
 <EnzymeSubstrateResource: HPRD>,
 <EnzymeSubstrateResource: Li2012>,
 <EnzymeSubstrateResource: DEPOD>,
 <EnzymeSubstrateResource: PhosphoSite>,
 <EnzymeSubstrateResource: PhosphoNetworks>,
 <EnzymeSubstrateResource: MIMP>,
 <EnzymeSubstrateResource: ProtMapper>,
 <EnzymeSubstrateResource: KEA>]

The resource definitions carry all information necessary to load the resource, for example:

[202]:
phosphoelm = rc.collect_enzyme_substrate()[0]
phosphoelm.input_method, phosphoelm.id_type_enzyme

executed in 0ms, finished 20:09:51 2022-12-02

[202]:
('phosphoelm.phosphoelm_enzyme_substrate', 'uniprot')

Building networks§

For this you will need the Network class from the pypath.core.network module which takes care about building and querying the network. Also you need the pypath.resources.network module where you find a number of predefined input settings organized in larger categories (e.g. activity flow, enzyme-substrate, transcriptional regulation, etc). These input settings will tell pypath how to download and process the data.

[13]:
from pypath.core import network
from pypath.resources import network as netres

executed in 0ms, finished 13:59:49 2022-12-02

For example the netres.pathway is a collection of databases which fit into the activity flow concept, i.e. one protein either stimulates or inhibits the other. It is a dictionary with names as keys and the input settings as values:

[14]:
netres.pathway

executed in 0ms, finished 13:59:52 2022-12-02

[14]:
{'trip': <NetworkResource: TRIP (post_translational, activity_flow)>,
 'spike': <NetworkResource: SPIKE (post_translational, activity_flow)>,
 'signalink3': <NetworkResource: SignaLink3 (post_translational, activity_flow)>,
 'guide2pharma': <NetworkResource: Guide2Pharma (post_translational, activity_flow)>,
 'ca1': <NetworkResource: CA1 (post_translational, activity_flow)>,
 'arn': <NetworkResource: ARN (post_translational, activity_flow)>,
 'nrf2': <NetworkResource: NRF2ome (post_translational, activity_flow)>,
 'macrophage': <NetworkResource: Macrophage (post_translational, activity_flow)>,
 'death': <NetworkResource: DeathDomain (post_translational, activity_flow)>,
 'pdz': <NetworkResource: PDZBase (post_translational, activity_flow)>,
 'signor': <NetworkResource: SIGNOR (post_translational, activity_flow)>,
 'adhesome': <NetworkResource: Adhesome (post_translational, activity_flow)>,
 'icellnet': <NetworkResource: ICELLNET (post_translational, activity_flow)>,
 'celltalkdb': <Net
Output truncated: showing 1000 of 1864 characters

Such a dictionary you can pass to the load method of the network.Network object. Then it will download the data from the original sources, translate the identifiers and merge the networks. Pypath stores all downloaded data in a cache, by default ~/.pypath/cache in your user’s home directory. For this reason when you load a resource for the first time it might take long but next time will be faster as data will be fetched from the cache. First create a pypath.network.Network object, then build the network:

[15]:
n = network.Network()
n.load(netres.pathway)

executed in 32.90s, finished 14:00:36 2022-12-02

[16]:
n

executed in 0ms, finished 14:02:23 2022-12-02

[16]:
<Network: 6833 nodes, 25607 interactions>

You can add more resource sets a similar way:

[18]:
n.load(netres.enzyme_substrate)

executed in 30.04s, finished 14:04:29 2022-12-02

[19]:
n

executed in 0ms, finished 14:05:38 2022-12-02

[19]:
<Network: 7979 nodes, 35550 interactions>

To load one single resource simply pass the NetworkResource directly:

[20]:
n.load(netres.interaction['matrixdb'])

executed in 0ms, finished 14:05:42 2022-12-02

[21]:
n

executed in 0ms, finished 14:05:44 2022-12-02

[21]:
<Network: 8002 nodes, 35748 interactions>

Which network datasets are pre-defined in pypath?§

You can find all the pre-defined datasets in the pypath.resources.network module. This module currently is a wrapper around an older module, pypath.resources.data_formats, the actual definitions are written in this latter. As already we mentined above, the pathway dataset contains the literature curated activity flow resources. This was the original focus of pypath and OmniPath, however since then we added a great variety of other kinds of resource definitions. Here we give an overview of these.

  • pypath.resources.network.pathway: activity flow networks with literature references

  • pypath.resources.network.activity_flow: synonym for pathway

  • pypath.resources.network.pathway_noref: activity flow networks without literature references

  • pypath.resources.network.pathway_all: all activity flow data

  • pypath.resources.network.ptm: enzyme-substrate interaction networks with literature references

  • pypath.resources.network.enzyme_substrate: synonym for ptm

  • pypath.resources.network.ptm_noref: enzyme-substrate networks without literature references

  • pypath.resources.network.ptm_all: all enzyme-substrate data

  • pypath.resources.network.interaction: undirected interactions from both literature curated and high-throughput collections (e.g. IntAct, BioGRID)

  • pypath.resources.network.interaction_misc: undirected, high-scale interaction networks without the constraint of having any literature reference (e.g. the unbiased human interactome screen from the Vidal lab)

  • pypath.resources.network.transcription_onebyone: transcriptional regulation databases (TF-target interactions) with all databases downloaded directly and processed by pypath

  • pypath.resources.network.transcription: transcriptional regulation only from the DoRothEA data

  • pypath.resources.network.mirna_target: miRNA-mRNA interactions from literature curated resources

  • pypath.resources.network.tf_mirna: transcriptional regulation of miRNA from literature curated resources

  • pypath.resources.network.lncrna_protein: lncRNA-protein interactions from literature curated datasets

  • pypath.resources.network.ligand_receptor: ligand-receptor interactions from both literature curated and other kinds of resources

  • pypath.resources.network.pathwaycommons: the PathwayCommons database

  • pypath.resources.network.reaction: process description databases; not guaranteed to work at this moment

  • pypath.resources.network.reaction_misc: alternative definitions to load process description databases; not guaranteed to work at this moment

  • pypath.resources.network.small_molecule_protein: signaling interactions between small molecules and proteins

To see the list of the resources in a dataset, you can check the dict keys or the name attribute of each element:

[22]:
netres.pathway.keys()

executed in 0ms, finished 14:05:57 2022-12-02

[22]:
dict_keys(['trip', 'spike', 'signalink3', 'guide2pharma', 'ca1', 'arn', 'nrf2', 'macrophage', 'death', 'pdz', 'signor', 'adhesome', 'icellnet', 'celltalkdb', 'cellchatdb', 'connectomedb', 'talklr', 'cellinker', 'scconnect', 'hpmr', 'cellphonedb', 'ramilowski2015', 'lrdb', 'baccin2019'])
[23]:
[resource.name for resource in netres.pathway.values()]

executed in 0ms, finished 14:06:00 2022-12-02

[23]:
['TRIP',
 'SPIKE',
 'SignaLink3',
 'Guide2Pharma',
 'CA1',
 'ARN',
 'NRF2ome',
 'Macrophage',
 'DeathDomain',
 'PDZBase',
 'SIGNOR',
 'Adhesome',
 'ICELLNET',
 'CellTalkDB',
 'CellChatDB',
 'connectomeDB2020',
 'talklr',
 'Cellinker',
 'scConnect',
 'HPMR',
 'CellPhoneDB',
 'Ramilowski2015',
 'LRdb',
 'Baccin2019']

The resource definitions above carry all the information about how to load the resource: which function to call, how to process the identifiers, references, directions, and all other attributes from the input. E.g. which column from SPIKE corresponds to the source node? Which identifier type is used? It is the second column, and it has gene symbols in it:

[24]:
netres.pathway['spike'].networkinput.id_col_a, netres.pathway['spike'].networkinput.id_type_a

executed in 0ms, finished 14:06:07 2022-12-02

[24]:
(1, 'genesymbol')

The Network object§

Once you built a network you can use it for various purposes and write your own scripts for further processing or analysis. Below we create a Network object and populate it with the pathway dataset.

Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.

[2]:
from pypath.core import network
from pypath.resources import network as netres

n = network.Network()
n.load(netres.pathway)
n

executed in 36.07s, finished 14:15:48 2022-12-02

[2]:
<Network: 6833 nodes, 25607 interactions>

Almost all data is stored as a dict node pairs vs. interactions in Network.interactions:

[3]:
n.interactions

executed in 0ms, finished 14:17:02 2022-12-02

[3]:
{(<Entity: TRPC1>,
  <Entity: KCNMA1>): <Interaction: TRPC1 ============= KCNMA1 [Evidences: TRIP (2 references)]>,
 (<Entity: TRPC1>,
  <Entity: PPP3CA>): <Interaction: TRPC1 ============= PPP3CA [Evidences: TRIP (1 references)]>,
 (<Entity: CALM2>,
  <Entity: TRPC1>): <Interaction: CALM2 =======(-)==> TRPC1 [Evidences: TRIP (3 references)]>,
 (<Entity: CALM3>,
  <Entity: TRPC1>): <Interaction: CALM3 =======(-)==> TRPC1 [Evidences: TRIP (3 references)]>,
 (<Entity: CALM1>,
  <Entity: TRPC1>): <Interaction: CALM1 =======(-)==> TRPC1 [Evidences: TRIP (3 references)]>,
 (<Entity: CASP1>,
  <Entity: TRPC1>): <Interaction: CASP1 ============= TRPC1 [Evidences: TRIP (1 references)]>,
 (<Entity: TRPC1>,
  <Entity: CASP4>): <Interaction: TRPC1 ============= CASP4 [Evidences: TRIP (1 references)]>,
 (<Entity: TRPC1>,
  <Entity: CACNA1C>): <Interaction: TRPC1 ============= CACNA1C [Evidences: TRIP (1 references)]>,
 (<Entity: TRPC1>,
  <Entity: CAV1>): <Interaction: TRPC1 <=(+)======== CAV1 [Ev
Output truncated: showing 1000 of 118492 characters

The dict under Network.nodes is kept in sync with the interactions, and facilitates node access. Keys are primary identifiers (for proteins UniProt IDs by default), values are Entity objects:

[26]:
n.nodes

executed in 0ms, finished 14:06:21 2022-12-02

[26]:
{'P48995': <Entity: TRPC1>,
 'Q12791': <Entity: KCNMA1>,
 'Q08209': <Entity: PPP3CA>,
 'P0DP24': <Entity: CALM2>,
 'P0DP25': <Entity: CALM3>,
 'P0DP23': <Entity: CALM1>,
 'P29466': <Entity: CASP1>,
 'P49662': <Entity: CASP4>,
 'Q13936': <Entity: CACNA1C>,
 'Q03135': <Entity: CAV1>,
 'P56539': <Entity: CAV3>,
 'Q14247': <Entity: CTTN>,
 'P14416': <Entity: DRD2>,
 'P11532': <Entity: DMD>,
 'P11362': <Entity: FGFR1>,
 'Q02790': <Entity: FKBP4>,
 'Q86YM7': <Entity: HOMER1>,
 'Q9NSC5': <Entity: HOMER3>,
 'Q99750': <Entity: MDFI>,
 'Q14571': <Entity: ITPR2>,
 'Q14573': <Entity: ITPR3>,
 'P29966': <Entity: MARCKS>,
 'Q13255': <Entity: GRM1>,
 'P20591': <Entity: MX1>,
 'P62166': <Entity: NCS1>,
 'Q96D31': <Entity: ORAI1>,
 'Q96SN7': <Entity: ORAI2>,
 'Q9BRQ5': <Entity: ORAI3>,
 'P11171': <Entity: EPB41>,
 'P61586': <Entity: RHOA>,
 'Q9Y225': <Entity: RNF24>,
 'P21817': <Entity: RYR1>,
 'P16615': <Entity: ATP2A2>,
 'Q93084': <Entity: ATP2A3>,
 'P60880': <Entity: SNAP25>,
 'Q13586': <Entity: STI
Output truncated: showing 1000 of 30573 characters

An interaction between a pair of entities can be accessed:

[27]:
n.interaction('EGF', 'EGFR')

executed in 0ms, finished 14:06:27 2022-12-02

[27]:
<Interaction: EGFR <=(+)======== EGF [Evidences: Baccin2019, CellTalkDB, Fantom5, Guide2Pharma, HPMR, HPRD, ICELLNET, LRdb, Ramilowski2015, SIGNOR, SPIKE, SignaLink3, cellsignal.com, connectomeDB2020 (17 references)]>

Similarly, individual nodes can be looked up:

[28]:
n.entity('EGFR')

executed in 0ms, finished 14:06:29 2022-12-02

[28]:
<Entity: EGFR>

Labels (gene symbols for proteins by default), identifiers (such as UniProt IDs) and Entity objects can be used to refer to nodes. Each node carries some basic information:

[29]:
egfr = n.entity('EGFR')
egfr.identifier, egfr.label, egfr.entity_type, egfr.id_type, egfr.taxon

executed in 0ms, finished 14:06:32 2022-12-02

[29]:
('P00533', 'EGFR', 'protein', 'uniprot', 9606)

Interactions feature a number of methods to access various information, such as their types, direction, effect, resources, references, etc. The very same methods are also available for the whole network. Below we only show a few examples of these methods.

[30]:
ia = n.interaction('EGF', 'EGFR')
ia

executed in 0ms, finished 14:06:34 2022-12-02

[30]:
<Interaction: EGFR <=(+)======== EGF [Evidences: Baccin2019, CellTalkDB, Fantom5, Guide2Pharma, HPMR, HPRD, ICELLNET, LRdb, Ramilowski2015, SIGNOR, SPIKE, SignaLink3, cellsignal.com, connectomeDB2020 (17 references)]>
[31]:
ia.get_resource_names()

executed in 0ms, finished 14:06:47 2022-12-02

[31]:
{'Baccin2019',
 'CellTalkDB',
 'HPMR',
 'ICELLNET',
 'LRdb',
 'SIGNOR',
 'SPIKE',
 'SignaLink3',
 'connectomeDB2020'}
[32]:
ia.get_references()

executed in 0ms, finished 14:06:50 2022-12-02

[32]:
{<Reference: 10085134>,
 <Reference: 10209155>,
 <Reference: 10788520>,
 <Reference: 12093292>,
 <Reference: 12297050>,
 <Reference: 12620237>,
 <Reference: 12648462>,
 <Reference: 15620700>,
 <Reference: 16274239>,
 <Reference: 17145710>,
 <Reference: 19531499>,
 <Reference: 20458382>,
 <Reference: 21071413>,
 <Reference: 23331499>,
 <Reference: 3494473>,
 <Reference: 6289330>,
 <Reference: 8639530>}

This is a valid direction for this interaction:

[33]:
ia.get_direction(('EGF', 'EGFR'))

executed in 0ms, finished 14:06:53 2022-12-02

[33]:
True

The opposite direction is not supported by any of the resources:

[34]:
ia.get_direction(('EGFR', 'EGF'))

executed in 0ms, finished 14:06:55 2022-12-02

[34]:
False

However, some resources provide no direction information, these are classified as “undirected”:

ia.get_direction(‘undirected’)

We can check which resources are those exactly:

[35]:
ia.get_direction('undirected', sources = True)

executed in 0ms, finished 14:07:23 2022-12-02

[35]:
{'HPMR', 'SPIKE'}

Effect signs (stimulation, inhibition) are available in a similar way. The first one of the Boolean values mean stimulation (activation), the second one inhibition.

[36]:
ia.get_sign(('EGF', 'EGFR'))

executed in 0ms, finished 14:07:25 2022-12-02

[36]:
[True, False]

Which resources support the effect signs:

[37]:
ia.get_sign(('EGF', 'EGFR'), sources = True)

executed in 0ms, finished 14:07:28 2022-12-02

[37]:
[{'SIGNOR', 'SPIKE', 'SignaLink3'}, set()]

Many methods start by get_..., such as:

[38]:
ia.get_interaction_types()

executed in 0ms, finished 14:07:30 2022-12-02

[38]:
{'post_translational'}

Others are called ..._by_..., these combine two get_... methods:

[39]:
ia.references_by_resource()

executed in 0ms, finished 14:07:32 2022-12-02

[39]:
{'ICELLNET': {<Reference: 8639530>},
 'SIGNOR': {<Reference: 12297050>, <Reference: 12648462>},
 'SignaLink3': {<Reference: 10085134>,
  <Reference: 10209155>,
  <Reference: 19531499>,
  <Reference: 21071413>,
  <Reference: 23331499>},
 'Baccin2019': {<Reference: 10788520>,
  <Reference: 12093292>,
  <Reference: 12297050>,
  <Reference: 12620237>,
  <Reference: 15620700>,
  <Reference: 16274239>,
  <Reference: 6289330>},
 'LRdb': {<Reference: 10788520>,
  <Reference: 12093292>,
  <Reference: 12297050>,
  <Reference: 12620237>,
  <Reference: 15620700>,
  <Reference: 16274239>,
  <Reference: 6289330>},
 'SPIKE': {<Reference: 12297050>,
  <Reference: 17145710>,
  <Reference: 20458382>,
  <Reference: 3494473>},
 'CellTalkDB': {<Reference: 12093292>},
 'connectomeDB2020': {<Reference: 10788520>,
  <Reference: 12093292>,
  <Reference: 12297050>,
  <Reference: 12620237>,
  <Reference: 15620700>,
  <Reference: 16274239>,
  <Reference: 6289330>},
 'HPMR': {<Reference: 6289330>}}

And all these methods accept the same filtering parameters. E.g. if you are interested only in certain resources, it’s possible to restrict the query to those. For example, the two resources below provide no positive sign interaction:

[40]:
ia.get_interactions_positive(resources = {'ICELLNET', 'HPMR'})

executed in 0ms, finished 14:07:39 2022-12-02

[40]:
()

While some other resources do:

[41]:
ia.get_interactions_positive(resources = {'SignaLink3'})

executed in 0ms, finished 14:07:42 2022-12-02

[41]:
((<Entity: EGF>, <Entity: EGFR>),)

Or see the references that do or do not provide effect sign:

[42]:
ia.get_references(effect = True), ia.get_references(effect = False)

executed in 0ms, finished 14:07:44 2022-12-02

[42]:
({<Reference: 10085134>,
  <Reference: 10209155>,
  <Reference: 12297050>,
  <Reference: 12648462>,
  <Reference: 19531499>,
  <Reference: 20458382>,
  <Reference: 21071413>,
  <Reference: 23331499>},
 {<Reference: 10085134>,
  <Reference: 10209155>,
  <Reference: 10788520>,
  <Reference: 12093292>,
  <Reference: 12297050>,
  <Reference: 12620237>,
  <Reference: 12648462>,
  <Reference: 15620700>,
  <Reference: 16274239>,
  <Reference: 17145710>,
  <Reference: 19531499>,
  <Reference: 20458382>,
  <Reference: 21071413>,
  <Reference: 23331499>,
  <Reference: 3494473>,
  <Reference: 6289330>,
  <Reference: 8639530>})

Network in pandas.DataFrame§

Contents of a pypath.core.network.Network object can be exported to a pandas.DataFrame:

[1]:
from pypath import omnipath
cu = omnipath.db.get_db('curated')
cu.make_df()
cu.df

executed in 23.41s, finished 15:24:19 2022-12-03

[1]:
id_a id_b type_a type_b directed effect type dmodel sources references
0 P48995 Q12791 protein protein False 0 post_translational {activity_flow} {TRIP} NaN
1 P48995 Q08209 protein protein False 0 post_translational {activity_flow} {TRIP} NaN
2 P0DP23 P48995 protein protein True -1 post_translational {activity_flow} {TRIP} NaN
3 P0DP25 P48995 protein protein True -1 post_translational {activity_flow} {TRIP} NaN
4 P0DP24 P48995 protein protein True -1 post_translational {activity_flow} {TRIP} NaN
... ... ... ... ... ... ... ... ... ... ...
44033 Q14289 Q9ULZ3 protein protein True 0 post_translational {enzyme_substrate} {iPTMnet} NaN
44034 P54646 Q9Y2I7 protein protein True 0 post_translational {enzyme_substrate} {iPTMnet} NaN
44035 Q9BXM7 Q9Y2N7 protein protein True 0 post_translational {enzyme_substrate} {iPTMnet} NaN
44036 P49137 Q9Y385 protein protein True 0 post_translational {enzyme_substrate} {iPTMnet} NaN
44037 Q9UHC7 P04637 protein protein True 0 post_translational {enzyme_substrate} {iPTMnet} NaN

44038 rows × 10 columns

In the pypath.omnipath.export module independent and more flexible interfaces are available for building network data frames. These are used also for building the tables used by the web server.

[12]:
from pypath import omnipath
from pypath.omnipath import export

cu = omnipath.db.get_db('curated')
e = export.Export(cu)
e.make_df(unique_pairs = False)
e.df

executed in 22.65s, finished 19:20:12 2023-03-10

[12]:
source target source_genesymbol target_genesymbol is_directed is_stimulation is_inhibition consensus_direction consensus_stimulation consensus_inhibition sources references
0 P48995 Q12791 TRPC1 KCNMA1 0 0 0 0 0 0 TRIP TRIP:19168436;TRIP:25139746
1 P48995 Q08209 TRPC1 PPP3CA 0 0 0 0 0 0 TRIP TRIP:23228564
2 P0DP23 P48995 CALM1 TRPC1 1 0 1 1 0 1 TRIP TRIP:11290752;TRIP:11983166;TRIP:12601176
3 P0DP25 P48995 CALM3 TRPC1 1 0 1 1 0 1 TRIP TRIP:11290752;TRIP:11983166;TRIP:12601176
4 P0DP24 P48995 CALM2 TRPC1 1 0 1 1 0 1 TRIP TRIP:11290752;TRIP:11983166;TRIP:12601176
... ... ... ... ... ... ... ... ... ... ... ... ...
36729 Q14289 Q9ULZ3 PTK2B PYCARD 1 0 0 0 0 0 iPTMnet iPTMnet:27796369
36730 P54646 Q9Y2I7 PRKAA2 PIKFYVE 1 0 0 0 0 0 iPTMnet iPTMnet:24070423
36731 Q9BXM7 Q9Y2N7 PINK1 HIF3A 1 0 0 0 0 0 iPTMnet iPTMnet:27551449
36732 P49137 Q9Y385 MAPKAPK2 UBE2J1 1 0 0 0 0 0 iPTMnet iPTMnet:24020373
36733 Q9UHC7 P04637 MKRN1 TP53 1 0 0 0 0 0 iPTMnet iPTMnet:19536131

36734 rows × 12 columns

The data frame built for the web service includes even more details. Using the extra_node_attrs and extra_edge_attrs arguments of the Export object, you can fully customise these data frames.

[13]:
e.webservice_interactions_df()
e.df

executed in 21.99s, finished 19:22:51 2023-03-10

[13]:
source target source_genesymbol target_genesymbol is_directed is_stimulation is_inhibition consensus_direction consensus_stimulation consensus_inhibition ... dorothea_tfbs dorothea_coexp dorothea_level type curation_effort extra_attrs ncbi_tax_id_source entity_type_source ncbi_tax_id_target entity_type_target
0 P48995 Q12791 TRPC1 KCNMA1 0 0 0 0 0 0 ... None None post_translational 2 {"TRIP_method":["Co-immunoprecipitation","Co-i... 9606 protein 9606 protein
1 P48995 Q08209 TRPC1 PPP3CA 0 0 0 0 0 0 ... None None post_translational 1 {"TRIP_method":["Co-immunoprecipitation"]} 9606 protein 9606 protein
2 P0DP23 P48995 CALM1 TRPC1 1 0 1 1 0 1 ... None None post_translational 3 {"TRIP_method":["Fluorescence probe labeling",... 9606 protein 9606 protein
3 P0DP25 P48995 CALM3 TRPC1 1 0 1 1 0 1 ... None None post_translational 3 {"TRIP_method":["Fluorescence probe labeling",... 9606 protein 9606 protein
4 P0DP24 P48995 CALM2 TRPC1 1 0 1 1 0 1 ... None None post_translational 3 {"TRIP_method":["Fluorescence probe labeling",... 9606 protein 9606 protein
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
36729 Q14289 Q9ULZ3 PTK2B PYCARD 1 0 0 0 0 0 ... None None post_translational 1 {} 9606 protein 9606 protein
36730 P54646 Q9Y2I7 PRKAA2 PIKFYVE 1 0 0 0 0 0 ... None None post_translational 1 {} 9606 protein 9606 protein
36731 Q9BXM7 Q9Y2N7 PINK1 HIF3A 1 0 0 0 0 0 ... None None post_translational 1 {} 9606 protein 9606 protein
36732 P49137 Q9Y385 MAPKAPK2 UBE2J1 1 0 0 0 0 0 ... None None post_translational 1 {} 9606 protein 9606 protein
36733 Q9UHC7 P04637 MKRN1 TP53 1 0 0 0 0 0 ... None None post_translational 1 {} 9606 protein 9606 protein

36734 rows × 34 columns

Self interactions (loop edges) in the network§

Depending on the downstream application, loops might be beneficial or undesired. By default loops are disabled, but are enabled for OmniPath and the GRN networks among the built-in network databases. The allow_loops parameter can be set at the module level or at the instance level. If set at the module level, it will be valid for all subsequently created instances:

[14]:
from pypath.share import settings
settings.setup(network_allow_loops = True)

executed in 0ms, finished 19:32:52 2023-03-10

If set at the instance level, it will be valid for the instance:

[15]:
from pypath.core import network
n = network.Network(allow_loops = True)

executed in 0ms, finished 19:33:44 2023-03-10

If you want keep loops only for certain resources, load first the resources where loops should be removed, then remove the loops, and load the resources where you wish to keep the loops:

[30]:
from pypath.core import network
from pypath import resources

co = resources.get_controller()
pw = co.collect_network('pathway')
gr = co.collect_network('dorothea', interaction_types = 'transcriptional')

n = network.Network(pw, allow_loops = False)
n.load(gr, allow_loops = True)
n.count_loops()

executed in 2m 24.45s, finished 19:56:41 2023-03-10

[30]:
149
[32]:
n.count_interactions_by_interaction_type()

executed in 16.50s, finished 19:59:10 2023-03-10

[32]:
{'post_translational': 33571, 'transcriptional': 281262}

Molecular complexes in the network§

Currently pypath supports protein complexes, however, soon other kind of components, such as small molecules, nucleic acids, will be supported too. Complexes are represented by pypath.internals.intera.Complex objects, and can be network nodes. These objects optionally carry information about the defining resources, references, stoichiometry and custom attributes. Apart from the components and resources, none of these is mandatory. For more information, see the Protein complexes section in this notebook. Here we only show how complexes are included in networks. The Network object either represents each complex as a node (default behaviour), or expands the complex by creating a node for each of its components and apply all the interactions of the complex to all of its components. This latter method has adverse effects on network topology, and can be enabled by setting network_expand_complexes to True. Only a few resources list interactions of protein complexes, for example, SIGNOR, CollecTRI, Guide to Pharmacology, CellphoneDB, etc. Let’s load such a resource:

[1]:
from pypath.core import network
from pypath.resources import network as netres

n = network.Network(netres.collectri)

executed in 38.12s, finished 20:35:23 2023-03-27

We can retrieve various information about the complexes in the network, e.g. count them:

[2]:
n.count_complexes()

executed in 1.45s, finished 20:37:11 2023-03-27

[2]:
33

Or list them:

[3]:
n.get_complexes()

executed in 1.50s, finished 20:37:34 2023-03-27

[3]:
{<Entity: FOS_JUN>,
 <Entity: FOS_JUNB>,
 <Entity: FOS_JUND>,
 <Entity: JUN>,
 <Entity: FOSL1_JUN>,
 <Entity: FOSL2_JUN>,
 <Entity: JUN_JUNB>,
 <Entity: JUN_JUND>,
 <Entity: FOSB_JUN>,
 <Entity: FOSL1_JUNB>,
 <Entity: FOSL1_JUND>,
 <Entity: FOSL2_JUNB>,
 <Entity: FOSL2_JUND>,
 <Entity: JUNB>,
 <Entity: JUNB_JUND>,
 <Entity: FOSB_JUNB>,
 <Entity: JUND>,
 <Entity: FOSB_JUND>,
 <Entity: NFKB1>,
 <Entity: NFKB1_NFKB2>,
 <Entity: NFKB1_RELB>,
 <Entity: NFKB1_RELA>,
 <Entity: NFKB1_REL>,
 <Entity: NFKB2>,
 <Entity: NFKB2_RELB>,
 <Entity: NFKB2_RELA>,
 <Entity: NFKB2_REL>,
 <Entity: RELB>,
 <Entity: RELA_RELB>,
 <Entity: REL_RELB>,
 <Entity: RELA>,
 <Entity: REL_RELA>,
 <Entity: REL>}

In the network, these are Entity objects, and their identifier attribute is the Complex object:

[4]:
cplex_entity = list(n.get_complexes())[0]
cplex_entity

executed in 1.40s, finished 20:39:53 2023-03-27

[4]:
<Entity: REL_RELA>
[6]:
cplex = cplex_entity.identifier
cplex

executed in 0ms, finished 20:40:32 2023-03-27

[6]:
Complex: COMPLEX:Q04206_Q04864

When creating a data frame, the complex objects are added to the identifier cells, where we used to have UniProt IDs for single proteins. The labels are the gene symbols of the components, separated by underscore by default.

[8]:
from pypath.omnipath import export
from pypath.internals import intera

e = export.Export(n)
e.make_df(unique_pairs = False)
e.df[[isinstance(s, intera.Complex) for s in e.df.source]]

executed in 9.65s, finished 20:44:06 2023-03-27

[8]:
source target source_genesymbol target_genesymbol is_directed is_stimulation is_inhibition consensus_direction consensus_stimulation consensus_inhibition sources references
1 (P17535, P15407) P04040 FOSL1_JUND CAT 1 1 0 1 1 0 CollecTRI;ExTRI_CollecTRI CollecTRI:10022519;CollecTRI:10329043;CollecTR...
2 (P05412, P15408) P04040 FOSL2_JUN CAT 1 1 0 1 1 0 CollecTRI;ExTRI_CollecTRI CollecTRI:10022519;CollecTRI:10329043;CollecTR...
3 (P05412, P15407) P04040 FOSL1_JUN CAT 1 1 0 1 1 0 CollecTRI;ExTRI_CollecTRI CollecTRI:10022519;CollecTRI:10329043;CollecTR...
4 (P05412, P17275) P04040 JUN_JUNB CAT 1 1 0 1 1 0 CollecTRI;ExTRI_CollecTRI CollecTRI:10022519;CollecTRI:10329043;CollecTR...
5 (P17275, P17535) P04040 JUNB_JUND CAT 1 1 0 1 1 0 CollecTRI;ExTRI_CollecTRI CollecTRI:10022519;CollecTRI:10329043;CollecTR...
... ... ... ... ... ... ... ... ... ... ... ... ...
54980 (P17535, P01100) P01270 FOS_JUND PTH 1 1 0 1 1 0 CollecTRI;ExTRI_CollecTRI CollecTRI:9989817
54981 (P17275, P15408) P01270 FOSL2_JUNB PTH 1 1 0 1 1 0 CollecTRI;ExTRI_CollecTRI CollecTRI:9989817
54982 (P05412, P53539) P01270 FOSB_JUN PTH 1 1 0 1 1 0 CollecTRI;ExTRI_CollecTRI CollecTRI:9989817
54983 (P17275, P15407) P01270 FOSL1_JUNB PTH 1 1 0 1 1 0 CollecTRI;ExTRI_CollecTRI CollecTRI:9989817
54984 (P17275) P01270 JUNB PTH 1 1 0 1 1 0 CollecTRI;ExTRI_CollecTRI CollecTRI:9989817

23235 rows × 12 columns

For some reason, pandas show the Complex objects as tuples.

[10]:
e.df[[isinstance(s, intera.Complex) for s in e.df.source]].source.iloc[0]

executed in 0ms, finished 20:45:07 2023-03-27

[10]:
Complex: COMPLEX:P15407_P17535
[12]:
e.webservice_interactions_df()

executed in 41.08s, finished 20:48:51 2023-03-27

[13]:
e.df

executed in 0ms, finished 20:50:14 2023-03-27

[13]:
source target source_genesymbol target_genesymbol is_directed is_stimulation is_inhibition consensus_direction consensus_stimulation consensus_inhibition ... dorothea_tfbs dorothea_coexp dorothea_level type curation_effort extra_attrs ncbi_tax_id_source entity_type_source ncbi_tax_id_target entity_type_target
0 P01106 O14746 MYC TERT 1 1 0 1 1 0 ... None None transcriptional 75 {} 9606 protein 9606 protein
1 (P17535, P15407) P04040 FOSL1_JUND CAT 1 1 0 1 1 0 ... None None transcriptional 14 {} 9606 complex 9606 protein
2 (P05412, P15408) P04040 FOSL2_JUN CAT 1 1 0 1 1 0 ... None None transcriptional 14 {} 9606 complex 9606 protein
3 (P05412, P15407) P04040 FOSL1_JUN CAT 1 1 0 1 1 0 ... None None transcriptional 14 {} 9606 complex 9606 protein
4 (P05412, P17275) P04040 JUN_JUNB CAT 1 1 0 1 1 0 ... None None transcriptional 14 {} 9606 complex 9606 protein
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
67945 Q01196 Q13094 RUNX1 LCP2 1 1 0 1 1 0 ... None None transcriptional 1 {} 9606 protein 9606 protein
67946 Q01196 Q6MZQ0 RUNX1 PRR5L 1 1 0 1 1 0 ... None None transcriptional 1 {} 9606 protein 9606 protein
67947 Q15672 P08151 TWIST1 GLI1 1 1 0 1 1 0 ... None None transcriptional 1 {} 9606 protein 9606 protein
67948 P22415 Q5SRE5 USF1 NUP188 1 1 0 1 1 0 ... None None transcriptional 1 {} 9606 protein 9606 protein
67949 Q9UQR1 Q5VYX0 ZNF148 RNLS 1 1 0 1 1 0 ... None None transcriptional 1 {} 9606 protein 9606 protein

67950 rows × 34 columns

When we export to CSV, the Complex objects are converted to the string notation familiar from the OmniPath web service. See for example COMPLEX:P15407_P17535 below, and its human readable label FOSL1_JUND in the gene symbols column:

[15]:
e.df[[ets == 'complex' for ets in e.df.entity_type_source]].to_csv(index = False)[:1000]

executed in 0ms, finished 20:55:26 2023-03-27

[15]:
'source,target,source_genesymbol,target_genesymbol,is_directed,is_stimulation,is_inhibition,consensus_direction,consensus_stimulation,consensus_inhibition,sources,references,omnipath,kinaseextra,ligrecextra,pathwayextra,mirnatarget,dorothea,tf_target,lncrna_mrna,tf_mirna,small_molecule,dorothea_curated,dorothea_chipseq,dorothea_tfbs,dorothea_coexp,dorothea_level,type,curation_effort,extra_attrs,ncbi_tax_id_source,entity_type_source,ncbi_tax_id_target,entity_type_target\nCOMPLEX:P15407_P17535,P04040,FOSL1_JUND,CAT,1,1,0,1,1,0,CollecTRI;ExTRI_CollecTRI,CollecTRI:10022519;CollecTRI:10329043;CollecTRI:12036993;CollecTRI:12538496;CollecTRI:17935786;CollecTRI:7489329;CollecTRI:7651432;CollecTRI:7818486;CollecTRI:8867782;CollecTRI:9030359;CollecTRI:9136992;CollecTRI:9142914;CollecTRI:9168892;CollecTRI:9687385,False,False,False,False,False,False,False,False,False,False,,,,,,transcriptional,14,{},9606,complex,9606,protein\nCOMPLEX:P05412_P15408,P04040,FOSL2_JUN,CAT,1,1,0,1,1,0,CollecTRI;ExTRI_C
Output truncated: showing 1000 of 1004 characters

Translating identifiers§

The pypath.utils.mapping module is for ID translation, most of the time you can simply call the map_name method:

[1]:
from pypath.utils import mapping
mapping.map_name('P00533', 'uniprot', 'genesymbol')

executed in 1.38s, finished 12:31:45 2023-03-21

[1]:
{'EGFR'}

By default the map_name function returns a set because it accounts for ambiguous mapping. However most often the ID translation is unambiguous, and you want to retrieve only one ID. The map_name0 returns a string, even in case of ambiguity, it returns a random element from the resulted set:

[5]:
mapping.map_name0('GABARAPL3', 'genesymbol', 'uniprot')

executed in 0ms, finished 14:17:31 2022-12-02

[5]:
'Q9BY60'

Molecules have large variety of identifiers, but in pypath two identifier types are special:

  • The primary identifier defines the molecule category, e.g. if UniProt is the primary identifier for proteins, then a protein is anything that has a UniProt ID

  • The label is a human readable identifier, for proteins it’s gene symbol

The primary ID and label types are configured for each molecule type (protein, miRNA, drug, etc) in the module settings. The mapping module provides shortcuts to translate between these identifiers: label and id_from_label.

[6]:
mapping.label('O75385')

executed in 0ms, finished 14:17:33 2022-12-02

[6]:
'ULK1'
[7]:
mapping.id_from_label('ULK1')

executed in 0ms, finished 14:17:35 2022-12-02

[7]:
{'O75385'}
[8]:
mapping.id_from_label0('ULK1')

executed in 0ms, finished 14:17:37 2022-12-02

[8]:
'O75385'

Multiple IDs can be translated in one call, however, it’s not possible to know certainly which output corresponds to which input.

[9]:
mapping.map_names(['ULK1', 'EGFR', 'SMAD2'], 'genesymbol', 'uniprot')

executed in 0ms, finished 14:17:40 2022-12-02

[9]:
{'O75385', 'P00533', 'Q15796'}

The default organism is defined in the module settings, it is human by default. Translating for other organisms requires the ncbi_tax_id argument. Most of the functions in pypath accepts also common or latin names, but map_name accepts only numeric taxon IDs for efficiency. Let’s translate a mouse identifier:

[10]:
mapping.map_name('Smad2', 'genesymbol', 'uniprot', ncbi_tax_id = 10090)

executed in 0ms, finished 14:17:44 2022-12-02

[10]:
{'Q62432'}

If no direct translation table is available between two ID types, pypath will try to translate by an intermediate ID type.

[11]:
mapping.map_name('8408', 'entrez', 'genesymbol')

executed in 0ms, finished 14:17:46 2022-12-02

[11]:
{'ULK1'}

Behind the scenes the chain_map function is called:

[12]:
m = mapping.get_mapper()
m.chain_map('8408', id_type = 'entrez', target_id_type = 'genesymbol', by_id_type = 'uniprot')

executed in 0ms, finished 14:17:47 2022-12-02

[12]:
{'ULK1'}

And the procedure corresponds to the following:

[13]:
mapping.map_names(
    mapping.map_name('8408', 'entrez', 'uniprot'),
    'uniprot',
    'genesymbol',
)

executed in 0ms, finished 14:17:49 2022-12-02

[13]:
{'ULK1'}

Pre-defined ID translation tables§

A number of mapping tables are pre-defined, these load automatically on demand, and are removed from the memory if not used for some time (5 minutes by default). New mapping tables are saved directly into pickle files in the cache for a quick reload. Tables are either organism specific (hence loaded for each organism one-by-one), or non-organism specific, such as drug IDs (pypath uses integer 0 in this case in place of the numeric NCBI Taxonomy ID). The identifier translation data is retrieved from the following sources:

  • UniProt legacy API (main UniProt API until autumn 2022): internals.input_formats.UniprotMapping

  • UniProt uploadlists API (also outdated, replaced by the new UniProt API): internals.inputs_formats.UniprotListMapping

  • Ensembl Biomart: internals.input_formats.BiomartMapping and internals.input_formats.ArrayMapping (for microarray probes)

  • Protein Ontology Consortium: internals.input_formats.ProMapping

  • UniChem: internals.input_formats.UnichemMapping

  • Arbitrary files: internals.input_formats.FileMapping (this class is used to process data from miRBase, some files from the UniProt FTP site, and also user defined, custom cases)

  • RaMP: internals.input_formats.RampMapping

  • HMDB: internals.input_formats.HmdbMapping

Some of the classes above are instantiated in internals.maps, but most of the instances are created on the fly when loading a mapping table in utils.mapping.MapReader. This latter class is responsible to take a table definition and load a utils.mapping.MappingTable instance. The whole process is managed by utils.mapping.Mapper, this is the object all the ID translation queries are dispatched to. It has a method to list the defined ID translation tables:

[3]:
mapping.mapping_tables()

executed in 0ms, finished 12:32:06 2023-03-21

[3]:
[MappingTableDefinition(id_type_a='embl', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='database(embl)', resource_id_type_b='id'),
 MappingTableDefinition(id_type_a='genesymbol', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='genes(PREFERRED)', resource_id_type_b='id'),
 MappingTableDefinition(id_type_a='genesymbol-syn', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='genes(ALTERNATIVE)', resource_id_type_b='id'),
 MappingTableDefinition(id_type_a='entrez', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='database(geneid)', resource_id_type_b='id'),
 MappingTableDefinition(id_type_a='hgnc', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='database(HGNC)', resource_id_type_b='id'),
 MappingTableDefinition(id_type_a='refseqp', id_type_b='uniprot', resource='uniprot', input_cl
Output truncated: showing 1000 of 29850 characters

Pypath uses synonyms to refer to ID types: these are intended to be short, clear and lowercase for ease of use. Most of the synonyms are defined in internals.input_formats, in the AC_QUERY, AC_MAPPING, BIOMART_MAPPING, PRO_MAPPING and ARRAY_MAPPING dictionaries. UniChem ID types are used exactly as provided by UniChem. To list all available ID types (below pypath is the synonym used here, original is the name in the original resource):

[4]:
mapping.id_types()

executed in 0ms, finished 12:32:14 2023-03-21

[4]:
{IdType(pypath='CAS', original='CAS'),
 IdType(pypath='LIPIDMAPS', original='LIPIDMAPS'),
 IdType(pypath='MedChemExpress', original='MedChemExpress'),
 IdType(pypath='actor', original='actor'),
 IdType(pypath='affy', original='affy'),
 IdType(pypath='affymetrix', original='affymetrix'),
 IdType(pypath='agilent', original='agilent'),
 IdType(pypath='alzforum', original='Alzforum_mut'),
 IdType(pypath='araport', original='Araport'),
 IdType(pypath='atlas', original='atlas'),
 IdType(pypath='bigg', original='bigg'),
 IdType(pypath='bindingdb', original='bindingdb'),
 IdType(pypath='biocyc', original='biocyc'),
 IdType(pypath='brenda', original='brenda'),
 IdType(pypath='carotenoiddb', original='carotenoiddb'),
 IdType(pypath='cas', original='CAS'),
 IdType(pypath='cas', original='cas_registry_number'),
 IdType(pypath='cas_id', original='CAS'),
 IdType(pypath='cgnc', original='CGNC'),
 IdType(pypath='chebi', original='chebi'),
 IdType(pypath='chembl', original='chembl'),
 IdType(pypath='ch
Output truncated: showing 1000 of 8561 characters

Direct access to ID translation tables§

The Mapper (or the mapping module) is able to return ID translation tables as dicts or data frames:

[5]:
tbl = mapping.translation_dict('uniprot', 'genesymbol')
tbl

executed in 0ms, finished 12:33:55 2023-03-21

[5]:
<MappingTable from=uniprot, to=genesymbol, taxon=9606 (20243 IDs)>
[7]:
'P00533' in tbl

executed in 0ms, finished 12:34:16 2023-03-21

[7]:
True
[8]:
tbl['P00533']

executed in 0ms, finished 12:34:25 2023-03-21

[8]:
{'EGFR'}
[9]:
'EGFR' in tbl

executed in 0ms, finished 12:34:33 2023-03-21

[9]:
False
[10]:
list(tbl.items())[:10]

executed in 0ms, finished 12:34:50 2023-03-21

[10]:
[('Q00604', {'NDP'}),
 ('Q9HB19', {'PLEKHA2'}),
 ('Q16718', {'NDUFA5'}),
 ('P55769', {'SNU13'}),
 ('Q92886', {'NEUROG1'}),
 ('Q6T4R5', {'NHS'}),
 ('P80188', {'LCN2'}),
 ('Q86XR2', {'FAM129C'}),
 ('Q5T2W1', {'PDZK1'}),
 ('Q9BSH3', {'NICN1'})]

The same table as data frame:

[12]:
mapping.translation_df('uniprot', 'genesymbol')

executed in 0ms, finished 12:35:18 2023-03-21

[12]:
uniprot genesymbol
0 Q00604 NDP
1 Q9HB19 PLEKHA2
2 Q16718 NDUFA5
3 P55769 SNU13
4 Q92886 NEUROG1
... ... ...
20375 Q96L92 SNX27
20376 Q9UNH6 SNX7
20377 Q5VWJ9 SNX30
20378 Q9BZZ2 SIGLEC1
20379 Q96BD0 SLCO4A1

20380 rows × 2 columns

Orthology translation§

The utils.orthology module (formerly utils.homology) handles translation of data between organism by orthologous gene pairs. Its most important function is translate. The source organism is human by default, the target must be provided, below we use mouse (NCBI Taxonomy 10090):

[2]:
from pypath.utils import orthology
orthology.translate('P00533', target = 10090)

executed in 22.33s, finished 18:03:50 2023-09-28

[2]:
{'Q01279'}

ID translation and orthology translation are integrated, hence not only UniProt IDs can be translated:

[3]:
orthology.translate('EGFR', target = 10090, id_type = 'genesymbol')

executed in 22.08s, finished 18:04:16 2023-09-28

[3]:
{'Egfr'}

This module uses data from the Orthologous Matrix )OMA), NCBI HomoloGene and Ensembl. The latter covers more organisms, and accepts some parameters (high confidence, one-to-one vs. one-to-many mapping). The default is to use only OMA as that one is the most comprehensive, up to date and easy to use resource. These parameters can be controlled by the settings module, or passed to the functions above and below, for example:

[8]:
orthology.translate('P00533', target = 10090, oma = False, homologene = False, ensembl = True, ensembl_hc = False, ensembl_types = 'one2one')

executed in 24.52s, finished 18:07:43 2023-09-28

[8]:
{'Q01279'}

Orthology translation tables as dictionaries§

The translation tables are available as dicts of sets, these are convenient for use outside of pypath:

[9]:
human_mouse_genesymbols = orthology.get_dict(target = 'mouse', id_type = 'genesymbol')
human_mouse_genesymbols['EGFR']

executed in 0ms, finished 18:08:26 2023-09-28

[9]:
{'Egfr'}

The relationship types and confdence levels can be included using the full_records argument:

[11]:
human_mouse_genesymbols = orthology.get_dict(target = 'mouse', id_type = 'genesymbol', full_records = True)
human_mouse_genesymbols['EGFR']

executed in 0ms, finished 18:10:13 2023-09-28

[11]:
{OmaOrtholog(id='Egfr', rel_type='1:1', score=12704.5703125)}

Orthology translation data frames§

Similarly, pandas.DataFrames are available:

[13]:
human_mouse_genesymbols = orthology.get_df(target = 'mouse', id_type = 'genesymbol', full_records = True)
human_mouse_genesymbols

executed in 0ms, finished 18:11:16 2023-09-28

[13]:
source target rel_type score
0 H4C3 H4c1 m:n 1262.050049
1 H4C3 H4c3 m:n 1262.050049
2 H4C3 H4c12 m:n 1262.050049
3 H4C3 H4c11 m:n 1262.050049
4 H4C3 H4c9 m:n 1262.050049
... ... ... ... ...
18446 GDAP2 Gdap2 1:1 5553.779785
18447 ITGA8 Itga8 1:1 10772.969727
18448 SEMA3F Sema3f 1:1 9121.080078
18449 EEPD1 Eepd1 1:1 5874.350098
18450 DRG2 Drg2 1:1 4423.589844

18451 rows × 4 columns

Taxonomy§

Organisms matter everywhere, both in the input, output and processing parts of pypath. For this reason we created a utility module to deal with translation of organism identifiers. We prefer NCBI Taxonomy IDs as the primary organism identifier. These are simple numbers, 9606 is human, 10090 is mouse, etc. Many databases use common English names or latin (scientific) names. Then some databases use custom codes, such as hsapiens in Ensmebl (first letter of genus name + species name, without space, all lowercase); hsa in miRBase and KEGG (first letter of genus name, first two letters of species name). The pypath.utils.taxonomy module features some convenient functions for handling all these names.

Translating to NCBI Taxonomy, scientific names and common names§

The most often used is ensure_ncbi_tax_id, which returns the NCBI Taxonomy ID for any comprehensible input:

[21]:
from pypath.utils import taxonomy
taxonomy.ensure_ncbi_tax_id('human'), taxonomy.ensure_ncbi_tax_id('H sapiens'), taxonomy.ensure_ncbi_tax_id('hsapiens'), taxonomy.ensure_ncbi_tax_id(9606), taxonomy.ensure_ncbi_tax_id('Homo sapiens')

executed in 0ms, finished 14:18:22 2022-12-02

[21]:
(9606, 9606, 9606, 9606, 9606)

To access scientific names or common names:

[22]:
taxonomy.ensure_latin_name('cow')

executed in 0ms, finished 14:18:25 2022-12-02

[22]:
'Bos taurus'
[23]:
taxonomy.ensure_common_name('Erithacus rubecula')

executed in 0ms, finished 14:18:27 2022-12-02

[23]:
'European robin'

Organism from UniProt ID§

The uniprot_taxid function returns the taxonomy ID for a SwissProt ID. Unfortunately it does not work for TrEMBL IDs, that would require to keep too much data in memory.

[24]:
taxonomy.ensure_latin_name(taxonomy.uniprot_taxid('P53104'))

executed in 1.19s, finished 14:18:30 2022-12-02

[24]:
'Saccharomyces cerevisiae'

UniProt§

UniProt is a huge, diverse resource that is essential for pypath as we use it as a reference set for proteomes and it provides ID translation data. Its input module pypath.inputs.uniprot is already more complex than an average input module. It harbors a little database manager that loads and unloads tables on demand, ensuring fast and convenient operation. Further services are available in the pypath.utils.uniprot module.

The UniProt input module§

All UniProt IDs for one organism§

The complete set of UniProt IDs for an organism is considered to be the proteome of the organism, and it is used in many procedures across pypath. All SwissProt IDs, all TrEMBL IDs or both together can be retrieved:

[119]:
from pypath.inputs import uniprot as iuniprot
(
    len(iuniprot.all_uniprots(organism = 10090)),
    len(iuniprot.all_swissprots(organism = 10090)),
    len(iuniprot.all_trembls(organism = 10090)),
)

executed in 3m 33.99s, finished 16:07:43 2022-12-02

[119]:
(86440, 17131, 69300)

UniProt ID format validation§

UniProt defines a format for its accessions, any string can be checked against this template to tell if it’s possibly a valid ID:

[124]:
from pypath.inputs import uniprot as iuniprot
iuniprot.valid_uniprot('A0A8D0H0C2')

executed in 0ms, finished 16:17:41 2022-12-02

[124]:
True

UniProt ID validation§

Another functions check if an ID indeed exists in UniProt. These functions require loading the list of all UniProt IDs for the organism, hence calling them the first time might take even a few minutes (in case new download is necessary). Subsequent calls will be much faster.

[125]:
from pypath.inputs import uniprot as iuniprot
iuniprot.is_uniprot('P00533')

executed in 0ms, finished 16:17:44 2022-12-02

[125]:
True
[122]:
iuniprot.is_swissprot('P00533')

executed in 0ms, finished 16:14:14 2022-12-02

[122]:
True

If the organism doesn’t match:

[123]:
iuniprot.is_uniprot('P00533', organism = 10090)

executed in 0ms, finished 16:15:07 2022-12-02

[123]:
False

Single UniProt protein datasheet§

Raw contents of protein datasheets can be retrieved. The structure is a Python list with tuples of two elements, the first is the tag of the line, the second is the line content.

[126]:
from pypath.inputs import uniprot as iuniprot
iuniprot.protein_datasheet('P00533')

executed in 0ms, finished 16:18:06 2022-12-02

[126]:
[('ID', 'EGFR_HUMAN              Reviewed;        1210 AA.'),
 ('AC',
  'P00533; O00688; O00732; P06268; Q14225; Q68GS5; Q92795; Q9BZS2; Q9GZX1;'),
 ('AC', 'Q9H2C9; Q9H3C9; Q9UMD7; Q9UMD8; Q9UMG5;'),
 ('DT', '21-JUL-1986, integrated into UniProtKB/Swiss-Prot.'),
 ('DT', '01-NOV-1997, sequence version 2.'),
 ('DT', '12-OCT-2022, entry version 283.'),
 ('DE', 'RecName: Full=Epidermal growth factor receptor {ECO:0000305};'),
 ('DE', 'EC=2.7.10.1;'),
 ('DE', 'AltName: Full=Proto-oncogene c-ErbB-1;'),
 ('DE', 'AltName: Full=Receptor tyrosine-protein kinase erbB-1;'),
 ('DE', 'Flags: Precursor;'),
 ('GN', 'Name=EGFR {ECO:0000312|HGNC:HGNC:3236}; Synonyms=ERBB, ERBB1, HER1;'),
 ('OS', 'Homo sapiens (Human).'),
 ('OC',
  'Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;'),
 ('OC',
  'Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae;'),
 ('OC', 'Homo.'),
 ('OX', 'NCBI_TaxID=9606;'),
 ('RN', '[1]'),
 ('RP',
  'NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM
Output truncated: showing 1000 of 58080 characters

History of UniProt records§

[131]:
from pypath.inputs import uniprot as iuniprot
egfr_history = list(iuniprot.uniprot_history('P00533'))
egfr_history

executed in 0ms, finished 16:21:15 2022-12-02

[131]:
[UniprotRecordHistory(entry_version='283', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_04', date='2022-10-12', replaces='', replaced_by=''),
 UniprotRecordHistory(entry_version='282', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_03', date='2022-08-03', replaces='', replaced_by=''),
 UniprotRecordHistory(entry_version='281', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_02', date='2022-05-25', replaces='', replaced_by=''),
 UniprotRecordHistory(entry_version='280', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_01', date='2022-02-23', replaces='', replaced_by=''),
 UniprotRecordHistory(entry_version='279', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2021_04', date='2021-09-29', replaces='', replaced_by=''),
 UniprotRecordHistory(entry_version='278', sequence_version='2', entry_name='EGFR_HUMAN', database='
Output truncated: showing 1000 of 50933 characters
[132]:
iuniprot.uniprot_recent_version('P00533')

executed in 0ms, finished 16:21:57 2022-12-02

[132]:
UniprotRecordHistory(entry_version='283', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_04', date='2022-10-12', replaces='', replaced_by='')
[133]:
iuniprot.uniprot_history_recent_datasheet('P00533')

executed in 1ms, finished 16:22:33 2022-12-02

[133]:
[('ID', 'EGFR_HUMAN              Reviewed;        1210 AA.'),
 ('AC',
  'P00533; O00688; O00732; P06268; Q14225; Q68GS5; Q92795; Q9BZS2; Q9GZX1;'),
 ('AC', 'Q9H2C9; Q9H3C9; Q9UMD7; Q9UMD8; Q9UMG5;'),
 ('DT', '21-JUL-1986, integrated into UniProtKB/Swiss-Prot.'),
 ('DT', '01-NOV-1997, sequence version 2.'),
 ('DT', '12-OCT-2022, entry version 283.'),
 ('DE', 'RecName: Full=Epidermal growth factor receptor {ECO:0000305};'),
 ('DE', 'EC=2.7.10.1;'),
 ('DE', 'AltName: Full=Proto-oncogene c-ErbB-1;'),
 ('DE', 'AltName: Full=Receptor tyrosine-protein kinase erbB-1;'),
 ('DE', 'Flags: Precursor;'),
 ('GN', 'Name=EGFR {ECO:0000312|HGNC:HGNC:3236}; Synonyms=ERBB, ERBB1, HER1;'),
 ('OS', 'Homo sapiens (Human).'),
 ('OC',
  'Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;'),
 ('OC',
  'Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae;'),
 ('OC', 'Homo.'),
 ('OX', 'NCBI_TaxID=9606;'),
 ('RN', '[1]'),
 ('RP',
  'NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM
Output truncated: showing 1000 of 58080 characters

The functions above are able to retrieve the latest datasheet of deleted UniProt records. However, they are slow as several queries are performed to process a single protein.

UniProt REST API§

UniProt deployed its new API in the autumn of 2022, since then pypath has fully transitioned to the new API. It is accessed by the inputs.uniprot.uniprot_data and inputs.uniprot.uniprot_query functions, though for some purposes higher level functions are more convenient for the users. For the functions above, a list of fields can be passed. By default it uses only SwissProt. The output is a dict of dicts with fields as top level keys and UniProt IDs as second level keys. The results often contain notes, additional info in parentheses, prefixes and postfixes for identifiers, that are not needed in every situation. Using uniprot_preprocess instead of uniprot_data cleans up some of this clutter.

[1]:
from pypath.inputs import uniprot as iuniprot
iuniprot.uniprot_data(fields = ('family', 'keywords', 'transmembrane'))

executed in 28.47s, finished 03:24:10 2023-11-16

[1]:
{'family': {'A0A087X1C5': 'Cytochrome P450 family',
  'A0A0B4J2F2': 'Protein kinase superfamily, CAMK Ser/Thr protein kinase family, AMPK subfamily',
  'A0A0K2S4Q6': 'CD300 family',
  'A0A1B0GTW7': 'Peptidase M8 family',
  'A0AV02': 'SLC12A transporter family',
  'A0AV96': 'RRM RBM47 family',
  'A0AVF1': 'IFT56 family',
  'A0AVI4': 'TMEM129 family',
  'A0AVK6': 'E2F/DP family',
  'A0AVT1': 'Ubiquitin-activating E1 family',
  'A0FGR8': 'Extended synaptotagmin family',
  'A0FGR9': 'Extended synaptotagmin family',
  'A0JLT2': 'Mediator complex subunit 19 family',
  'A0JP26': 'POTE family',
  'A0MZ66': 'Shootin family',
  'A0PJK1': 'Sodium:solute symporter (SSF) (TC 2.A.21) family',
  'A0PJY2': 'Krueppel C2H2-type zinc-finger protein family',
  'A0PK00': 'TMEM120 family',
  'A0PK11': 'Clarin family',
  'A1A4Y4': 'TRAFAC class dynamin-like GTPase superfamily, IRG family',
  'A1A519': 'FAM170 family',
  'A1A5B4': 'Anoctamin family',
  'A1A5C7': 'Major facilitator (TC 2.A.1) superfamily, Orga
Output truncated: showing 1000 of 510530 characters

The inputs.uiprot.query_builder funcion builds queries for the API.

[2]:
from pypath.inputs import uniprot
uniprot.query_builder('kinase', organism_id = 9606)

executed in 0ms, finished 03:30:18 2023-11-16

[2]:
'kinase AND organism_id:9606'
[3]:
uniprot.query_builder(organism = [9606, 10090, 10116])

executed in 0ms, finished 03:30:49 2023-11-16

[3]:
'(organism_id:9606 OR organism_id:10090 OR organism_id:10116)'
[4]:
uniprot.query_builder({'organism_id': 9606, 'reviewed': True})

executed in 0ms, finished 03:31:22 2023-11-16

[4]:
'(organism_id:9606 AND reviewed:true)'
[5]:
uniprot.query_builder({'length': (500,), 'mass': (50000,), 'op': 'OR'})

executed in 0ms, finished 03:31:41 2023-11-16

[5]:
'(length:[500 TO *] OR mass:[50000 TO *])'
[6]:
uniprot.query_builder(lit_author = ['Huang', 'Kovac', '_AND'])

executed in 0ms, finished 03:32:21 2023-11-16

[6]:
'(lit_author:Huang AND lit_author:Kovac)'
[7]:
uniprot.query_builder({'organism_id': [9606, 10090], 'reviewed': True})

executed in 0ms, finished 03:32:41 2023-11-16

[7]:
'((organism_id:9606 OR organism_id:10090) AND reviewed:true)'
[8]:
uniprot.query_builder({'length': (100, None), 'organism_id': 9606})

executed in 0ms, finished 03:33:04 2023-11-16

[8]:
'(length:[100 TO *] AND organism_id:9606)'

The query parameters can be passed the same way to uniprot_data and uniprot_query. For example, to query records in one proteome:

[10]:
from pypath.inputs import uniprot
uniprot.uniprot_query(proteome = 'UP000004102')[:10]

executed in 0ms, finished 03:36:16 2023-11-16

[10]:
['D1YM56',
 'D1YMJ2',
 'D1YN32',
 'D1YNB3',
 'D1YPZ1',
 'D1YR07',
 'D1YR15',
 'D1YR93',
 'D1YRB4',
 'D1YRB7']

All these functionalities are performed by the pypath.inputs.uniprot.UniprotQuery class.

Processed UniProt annotations§

For a few important fields we have dedicated processing functions with the aim of making their format cleaner and better usable. Sometimes even these do an imperfect job, and certain fields are badly truncated or contain residual fragments of the stripped labels.

Note: All the data presented below is part of the OmniPath annotations database, the recommended way to access it is by the database manager.

[136]:
from pypath.inputs import uniprot as iuniprot
iuniprot.uniprot_taxonomy()

executed in 1ms, finished 16:40:33 2022-12-02

[136]:
{'P00521': {'Abelson murine leukemia virus'},
 'P03333': {'Abelson murine leukemia virus'},
 'H8ZM73': {'Abies balsamea', 'Balsam fir', 'Pinus balsamea'},
 'H8ZM71': {'Abies balsamea', 'Balsam fir', 'Pinus balsamea'},
 'Q9MV51': {'Abies firma', 'Momi fir'},
 'O81086': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'O24474': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'O24475': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'O64404': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'O64405': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'Q948Z0': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'Q9M7D1': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'Q9M7D0': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'O22340': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'Q9M7C9': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'Q5K3V1': {'Abies homolepis', 'Nikko fir'},
 'P21715': {'Abrothrix jelskii', 'Akodon jelskii', "Jelski's altiplano mouse"},
 'P11140': {'Abru
Output truncated: showing 1000 of 56985 characters
[139]:
iuniprot.uniprot_ncbi_taxids_2()

executed in 0ms, finished 16:42:33 2022-12-02

[139]:
{648330: Taxon(ncbi_id=648330, latin='Aedes albopictus densovirus (isolate Boublik/1994)', english='AalDNV', latin_synonym=None),
 10804: Taxon(ncbi_id=10804, latin='Adeno-associated virus 2', english='AAV-2', latin_synonym=None),
 648242: Taxon(ncbi_id=648242, latin='Adeno-associated virus 2 (isolate Srivastava/1982)', english='AAV-2', latin_synonym=None),
 118452: Taxon(ncbi_id=118452, latin='Abacion magnum', english='Millipede', latin_synonym=None),
 72259: Taxon(ncbi_id=72259, latin='Abaeis nicippe', english='Sleepy orange butterfly', latin_synonym='Eurema nicippe'),
 102642: Taxon(ncbi_id=102642, latin='Abax parallelepipedus', english='Ground beetle', latin_synonym=None),
 392897: Taxon(ncbi_id=392897, latin='Abalistes stellaris', english='Starry triggerfish', latin_synonym='Balistes stellaris'),
 75332: Taxon(ncbi_id=75332, latin='Abbottina rivularis', english='Chinese false gudgeon', latin_synonym='Gobio rivularis'),
 515833: Taxon(ncbi_id=515833, latin='Abdopus aculeatus', engl
Output truncated: showing 1000 of 118050 characters
[140]:
iuniprot.uniprot_locations()

executed in 0ms, finished 16:42:50 2022-12-02

[140]:
{'Q96EC8': {UniprotLocation(location='Golgi apparatus membrane', features=('Multi-pass membrane protein',))},
 'Q6ZMS4': {UniprotLocation(location='Nucleus', features=None)},
 'Q8N8L2': {UniprotLocation(location='Nucleus', features=None)},
 'Q15916': {UniprotLocation(location='Nucleus', features=None)},
 'Q3MIS6': {UniprotLocation(location='Nucleus', features=None)},
 'Q6P280': {UniprotLocation(location='Nucleus', features=None)},
 'Q969W1': {UniprotLocation(location='Endoplasmic reticulum membrane', features=('Multi-pass membrane protein',))},
 'O14978': {UniprotLocation(location='Nucleus', features=None)},
 'Q66K41': {UniprotLocation(location='Nucleus', features=None)},
 'Q15937': {UniprotLocation(location='Nucleus', features=None)},
 'Q9P2J8': {UniprotLocation(location='Nucleus', features=None)},
 'Q8ND82': {UniprotLocation(location='Nucleus', features=None)},
 'Q9NP64': {UniprotLocation(location='Nucleolus', features=None),
  UniprotLocation(location='Nucleus', features=None)},
 'P
Output truncated: showing 1000 of 143466 characters
[141]:
iuniprot.uniprot_keywords()

executed in 0ms, finished 16:43:06 2022-12-02

[141]:
{'P63120': {UniprotKeyword(keyword='Aspartyl protease'),
  UniprotKeyword(keyword='Autocatalytic cleavage'),
  UniprotKeyword(keyword='ERV'),
  UniprotKeyword(keyword='Hydrolase'),
  UniprotKeyword(keyword='Protease'),
  UniprotKeyword(keyword='Reference proteome'),
  UniprotKeyword(keyword='Ribosomal frameshifting'),
  UniprotKeyword(keyword='Transposable element')},
 'Q96EC8': {UniprotKeyword(keyword='Acetylation'),
  UniprotKeyword(keyword='Alternative splicing'),
  UniprotKeyword(keyword='Golgi apparatus'),
  UniprotKeyword(keyword='Membrane'),
  UniprotKeyword(keyword='Phosphoprotein'),
  UniprotKeyword(keyword='Reference proteome'),
  UniprotKeyword(keyword='Transmembrane'),
  UniprotKeyword(keyword='Transmembrane helix')},
 'Q6ZMS4': {UniprotKeyword(keyword='Metal-binding'),
  UniprotKeyword(keyword='Nucleus'),
  UniprotKeyword(keyword='Phosphoprotein'),
  UniprotKeyword(keyword='Reference proteome'),
  UniprotKeyword(keyword='Repeat'),
  UniprotKeyword(keyword='Zinc'),
  Unipro
Output truncated: showing 1000 of 445111 characters
[142]:
iuniprot.uniprot_families()

executed in 0ms, finished 16:43:22 2022-12-02

[142]:
{'P63120': {UniprotFamily(family='Peptidase A2', subfamily='HERV class-II K(HML-2)')},
 'Q96EC8': {UniprotFamily(family='YIP1', subfamily=None)},
 'Q6ZMS4': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q8N8L2': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q3MIS6': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q86UK7': {UniprotFamily(family='ZNF598', subfamily=None)},
 'Q6P280': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q969W1': {UniprotFamily(family='DHHC palmitoyltransferase', subfamily=None)},
 'O14978': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q15937': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q9P2J8': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q8IUH4': {UniprotFamily(family='DHHC palmitoyltransferase',
Output truncated: showing 1000 of 77892 characters
[143]:
iuniprot.uniprot_tissues()

executed in 1.12s, finished 16:43:55 2022-12-02

[143]:
{'Q15916': {UniprotTissue(tissue='Brain', level='high'),
  UniprotTissue(tissue='Wide', level='high')},
 'Q969W1': {UniprotTissue(tissue='Wide', level='undefined')},
 'O14978': {UniprotTissue(tissue='Brain', level='undefined'),
  UniprotTissue(tissue='Colon', level='undefined'),
  UniprotTissue(tissue='Heart', level='undefined'),
  UniprotTissue(tissue='Kidney', level='undefined'),
  UniprotTissue(tissue='Leukocyte', level='undefined'),
  UniprotTissue(tissue='Liver', level='undefined'),
  UniprotTissue(tissue='Lung', level='undefined'),
  UniprotTissue(tissue='Ovary', level='undefined'),
  UniprotTissue(tissue='Pancreas', level='undefined'),
  UniprotTissue(tissue='Placenta', level='undefined'),
  UniprotTissue(tissue='Prostate', level='undefined'),
  UniprotTissue(tissue='Skeletal muscle', level='undefined'),
  UniprotTissue(tissue='Small intestine', level='undefined'),
  UniprotTissue(tissue='Spleen', level='undefined'),
  UniprotTissue(tissue='Testis', level='undefined'),
  Uniprot
Output truncated: showing 1000 of 318790 characters
[144]:
iuniprot.uniprot_topology()

executed in 0ms, finished 16:44:13 2022-12-02

[144]:
{'Q96EC8': {UniprotTopology(topology='Cytoplasmic', start=2, end=84),
  UniprotTopology(topology='Cytoplasmic', start=137, end=146),
  UniprotTopology(topology='Cytoplasmic', start=206, end=212),
  UniprotTopology(topology='Lumenal', start=106, end=115),
  UniprotTopology(topology='Lumenal', start=168, end=184),
  UniprotTopology(topology='Lumenal', start=234, end=236),
  UniprotTopology(topology='Transmembrane', start=85, end=105),
  UniprotTopology(topology='Transmembrane', start=116, end=136),
  UniprotTopology(topology='Transmembrane', start=147, end=167),
  UniprotTopology(topology='Transmembrane', start=185, end=205),
  UniprotTopology(topology='Transmembrane', start=213, end=233)},
 'Q969W1': {UniprotTopology(topology='Cytoplasmic', start=1, end=77),
  UniprotTopology(topology='Cytoplasmic', start=138, end=198),
  UniprotTopology(topology='Cytoplasmic', start=288, end=377),
  UniprotTopology(topology='Lumenal', start=99, end=116),
  UniprotTopology(topology='Lumenal', start=220,
Output truncated: showing 1000 of 544230 characters

The UniProt utils module§

Datasheets§

The pypath.utils.uniprot module is an API around UniProt protein datasheets. It is not suitable for bulk retrieval: that would work but take really long time. Calling its bulk methods with more than a few dozens or hundreds of proteins might take minutes, as it downloads protein datasheets one-by-one. To retrieve the full datasheets of one or more proteins use query:

[153]:
from pypath.utils import uniprot
uniprot.query('P00533', 'O75385', 'Q14457')

executed in 1ms, finished 17:57:18 2022-12-02

[153]:
[<UniProt datasheet P00533 (EGFR)>,
 <UniProt datasheet O75385 (ULK1)>,
 <UniProt datasheet Q14457 (BECN1)>]
[154]:
ulk1 = uniprot.query('O75385')
ulk1

executed in 0ms, finished 17:57:58 2022-12-02

[154]:
<UniProt datasheet O75385 (ULK1)>

Many attributes are available from the datasheet objects, just a few examples:

[156]:
ulk1.weight, ulk1.length, ulk1.subcellular_location, ulk1.sequence

executed in 0ms, finished 17:59:18 2022-12-02

[156]:
(112631,
 1050,
 'Cytoplasm, cytosol. Preautophagosomal structure. Note=Under starvation conditions, is localized to puncate structures primarily representing the isolation membrane that sequesters a portion of the cytoplasm resulting in the formation of an autophagosome.',
 'MEPGRGGTETVGKFEFSRKDLIGHGAFAVVFKGRHREKHDLEVAVKCINKKNLAKSQTLLGKEIKILKELKHENIVALYDFQEMANSVYLVMEYCNGGDLADYLHAMRTLSEDTIRLFLQQIAGAMRLLHSKGIIHRDLKPQNILLSNPAGRRANPNSIRVKIADFGFARYLQSNMMAATLCGSPMYMAPEVIMSQHYDGKADLWSIGTIVYQCLTGKAPFQASSPQDLRLFYEKNKTLVPTIPRETSAPLRQLLLALLQRNHKDRMDFDEFFHHPFLDASPSVRKSPPVPVPSYPSSGSGSSSSSSSTSHLASPPSLGEMQQLQKTLASPADTAGFLHSSRDSGGSKDSSCDTDDFVMVPAQFPGDLVAEAPSAKPPPDSLMCSGSSLVASAGLESHGRTPSPSPPCSSSPSPSGRAGPFSSSRCGASVPIPVPTQVQNYQRIERNLQSPTQFQTPRSSAIRRSGSTSPLGFARASPSPPAHAEHGGVLARKMSLGGGRPYTPSPQVGTIPERPGWSGTPSPQGAEMRGGRSPRPGSSAPEHSPRTSGLGCRLHSAPNLSDLHVVRPKLPKPPTDPLGAVFSPPQASPPQPSHGLQSCRNLRGSPKLPDFLQRNPLPPILGSPTKAVPSFDFPKTPSSQNLLALLARQGVVMTPPRNRTLPDLSEVGPFHGQPLGPGLRPGEDPKGPFGRSFSTSRLTDLLLKAAFGTQAPDPGSTESLQEK
Output truncated: showing 1000 of 1329 characters

The collect function collects certain features for a set of proteins.

Warning: This is a really inefficient way of retrieving data from UniProt. If you work with more than a handful of proteins, go for pypath.inputs.uniprot_data instead.

[158]:
uniprot.collect(['P00533', 'O75385', 'Q14457'], 'weight', 'length')

executed in 0ms, finished 18:02:29 2022-12-02

[158]:
OrderedDict([('ac', ['P00533', 'O75385', 'Q14457']),
             ('weight', [134277, 112631, 51896]),
             ('length', [1210, 1050, 450])])

Tables§

UniProt data can be printed to the console in a tabular format:

[159]:
uniprot.print_features(['P00533', 'O75385', 'Q14457'], 'weight', 'length')

executed in 0ms, finished 18:07:18 2022-12-02

╒═══════╤════════╤══════════╤══════════╕
│   No. │ ac     │   weight │   length │
╞═══════╪════════╪══════════╪══════════╡
│     1 │ P00533 │   134277 │     1210 │
├───────┼────────┼──────────┼──────────┤
│     2 │ O75385 │   112631 │     1050 │
├───────┼────────┼──────────┼──────────┤
│     3 │ Q14457 │    51896 │      450 │
╘═══════╧════════╧══════════╧══════════╛

There is a shortcut to print essential characterization of proteins as such a table. The info function is really useful if you get to a set of proteins at some point of your analysis and you want to quickly check what kind they are. To iterate through multiple groups of proteins, use utils.uniprot.browse. The columns and format of these tables can be customized by kwargs.

[160]:
uniprot.info(['P00533', 'O75385', 'Q14457'])

executed in 0ms, finished 18:09:45 2022-12-02

=====> [3 proteins] <=====
╒═══════╤════════╤══════════════╤══════════╤══════════╤═════════════╤══════════════╤════════════╤══════════════╕
│   No. │ ac     │ genesymbol   │   length │   weight │ full_name   │ function_o   │ keywords   │ subcellula   │
│       │        │              │          │          │             │ r_genecard   │            │ r_location   │
│       │        │              │          │          │             │ s            │            │              │
╞═══════╪════════╪══════════════╪══════════╪══════════╪═════════════╪══════════════╪════════════╪══════════════╡
│     1 │ P00533 │ EGFR         │     1210 │   134277 │ Epidermal   │ Receptor     │ 3D-        │ Cell         │
│       │        │              │          │          │ growth      │ tyrosine     │ structure, │ membrane;    │
│       │        │              │          │          │ factor      │ kinase       │ Alternativ │ Single-      │
│       │        │              │          │          │ receptor    │
Output truncated: showing 1000 of 20254 characters

Sanitizing UniProt IDs§

It is important to know that the ID translation module always do a number of checks when translating to UniProt IDs. Unless the uniprot_cleanup parameter is disabled. It translates secondary IDs to primary, attempts to map TrEMBL IDs to SwissProts by gene symbols, removes IDs of other organisms or invalid format. To exploit this behaviour it’s enough to map from UniProt to UniProt:

[162]:
from pypath.utils import mapping
mapping.map_name('Q9UQ28', 'uniprot', 'uniprot')

executed in 0ms, finished 18:20:02 2022-12-02

[162]:
{'O75385'}

Enzyme-substrate interactions§

The database is an instance of pypath.core.enz_sub.EnzymeSubstrateAggregator class. The database is built with the default or current configuration by the core.enz_sub.get_db method.

Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.

[25]:
from pypath.core import enz_sub
es = enz_sub.get_db()

executed in 8m 1.81s, finished 14:26:37 2022-12-02

Instead, let’s acquire the database from the manager:

[6]:
from pypath import omnipath
es = omnipath.db.get_db('enz_sub')

executed in 7.27s, finished 15:37:33 2022-12-03

The database itself is stored as a dictionary (EnzymeSubstrateAggregator.enz_sub) with pairs of proteins as keys and a list of special objects representing enzyme-substrate interactions as values. These can be accessed by pairs of labels, identifiers or Entity objects, e.g. mTOR phosphorylates AKT1:

[27]:
es[('MTOR', 'AKT1')]

executed in 0ms, finished 14:40:55 2022-12-02

[27]:
[<MTOR => Residue AKT1-1:S473:phosphorylation [Evidences: HPRD, KEA, MIMP, PhosphoSite, ProtMapper, REACH, SIGNOR, Sparser, dbPTM, phosphoELM (15 references)]>,
 <MTOR => Residue AKT1-1:T450:phosphorylation [Evidences: HPRD, MIMP, PhosphoSite, ProtMapper, phosphoELM (0 references)]>,
 <MTOR => Residue AKT1-1:T308:phosphorylation [Evidences: ProtMapper, Sparser (1 references)]>]

Enzyme-substrate objects§

Let’s take a closer look at one of the enzyme-PTM relationships, represented by pypath.internals.intera.DomainMotif objects. Below some of the attributes are shown:

[28]:
e_ptm = es[('MTOR', 'AKT1')][0]
e_ptm.ptm.protein, e_ptm.ptm.protein.identifier, e_ptm.ptm.isoform, e_ptm.ptm.residue, e_ptm.ptm.residue.name, e_ptm.ptm.residue.number, e_ptm.ptm.typ, e_ptm.domain.protein

executed in 0ms, finished 14:40:57 2022-12-02

[28]:
(<Entity: AKT1>,
 'P31749',
 1,
 <Residue AKT1-1:S473>,
 'S',
 473,
 'phosphorylation',
 <Entity: MTOR>)

The resources and references are available in Evidences objects:

[29]:
e_ptm.evidences

executed in 0ms, finished 14:41:00 2022-12-02

[29]:
<Evidences: HPRD, KEA, MIMP, PhosphoSite, ProtMapper, REACH, SIGNOR, Sparser, dbPTM, phosphoELM (15 references)>
[30]:
e_ptm.evidences.get_resource_names()

executed in 0ms, finished 14:41:03 2022-12-02

[30]:
{'KEA', 'MIMP', 'PhosphoSite', 'ProtMapper', 'SIGNOR', 'dbPTM'}
[31]:
e_ptm.evidences.get_references()

executed in 0ms, finished 14:41:04 2022-12-02

[31]:
{<Reference: 14761976>,
 <Reference: 15047712>,
 <Reference: 15364915>,
 <Reference: 15718470>,
 <Reference: 15899889>,
 <Reference: 16221682>,
 <Reference: 17013611>,
 <Reference: 19844585>,
 <Reference: 20333297>,
 <Reference: 20489726>,
 <Reference: 21157483>,
 <Reference: 21592956>,
 <Reference: 23006971>,
 <Reference: 8978681>,
 <Reference: 9736715>}

Enzyme-substrate data frame§

The dabase object is able to export its contents into a pandas.DataFrame:

[7]:
es.make_df()
es.df

executed in 1.03s, finished 15:37:39 2022-12-03

[7]:
enzyme enzyme_genesymbol substrate substrate_genesymbol isoforms residue_type residue_offset modification sources references curation_effort
0 P31749 AKT1 P63104 YWHAZ 1 S 58 phosphorylation HPRD;HPRD_MIMP;KEA;MIMP;PhosphoSite;PhosphoSit... HPRD:11956222;KEA:11956222;KEA:12861023;KEA:16... 11
1 P31749 AKT1 P63104 YWHAZ 1 S 184 phosphorylation HPRD;HPRD_MIMP;KEA;MIMP;PhosphoSite_MIMP;phosp... HPRD:11956222;KEA:11956222;KEA:15071501 3
2 P45983 MAPK8 P63104 YWHAZ 1 S 184 phosphorylation HPRD;HPRD_MIMP;KEA;MIMP;PhosphoNetworks;Phosph... HPRD:15696159;KEA:11956222;KEA:15071501;KEA:15... 9
3 P06493 CDK1 P11171 EPB41 1 S 712 phosphorylation HPRD_MIMP;MIMP;PhosphoSite_MIMP;ProtMapper;REA... ProtMapper:15525677;dbPTM:15525677;dbPTM:18220... 5
4 P06493 CDK1 P11171 EPB41 1;2;5;7 T 60 phosphorylation MIMP;PhosphoSite;PhosphoSite_MIMP;ProtMapper;R... ProtMapper:15525677;dbPTM:15525677;dbPTM:2171679 3
... ... ... ... ... ... ... ... ... ... ... ...
41421 P29597 TYK2 P51692 STAT5B 1 Y 699 phosphorylation KEA KEA:10830280;KEA:11751923;KEA:12411494 3
41422 Q06418 TYRO3 P19174 PLCG1 1;2 Y 771 phosphorylation KEA KEA:12601080;KEA:15144186;KEA:15592455;KEA:160... 8
41423 Q9H4A3 WNK1 Q8TAX0 OSR1 1 T 185 phosphorylation KEA KEA:18270262 1
41424 Q9H4A3 WNK1 Q96J92 WNK4 1;3 S 335 phosphorylation KEA KEA:15883153 1
41425 Q9NYL2 MAP3K20 Q92903 CDS1 1 T 68 phosphorylation KEA KEA:10973490 1

41426 rows × 11 columns

Protein sequences§

The APIs for sequences are very basic, because we’ve never really needed them; but the fundamentals are probably there to make a nice, powerful API. Still, I don’t believe pypath will ever be strong in sequences, it’s just not our main topic.

[186]:
from pypath.utils import homology
seqc = homology.SequenceContainer(preload_seq = [9606])
akt1 = seqc.get_seq('P31749')
akt1.get_region(start = 10, end = 19, isoform = 2)

executed in 0ms, finished 19:40:09 2022-12-02

[186]:
(10, 19, 'TFIIRCLQWT')
[187]:
from pypath.utils import seq
human_proteome = seq.swissprot_seq()
human_proteome

executed in 0ms, finished 19:44:52 2022-12-02

[187]:
{'P63120': <pypath.utils.seq.Seq at 0x689900d45cc0>,
 'Q96EC8': <pypath.utils.seq.Seq at 0x689908ea8f70>,
 'Q6ZMS4': <pypath.utils.seq.Seq at 0x689908eaa4a0>,
 'Q8N8L2': <pypath.utils.seq.Seq at 0x6899223538b0>,
 'Q15916': <pypath.utils.seq.Seq at 0x689922353c70>,
 'O60384': <pypath.utils.seq.Seq at 0x689922350730>,
 'Q3MIS6': <pypath.utils.seq.Seq at 0x689922353310>,
 'Q86UK7': <pypath.utils.seq.Seq at 0x689922353760>,
 'Q6P280': <pypath.utils.seq.Seq at 0x689922353190>,
 'Q969W1': <pypath.utils.seq.Seq at 0x689922350d90>,
 'O14978': <pypath.utils.seq.Seq at 0x689922353220>,
 'P61129': <pypath.utils.seq.Seq at 0x689922353370>,
 'Q66K41': <pypath.utils.seq.Seq at 0x6899223534f0>,
 'Q15937': <pypath.utils.seq.Seq at 0x689922350c70>,
 'Q9P2J8': <pypath.utils.seq.Seq at 0x689922351450>,
 'Q8ND82': <pypath.utils.seq.Seq at 0x689922353910>,
 'Q9NP64': <pypath.utils.seq.Seq at 0x6899223502b0>,
 'P98182': <pypath.utils.seq.Seq at 0x689922350280>,
 'Q8IUH4': <pypath.utils.seq.Seq at 0x68992235
Output truncated: showing 1000 of 53045 characters
[191]:
list(human_proteome['P00533'].findall('YGCT'))

executed in 0ms, finished 19:48:41 2022-12-02

[191]:
[SeqLookup(isoform=1, offset=625)]

Annotations§

This database provides various annotations about the function, structure, localization and many other properties of the proteins and genes. The database is an instance of pypath.core.annot.AnnotationTable class. The database is built with the default or current configuration by the core.annot.get_db method.

Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.

[38]:
from pypath.core import annot
an = annot.get_db()
an

executed in 1ms, finished 15:07:08 2022-12-02

[38]:
<Annotation database: 3788067 records about 51636 entities from 78 resources>

Load a single annotation resource§

The annotations database is huge, on disk it takes up 1-2 GB of space, it consists of 60-70 resources. But all these resources are not integrated with each other, each can be loaded individually, by their dedicated classes in the core.annot module. This practice can be recommended and will be supported better in the future. Let’s load one resource:

[8]:
from pypath.core import annot
cpad = annot.Cpad()
cpad

executed in 48.26s, finished 15:38:57 2022-12-03

[8]:
<CPAD annotations: 2308 records about 1358 entities>

The resulted object is derived from the AnnotationBase class, its data is stored under the annot attribute, in a dict where identifiers are keys and sets of annotation records are the values. The keys of the records are shown by the get_names method:

[35]:
cpad.get_names()

executed in 0ms, finished 15:06:45 2022-12-02

[35]:
('regulator_type',
 'effect_on_pathway',
 'pathway',
 'effect_on_cancer',
 'effect_on_cancer_outcome',
 'cancer',
 'pathway_category')

For each name we can list the possible values:

[36]:
cpad.get_values('cancer')

executed in 0ms, finished 15:06:47 2022-12-02

[36]:
{'Acute lymphoblastic leukemia (ALL) (precursor T lymphoblastic leukemia)',
 'Acute myeloid leukemia (AML)',
 'Basal cell carcinoma',
 'Bladder cancer',
 'Breast cancer',
 'Cervical cancer',
 'Cholangiocarcinoma',
 'Choriocarcinoma',
 'Chronic lymphocytic leukemia (CLL)',
 'Chronic myeloid leukemia (CML)',
 'Colorectal cancer',
 'Endometrial cancer',
 'Esophageal cancer',
 "Ewing's sarcoma",
 'Gallbladder cancer',
 'Gastric cancer',
 'Glioma',
 'Hepatocellular carcinoma',
 'Hodgkin lymphoma',
 'Infantile hemangioma',
 'Laryngeal cancer',
 'Malignant melanoma',
 'Malignant pleural mesothelioma',
 'Mantle cell lymphoma',
 'Multiple myeloma',
 'Nasopharyngeal cancer',
 'Neuroblastoma',
 'Non-small cell lung cancer',
 'Oral cancer',
 'Osteosarcoma',
 'Ovarian cancer',
 'Pancreatic cancer',
 'Pituitary adenomas',
 'Prostate cancer',
 'Renal cell carcinoma',
 'Small cell lung cancer',
 'Squamous cell carcinoma',
 'Synovial sarcoma',
 'Testicular cancer',
 'Thyroid cancer'}

Based on their annotations the select method filters the annotated molecules. For example, 78 complexes, miRNAs and proteins are annotated as inhibiting colorectal cancer:

[37]:
cpad.select(cancer = 'Colorectal cancer', effect_on_cancer = 'Inhibiting')

executed in 0ms, finished 15:06:50 2022-12-02

[37]:
{'A6NDV4',
 Complex: COMPLEX:O14745,
 Complex: COMPLEX:O14862,
 Complex: COMPLEX:O15169_P25054,
 Complex: COMPLEX:O94813,
 Complex: COMPLEX:O94953,
 Complex: COMPLEX:P00533,
 Complex: COMPLEX:P06733,
 Complex Glucose transporter complex 1: COMPLEX:P11166,
 Complex: COMPLEX:P25054,
 Complex: COMPLEX:P40261,
 Complex: COMPLEX:P49327,
 Complex: COMPLEX:P54687,
 Complex PTEN phosphatase complex: COMPLEX:P60484,
 Complex: COMPLEX:Q01973,
 Complex: COMPLEX:Q12888,
 Complex: COMPLEX:Q13620,
 Complex: COMPLEX:Q96CX2,
 Complex: COMPLEX:Q99558,
 'MIMAT0000069',
 'MIMAT0000089',
 'MIMAT0000093',
 'MIMAT0000262',
 'MIMAT0000274',
 'MIMAT0000422',
 'MIMAT0000427',
 'MIMAT0000437',
 'MIMAT0000449',
 'MIMAT0000455',
 'MIMAT0000460',
 'MIMAT0000461',
 'MIMAT0000617',
 'MIMAT0003266',
 'MIMAT0003320',
 'O14745',
 'O14862',
 'O15169',
 'O75473',
 'O75888',
 'O76041',
 'O94813',
 'O94953',
 'P00533',
 'P06733',
 'P06756',
 'P11166',
 'P13631',
 'P22676',
 'P25054',
 'P25791',
 'P40261',
 'P49327',
 'P546
Output truncated: showing 1000 of 1279 characters

Load the full annotations database by the database manager§

Alternatively, the full annotations database can be accessed in the usual way:

[215]:
from pypath import omnipath
an = omnipath.db.get_db('annotations')
an
[215]:
<Annotation database: 5490653 records about 50872 entities from 68 resources>

The AnnotationTable object contains the resource specific annotation objects under the annots attribute:

[40]:
an.annots

executed in 0ms, finished 15:07:39 2022-12-02

[40]:
{'CellTypist': <CellTypist annotations: 927 records about 473 entities>,
 'Integrins': <Integrins annotations: 62 records about 62 entities>,
 'CellCellInteractions': <CellCellInteractions annotations: 5544 records about 4960 entities>,
 'PanglaoDB': <PanglaoDB annotations: 8479 records about 4813 entities>,
 'Lambert2018': <Lambert2018 annotations: 3281 records about 3277 entities>,
 'CancerSEA': <CancerSEA annotations: 2515 records about 1992 entities>,
 'Phobius': <Phobius annotations: 35382 records about 35382 entities>,
 'GO_Intercell': <GO_Intercell annotations: 48799 records about 18377 entities>,
 'MatrixDB': <MatrixDB annotations: 18127 records about 15903 entities>,
 'Surfaceome': <Surfaceome annotations: 3558 records about 3558 entities>,
 'Matrisome': <Matrisome annotations: 1514 records about 1514 entities>,
 'HPA_secretome': <HPA_secretome annotations: 3568 records about 3568 entities>,
 'HPMR': <HPMR annotations: 1748 records about 1695 entities>,
 'CPAD': <CPAD annotati
Output truncated: showing 1000 of 5842 characters

For each of these you can query the names of the fields, their possible values and the set of proteins annotated with any combination of the values, just like for CPAD above. As another exemple, let’s take a look into the Matrisome database:

[41]:
matrisome = an.annots['Matrisome']

executed in 0ms, finished 15:07:45 2022-12-02

[42]:
matrisome.get_names()

executed in 0ms, finished 15:07:49 2022-12-02

[42]:
('mainclass', 'subclass', 'subsubclass')
[43]:
matrisome.get_values('subclass')

executed in 0ms, finished 15:07:53 2022-12-02

[43]:
{'Collagens',
 'ECM Glycoproteins',
 'ECM Regulators',
 'ECM-affiliated Proteins',
 'Proteoglycans',
 'Secreted Factors',
 'n/a'}
[44]:
matrisome.get_subset(subclass = 'Collagens')

executed in 0ms, finished 15:07:56 2022-12-02

[44]:
{'A6NMZ7',
 'A8TX70',
 'B4DZ39',
 Complex Collagen type I homotrimer: COMPLEX:P02452,
 Complex HT_DM_Cluster278: COMPLEX:P02452_P02462_P08572_P29400_P53420_Q01955_Q02388_Q14031_Q17RW2_Q8NFW1,
 Complex Collagen type I trimer: COMPLEX:P02452_P08123,
 Complex Collagen type II trimer: COMPLEX:P02458,
 Complex Collagen type XI trimer variant 1: COMPLEX:P02458_P12107_P13942,
 Complex: COMPLEX:P02458_P20908_P25067,
 Complex: COMPLEX:P02458_P20908_P25067_P29400,
 Complex: COMPLEX:P02458_P25067_P29400,
 Complex Collagen type III trimer: COMPLEX:P02461,
 Complex: COMPLEX:P02462,
 Complex Collagen type IV trimer variant 1: COMPLEX:P02462_P08572,
 Complex Collagen type XI trimer variant 2: COMPLEX:P05997_P12107,
 Complex Collagen type XI trimer variant 3: COMPLEX:P05997_P12107_P20908,
 Complex Collagen type V trimer variant 1: COMPLEX:P05997_P20908,
 Complex Collagen type V trimer variant 2: COMPLEX:P05997_P20908_P25940,
 Complex: COMPLEX:P08572,
 Complex: COMPLEX:P12109_P12110,
 Complex Collagen
Output truncated: showing 1000 of 3072 characters

Load only selected annotations§

Another option is to load only certain annotation resources into an AnnotationTable object. We refer to the resources by class names. For example, if you only want to load the pathway membership annotations from SIGNOR, SignaLink, NetPath and KEGG, you can provide the names of the appropriate classes:

[47]:
pathways = annot.AnnotationTable(
    protein_sources = (
        'SignalinkPathways',
        'KeggPathways',
        'NetpathPathways',
        'SignorPathways',
    ),
    complex_sources = (),
)
pathways

executed in 12.07s, finished 15:09:48 2022-12-02

[47]:
<Annotation database: 28745 records about 6762 entities from 4 resources>

The AnnotationTable object provides methods to query all resources together, or build a boolean array out of them. To see all annotations of one protein:

[48]:
pathways.all_annotations('P00533')

executed in 0ms, finished 15:10:17 2022-12-02

[48]:
[SignalinkPathway(pathway='Receptor tyrosine kinase'),
 SignalinkPathway(pathway='JAK/STAT'),
 KeggPathway(pathway='Proteoglycans in cancer'),
 KeggPathway(pathway='Regulation of actin cytoskeleton'),
 KeggPathway(pathway='Oxytocin signaling pathway'),
 KeggPathway(pathway='Phospholipase D signaling pathway'),
 KeggPathway(pathway='Pathways in cancer'),
 KeggPathway(pathway='Hepatocellular carcinoma'),
 KeggPathway(pathway='Colorectal cancer'),
 KeggPathway(pathway='Melanoma'),
 KeggPathway(pathway='EGFR tyrosine kinase inhibitor resistance'),
 KeggPathway(pathway='Human papillomavirus infection'),
 KeggPathway(pathway='Pancreatic cancer'),
 KeggPathway(pathway='Non-small cell lung cancer'),
 KeggPathway(pathway='Central carbon metabolism in cancer'),
 KeggPathway(pathway='Endocytosis'),
 KeggPathway(pathway='Endometrial cancer'),
 KeggPathway(pathway='Choline metabolism in cancer'),
 KeggPathway(pathway='Bladder cancer'),
 KeggPathway(pathway='Parathyroid hormone synthesis, secretion
Output truncated: showing 1000 of 2540 characters

Data frames of annotations§

Data from annotation objects can be exported to a pandas.DataFrame:

[9]:
cpad.make_df()
cpad.df

executed in 0ms, finished 15:40:14 2022-12-03

[9]:
uniprot genesymbol entity_type source label value record_id
0 Q16181 SEPT7 protein CPAD regulator_type protein 0
1 Q16181 SEPT7 protein CPAD effect_on_pathway Upregulation 0
2 Q16181 SEPT7 protein CPAD pathway Actin cytoskeleton pathway 0
3 Q16181 SEPT7 protein CPAD effect_on_cancer Inhibiting 0
4 Q16181 SEPT7 protein CPAD effect_on_cancer_outcome inhibit glioma cell migration 0
... ... ... ... ... ... ... ...
14396 COMPLEX:P30990 COMPLEX:NTS complex CPAD cancer Hepatocellular carcinoma 2306
14397 COMPLEX:P30990 COMPLEX:NTS complex CPAD effect_on_pathway Upregulation 2307
14398 COMPLEX:P30990 COMPLEX:NTS complex CPAD pathway ERK signaling pathway 2307
14399 COMPLEX:P30990 COMPLEX:NTS complex CPAD effect_on_cancer Activating 2307
14400 COMPLEX:P30990 COMPLEX:NTS complex CPAD cancer Gastric cancer 2307

14401 rows × 7 columns

The data frame has a long format. It can be converted to the more conventional wide format using standard pandas procedures (well, in tidyverse you would simply call tidyr::pivot_wider, in pandas you have to do an unintuitive sequence of 6 calls):

[10]:
index_cols = ['record_id', 'uniprot', 'genesymbol', 'label', 'entity_type']

(
    cpad.df.drop('source', axis=1).
    set_index(index_cols).
    unstack('label').
    droplevel(axis=1, level=0).
    reset_index().
    drop('record_id', axis=1)
)

executed in 0ms, finished 15:40:19 2022-12-03

[10]:
label uniprot genesymbol entity_type cancer effect_on_cancer effect_on_cancer_outcome effect_on_pathway pathway pathway_category regulator_type
0 Q16181 SEPT7 protein Glioma Inhibiting inhibit glioma cell migration Upregulation Actin cytoskeleton pathway Regulation of actin cytoskeleton protein
1 MIMAT0000431 hsa-miR-140 mirna Squamous cell carcinoma Inhibiting suppress tumor cell migration and invasion Upregulation ADAM10 mediated Notch1 signaling pathway Notch signaling pathway mirna
2 MIMAT0005886 hsa-miR-1297 mirna Prostate cancer Inhibiting inhibit proliferation and invasion Upregulation AEG1/Wnt signaling pathway Wnt signaling pathway mirna
3 Q9UP65 PLA2G4C protein Breast cancer Inhibiting inhibit EGF-induced chemotaxis Downregulation Akt signaling pathway PI3K-Akt signaling pathway protein
4 Q92600 CNOT9 protein Breast cancer Inhibiting suppress cell proliferation Downregulation Akt signaling pathway PI3K-Akt signaling pathway protein
... ... ... ... ... ... ... ... ... ... ...
2303 COMPLEX:P16422 COMPLEX:EPCAM complex Prostate cancer Inhibiting NaN Downregulation PI3K-Akt-mTOR signaling pathway NaN NaN
2304 COMPLEX:Q9Y6Y0 COMPLEX:IVNS1ABP complex Prostate cancer Inhibiting NaN Upregulation Akt signaling pathway NaN NaN
2305 COMPLEX:Q96CX2 COMPLEX:KCTD12 complex Colorectal cancer Inhibiting NaN Upregulation ERK signaling pathway NaN NaN
2306 COMPLEX:P30990 COMPLEX:NTS complex Hepatocellular carcinoma Activating NaN Upregulation Wnt/beta-catenin signaling pathway NaN NaN
2307 COMPLEX:P30990 COMPLEX:NTS complex Gastric cancer Activating NaN Upregulation ERK signaling pathway NaN NaN

2308 rows × 10 columns

Inter-cellular signaling roles§

pypath does not combine the annotations in the annot module, exactly what goes in goes out. For example, WNT pathway from Signor and SignaLink won’t be merged automatically. However with the pypath.core.annot.CustomAnnotation class anyone can do it. For inter-cellular communication categories the pypath.core.intercell module combines the data from all the relevant resources and creates categories based on a combination of evidences. The database is an instance of the IntercellAnnotation object, and the build is executed by the pypath.core.intercell.get_db function.

Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.

[53]:
from pypath.core import intercell
ic = intercell.get_db() # this takes quite some time
                       # unless you load annotations from a pickle cache
ic

executed in 0ms, finished 15:13:03 2022-12-02

[53]:
<Intercell annotations: 310033 records about 43617 entities>
[11]:
from pypath import omnipath
ic = omnipath.db.get_db('intercell')
ic

executed in 2m 55.47s, finished 15:43:27 2022-12-03

[11]:
<Intercell annotations: 301527 records about 48570 entities>

This object stores its data under the classes attribute. Classes are defined in pypath.core.intercell_annot.annot_combined_classes. In addition, we manually revised and excluded some proteins from the more generic classes, these are listed in pypath.core.intercell_annot.excludes. Each class has the following properties:

  • name: all lowercase, human understandable name, without repeating the parent class (e.g. WNT receptors will be simply wnt, and the parent class will be receptor)

  • parent: for a specific class the parent is the generic category it belongs to; for generic classes the name and parent are the same

  • resource: the resource the data comes from, or OmniPath for composite classes (combined from multiple resources)

  • scope: specific or generic; e.g. TGF ligand is specific, ligand is generic

  • aspect: locational (e.g. plasma membrane) or functional (e.g. transporter)

Read more about the design of the intercell database in our paper.

[55]:
ic.classes

executed in 0ms, finished 15:16:54 2022-12-02

[55]:
{AnnotDefKey(name='transmembrane', parent='transmembrane', resource='UniProt_location'): <AnnotationGroup `transmembrane` from UniProt_location, 5150 elements>,
 AnnotDefKey(name='transmembrane', parent='transmembrane', resource='UniProt_topology'): <AnnotationGroup `transmembrane` from UniProt_topology, 5760 elements>,
 AnnotDefKey(name='transmembrane', parent='transmembrane', resource='UniProt_keyword'): <AnnotationGroup `transmembrane` from UniProt_keyword, 7041 elements>,
 AnnotDefKey(name='transmembrane', parent='transmembrane_predicted', resource='Phobius'): <AnnotationGroup `transmembrane` from Phobius, 6444 elements>,
 AnnotDefKey(name='transmembrane_phobius', parent='transmembrane_predicted', resource='Almen2009'): <AnnotationGroup `transmembrane_phobius` from Almen2009, 2072 elements>,
 AnnotDefKey(name='transmembrane_sosui', parent='transmembrane_predicted', resource='Almen2009'): <AnnotationGroup `transmembrane_sosui` from Almen2009, 1663 elements>,
 AnnotDefKey(name='trans
Output truncated: showing 1000 of 143945 characters

An easy way to access the classes is the select method. The AnnotationGroup objects behave as plain Python sets, and besides that, they feature many further attributes and methods.

[56]:
gaba_receptors = ic.select('gaba', parent = 'receptor')
gaba_receptors

executed in 0ms, finished 15:17:00 2022-12-02

[56]:
<AnnotationGroup `gaba` from HGNC, 40 elements>
[57]:
gaba_receptors.members

executed in 0ms, finished 15:17:02 2022-12-02

[57]:
{'A8MPY1',
 Complex GABA-A receptor (GABRA1, GABRB2, GABRD): COMPLEX:O14764_P14867_P47870,
 Complex GABA-A receptor, alpha-4/beta-3/delta: COMPLEX:O14764_P28472_P48169,
 Complex GABA-A receptor, alpha-6/beta-3/delta: COMPLEX:O14764_P28472_Q16445,
 Complex GABA-A receptor, alpha-4/beta-2/delta: COMPLEX:O14764_P47870_P48169,
 Complex GABA-A receptor, alpha-6/beta-2/delta: COMPLEX:O14764_P47870_Q16445,
 Complex GABBR1-GABBR2 complex: COMPLEX:O75899_Q9UBS5,
 Complex: COMPLEX:P14867,
 Complex GABA-A receptor, alpha-1/beta-3/gamma-2: COMPLEX:P14867_P18507_P28472,
 Complex GABA-A receptor (GABRA1, GABRB2, GABRG2): COMPLEX:P14867_P18507_P47870,
 Complex GABA-A receptor, alpha-5/beta-3/gamma-2: COMPLEX:P18507_P28472_P31644,
 Complex GABA-A receptor, alpha-3/beta-3/gamma-2: COMPLEX:P18507_P28472_P34903,
 Complex GABA-A receptor, alpha-2/beta-3/gamma-2: COMPLEX:P18507_P28472_P47869,
 Complex GABA-A receptor, alpha-6/beta-3/gamma-2: COMPLEX:P18507_P28472_Q16445,
 Complex: COMPLEX:P18507_Q8N1C3,
 C
Output truncated: showing 1000 of 1368 characters

Build an intercellular communication network§

The intercell database can be connected to a Network object to create an intercellular communication network:

[58]:
cu = omnipath.db.get_db('curated')
ic.register_network(cu)

executed in 0ms, finished 15:17:08 2022-12-02

Quantitative overview of intercell annotations§

A data frame with basic statistics is available:

[13]:
ic.counts_df()

executed in 0ms, finished 15:45:17 2022-12-03

[13]:
category parent database scope aspect source consensus_score transmitter receiver secreted plasma_membrane_transmembrane plasma_membrane_peripheral n_uniprot
0 transmembrane transmembrane UniProt_location generic locational resource_specific 6 False False False True False 5150
1 transmembrane transmembrane UniProt_topology generic locational resource_specific 6 False False False True False 5760
2 transmembrane transmembrane UniProt_keyword generic locational resource_specific 1 False False False False False 7041
3 transmembrane transmembrane_predicted Phobius generic locational resource_specific 1 False False False False False 6444
4 transmembrane_phobius transmembrane_predicted Almen2009 generic locational resource_specific 0 False False False True False 2072
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1120 parin_adhesion_regulator intracellular_intercellular_related HGNC specific functional resource_specific 0 True False False False False 5
1121 plakophilin_adhesion_regulator intracellular_intercellular_related HGNC specific functional resource_specific 0 True False False False False 3
1122 actin_regulation_adhesome intracellular_intercellular_related Adhesome specific functional resource_specific 0 True False False False False 22
1123 adhesion_cytoskeleton_adaptor intracellular_intercellular_related Adhesome specific functional resource_specific 0 True False False False False 118
1124 intracellular_intercellular_related intracellular_intercellular_related OmniPath generic functional composite 0 True False False False False 291

1125 rows × 13 columns

Intercell database as data frame§

Just like the other databases, the object can be exported into a pandas.DataFrame:

[14]:
ic.make_df()
ic.df[:10]

executed in 22.72s, finished 15:45:46 2022-12-03

[14]:
category parent database scope aspect source uniprot genesymbol entity_type consensus_score transmitter receiver secreted plasma_membrane_transmembrane plasma_membrane_peripheral
0 transmembrane transmembrane UniProt_location generic locational resource_specific Q96JP9 CDHR1 protein 6 False False False True False
1 transmembrane transmembrane UniProt_location generic locational resource_specific Q9P126 CLEC1B protein 8 False False False True False
2 transmembrane transmembrane UniProt_location generic locational resource_specific Q13585 GPR50 protein 6 False False False True False
3 transmembrane transmembrane UniProt_location generic locational resource_specific Q8N9I0 SYT2 protein 7 False False False False False
4 transmembrane transmembrane UniProt_location generic locational resource_specific O43614 HCRTR2 protein 6 False False False True False
5 transmembrane transmembrane UniProt_location generic locational resource_specific A6NJY1 SLC9B1P1 protein 4 False False False False False
6 transmembrane transmembrane UniProt_location generic locational resource_specific Q5RI15 COX20 protein 5 False False False False False
7 transmembrane transmembrane UniProt_location generic locational resource_specific Q13948 CUX1 protein 5 False False False False False
8 transmembrane transmembrane UniProt_location generic locational resource_specific Q8NGK4 OR52K1 protein 6 False False False False False
9 transmembrane transmembrane UniProt_location generic locational resource_specific Q8IYS2 KIAA2013 protein 7 False False False True False

Browse intercell categories§

Use the select method to access intercell classes:

[72]:
ic.select(definition = 'neurotensin', parent = 'receptor')

executed in 0ms, finished 15:27:15 2022-12-02

[72]:
<AnnotationGroup `neurotensin` from HGNC, 2 elements>

Proteins in each category can be listed with their descriptions from UniProt. Loading the UniProt datasheets for each protein is a slow process, we don’t recomment calling this method on more than a few dozens of proteins.

[79]:
ic.show('neurotensin', parent = 'receptor')

executed in 1ms, finished 15:35:58 2022-12-02

=====> [2 proteins] <=====
╒═══════╤════════╤══════════════╤══════════╤══════════╤═════════════╤══════════════╤════════════╤══════════════╕
│   No. │ ac     │ genesymbol   │   length │   weight │ full_name   │ function_o   │ keywords   │ subcellula   │
│       │        │              │          │          │             │ r_genecard   │            │ r_location   │
│       │        │              │          │          │             │ s            │            │              │
╞═══════╪════════╪══════════════╪══════════╪══════════╪═════════════╪══════════════╪════════════╪══════════════╡
│     1 │ O95665 │ NTSR2        │      410 │    45385 │ Neurotensi  │ Receptor     │ Cell       │ Cell         │
│       │        │              │          │          │ n receptor  │ for the tr   │ membrane,  │ membrane;    │
│       │        │              │          │          │ type 2      │ idecapepti   │ Disulfide  │ Multi-pass   │
│       │        │              │          │          │             │
Output truncated: showing 1000 of 7598 characters

Gene Ontology§

pypath.utils.go is an almost standalone module for management of the Gene Ontology tree and annotations. The main objects here are GeneOntology and GOAnnotation. The former represents the ontology tree, i.e. terms and their relationships, the latter their assignment to gene products. Both provides many versatile methods for querying.

[80]:
from pypath.utils import go
goa = go.GOAnnotation()

executed in 1.26s, finished 15:36:46 2022-12-02

[81]:
goa.ontology # the GeneOntology object

executed in 0ms, finished 15:36:48 2022-12-02

[81]:
<pypath.utils.go.GeneOntology at 0x689946b55570>
[82]:
goa # the GOAnnotation object

executed in 0ms, finished 15:36:50 2022-12-02

[82]:
<pypath.utils.go.GOAnnotation at 0x68991cdc9b40>

Among many others, the most versatile method is select which is able to select the annotated gene products by various expressions built from GO terms or IDs. It understands AND, OR, NOT and parentheses.

[83]:
query = """(cell surface OR
        external side of plasma membrane OR
        extracellular region) AND
        (regulation of transmembrane transporter activity OR
        channel regulator activity)"""
result = goa.select(query)
print(list(result)[:7])

executed in 0ms, finished 15:36:55 2022-12-02

['P21333', 'P80108', 'P62258', 'Q9NRX4', 'P54710', 'Q8NER1', 'P01303']
[84]:
goa.ontology.get_all_descendants('GO:0005576')

executed in 0ms, finished 15:36:56 2022-12-02

[84]:
{'GO:0001507',
 'GO:0001527',
 'GO:0003351',
 'GO:0003355',
 'GO:0005201',
 'GO:0005576',
 'GO:0005577',
 'GO:0005582',
 'GO:0005583',
 'GO:0005584',
 'GO:0005585',
 'GO:0005586',
 'GO:0005587',
 'GO:0005588',
 'GO:0005590',
 'GO:0005591',
 'GO:0005592',
 'GO:0005595',
 'GO:0005596',
 'GO:0005599',
 'GO:0005601',
 'GO:0005602',
 'GO:0005604',
 'GO:0005606',
 'GO:0005607',
 'GO:0005608',
 'GO:0005609',
 'GO:0005610',
 'GO:0005611',
 'GO:0005612',
 'GO:0005614',
 'GO:0005615',
 'GO:0005616',
 'GO:0006858',
 'GO:0006859',
 'GO:0006860',
 'GO:0009519',
 'GO:0010367',
 'GO:0016914',
 'GO:0016942',
 'GO:0020003',
 'GO:0020004',
 'GO:0020005',
 'GO:0020006',
 'GO:0030020',
 'GO:0030021',
 'GO:0030023',
 'GO:0030197',
 'GO:0030345',
 'GO:0030934',
 'GO:0030935',
 'GO:0030938',
 'GO:0031012',
 'GO:0031395',
 'GO:0032311',
 'GO:0032579',
 'GO:0033165',
 'GO:0033166',
 'GO:0034358',
 'GO:0034359',
 'GO:0034360',
 'GO:0034361',
 'GO:0034362',
 'GO:0034363',
 'GO:0034364',
 'GO:0034365',
 'GO:00343
Output truncated: showing 1000 of 3104 characters

Protein complexes§

The pypath.complex module builds a non-redundant list of complexes from about 12 original resources. Complexes are unique considering their set of components, and optionally carry stoichiometry information. Homomultimers are also included, hence some complexes consist only of a single kind of protein. The database is an instance of pypath.core.complex.ComplexAggregator object and the built by the pypath.core.complex.get_db function.

Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.

[90]:
from pypath.core import complex
co = complex.get_db()
co.update_index()
co

executed in 0ms, finished 15:39:31 2022-12-02

[90]:
<Complex database: 28173 complexes>

To retrieve all complexes containing a specific protein, here MTOR:

[91]:
co.proteins['P42345']

executed in 0ms, finished 15:39:42 2022-12-02

[91]:
{Complex: COMPLEX:O00141_O15530_O75879_P23443_P34931_P42345_Q6R327_Q8N122_Q9BPZ7_Q9BVC4_Q9H672,
 Complex: COMPLEX:O00141_O15530_P07900_P23443_P31749_P31751_P42345_P78527_Q05513_Q05655_Q6R327_Q8N122_Q9BPZ7_Q9BVC4,
 Complex: COMPLEX:O00141_O15530_P0CG47_P0CG48_P23443_P42345_Q15118_Q6R327_Q8N122_Q96BR1_Q9BPZ7_Q9BVC4,
 Complex: COMPLEX:O00141_O15530_P23443_P42345_Q15118_Q6R327_Q8N122_Q96BR1_Q96J02_Q9BPZ7_Q9BVC4,
 Complex: COMPLEX:O00141_O75879_P0CG48_P23443_P34931_P42345_P62753_Q6R327_Q8N122_Q9BPZ7_Q9BVC4_Q9NY26,
 Complex: COMPLEX:O00141_P0CG48_P23443_P36894_P42345_P62942_P68106_Q15427_Q6R327_Q8N122_Q9BPZ7_Q9BVC4,
 Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P46781_P62753_Q6R327_Q8N122_Q96KQ7_Q9BPZ7_Q9BVC4_Q9NY26,
 Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P62753_P62942_Q6R327_Q8N122_Q9BPZ7_Q9BVC4_Q9NY26,
 Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P62753_Q15172_Q6R327_Q8IW41_Q9BPZ7_Q9BVC4_Q9H672,
 Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P62753_Q6R327_Q70Z35_Q8N122_Q8TCU6_Q9BPZ7
Output truncated: showing 1000 of 5348 characters

Note some of the complexes have human readable names, these are preferred at printing if available from any of the databases. Otherwise the complexes are labelled by COMPLEX:list-of-components.

Protein complex objects§

Take a closer look on one complex object. The hash of the is equivalent with the string representation below, where the UniProt IDs are unique and alphabetically sorted. Hence you can look up complexes using strings as keys despite the dict keys are in fact pypath.intera.Complex objects:

[97]:
cplex = co.complexes['COMPLEX:Q09472_Q92793']
cplex

executed in 0ms, finished 15:41:36 2022-12-02

[97]:
Complex CBP/p300: COMPLEX:Q09472_Q92793
[98]:
cplex.components # stoichiometry

executed in 0ms, finished 15:41:38 2022-12-02

[98]:
{'Q92793': 1, 'Q09472': 1}
[99]:
cplex.sources # resources

executed in 0ms, finished 15:41:39 2022-12-02

[99]:
{'Signor'}

Protein complex data frame§

The database can be exported into a pandas.DataFrame:

[18]:
co.make_df()
co.df

executed in 3.40s, finished 15:47:16 2022-12-03

[18]:
name components components_genesymbols stoichiometry sources references identifiers
0 NFY P23511_P25208_Q13952 NFYA_NFYB_NFYC 1:1:1 CORUM;Compleat;PDB;Signor;ComplexPortal;hu.MAP... 15243141;14755292;9372932 Signor:SIGNOR-C1;CORUM:4478;Compleat:HC1449;in...
1 mTORC2 P68104_P85299_Q6R327_Q8TB45_Q9BVC4 DEPTOR_EEF1A1_MLST8_PRR5_RICTOR 0:0:0:0:0 Signor Signor:SIGNOR-C2
2 mTORC1 P42345_Q8N122_Q8TB45_Q96B36_Q9BVC4 AKT1S1_DEPTOR_MLST8_MTOR_RPTOR 0:0:0:0:0 Signor Signor:SIGNOR-C3
3 SCF-betaTRCP P63208_Q13616_Q9Y297 BTRC_CUL1_SKP1 1:1:1 CORUM;Compleat;Signor 9990852 Signor:SIGNOR-C5;CORUM:227;Compleat:HC757
4 CBP/p300 Q09472_Q92793 CREBBP_EP300 0:0 Signor Signor:SIGNOR-C6
... ... ... ... ... ... ... ...
28168 Npnt complex 2 Q5SZK8_Q6UXI9_Q86XX4 FRAS1_FREM2_NPNT 0:0:0 CellChatDB
28169 NRP1_NRP2 O14786_O60462_Q9Y4D7 NRP1_NRP2_PLXND1 0:0:0 CellChatDB
28170 NRP2_PLXNA2 O60462_O75051 NRP2_PLXNA2 0:0 CellChatDB
28171 NRP2_PLXNA4 O60462_Q9HCM2 NRP2_PLXNA4 0:0 CellChatDB
28172 PTCH2_SMO Q99835_Q9Y6C5 PTCH2_SMO 0:0 CellChatDB

28173 rows × 7 columns

Saving datasets as pickles§

The large datasets above are compiled from many resources. Even if these are already available in the cache, the data processing often takes longer than convenient, e.g. from a few minutes up to half an hour. Most of the data integration objects in pypath provide methods to save and load their contents as pickle dumps. In fact, the database manager does this all the time, in a coordinated way – for this reason, the methods below should be used only with good reason, and relying on the database manager is preferred.

[ ]:
# for `pypath.annot.AnnotationTable` objects:
a.save_to_pickle('myannots.pickle')
a = annot.AnnotationTable(pickle_file = 'myannots.pickle')
# for `pypath.complex.ComplexAggregator` objects:
complexdb.save_to_pickle('mycomplexes.pickle')
complexdb = complex.ComplexAggregator(pickle_file = 'mycomplexes.pickle')

Log messages and sessions§

In pypath all modules sends messages to a log file named by default by the session ID (a 5 char random string). The default path to the log file is ./pypath_log/pypath-xxxxx.log where xxxxx is the session ID.

Warning: The logger of pypath is really verbose, the log files can grow huge: several tens of thousands of lines, few MBs. It is recommended to empty the pypath_log directories time to time.

Basic info about the session§

The info function prints the most important information about the current session:

[100]:
import pypath
pypath.info()

executed in 0ms, finished 15:41:55 2022-12-02

[2022-12-02 16:41:55] [pypath]
        - session ID: `l0n17`
        - working directory: `/home/denes/pypath/notebooks`
        - logfile: `/home/denes/pypath/notebooks/pypath_log/pypath-l0n17.log`
        - pypath version: 0.14.31

Another function prints a disclaimer about licenses. Until recently this message was printed every time upon import, it is still important, but we removed it as in certain situations it can be annoying.

[101]:
pypath.disclaimer()

executed in 0ms, finished 15:41:59 2022-12-02


        === d i s c l a i m e r ===

        All data accessed through this module,
        either as redistributed copy or downloaded using the
        programmatic interfaces included in the present module,
        are free to use at least for academic research or
        education purposes.
        Please be aware of the licenses of all the datasets
        you use in your analysis, and please give appropriate
        credits for the original sources when you publish your
        results. To find out more about data sources please
        look at `pypath/resources/data/resources.json` or
        https://omnipathdb.org/info and
        `pypath.resources.urls.urls`.

Read the log file§

Calling pypath.log opens the logfile by the default console application for paginating text files (in GNU systems typically less):

[ ]:
pypath.log()

executed in 0ms, finished 15:42:08 2022-12-02

The logger and the log file are bound to the session (the 5 random characters is the session ID):

[104]:
pypath.session

executed in 0ms, finished 15:42:27 2022-12-02

[104]:
<Session l0n17>

The logger:

[105]:
pypath.session.log

executed in 0ms, finished 15:42:46 2022-12-02

[105]:
Logger [/home/denes/pypath/notebooks/pypath_log/pypath-l0n17.log]

The path to the log file:

[106]:
pypath.session.log.fname

executed in 0ms, finished 15:42:49 2022-12-02

[106]:
'/home/denes/pypath/notebooks/pypath_log/pypath-l0n17.log'

Logging to the console§

Each log message has a numeric priority level, and messages with lower level than a threshold are printed to the console. By default only important warnings are dispatched to the console. To log everything to the console, set the threshold to a large number:

[107]:
pypath.session.log.console_level = 10

from pypath.inputs import signor

si = signor.signor_interactions()
pypath.session.log.console_level = -1

executed in 0ms, finished 15:42:56 2022-12-02

[2022-12-02 16:42:55] [curl] Creating Curl object to retrieve data from `https://signor.uniroma2.it/download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file path: `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file found, no need for download.
[2022-12-02 16:42:55] [curl] Opening plain text file `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`.
[2022-12-02 16:42:55] [curl] Creating Curl object to retrieve data from `https://signor.uniroma2.it/download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file path: `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file found, no need for download.
[2022-12-02 16:42:55] [curl] Opening plain text file `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`.
[2022-12-02 16:42:55] [curl] Creating Curl object to retrieve data from `https
Output truncated: showing 1000 of 1046 characters

Disable logging§

To avoid creation of a log file (and the directory pypath_log) set the environment variable PYPATH_LOG or the builtins.PYPATH_LOG attribute:

[ ]:
# shell:
export PYPATH_LOG="/dev/null"
# then, start Python and use pypath
[108]:
import os
import builtins
builtins.PYPATH_LOG=os.devnull
import pypath

executed in 0ms, finished 15:43:10 2022-12-02

Write to the log§

Sending a single message§

First we change the console level so we can see the log messages. The label is optional. The priority of the message is given by the level, notice that the second message won’t be printed to the console as its level is higher than 10:

[109]:
pypath.session.log.console_level = 10
pypath.session.log.msg('Greetings from the pypath tutorial notebook! :)', label = 'book')
pypath.session.log.msg('Not important, not shown on console but printed to the logfile.', level = 11)

executed in 0ms, finished 15:43:13 2022-12-02

[2022-12-02 16:43:13] [book] Greetings from the pypath tutorial notebook! :)

Connect a module or class to the pypath logger§

The preferred way of connecting to the logger is to make a class inherit from the Logger class. Here the name will be the default label for all messages coming from the instances of this class:

[110]:
from pypath.share import session

class ChildOfLogger(session.Logger):

    def __init__(self):

        session.Logger.__init__(self, name = 'child')

    def say_something(self):

        self._log('Have a nice day! :D')


col = ChildOfLogger()
col.say_something()

executed in 0ms, finished 15:43:17 2022-12-02

[2022-12-02 16:43:17] [child] Have a nice day! :D

Alternatively, a logger can be created anywhere and used from any module or function:

[111]:
from pypath.share import session

_logger = session.Logger(name = 'mylogger')
_log = _logger._log

_log('Message from a stray logger')

executed in 0ms, finished 15:43:20 2022-12-02

[2022-12-02 16:43:20] [mylogger] Message from a stray logger

Finally we just set the console level to a lower value, to avoid flooding the rest of this book with log messages:

[112]:
pypath.session.log.console = -1

executed in 0ms, finished 15:43:23 2022-12-02

BEL export§

Warning: This section hasn’t been thoroughly revised for long time, some parts might be outdated or broken.

Biological Expression Language (BEL, https://bel-commons.scai.fraunhofer.de/) is a versatile description language to capture relationships between various biological entities spanning wide range of the levels of biological organization. pypath has a dedicated module to convert the network and the enzyme-substrate interactions to BEL format:

[ ]:
from pypath.legacy import main
from pypath.resources import data_formats
from pypath.omnipath import bel
[ ]:
pa = main.PyPath()
pa.init_network(data_formats.pathway)

You can provide one or more resources to the Bel class. Supported resources currently are pypath.main.PyPath and pypath.ptm.PtmAggregator.

[ ]:
b = bel.Bel(resource = pa)

From the resources we compile a BELGraph object which provides a Python interface for various operations and you can also export the data in BEL format:

[ ]:
b.main()
[ ]:
b.bel_graph
[ ]:
b.bel_graph.summarize()
[ ]:
b.export_relationships('omnipath_pathways.bel')
[ ]:
with open('omnipath_pathways.bel', 'r') as fp:
    bel_str = fp.read()
[ ]:
print(bel_str[:333])

CellPhoneDB export§

Warning: This section hasn’t been thoroughly revised for long time, some parts might be outdated or broken.

CellPhoneDB is a statistical method and a database for inferring inter-cellular communication pathways between specific cell types from single-cell data. OmniPath/pypath uses CellPhoneDB as a resource for interaction, protein complex and annotation data. Apart from this, pypath is able to export its data in the appropriate format to provide input for the CellPhoneDB Python module. For this you can use the pypath.cellphonedb module:

[ ]:
from pypath.omnipath import cellphonedb
from pypath.share import settings

settings.setup(network_expand_complexes = False)

Here you can provide parameters for the network or provide an already built network. Also you can provide the datasets as pickles to make them load really fast. Otherwise this step will take quite long.

[ ]:
c = cellphonedb.CellPhoneDB()

You can access each of the CellPhoneDB input files as a pandas.DataFrame and also they’ve been exported to csv files. For example the interaction_input.csv contains interactions from all the resources used for building the network (here Signor, SingnaLink, etc.):

[ ]:
c.interaction_dataframe[:10]

The proteins and complexes are annotated (transmembrane, peripheral, secreted, etc.) using data from the pypath.intercell module (identical to the http://omnipathdb.org/intercell query of the web service):

[ ]:
c.protein_dataframe[:10]
[ ]:

                        

The legacy igraph-based network object§

Warning: This section hasn’t been thoroughly revised for long time, some parts might be outdated or broken.

Until about 2019 (before pypath version 0.9) pypath used an igraph.Graph object (igraph.org) to organize all data structures around. This legacy API still present in pypath.legacy.main, however it is not maintained. This section of the book is still here, but will be removed soon, along with the legacy module.

[43]:
from pypath.legacy import main
No module `cairo` available.
Some plotting functionalities won't be accessible.
[ ]:
pa = main.PyPath()
#pa.load_omnipath() # This is commented out because it takes > 1h
                    # to run it for the first time due to the vast
                    # amount of data download.
                    # Once you populated the cache it still takes
                    # approx. 30 min to build the entire OmniPath
                    # as the process consists of quite some data
                    # processing. If you dump it in a pickle, you
                    # can load the network in < 1 min

I just want a network quickly and play around with pypath§

You can find the predefined formats in the pypath.resources.network module. For example, to load one resource from there, let’s say SIGNOR:

[ ]:
from pypath.legacy import main
from pypath.resources import network as netres
pa = main.PyPath()
pa.load_resources({'signor': netres.pathway['signor']})

Or to load all activity flow resources with literature references:

[ ]:
from pypath.legacy import main
from pypath.resources import network as netres
[ ]:
pa = main.PyPath()
pa.init_network(netres.pathway)

Or to load all activity flow resources, including the ones without literature references:

[ ]:
pa = main.PyPath()
pa.init_network(data_formats.pathway_all)

How do I build networks from any data with pypath?§

Here we show how to build a network from your own files. The advantage of building network with pypath is that you don’t need to worry about merging redundant elements, neither about different formats and identifiers. Let’s say you have two files with network data:

network1.csv

entrezA,entrezB,effect
1950,1956,inhibition
5290,207,stimulation
207,2932,inhibition
1956,5290,stimulation

network2.sif

EGF + EGFR
EGFR + PIK3CA
EGFR + SOS1
PIK3CA + RAC1
RAC1 + MAP3K1
SOS1 + HRAS
HRAS + MAP3K1
PIK3CA + AKT1
AKT1 - GSK3B

Note: you need to create these files in order to load them.

Defining input formats§

[ ]:
import pypath
import pypath.iinput_formats as input_formats

input1 = input_formats.ReadSettings(
    name = 'egf1',
    input = 'network1.csv',
    header = True,
    separator = ',',
    id_col_a = 0,
    id_col_b = 1,
    id_type_a = 'entrez',
    id_type_b = 'entrez',
    sign = (2, 'stimulation', 'inhibition'),
    ncbi_tax_id = 9606,
)

input2 = input_formats.ReadSettings(
    name = 'egf2',
    input = 'network2.sif',
    separator = ' ',
    id_col_a = 0,
    id_col_b = 2,
    id_type_a = 'genesymbol',
    id_type_b = 'genesymbol',
    sign = (1, '+', '-'),
    ncbi_tax_id = 9606,
)

Creating PyPath object and loading the 2 test files§

[ ]:
inputs = {
    'egf1': input1,
    'egf2': input2
}

pa = main.PyPath()
pa.reload()
pa.init_network(lst = inputs)

Structure of the legacy network object§

[ ]:
from pypath.legacy import main as legacy
pa = legacy.PyPath()
[ ]:
pa.graph

Number of edges and nodes:

[ ]:
pa.ecount, pa.vcount

The edge and vertex sequences you can access in the es and vs attributes, you can iterate these or index by integers. The edge and vertex attributes you can access by string keys. E.g. get the sources of edge 0:

[ ]:
pa.graph.es[81]['sources']

Directions and signs§

By default the igraph object is undirected but it carries all direction information in Python objects assigned to each edge. Pypath can convert it to a directed igraph object, but you still need the Direction objects to have the signs, as igraph has no signed network representation. Certain methods need the directed igraph object and they will automatically create it, but you can create it manually:

[ ]:
pa.get_directed()

You find the directed network in the pa.dgraph attribute:

[ ]:
pa.dgraph

Now let’s take a look on the pypath.main.Direction objects which contain details about directions and signs. First as an example, select a random edge:

[ ]:
edge = pa.graph.es[3241]

The Direction object is in the dirs edge attribute:

[ ]:
d = edge['dirs']

It has a method to print its content a human readable way:

[ ]:
print(pa.graph.es[3241]['dirs'])

From this we see the databases phosphoELM and Signor agree that protein P17252 has an effect on Q15139 and Signor in addition tells us this effect is stimulatory. However in your scripts you can query the Direction objects a number of ways. Each Direction object calls the two possible directions either straight or reverse:

[ ]:
d.straight
[ ]:
d.reverse

It can tell you if one of these directions is supported by any of the network resources:

[ ]:
d.get_dir(d.straight)

Or it can return those resources:

[ ]:
d.get_dir(d.straight, sources = True)

The opposite direction is not supported by any resource:

[ ]:
d.get_dir(d.reverse, sources = True)

Similar way the signs can be queried. The returned pair of boolean values mean if the interaction in this direction is stimulatory or inhibitory, respectively.

[ ]:
d.get_sign(d.straight)

Or you can ask whether it is inhibition:

[ ]:
d.is_inhibition(d.straight)

Or if the interaction is directed at all:

[ ]:
d.is_directed()

Sometimes resources don’t agree, for example one tells an interaction is inhibition while according to others it is stimulation; or one tells A effects B and another resource the other way around. Here we preserve all these potentially contradicting information in the Direction object and at the end you decide what to do with it depending on your purpose. If you want to get rid of ambiguity there is a method to get a consensus direction and sign which returns the attributes the most resources agree on:

[ ]:
d.consensus_edges()

Accessing nodes in the network§

In igraph the vertices are numbered but this numbering can change at certain operations. Instead the we can use the vertex attributes. In PyPath for proteins the name attribute is UniProt ID by default and the label is Gene Symbol.

[ ]:
pa.graph.vs['name'][:5]
[ ]:
pa.graph.vs['label'][:5]

The PyPath object offers a number of helper methods to access the nodes by their names. For example, uniprot or up returns the igraph.Vertex for a UniProt ID:

[ ]:
type(pa.up('P00533'))

Similarly genesymbol or gs for Gene Symbols:

[ ]:
type(pa.gs('ESR1'))

Each of these has a “plural” version:

[ ]:
len(list(pa.gss(['MTOR', 'ATG16L2', 'ULK1'])))

And a generic method where you can mix UniProts and Gene Symbols:

[ ]:
len(list(pa.proteins(['MTOR', 'P00533'])))

Querying relationships with our without causality§

Above you could see how to query the directions and names of individual edges and nodes. Building on top of these, other methods give a way to query causality, i.e. which proteins are affected by an other one, and which others are its regulators. The example below returns the nodes PIK3CA is stimulated by, the gs prefix tells we query by the Gene Symbol:

[ ]:
pa.gs_stimulated_by('PIK3CA')

It returns a so called _NamedVertexSeq object, which you can get a series of igraph.Vertex objects or Gene Symbols or UniProt IDs from:

[ ]:
list(pa.gs_stimulated_by('PIK3CA').gs())[:5]
[ ]:
list(pa.gs_stimulated_by('PIK3CA').up())[:5]

Note, the names of these methods are a bit contraintuitive, the for example the gs_stimulates returns the genes stimulated by PIK3CA:

[ ]:
list(pa.gs_stimulates('PIK3CA').gs())[:5]
[ ]:
'PIK3CA' in set(pa.affected_by('AKT1').gs())

There are many similary methods, inhibited_by returns negative regulators, affected_by does not consider +/- signs, without gs_ and up_ prefixes you can provide either of these identifiers, neighbors does not consider the direction. At the end .gs() converts the result for a list of Gene Symbols, up() to UniProts, .ids() to vertex IDs and by default it yields igraph.Vertex objects:

[ ]:
list(pa.neighbors('AKT1').ids())[:5]

Finally, with neighborhood methods return the indirect neighborhood in custom number of steps (however size of the neighborhood increases rapidly with number of steps):

[ ]:
print(list(pa.neighborhood('ATG3', 1).gs()))
[ ]:
print(list(pa.neighborhood('ATG3', 2).gs()))
[ ]:
len(list(pa.neighborhood('ATG3', 3).gs()))
[ ]:
len(list(pa.neighborhood('ATG3', 4).gs()))

Accessing edges by identifiers§

Just like nodes also edges can be accessed by identifiers like Gene Symbols. get_edge returns an igraph.Edge if the edge exists otherwise None.

[ ]:
type(pa.get_edge('EGF', 'EGFR'))
[ ]:
type(pa.get_edge('EGF', 'P00533'))
[ ]:
type(pa.get_edge('EGF', 'AKT1'))
[ ]:
print(pa.get_edge('EGF', 'EGFR')['dirs'])

Literature references§

Select a random edge and in the references attribute you find a list of references:

[ ]:
edge = pa.get_edge( 'MAP1LC3B', 'SQSTM1')
edge['references']

Each reference has a PubMed ID:

[ ]:
edge['references'][0].pmid
[ ]:
edge['references'][0].open()

These 3 references come from 3 different databases, but there must be 2 overlaps between them:

[ ]:
edge['refs_by_source']

Plotting the network with igraph§

Here we use the network created above (because it is reasonable size, not like the networks we could get from most of the network databases). Igraph has excellent plotting abilities built on top of the cairo library.

[ ]:
import igraph
plot = igraph.plot(pa.graph, target = 'egf_network.png',
            edge_width = 0.3, edge_color = '#777777',
            vertex_color = '#97BE73', vertex_frame_width = 0,
            vertex_size = 70.0, vertex_label_size = 15,
            vertex_label_color = '#FFFFFF',
            # due to a bug in either igraph or IPython,
            # vertex labels are not visible on inline plots:
            inline = False, margin = 120)
from IPython.display import Image
Image(filename='egf_network.png')