The pypath book§

Contents

1 Introduction
2 Build, load and save databases
- 2.1 The OmniPath app
- 2.2 Built-in database definitions
- 2.3 Networks
  - 2.3.1 Strictly literature curated network
  - 2.3.2 The OmniPath network with extra activity flow, enzyme-substrate and ligand-receptor interactions
  - 2.3.3 Transcriptional regulation network from DoRothEA and other resources
  - 2.3.4 Literature curated miRNA post-transcriptional regulation network
  - 2.3.5 Transcriptional regulation of miRNA
  - 2.3.6 lncRNA-mRNA interactions
  - 2.3.7 Small molecule-protein interactions
- 2.4 Enzyme-substrate relationships
- 2.5 Protein complexes
- 2.6 Annotations
- 2.7 Inter-cellular communication roles
3 Data directly from the original resources
4 Interesting resources
- 4.1 RaMP
  - 4.1.1 TL;DR
- 4.2 HMDB (Human Metabolome Database)
  - 4.2.1 Direct access to HMDB data
  - 4.2.2 Higher level access to HMDB data
  - 4.2.3 ID translation with HMDB
- 4.3 NCBI E-Utils
5 Download management
- 5.1 Cache management and customization
- 5.2 Download failures
  - 5.2.1 Corrupted cache content
  - 5.2.2 Network communication issues: look into the curl debug log
  - 5.2.3 Timeouts
  - 5.2.4 Access and inspect the Curl object
  - 5.2.5 Is it failing only for you?
  - 5.2.6 Read the log
  - 5.2.7 TLS (SSL, HTTPS) errors
6 Resources
- 6.1 Licenses
  - 6.1.1 Example: build a network for commercial use
- 6.2 Resource information
- 6.3 Resource definitions for a certain database or dataset
7 Building networks
- 7.1 Which network datasets are pre-defined in pypath?
- 7.2 The Network object
- 7.3 Network in pandas.DataFrame
- 7.4 Self interactions (loop edges) in the network
- 7.5 Molecular complexes in the network
8 Translating identifiers
- 8.1 Pre-defined ID translation tables
- 8.2 Direct access to ID translation tables
9 Orthology translation
- 9.1 Orthology translation tables as dictionaries
- 9.2 Orthology translation data frames
10 Taxonomy
- 10.1 Translating to NCBI Taxonomy, scientific names and common names
- 10.2 Organism from UniProt ID
11 UniProt
- 11.1 The UniProt input module
  - 11.1.1 All UniProt IDs for one organism
  - 11.1.2 UniProt ID format validation
  - 11.1.3 UniProt ID validation
  - 11.1.4 Single UniProt protein datasheet
  - 11.1.5 History of UniProt records
  - 11.1.6 UniProt REST API
  - 11.1.7 Processed UniProt annotations
- 11.2 The UniProt utils module
  - 11.2.1 Datasheets
  - 11.2.2 Tables
- 11.3 Sanitizing UniProt IDs
12 Enzyme-substrate interactions
- 12.1 Enzyme-substrate objects
- 12.2 Enzyme-substrate data frame
13 Protein sequences
14 Annotations
- 14.1 Load a single annotation resource
- 14.2 Load the full annotations database by the database manager
- 14.3 Load only selected annotations
- 14.4 Data frames of annotations
15 Inter-cellular signaling roles
- 15.1 Build an intercellular communication network
- 15.2 Quantitative overview of intercell annotations
- 15.3 Intercell database as data frame
- 15.4 Browse intercell categories
16 Gene Ontology
17 Protein complexes
- 17.1 Protein complex objects
- 17.2 Protein complex data frame
18 Saving datasets as pickles
19 Log messages and sessions
- 19.1 Basic info about the session
- 19.2 Read the log file
- 19.3 Logging to the console
- 19.4 Disable logging
- 19.5 Write to the log
  - 19.5.1 Sending a single message
  - 19.5.2 Connect a module or class to the pypath logger
20 BEL export
21 CellPhoneDB export
22 The legacy igraph-based network object
- 22.1 I just want a network quickly and play around with pypath
- 22.2 How do I build networks from any data with pypath?
  - 22.2.1 Defining input formats
  - 22.2.2 Creating PyPath object and loading the 2 test files
- 22.3 Structure of the legacy network object
  - 22.3.1 Directions and signs
  - 22.3.2 Accessing nodes in the network
- 22.4 Querying relationships with our without causality
- 22.5 Accessing edges by identifiers
- 22.6 Literature references
- 22.7 Plotting the network with igraph

Introduction§

OmniPath consists of 5 main database segments: network (interactions), enzyme-substrate interactions (enz_sub or ptms), protein complexes (complexes), molecular entity annotations (annotations) and intercellular communication roles (intercell). You can access all these by the web service at https://omnipathdb.org/ and the R/Bioconductor package OmnipathR, furthermore the network and some of the annotations by the Cytoscape app. However only pypath is able to build these databases directly from the original sources with various options for customization and to provide a rich and versatile API for each database enjoying the almost unlimited flexibility of Python. This book attempts to be a guided tour around pypath, however almost all objects, modules, APIs presented here have many more methods, options and features than we have a chance to cover. If you feel like there might be something useful for you, don’t hesitate to ask us by github.

This document has been run with the following pypath version:

[1]:

                          import pypath
pypath.__version__

executed in 0ms, finished 16:49:47 2023-03-09

[1]:

'0.14.36'

Build, load and save databases§

We provide a high level interface in the module pypath.omnipath.app. This is the easiest way to build, manage and access the OmniPath databases, hence this is what we present in the Quick start section. In further sections we show the lower level modules more in detail.

The OmniPath app§

pypath.omnipath is an application which contains a database manager at omnipath.db. This manager is empty by default. It builds and loads the databases on demand.

[2]:

                            from pypath import omnipath

omnipath.db

executed in 1.34s, finished 14:11:27 2022-12-03

[2]:

<pypath.omnipath.app.DatabaseManager at 0x602fb851cd90>

Built-in database definitions§

The databases presented below are pre-defined in pypath. You can also list them by:

[3]:

                            from pypath import omnipath
omnipath.db.datasets

executed in 0ms, finished 14:11:32 2022-12-03

[3]:

['omnipath',
 'curated',
 'complex',
 'annotations',
 'intercell',
 'tf_target',
 'dorothea',
 'small_molecule',
 'tf_mirna',
 'mirna_mrna',
 'lncrna_mrna',
 'enz_sub']

Networks§

OmniPath offers multiple built in network datasets: the OmniPath PPI network the more strict literature curated PPI network, the special ligand-receptor PPI network and various other PPI datasets, the transcriptional regulation network from DoRothEA and other resources, miRNA post-transcriptional regulation network and also transcriptional regulation network for miRNAs.

Strictly literature curated network§

[4]:

                              from pypath import omnipath
cu = omnipath.db.get_db('curated')
cu

                            

executed in 16.83s, finished 13:17:13 2022-12-02

[4]:

<Network: 7980 nodes, 35551 interactions>

The OmniPath network with extra activity flow, enzyme-substrate and ligand-receptor interactions§

[5]:

                              from pypath import omnipath
op = omnipath.db.get_db('omnipath')
op

                            

executed in 1m, finished 13:18:55 2022-12-02

[5]:

<Network: 18558 nodes, 94358 interactions>

Transcriptional regulation network from DoRothEA and other resources§

Note: according to the default settings, DoRothEA confidence levels A-D and all original resources will be loaded. To load only DoRothEA, use the key "dorothea" instead of "tf_target".

[6]:

                              from pypath import omnipath
tft = omnipath.db.get_db('tf_target')
tft

                            

executed in 2m 12.72s, finished 13:21:54 2022-12-02

[6]:

<Network: 18986 nodes, 326708 interactions>

Literature curated miRNA post-transcriptional regulation network§

[1]:

                              from pypath import omnipath
mi = omnipath.db.get_db('mirna_mrna')
mi

                            

executed in 2.28s, finished 13:31:55 2022-12-02

[1]:

<Network: 1264 nodes, 3288 interactions>

Transcriptional regulation of miRNA§

[4]:

                              from pypath import omnipath
tmi = omnipath.db.get_db('tf_mirna')
tmi

                            

executed in 0ms, finished 13:32:41 2022-12-02

[4]:

<Network: 1032 nodes, 4960 interactions>

lncRNA-mRNA interactions§

[6]:

                              from pypath import omnipath
lnc = omnipath.db.get_db('lncrna_mrna')
lnc

                            

executed in 0ms, finished 13:33:03 2022-12-02

[6]:

<Network: 243 nodes, 217 interactions>

Small molecule-protein interactions§

These interactions are either ligand-receptor connections, enzyme inhibitions, allosteric regulations or enzyme-metabolite interactions. Currently it is a small, experimental dataset, but will be largely extended in the future.

[1]:

                              from pypath import omnipath
smol = omnipath.db.get_db('small_molecule')
smol

                            

executed in 7.94s, finished 13:57:17 2022-12-02

[1]:

<Network: 1980 nodes, 3147 interactions>

Enzyme-substrate relationships§

[7]:

                            from pypath import omnipath
es = omnipath.db.get_db('enz_sub')
es

                          

executed in 6.14s, finished 13:33:26 2022-12-02

[7]:

<Enzyme-substrate database: 41426 relationships>

Protein complexes§

[8]:

                            from pypath import omnipath
co = omnipath.db.get_db('complex')
co

                          

executed in 0ms, finished 13:33:31 2022-12-02

[8]:

<Complex database: 28173 complexes>

Annotations§

The annotations database is huge, building or even loading it takes long time and requires quite some memory.

[9]:

                            from pypath import omnipath
an = omnipath.db.get_db('annotations')
an

                          

executed in 2m 43.60s, finished 13:36:28 2022-12-02

[9]:

<Annotation database: 5490653 records about 50872 entities from 68 resources>

Inter-cellular communication roles§

This database is quick to build, but it requires the annotations database, which is really heavy.

[10]:

                            from pypath import omnipath
ic = omnipath.db.get_db('intercell')
ic

                          

executed in 23.34s, finished 13:37:12 2022-12-02

[10]:

<Intercell annotations: 301527 records about 48570 entities>

Data directly from the original resources§

The pypath.inputs module contains clients for more than 150 molecular biology and biomedical resources, and overall almost 500 functions that download data directly from these resources. Maintaining such a large number of clients is troublesome, hence at any time some of them are broken, you can check them in our daily status report. Each submodule of pypath.inputs is named after its corresponding resource, all lowercase, e.g. “depod” (DEPOD) or “cytosig” (CytoSig). Within these modules each function name starts with the name of the resource, and ends with the kind of data it retrieves. For example, pypath.inputs.signor.signor_interactions downloads interactions from SIGNOR. The labels *”_interactions”,”_enz_sub”,”_complexes”* and *”_annotations”* retrieve records intended to these respective databases. However, the records at this stage are not fully processed yet. Some functions have different postfixes, e.g. *”_raw”* means the data is close to the format provided by the resource itself; *”_mapping”* means it is intended for a translation table. The purpose of the input functions is to 1) handle the download; 2) read the raw data; 3) extract the relevant parts; 4) do the specific part of processing, i.e. bring the data to a state when it is suitable for the generic database classes for further processing. The outputs of these functions is not standard in any ways, though you may observ repeated patterns. The input functions typically return lists or dictionaries. These are arbitrarily designed towards the aims of selecting the relevant fields and give straightforward, accessible Python data structures for processing within or outside of pypath.

We use SIGNOR as an example because this resource provides data for almost all OmniPath databases. The signor_complexes function returns a set of pypath.internals.intera.Complex objects, ready to be added to the OmniPath complexes database (built by pypath.core.complex.ComplexAggregator).

[2]:

                          from pypath.inputs import signor
signor.signor_complexes()

executed in 0ms, finished 15:24:43 2022-12-03

[2]:

{'COMPLEX:P23511_P25208_Q13952': Complex NFY: COMPLEX:P23511_P25208_Q13952,
 'COMPLEX:P68104_P85299_Q6R327_Q8TB45_Q9BVC4': Complex mTORC2: COMPLEX:P68104_P85299_Q6R327_Q8TB45_Q9BVC4,
 'COMPLEX:P42345_Q8N122_Q8TB45_Q96B36_Q9BVC4': Complex mTORC1: COMPLEX:P42345_Q8N122_Q8TB45_Q96B36_Q9BVC4,
 'COMPLEX:P63208_Q13616_Q9Y297': Complex SCF-betaTRCP: COMPLEX:P63208_Q13616_Q9Y297,
 'COMPLEX:Q09472_Q92793': Complex CBP/p300: COMPLEX:Q09472_Q92793,
 'COMPLEX:Q09472_Q92793_Q92831': Complex P300/PCAF: COMPLEX:Q09472_Q92793_Q92831,
 'COMPLEX:Q13485_Q15796': Complex SMAD2/SMAD4: COMPLEX:Q13485_Q15796,
 'COMPLEX:P84022_Q13485': Complex SMAD3/SMAD4: COMPLEX:P84022_Q13485,
 'COMPLEX:P05412_Q13485': Complex SMAD4/JUN: COMPLEX:P05412_Q13485,
 'COMPLEX:Q15796_Q9HAU4': Complex SMAD2/SMURF2: COMPLEX:Q15796_Q9HAU4,
 'COMPLEX:O15105_Q01094_Q13547': Complex SMAD7/HDAC1/E2F-1: COMPLEX:O15105_Q01094_Q13547,
 'COMPLEX:P19838_Q04206': Complex NfKb-p65/p50: COMPLEX:P19838_Q04206,
 'COMPLEX:O14920_O15111': Complex IK

Output truncated: showing 1000 of 17699 characters

The signor_interactions function returns a list of arbitrary tuples that represent the most important properties of SIGNOR interaction records in a human readable way, and ready to be processed by the pypath.core.network.Network object.

[5]:

                          signor.signor_interactions()[:10]

                        

executed in 0ms, finished 14:11:52 2022-12-03

[5]:

[SignorInteraction(source='O15530', target='O15530', source_isoform=None, target_isoform=None, source_type='protein', target_type='protein', effect='unknown', mechanism='phosphorylation', ncbi_tax_id='9606', pubmeds='10455013', direct=True, ptm_type='phosphorylation', ptm_residue='Ser396', ptm_motif='SSSSSSHsLSASDTG'),
 SignorInteraction(source='Q9NQ66', target='CHEBI:18035', source_isoform=None, target_isoform=None, source_type='protein', target_type='smallmolecule', effect='up-regulates quantity', mechanism='', ncbi_tax_id='-1', pubmeds='23880553', direct=True, ptm_type='', ptm_residue='Small molecule catalysis', ptm_motif=''),
 SignorInteraction(source='P62136', target='O15169', source_isoform=None, target_isoform=None, source_type='protein', target_type='protein', effect='down-regulates activity', mechanism='dephosphorylation', ncbi_tax_id='9606', pubmeds='17318175', direct=True, ptm_type='dephosphorylation', ptm_residue='Ser77', ptm_motif='YEPEGSAsPTPPYLK'),
 SignorInteraction(sou

Output truncated: showing 1000 of 3285 characters

Note, the records above contain also enzyme-PTM data, hence the signor.signor_enzyme_substrate function only converts them to an intermediate format to make it easier to process for pypath.core.enz_sub.EnzymeSubstrateAggregator.

[4]:

                          signor.signor_enzyme_substrate()[:2]

                        

executed in 0ms, finished 13:58:20 2022-12-02

[4]:

[{'typ': 'phosphorylation',
  'resnum': 396,
  'instance': 'SSSSSSHSLSASDTG',
  'substrate': 'O15530',
  'start': 389,
  'end': 403,
  'kinase': 'O15530',
  'resaa': 'S',
  'motif': 'SSSSSSHSLSASDTG',
  'enzyme_isoform': None,
  'substrate_isoform': None,
  'references': {'10455013'}},
 {'typ': 'dephosphorylation',
  'resnum': 77,
  'instance': 'YEPEGSASPTPPYLK',
  'substrate': 'O15169',
  'start': 70,
  'end': 84,
  'kinase': 'P62136',
  'resaa': 'S',
  'motif': 'YEPEGSASPTPPYLK',
  'enzyme_isoform': None,
  'substrate_isoform': None,
  'references': {'17318175'}}]

Finally, SIGNOR also assigns proteins to pathways. This information is intended for the OmniPath annotations database, and retrieved by the signor.signor_pathway_annotations function. This function returns a dict of sets which is typical for *_annotation* functions. This format requires practically no further processing.

[5]:

                          signor.signor_pathway_annotations()['O14733']

                        

executed in 1.48s, finished 13:58:28 2022-12-02

[5]:

{SignorPathway(pathway='TNF alpha'),
 SignorPathway(pathway='Toll like receptors')}

We haven’t mention all functions in the inputs.signor module. The rest of the functions retrieve additional information needed by the four functions above, and are of limited direct use for users. For example, signor_protein_families returns a dict with the internal ID and members of protein families; this data is used to process the interactions and complexes, but not too interesting on its own.

[6]:

                          signor.signor_protein_families()['SIGNOR-PF2']

                        

executed in 0ms, finished 13:58:53 2022-12-02

[6]:

['Q9HBW0', 'Q92633', 'Q9UBY5']

Interesting resources§

Here we showcase a few potentially useful features in pypath.inputs.

RaMP§

RaMP is a human metabolite and metabolic network database providing ID translation, annotations and enzymatic reactions of metabolites. Let’s take a closer look first at the full database contents. It is available as a MySQL database, below we list the tables and their column names:

[6]:

                            from pypath.inputs import ramp
ramp.ramp_list_tables()

executed in 2.20s, finished 16:51:14 2023-03-09

[6]:

{'analyte': ['rampId', 'type'],
 'analytehasontology': ['rampCompoundId', 'rampOntologyId'],
 'analytehaspathway': ['rampId', 'pathwayRampId', 'pathwaySource'],
 'analytesynonym': ['Synonym', 'rampId', 'geneOrCompound', 'source'],
 'catalyzed': ['rampCompoundId', 'rampGeneId'],
 'chem_props': ['ramp_id',
  'chem_data_source',
  'chem_source_id',
  'iso_smiles',
  'inchi_key_prefix',
  'inchi_key',
  'inchi',
  'mw',
  'monoisotop_mass',
  'common_name',
  'mol_formula'],
 'db_version': ['ramp_version',
  'load_timestamp',
  'version_notes',
  'met_intersects_json',
  'gene_intersects_json',
  'met_intersects_json_pw_mapped',
  'gene_intersects_json_pw_mapped',
  'db_sql_url'],
 'entity_status_info': ['status_category',
  'entity_source_id',
  'entity_source_name',
  'entity_count'],
 'metabolite_class': ['ramp_id',
  'class_source_id',
  'class_level_name',
  'class_name',
  'source'],
 'ontology': ['rampOntologyId', 'commonName', 'HMDBOntologyType', 'metCount'],
 'pathway': ['pathwayR

Output truncated: showing 1000 of 1368 characters

Using the ramp_raw function, we can access these tables either as Python dicts, or pandas.DataFrames, or loaded into an SQLite database. For further inspection, the data frames are the most convenient. Most of the ID translation data is contained in the source table:

Note: At the very first time, retrieving these tables takes quite some time, not only due to the large download, but also a performance bottleneck when processing the MySQL dumps. Thanks to caching, loading the tables subsequently happens much faster.

[8]:

                            tables = ramp.ramp_raw(['analytesynonym', 'chem_props', 'source'])
tables['source']

executed in 4.25s, finished 16:54:17 2023-03-09

[8]:

	sourceId	rampId	IDtype	geneOrCompound	commonName	priorityHMDBStatus	dataSource	pathwayCount
0	hmdb:HMDB0000001	RAMP_C_000000001	hmdb	compound	1-Methylhistidine	quantified	hmdb	2
1	hmdb:HMDB0000479	RAMP_C_000000001	hmdb	compound	3-Methylhistidine	quantified	hmdb	2
2	chebi:50599	RAMP_C_000000001	chebi	compound	1-Methylhistidine	quantified	hmdb	2
3	chemspider:83153	RAMP_C_000000001	chemspider	compound	1-Methylhistidine	quantified	hmdb	2
4	kegg:C01152	RAMP_C_000000001	kegg	compound	1-Methylhistidine	quantified	hmdb_kegg	2
...	...	...	...	...	...	...	...	...
756552	uniprot:H0YDB7	RAMP_G_000009307	uniprot	gene	RAB38	NULL	wiki	10
756553	uniprot:A0A024R191	RAMP_G_000009307	uniprot	gene	RAB38	NULL	wiki	10
756554	uniprot:H0YEA4	RAMP_G_000009307	uniprot	gene	RAB38	NULL	wiki	10
756555	entrez:23682	RAMP_G_000009307	entrez	gene	RAB38	NULL	wiki	10
756556	gene_symbol:RAB38	RAMP_G_000009307	gene_symbol	gene	RAB38	NULL	wiki	10

756557 rows × 8 columns

Structural and physicochemical info is available in the chem_props table:

[10]:

                            tables['chem_props']

                          

executed in 0ms, finished 17:00:46 2023-03-09

[10]:

	ramp_id	chem_data_source	chem_source_id	iso_smiles	inchi_key_prefix	inchi_key	inchi	mw	monoisotop_mass	common_name	mol_formula
0	RAMP_C_000000001	hmdb	hmdb:HMDB0000001	[H]OC(=O)[C@@]([H])(N([H])[H])C([H])([H])C1=C(...	BRMWTNUJHUMWMS	BRMWTNUJHUMWMS-LURJTMIESA-N	InChI=1S/C7H11N3O2/c1-10-3-5(9-4-10)2-6(8)7(11...	169.181	169.085	1-Methylhistidine	C7H11N3O2
1	RAMP_C_000000001	hmdb	hmdb:HMDB0000479	[H][C@](N)(CC1=CN=CN1C)C(O)=O	JDHILDINMRGULE	JDHILDINMRGULE-LURJTMIESA-N	InChI=1S/C7H11N3O2/c1-10-4-9-3-5(10)2-6(8)7(11...	169.181	169.085	3-Methylhistidine	C7H11N3O2
2	RAMP_C_000000001	chebi	chebi:27596	Cn1cncc1C[C@H](N)C(O)=O	JDHILDINMRGULE	JDHILDINMRGULE-LURJTMIESA-N	InChI=1S/C7H11N3O2/c1-10-4-9-3-5(10)2-6(8)7(11...	NULL	169.085	N(pros)-methyl-L-histidine	C7H11N3O2
3	RAMP_C_000000001	chebi	chebi:50599	Cn1cnc(C[C@H](N)C(O)=O)c1	BRMWTNUJHUMWMS	BRMWTNUJHUMWMS-LURJTMIESA-N	InChI=1S/C7H11N3O2/c1-10-3-5(9-4-10)2-6(8)7(11...	NULL	169.085	N(tele)-methyl-L-histidine	C7H11N3O2
4	RAMP_C_000000002	hmdb	hmdb:HMDB0000002	NCCCN	XFNJVJPLKCPIBV	XFNJVJPLKCPIBV-UHFFFAOYSA-N	InChI=1S/C3H10N2/c4-2-1-3-5/h1-5H2	74.1249	74.0844	1,3-Diaminopropane	C3H10N2
...	...	...	...	...	...	...	...	...	...	...	...
275898	RAMP_C_000258279	lipidmaps	LIPIDMAPS:LMPK15050003	C1(OC)C(=O)C(C[C@H](OC(C)=O)CCCCCCCCCCCCC)=C(O...	UXLMJHNFDRMGPW	UXLMJHNFDRMGPW-LJQANCHMSA-N	InChI=1S/C24H38O6/c1-4-5-6-7-8-9-10-11-12-13-1...	NULL	422.267	2-hydroxy-5-methoxy-3-(2R-acetoxy-pentadecyl)-...	C24H38O6
275899	RAMP_C_000258280	lipidmaps	LIPIDMAPS:LMPK15050004	C1(OC)C(=O)C(C[C@H](OC(C)=O)CCCCCCCCCCCCC)=CC(...	CVZNKLNAHBTINT	CVZNKLNAHBTINT-JOCHJYFZSA-N	InChI=1S/C24H38O5/c1-4-5-6-7-8-9-10-11-12-13-1...	NULL	406.272	5-methoxy-3-(2R-acetoxy-pentadecyl)-1,4-benzoq...	C24H38O5
275900	RAMP_C_000226089	lipidmaps	LIPIDMAPS:LMPK15050005	C1(OC)C(=O)C(C[C@H](OC(C)=O)CCCCCCCCCCC)=CC(=O...	JIUGZSYPFREDLG	JIUGZSYPFREDLG-HXUWFJFHSA-N	InChI=1S/C22H34O5/c1-4-5-6-7-8-9-10-11-12-13-2...	NULL	378.241	5-methoxy-3-(2R-acetoxy-tridecyl)-1,4-benzoqui...	C22H34O5
275901	RAMP_C_000258283	lipidmaps	LIPIDMAPS:LMPK15050008	C1(O)C(=O)C(CCCCCCCCCCCCCCC)=C(O)C(=O)C=1	GXDURRGUXLDZKN	GXDURRGUXLDZKN-UHFFFAOYSA-N	InChI=1S/C21H34O4/c1-2-3-4-5-6-7-8-9-10-11-12-...	NULL	350.246	Suberonone	C21H34O4
275902	RAMP_C_000258284	lipidmaps	LIPIDMAPS:LMPK15050009	C1(O)C(=O)C(CCCCCCCCCCCCC)=C(O)C(=O)C=1	AMKNOBHCKRZHIO	AMKNOBHCKRZHIO-UHFFFAOYSA-N	InChI=1S/C19H30O4/c1-2-3-4-5-6-7-8-9-10-11-12-...	NULL	322.214	Rapanone	C19H30O4

275903 rows × 11 columns

Raw RaMP data can be accessed also as an SQLite database. The advantage here is the high performance and flexibility of operations. Conversion to pandas and vice versa is really easy, you can always have the result in a data frame. Below, con is a database connection ready to execute your queries. It is an in-memory database, using alternatively an on-disk database is possible. We use pypath.formats.sqlite to look into the SQLite database.

[11]:

                            con = ramp.ramp_raw(['source', 'chem_props', 'analytesynonym'], sqlite = True)
con

executed in 10.56s, finished 17:07:00 2023-03-09

[11]:

<sqlite3.Connection at 0x6fa1e9e4e940>

Now we have already loaded these 3 big tables both as data frames and as SQLite tables, let’s see how much memory they use (normally half is enough, and they should stay in the memory only for short periods):

[13]:

                            from pypath.share import common
common.format_bytes(common.python_memory_usage())

executed in 0ms, finished 17:07:44 2023-03-09

[13]:

'3.7 GB'

Looking into the database, we see the 3 tables loaded, and their column names:

[19]:

                            from pypath.formats import sqlite
sqlite.list_columns(con)

executed in 0ms, finished 17:13:01 2023-03-09

[19]:

{'source': ['sourceId',
  'rampId',
  'IDtype',
  'geneOrCompound',
  'commonName',
  'priorityHMDBStatus',
  'dataSource',
  'pathwayCount'],
 'analytesynonym': ['Synonym', 'rampId', 'geneOrCompound', 'source'],
 'chem_props': ['ramp_id',
  'chem_data_source',
  'chem_source_id',
  'iso_smiles',
  'inchi_key_prefix',
  'inchi_key',
  'inchi',
  'mw',
  'monoisotop_mass',
  'common_name',
  'mol_formula']}

Let’s see how to execute an SQL query and fetch the output into a data frame. This query takes the source table, selects the records with HMDB and ChEBI IDs in two subqueries, and joins the two by rampId, in order to obtain a HMDB ←→ ChEBI mapping table:

[22]:

                            import pandas as pd

query = (
    'SELECT DISTINCT a.sourceId as hmdb, b.sourceId as chebi '
    'FROM '
    '   (SELECT sourceId, rampId '
    '    FROM source '
    '   WHERE geneOrCompound = "compound" AND IDtype = "hmdb") a '
    'JOIN '
    '   (SELECT sourceId, rampId '
    '    FROM source '
    '   WHERE geneOrCompound = "compound" AND IDtype = "chebi") b '
    'ON a.rampId = b.rampId;'
)
df = pd.read_sql_query(query, con)
df

                          

executed in 1ms, finished 17:18:37 2023-03-09

[22]:

	hmdb	chebi
0	hmdb:HMDB0000001	chebi:27596
1	hmdb:HMDB0000001	chebi:50599
2	hmdb:HMDB0000479	chebi:27596
3	hmdb:HMDB0000479	chebi:50599
4	hmdb:HMDB00001	chebi:27596
...	...	...
104129	hmdb:HMDB0126033	chebi:25882
104130	hmdb:HMDB0141947	chebi:180150
104131	hmdb:HMDB0128505	chebi:7870
104132	hmdb:HMDB0130984	chebi:8227
104133	hmdb:HMDB0130987	chebi:8630

104134 rows × 2 columns

Such mapping tables can be easily accessed for any pairs of identifiers by the ramp_mapping function. Before that, let’s see the complete list of supported ID types:

[24]:

                            ramp.ramp_id_types()

                          

executed in 4.45s, finished 17:23:09 2023-03-09

[24]:

{'CAS',
 'EN',
 'LIPIDMAPS',
 'brenda',
 'chebi',
 'chemspider',
 'ensembl',
 'entrez',
 'gene_symbol',
 'hmdb',
 'kegg',
 'kegg_glycan',
 'lipidbank',
 'ncbiprotein',
 'plantfa',
 'pubchem',
 'swisslipids',
 'uniprot',
 'wikidata'}

[31]:

                            ramp.ramp_mapping('LIPIDMAPS', 'swisslipids')

                          

executed in 4.94s, finished 17:29:17 2023-03-09

[31]:

{'LMFA00000008': {'SLM:000390048'},
 'LMFA01010001': {'SLM:000000510'},
 'LMFA01010002': {'SLM:000000449'},
 'LMFA01010003': {'SLM:000001194'},
 'LMFA01010004': {'SLM:000001195'},
 'LMFA01010005': {'SLM:000389552'},
 'LMFA01010006': {'SLM:000001196'},
 'LMFA01010007': {'SLM:000389947'},
 'LMFA01010008': {'SLM:000000853'},
 'LMFA01010010': {'SLM:000000855'},
 'LMFA01010011': {'SLM:000389946'},
 'LMFA01010012': {'SLM:000000719'},
 'LMFA01010013': {'SLM:000001198'},
 'LMFA01010014': {'SLM:000000825'},
 'LMFA01010015': {'SLM:000001199'},
 'LMFA01010017': {'SLM:000001095'},
 'LMFA01010019': {'SLM:000001205'},
 'LMFA01010020': {'SLM:000000829'},
 'LMFA01010021': {'SLM:000001207'},
 'LMFA01010022': {'SLM:000000827'},
 'LMFA01010023': {'SLM:000001128'},
 'LMFA01010024': {'SLM:000000414'},
 'LMFA01010026': {'SLM:000000539'},
 'LMFA01010027': {'SLM:000000980'},
 'LMFA01010028': {'SLM:000000540'},
 'LMFA01010030': {'SLM:000000543'},
 'LMFA01010032': {'SLM:000000544'},
 'LMFA01010034': {'SLM:00000

Output truncated: showing 1000 of 44684 characters

Above we got a dict of sets, alternatively data frames are available:

[32]:

                            ramp.ramp_mapping('LIPIDMAPS', 'swisslipids', return_df = True)

                          

executed in 4.63s, finished 17:30:27 2023-03-09

[32]:

	id_type_a	id_type_b
0	LMST02030086	SLM:000485328
1	LMST02030087	SLM:000485330
2	LMSP06020013	SLM:000000534
3	LMST02020001	SLM:000001055
4	LMST02020001	SLM:000485315
...	...	...
35218	LMPR0104010007	SLM:000389242
35219	LMPR0104030005	SLM:000390232
35220	LMPR0104030006	SLM:000390227
35221	LMPR01070626	SLM:000000432
35222	LMPR01090015	SLM:000389419

35223 rows × 2 columns

RaMP ID translation is also integrated into the higher level APIs in pypath.utils.mapping. Below, we first look into the available ID types and translation tables:

[34]:

                            from pypath.utils import mapping
m = mapping.get_mapper()
m.id_types()

                          

executed in 0ms, finished 17:38:25 2023-03-09

[34]:

{IdType(pypath='CAS', original='CAS'),
 IdType(pypath='LIPIDMAPS', original='LIPIDMAPS'),
 IdType(pypath='MedChemExpress', original='MedChemExpress'),
 IdType(pypath='actor', original='actor'),
 IdType(pypath='affy', original='affy'),
 IdType(pypath='affymetrix', original='affymetrix'),
 IdType(pypath='agilent', original='agilent'),
 IdType(pypath='alzforum', original='Alzforum_mut'),
 IdType(pypath='araport', original='Araport'),
 IdType(pypath='atlas', original='atlas'),
 IdType(pypath='bindingdb', original='bindingdb'),
 IdType(pypath='brenda', original='brenda'),
 IdType(pypath='carotenoiddb', original='carotenoiddb'),
 IdType(pypath='cas', original='CAS'),
 IdType(pypath='cas_id', original='CAS'),
 IdType(pypath='cgnc', original='CGNC'),
 IdType(pypath='chebi', original='chebi'),
 IdType(pypath='chembl', original='chembl'),
 IdType(pypath='chemicalbook', original='chemicalbook'),
 IdType(pypath='chemspider', original='chemspider'),
 IdType(pypath='clinicaltrials', original='clinic

Output truncated: showing 1000 of 7422 characters

These are ID types not only from RaMP, but all the supported resources. In the mapping table definitions, as translation between any two ID types is supported, id_type_b is always None:

[35]:

                            [t for t in m.mapping_tables() if t.resource == 'ramp']

                          

executed in 0ms, finished 17:46:56 2023-03-09

[35]:

[MappingTableDefinition(id_type_a='kegg_glycan', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='kegg_glycan', resource_id_type_b=None),
 MappingTableDefinition(id_type_a='hmdb', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='hmdb', resource_id_type_b=None),
 MappingTableDefinition(id_type_a='wikidata', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='wikidata', resource_id_type_b=None),
 MappingTableDefinition(id_type_a='LIPIDMAPS', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='LIPIDMAPS', resource_id_type_b=None),
 MappingTableDefinition(id_type_a='kegg', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='kegg', resource_id_type_b=None),
 MappingTableDefinition(id_type_a='CAS', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='CAS', resource_id_type_b=None),
 MappingTableDefinition(id_type_a='chebi

Output truncated: showing 1000 of 3238 characters

TL;DR§

Up until this point this section is about extra insights, but what 99% of the users will do looks like this:

[36]:

                              from pypath.utils import mapping
mapping.map_name('131431', 'chebi', 'hmdb')

executed in 0ms, finished 17:53:38 2023-03-09

[36]:

{'HMDB0094709'}

HMDB (Human Metabolome Database)§

Direct access to HMDB data§

In the inputs.hmdb module processes metabolite and protein data using lxml.etree and some minimal utilities from formats.xml. The metabolite or protein records are available as lxml.etree.Element objects, or custom fields can be extracted into dicts, or into data frames. To iterate through the xml elements, each representing a metabolite:

[1]:

                              from pypath.inputs import hmdb
next(hmdb.iter_metabolites())

executed in 1ms, finished 12:23:11 2023-04-24

[1]:

<Element {http://www.hmdb.ca}metabolite at 0x60b1846262c0>

On the Element objects you can use directly lxml.etree’s methods to extract information. An easier and flexible way to extract information from these XML records is to define a schema with instructions for lxml. A full schema for HMDB metabolites is available in hmdb.SCHEMA:

[2]:

                              hmdb.METABOLITES_SCHEMA

                            

executed in 0ms, finished 12:24:03 2023-04-24

[2]:

{'taxonomy': ('taxonomy',
  {'description': ('description', None),
   'direct_parent': ('direct_parent', None),
   'kingdom': ('kingdom', None),
   'class': ('class', None),
   'sub_class': ('sub_class', None),
   'molecular_framework': ('molecular_framework', None),
   'alternative_parents': ('alternative_parents',
    ('alternative_parent', 'findall'),
    None),
   'substituents': ('substituents', ('substituent', 'findall'), None)}),
 'spectra': ('spectra', ('spectrum', 'findall'), {'spectrum_id', 'type'}),
 'biological_properties': ('biological_properties',
  {'cellular_locations': ('cellular_locations', ('cellular', 'findall'), None),
   'biospecimen_locations': ('biospecimen_locations',
    ('biospecimen', 'findall'),
    None),
   'tissue_locations': ('tissue_locations', ('tissue', 'findall'), None),
   'pathways': ('pathways',
    ('pathway', 'findall'),
    {'kegg_map_id', 'name', 'smpdb_id'})}),
 'experimental_properties': ('experimental_properties',
  ('property', 'findall')

Output truncated: showing 1000 of 4037 characters

The schema for proteins is different:

[3]:

                              hmdb.PROTEINS_SCHEMA

                            

executed in 0ms, finished 12:24:52 2023-04-24

[3]:

{'gene_properties': ('gene_properties',
  {'chromosome_location': ('chromosome_location', None),
   'locus': ('locus', None),
   'gene_sequence': ('gene_sequence', None)}),
 'protein_properties': ('protein_properties',
  {'residue_number': ('residue_number', None),
   'molecular_weight': ('molecular_weight', None),
   'theoretical_pi': ('theoretical_pi', None),
   'polypeptide_sequence': ('polypeptide_sequence', None),
   'transmembrane_regions': ('transmembrane_regions',
    ('region', 'findall'),
    None),
   'signal_regions': ('signal_regions', ('region', 'findall'), None)}),
 'pfams': ('pfams', ('pfam', 'findall'), {'name', 'pfam_id'}),
 'metabolite_associations': ('metabolite_associations',
  ('metabolite', 'findall'),
  {'accession', 'name'}),
 'go_classifications': ('go_classifications',
  ('go_class', 'findall'),
  {'category', 'description', 'go_id'}),
 'pathways': ('pathways',
  ('pathway', 'findall'),
  {'kegg_map_id', 'name', 'smpdb_id'}),
 'general_references': ('general_

Output truncated: showing 1000 of 2072 characters

By default the full schema is used by hmdb.metabolites_raw and hmdb.proteins_raw, but you can pass a smaller dict with only your fields of interest, largely reducing the processing time. Using the head argument we peek into the first N records of the data:

[4]:

                              list(hmdb.metabolites_raw(head = 3))

                            

executed in 0ms, finished 12:25:31 2023-04-24

[4]:

[{'taxonomy': {'description': ' belongs to the class of organic compounds known as histidine and derivatives. Histidine and derivatives are compounds containing cysteine or a derivative thereof resulting from reaction of cysteine at the amino group or the carboxy group, or from the replacement of any hydrogen of glycine by a heteroatom.',
   'direct_parent': 'Histidine and derivatives',
   'kingdom': 'Organic compounds',
   'class': 'Carboxylic acids and derivatives',
   'sub_class': 'Amino acids, peptides, and analogues',
   'molecular_framework': 'Aromatic heteromonocyclic compounds',
   'alternative_parents': ['Amino acids',
    'Aralkylamines',
    'Azacyclic compounds',
    'Carbonyl compounds',
    'Carboxylic acids',
    'Heteroaromatic compounds',
    'Hydrocarbon derivatives',
    'Imidazolyl carboxylic acids and derivatives',
    'L-alpha-amino acids',
    'Monoalkylamines',
    'Monocarboxylic acids and derivatives',
    'N-substituted imidazoles',
    'Organic oxides',

Output truncated: showing 1000 of 132354 characters

The returned nested dict corresponds to the schema. Another example with a schema that extracts only the accession and name fields:

[6]:

                              list(hmdb.metabolites_raw(
    schema = {
        'accession': hmdb.METABOLITES_SCHEMA['accession'],
        'name': hmdb.METABOLITES_SCHEMA['name'],
    },
    head = 20,
))

                            

executed in 0ms, finished 12:25:55 2023-04-24

[6]:

[{'accession': 'HMDB0000001', 'name': '1-Methylhistidine'},
 {'accession': 'HMDB0000002', 'name': '1,3-Diaminopropane'},
 {'accession': 'HMDB0000005', 'name': '2-Ketobutyric acid'},
 {'accession': 'HMDB0000008', 'name': '2-Hydroxybutyric acid'},
 {'accession': 'HMDB0000010', 'name': '2-Methoxyestrone'},
 {'accession': 'HMDB0000011', 'name': '3-Hydroxybutyric acid'},
 {'accession': 'HMDB0000012', 'name': 'Deoxyuridine'},
 {'accession': 'HMDB0000014', 'name': 'Deoxycytidine'},
 {'accession': 'HMDB0000015', 'name': 'Cortexolone'},
 {'accession': 'HMDB0000016', 'name': 'Deoxycorticosterone'},
 {'accession': 'HMDB0000017', 'name': '4-Pyridoxic acid'},
 {'accession': 'HMDB0000019', 'name': 'alpha-Ketoisovaleric acid'},
 {'accession': 'HMDB0000020', 'name': 'p-Hydroxyphenylacetic acid'},
 {'accession': 'HMDB0000021', 'name': 'Iodotyrosine'},
 {'accession': 'HMDB0000022', 'name': '3-Methoxytyramine'},
 {'accession': 'HMDB0000023', 'name': '(S)-3-Hydroxyisobutyric acid'},
 {'accession': 'HMDB00

Output truncated: showing 1000 of 1291 characters

It works a similar way for proteins:

[7]:

                              list(hmdb.proteins_raw(
    schema = {
        'name': hmdb.PROTEINS_SCHEMA['name'],
        'genesymbol': hmdb.PROTEINS_SCHEMA['gene_name'],
    },
    head = 20,
))

                            

executed in 0ms, finished 12:29:23 2023-04-24

[7]:

[{'name': "5'-nucleotidase", 'genesymbol': 'NT5E'},
 {'name': 'Deoxycytidylate deaminase', 'genesymbol': 'DCTD'},
 {'name': 'UMP-CMP kinase', 'genesymbol': 'CMPK1'},
 {'name': "Cytosolic 5'-nucleotidase 1B", 'genesymbol': 'NT5C1B'},
 {'name': "Cytosolic 5'-nucleotidase 1A", 'genesymbol': 'NT5C1A'},
 {'name': "5'(3')-deoxyribonucleotidase, cytosolic type",
  'genesymbol': 'NT5C'},
 {'name': 'Deoxycytidine kinase', 'genesymbol': 'DCK'},
 {'name': "5'(3')-deoxyribonucleotidase, mitochondrial", 'genesymbol': 'NT5M'},
 {'name': 'Hydroxymethylglutaryl-CoA lyase, mitochondrial',
  'genesymbol': 'HMGCL'},
 {'name': 'ATP-citrate synthase', 'genesymbol': 'ACLY'},
 {'name': 'Histone acetyltransferase p300', 'genesymbol': 'EP300'},
 {'name': 'Pyruvate dehydrogenase E1 component subunit beta, mitochondrial',
  'genesymbol': 'PDHB'},
 {'name': 'Acetyl-CoA acetyltransferase, cytosolic', 'genesymbol': 'ACAT2'},
 {'name': 'CREB-binding protein', 'genesymbol': 'CREBBP'},
 {'name': 'Diamine acetyltransfe

Output truncated: showing 1000 of 1478 characters

Higher level access to HMDB data§

By the hmdb.metabolites_table and hmdb.proteins_table functions you can process the records into a pandas data frame. This function accepts list of nameless or named arguments using a simple notation (see its documentation). Instead of the simple notation of tuples, alternatively, hmdb.Field objects can be used to define the fields, though the arguments for Field and the tuples or strings directly passed to hmdb.*_table follow the same format. Let’s extract a data frame with SMILEs, InChi Keys and HMDB accessions:

[8]:

                              hmdb.metabolites_table('accession', 'smiles', 'inchikey', head = 10)

                            

executed in 0ms, finished 12:32:01 2023-04-24

[8]:

	accession	smiles	inchikey
0	HMDB0000001	CN1C=NC(C[C@H](N)C(O)=O)=C1	BRMWTNUJHUMWMS-LURJTMIESA-N
1	HMDB0000002	NCCCN	XFNJVJPLKCPIBV-UHFFFAOYSA-N
2	HMDB0000005	CCC(=O)C(O)=O	TYEYBOSBBBHJIV-UHFFFAOYSA-N
3	HMDB0000008	CC[C@H](O)C(O)=O	AFENDNXGAFYKQO-VKHMYHEASA-N
4	HMDB0000010	[H][C@@]12CCC(=O)[C@@]1(C)CC[C@]1([H])C3=C(CC[...	WHEUWNKSCXYKBU-QPWUGHHJSA-N
5	HMDB0000011	C[C@@H](O)CC(O)=O	WHBMMWSBFZVSSR-GSVOUGTGSA-N
6	HMDB0000012	OC[C@H]1O[C@H](C[C@@H]1O)N1C=CC(=O)NC1=O	MXHRCPNRJAMMIM-SHYZEUOFSA-N
7	HMDB0000014	NC1=NC(=O)N(C=C1)[C@H]1C[C@H](O)[C@@H](CO)O1	CKTSBUTUHBMZGZ-SHYZEUOFSA-N
8	HMDB0000015	[H][C@@]12CC[C@](O)(C(=O)CO)[C@@]1(C)CC[C@@]1(...	WHBHBVVOGNECLV-OBQKJFGGSA-N
9	HMDB0000016	[H][C@@]12CC[C@H](C(=O)CO)[C@@]1(C)CC[C@@]1([H...	ZESRJSPZRDMNHY-YFWFAHHUSA-N
10	HMDB0000017	CC1=NC=C(CO)C(C(O)=O)=C1O	HXACOUQIXZGNBF-UHFFFAOYSA-N

The above example is simple, as each field has a simple string value. The synonyms is an array within each record, below first we process it as an array column, i.e. each row contains an array:

[9]:

                              hmdb.metabolites_table('accession', 'name', 'synonyms', head = 10)

                            

executed in 0ms, finished 12:32:13 2023-04-24

[9]:

	accession	name	synonyms
0	HMDB0000001	1-Methylhistidine	[(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)pro...
1	HMDB0000002	1,3-Diaminopropane	[1,3-Propanediamine, 1,3-Propylenediamine, Pro...
2	HMDB0000005	2-Ketobutyric acid	[2-Ketobutanoic acid, 2-Oxobutyric acid, 3-Met...
3	HMDB0000008	2-Hydroxybutyric acid	[(S)-2-Hydroxybutanoic acid, 2-Hydroxybutyrate...
4	HMDB0000010	2-Methoxyestrone	[2-(8S,9S,13S,14S)-3-Hydroxy-2-methoxy-13-meth...
5	HMDB0000011	3-Hydroxybutyric acid	[(R)-(-)-beta-Hydroxybutyric acid, (R)-3-Hydro...
6	HMDB0000012	Deoxyuridine	[2-Deoxyuridine, dU, 2'-Deoxyuridine, 1-(2-Deo...
7	HMDB0000014	Deoxycytidine	[4-Amino-1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymet...
8	HMDB0000015	Cortexolone	[11-Desoxy-17-hydroxycorticosterone, Cortodoxo...
9	HMDB0000016	Deoxycorticosterone	[21-Hydroxy-4-pregnene-3,20-dione, 21-Hydroxyp...
10	HMDB0000017	4-Pyridoxic acid	[2-Methyl-3-hydroxy-4-carboxy-5-hydroxymethylp...

Each element in the column is an array:

[10]:

                              hmdb_synonyms = hmdb.metabolites_table('accession', 'name', 'synonyms', head = 10)
hmdb_synonyms.synonyms[0]

executed in 0ms, finished 12:32:19 2023-04-24

[10]:

['(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)propanoic acid',
 'Pi-methylhistidine',
 '(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)propanoate',
 '1 Methylhistidine',
 '1-Methyl histidine',
 '1-Methyl-histidine',
 '1-Methyl-L-histidine',
 '1-MHis',
 '1-N-Methyl-L-histidine',
 'L-1-Methylhistidine',
 'N1-Methyl-L-histidine',
 '1-Methylhistidine dihydrochloride',
 '1-Methylhistidine']

Using the @ notation, the arrays can be expanded into multiple rows:

[11]:

                              hmdb.metabolites_table('accession', 'name', ('synonyms', '@'), head = 10)

                            

executed in 0ms, finished 12:32:25 2023-04-24

[11]:

	accession	name	synonyms
0	HMDB0000001	1-Methylhistidine	(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)prop...
1	HMDB0000001	1-Methylhistidine	Pi-methylhistidine
2	HMDB0000001	1-Methylhistidine	(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)prop...
3	HMDB0000001	1-Methylhistidine	1 Methylhistidine
4	HMDB0000001	1-Methylhistidine	1-Methyl histidine
...	...	...	...
291	HMDB0000017	4-Pyridoxic acid	3-Hydroxy-5-hydroxymethyl-2-methyl-isonicotins...
292	HMDB0000017	4-Pyridoxic acid	4 Pyridoxinic acid
293	HMDB0000017	4-Pyridoxic acid	Pyridoxinecarboxylic acid
294	HMDB0000017	4-Pyridoxic acid	4 Pyridoxylic acid
295	HMDB0000017	4-Pyridoxic acid	4 Pyridoxic acid

296 rows × 3 columns

This already resulted almost 300 rows: be careful using @ for multiple columns, as it yields rows in a combinatorial way, and the resulted data frames can easily grow huge. Another notation is *, it means extract all elements from a dict into multiple columns. Below we apply it to the taxonomy column which is a dict of multiple fields:

[12]:

                              hmdb.metabolites_table('accession', 'name', ('taxonomy', '*'), head = 10)

                            

executed in 0ms, finished 12:32:30 2023-04-24

[12]:

	accession	name	taxonomy__alternative_parents	taxonomy__class	taxonomy__description	taxonomy__direct_parent	taxonomy__kingdom	taxonomy__molecular_framework	taxonomy__sub_class	taxonomy__substituents
0	HMDB0000001	1-Methylhistidine	[Amino acids, Aralkylamines, Azacyclic compoun...	Carboxylic acids and derivatives	belongs to the class of organic compounds kno...	Histidine and derivatives	Organic compounds	Aromatic heteromonocyclic compounds	Amino acids, peptides, and analogues	[Alpha-amino acid, Amine, Amino acid, Aralkyla...
1	HMDB0000002	1,3-Diaminopropane	[Hydrocarbon derivatives, Organopnictogen comp...	Organonitrogen compounds	belongs to the class of organic compounds kno...	Monoalkylamines	Organic compounds	Aliphatic acyclic compounds	Amines	[Aliphatic acyclic compound, Hydrocarbon deriv...
2	HMDB0000005	2-Ketobutyric acid	[Alpha-hydroxy ketones, Alpha-keto acids and d...	Keto acids and derivatives	belongs to the class of organic compounds kno...	Short-chain keto acids and derivatives	Organic compounds	Aliphatic acyclic compounds	Short-chain keto acids and derivatives	[Aliphatic acyclic compound, Alpha-hydroxy ket...
3	HMDB0000008	2-Hydroxybutyric acid	[Carbonyl compounds, Carboxylic acids, Fatty a...	Hydroxy acids and derivatives	belongs to the class of organic compounds kno...	Alpha hydroxy acids and derivatives	Organic compounds	Aliphatic acyclic compounds	Alpha hydroxy acids and derivatives	[Alcohol, Aliphatic acyclic compound, Alpha-hy...
4	HMDB0000010	2-Methoxyestrone	[1-hydroxy-2-unsubstituted benzenoids, 17-oxos...	Steroids and steroid derivatives	belongs to the class of organic compounds kno...	Estrogens and derivatives	Organic compounds	Aromatic homopolycyclic compounds	Estrane steroids	[1-hydroxy-2-unsubstituted benzenoid, 17-oxost...
5	HMDB0000011	3-Hydroxybutyric acid	[Carbonyl compounds, Carboxylic acids, Fatty a...	Hydroxy acids and derivatives	belongs to the class of organic compounds kno...	Beta hydroxy acids and derivatives	Organic compounds	Aliphatic acyclic compounds	Beta hydroxy acids and derivatives	[Alcohol, Aliphatic acyclic compound, Beta-hyd...
6	HMDB0000012	Deoxyuridine	[Azacyclic compounds, Heteroaromatic compounds...	Pyrimidine nucleosides	belongs to the class of organic compounds kno...	Pyrimidine 2'-deoxyribonucleosides	Organic compounds	Aromatic heteromonocyclic compounds	Pyrimidine 2'-deoxyribonucleosides	[Alcohol, Aromatic heteromonocyclic compound, ...
7	HMDB0000014	Deoxycytidine	[Aminopyrimidines and derivatives, Azacyclic c...	Pyrimidine nucleosides	belongs to the class of organic compounds kno...	Pyrimidine 2'-deoxyribonucleosides	Organic compounds	Aromatic heteromonocyclic compounds	Pyrimidine 2'-deoxyribonucleosides	[Alcohol, Amine, Aminopyrimidine, Aromatic het...
8	HMDB0000015	Cortexolone	[17-hydroxysteroids, 20-oxosteroids, 3-oxo del...	Steroids and steroid derivatives	belongs to the class of organic compounds kno...	21-hydroxysteroids	Organic compounds	Aliphatic homopolycyclic compounds	Hydroxysteroids	[17-hydroxysteroid, 20-oxosteroid, 21-hydroxys...
9	HMDB0000016	Deoxycorticosterone	[20-oxosteroids, 3-oxo delta-4-steroids, Alpha...	Steroids and steroid derivatives	belongs to the class of organic compounds kno...	21-hydroxysteroids	Organic compounds	Aliphatic homopolycyclic compounds	Hydroxysteroids	[20-oxosteroid, 21-hydroxysteroid, 3-oxo-delta...
10	HMDB0000017	4-Pyridoxic acid	[Aromatic alcohols, Azacyclic compounds, Carbo...	Pyridines and derivatives	belongs to the class of organic compounds kno...	Pyridinecarboxylic acids	Organic compounds	Aromatic heteromonocyclic compounds	Pyridinecarboxylic acids and derivatives	[Alcohol, Aromatic alcohol, Aromatic heteromon...

We see taxonomy gave birth to 8 columns. If we expand all those columns, we get a data frame of more than 2,000 rows only from the first 10 records already:

[13]:

                              hmdb.metabolites_table('accession', 'name', ('taxonomy', '*', '@'), head = 10)

                            

executed in 0ms, finished 12:32:37 2023-04-24

[13]:

	accession	name	taxonomy__alternative_parents	taxonomy__class	taxonomy__description	taxonomy__direct_parent	taxonomy__kingdom	taxonomy__molecular_framework	taxonomy__sub_class	taxonomy__substituents
0	HMDB0000001	1-Methylhistidine	Amino acids	Carboxylic acids and derivatives	belongs to the class of organic compounds kno...	Histidine and derivatives	Organic compounds	Aromatic heteromonocyclic compounds	Amino acids, peptides, and analogues	Alpha-amino acid
1	HMDB0000001	1-Methylhistidine	Amino acids	Carboxylic acids and derivatives	belongs to the class of organic compounds kno...	Histidine and derivatives	Organic compounds	Aromatic heteromonocyclic compounds	Amino acids, peptides, and analogues	Amine
2	HMDB0000001	1-Methylhistidine	Amino acids	Carboxylic acids and derivatives	belongs to the class of organic compounds kno...	Histidine and derivatives	Organic compounds	Aromatic heteromonocyclic compounds	Amino acids, peptides, and analogues	Amino acid
3	HMDB0000001	1-Methylhistidine	Amino acids	Carboxylic acids and derivatives	belongs to the class of organic compounds kno...	Histidine and derivatives	Organic compounds	Aromatic heteromonocyclic compounds	Amino acids, peptides, and analogues	Aralkylamine
4	HMDB0000001	1-Methylhistidine	Amino acids	Carboxylic acids and derivatives	belongs to the class of organic compounds kno...	Histidine and derivatives	Organic compounds	Aromatic heteromonocyclic compounds	Amino acids, peptides, and analogues	Aromatic heteromonocyclic compound
...	...	...	...	...	...	...	...	...	...	...
2235	HMDB0000017	4-Pyridoxic acid	Vinylogous acids	Pyridines and derivatives	belongs to the class of organic compounds kno...	Pyridinecarboxylic acids	Organic compounds	Aromatic heteromonocyclic compounds	Pyridinecarboxylic acids and derivatives	Organooxygen compound
2236	HMDB0000017	4-Pyridoxic acid	Vinylogous acids	Pyridines and derivatives	belongs to the class of organic compounds kno...	Pyridinecarboxylic acids	Organic compounds	Aromatic heteromonocyclic compounds	Pyridinecarboxylic acids and derivatives	Organopnictogen compound
2237	HMDB0000017	4-Pyridoxic acid	Vinylogous acids	Pyridines and derivatives	belongs to the class of organic compounds kno...	Pyridinecarboxylic acids	Organic compounds	Aromatic heteromonocyclic compounds	Pyridinecarboxylic acids and derivatives	Primary alcohol
2238	HMDB0000017	4-Pyridoxic acid	Vinylogous acids	Pyridines and derivatives	belongs to the class of organic compounds kno...	Pyridinecarboxylic acids	Organic compounds	Aromatic heteromonocyclic compounds	Pyridinecarboxylic acids and derivatives	Pyridine carboxylic acid
2239	HMDB0000017	4-Pyridoxic acid	Vinylogous acids	Pyridines and derivatives	belongs to the class of organic compounds kno...	Pyridinecarboxylic acids	Organic compounds	Aromatic heteromonocyclic compounds	Pyridinecarboxylic acids and derivatives	Vinylogous acid

2240 rows × 10 columns

The hmdb.metabolites_mapping and hmdb.proteins_mapping function provides data frames or dicts for translation between a pair of identifier types. For example, translate KEGG Pathway IDs to SMILES, default output is dict of sets:

[14]:

                              hmdb.metabolites_mapping('kegg', 'smiles', head = 10)

                            

executed in 0ms, finished 12:33:27 2023-04-24

[14]:

{'C00109': {'CCC(=O)C(O)=O'},
 'C00526': {'OC[C@H]1O[C@H](C[C@@H]1O)N1C=CC(=O)NC1=O'},
 'C00847': {'CC1=NC=C(CO)C(C(O)=O)=C1O'},
 'C00881': {'NC1=NC(=O)N(C=C1)[C@H]1C[C@H](O)[C@@H](CO)O1'},
 'C00986': {'NCCCN'},
 'C01089': {'C[C@@H](O)CC(O)=O'},
 'C01152': {'CN1C=NC(C[C@H](N)C(O)=O)=C1'},
 'C03205': {'[H][C@@]12CC[C@H](C(=O)CO)[C@@]1(C)CC[C@@]1([H])[C@@]2([H])CCC2=CC(=O)CC[C@]12C'},
 'C05299': {'[H][C@@]12CCC(=O)[C@@]1(C)CC[C@]1([H])C3=C(CC[C@@]21[H])C=C(O)C(OC)=C3'},
 'C05488': {'[H][C@@]12CC[C@](O)(C(=O)CO)[C@@]1(C)CC[C@@]1([H])[C@@]2([H])CCC2=CC(=O)CC[C@]12C'},
 'C05984': {'CC[C@H](O)C(O)=O'}}

The same data in a data frame:

[15]:

                              hmdb.metabolites_mapping('kegg', 'smiles', head = 10, return_df = True)

                            

executed in 0ms, finished 12:33:31 2023-04-24

[15]:

	id_a	id_b
0	C01152	CN1C=NC(C[C@H](N)C(O)=O)=C1
1	C00986	NCCCN
2	C00109	CCC(=O)C(O)=O
3	C05984	CC[C@H](O)C(O)=O
4	C05299	[H][C@@]12CCC(=O)[C@@]1(C)CC[C@]1([H])C3=C(CC[...
5	C01089	C[C@@H](O)CC(O)=O
6	C00526	OC[C@H]1O[C@H](C[C@@H]1O)N1C=CC(=O)NC1=O
7	C00881	NC1=NC(=O)N(C=C1)[C@H]1C[C@H](O)[C@@H](CO)O1
8	C05488	[H][C@@]12CC[C@](O)(C(=O)CO)[C@@]1(C)CC[C@@]1(...
9	C03205	[H][C@@]12CC[C@H](C(=O)CO)[C@@]1(C)CC[C@@]1([H...
10	C00847	CC1=NC=C(CO)C(C(O)=O)=C1O

ID translation with HMDB§

HMDB is also integrated into the ID translation service. Thanks to the multiple levels of caching, only the first call takes long time, subsequent calls are pretty fast:

[16]:

                              from pypath.utils import mapping
mapping.map_name('C01152', 'kegg', 'inchi')

executed in 0ms, finished 12:33:39 2023-04-24

[16]:

{'InChI=1S/C7H11N3O2/c1-10-3-5(9-4-10)2-6(8)7(11)12/h3-4,6H,2,8H2,1H3,(H,11,12)/t6-/m0/s1',
 'InChI=1S/C7H11N3O2/c1-10-4-9-3-5(10)2-6(8)7(11)12/h3-4,6H,2,8H2,1H3,(H,11,12)/t6-/m0/s1'}

The two InChi Keys correspond to the two constitutional isomers included in the KEGG ID: 1- and 3-Methylhistidine. A useful feature of HMDB that it has many synonyms and IUPAC names, making it possible to parse a large variety of metabolite names:

[17]:

                              mapping.map_name('C01152', 'kegg', 'hmdb_synonym')

                            

executed in 0ms, finished 12:33:41 2023-04-24

[17]:

{'(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)propanoate',
 '(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)propanoic acid',
 '(2S)-2-Amino-3-(1-methyl-1H-imidazol-5-yl)propanoate',
 '(2S)-2-Amino-3-(1-methyl-1H-imidazol-5-yl)propanoic acid',
 '1 Methylhistidine',
 '1-MHis',
 '1-Methyl histidine',
 '1-Methyl-L-histidine',
 '1-Methyl-histidine',
 '1-Methylhistidine',
 '1-Methylhistidine dihydrochloride',
 '1-N-Methyl-L-histidine',
 '3-Methyl-L-histidine',
 '3-Methylhistidine',
 '3-Methylhistidine dihydrochloride',
 '3-Methylhistidine hydride',
 '3-N-Methyl-L-histidine',
 'L-1-Methylhistidine',
 'L-3-Methylhistidine',
 'N Tau-methylhistidine',
 'N(Tau)-methylhistidine',
 'N(pros)-Methyl-L-histidine',
 'N-pros-Methyl-L-histidine',
 'N1-Methyl-L-histidine',
 'N3-Methyl-L-histidine',
 'Pi-methylhistidine',
 'Tau-methyl-L-histidine',
 'Tau-methylhistidine'}

[18]:

                              mapping.map_name('N(pros)-Methyl-L-histidine', 'hmdb_synonym', 'inchi')

                            

executed in 1.81s, finished 12:33:46 2023-04-24

[18]:

{'InChI=1S/C7H11N3O2/c1-10-4-9-3-5(10)2-6(8)7(11)12/h3-4,6H,2,8H2,1H3,(H,11,12)/t6-/m0/s1'}

The name provided by HMDB is typically the best human readable name, hence it can be used as labels in figures or tables:

[19]:

                              mapping.map_name('HMDB0000001', 'hmdb', 'hmdb_name')

                            

executed in 0ms, finished 12:33:47 2023-04-24

[19]:

{'1-Methylhistidine'}

SwissLipids§

The pypath.inputs.swisslipids module provides access to the datasets available from SwissLipids for download. Each function returns a csv.DictReader, which is a generator that yields rows as dicts:

[5]:

                            from pypath.inputs import swisslipids

                          

executed in 0ms, finished 19:38:03 2024-10-06

[3]:

                            tissues = swisslipids.swisslipids_tissues()
tissues

executed in 0ms, finished 19:37:12 2024-10-06

[3]:

<csv.DictReader at 0x6a4f241d0230>

[4]:

                            next(tissues)

                          

executed in 0ms, finished 19:37:38 2024-10-06

[4]:

{'Lipid ID': 'SLM:000056561',
 'Lipid name': 'Phosphatidylcholine (40:6)',
 'Tissue/Cell ID': 'UBERON:0001969',
 'Tissue/Cell name': 'blood plasma',
 'Taxon ID': '9606',
 'Taxon scientific name': 'Homo sapiens',
 'Evidence tag ID': '6814'}

Alternatively, the datasets can be retrieved as data frames by the return_df argument. The “lipids” and “lipids2uniprot” datasets use a large amount of memory if loaded this way.

[6]:

                            swisslipids.swisslipids_tissues(return_df = True)

                          

executed in 0ms, finished 19:40:23 2024-10-06

[6]:

	Lipid ID	Lipid name	Tissue/Cell ID	Tissue/Cell name	Taxon ID	Taxon scientific name	Evidence tag ID
0	SLM:000056561	Phosphatidylcholine (40:6)	UBERON:0001969	blood plasma	9606	Homo sapiens	6814
1	SLM:000056510	Phosphatidylcholine (34:3)	UBERON:0001969	blood plasma	9606	Homo sapiens	6806
2	SLM:000056525	Phosphatidylcholine (36:4)	UBERON:0001969	blood plasma	9606	Homo sapiens	6809
3	SLM:000056524	Phosphatidylcholine (36:3)	UBERON:0001969	blood plasma	9606	Homo sapiens	6808
4	SLM:000056509	Phosphatidylcholine (34:2)	UBERON:0001969	blood plasma	9606	Homo sapiens	6805
...	...	...	...	...	...	...	...
934	SLM:000098542	Phosphatidylethanolamine (O-18:0/16:0)	UBERON:0000468	multi-cellular organism	6239	Caenorhabditis elegans	15918
935	SLM:000098543	Phosphatidylethanolamine (O-18:0/16:1)	UBERON:0000468	multi-cellular organism	6239	Caenorhabditis elegans	15917
936	SLM:000098546	Phosphatidylethanolamine (O-18:0/18:0)	UBERON:0000468	multi-cellular organism	6239	Caenorhabditis elegans	15916
937	SLM:000098549	Phosphatidylethanolamine (O-18:0/18:3)	UBERON:0000468	multi-cellular organism	6239	Caenorhabditis elegans	15913
938	SLM:000098557	Phosphatidylethanolamine (O-18:0/20:5)	UBERON:0000468	multi-cellular organism	6239	Caenorhabditis elegans	15910

939 rows × 7 columns

LIPID MAPS§

LIPID MAPS is an international non-profit consortium that develops and maintains standards and tools for lipid research. Currently pypath features a client for its Structure Database, called LMSD. Pypath uses the SDF format, which includes all fields available in the database.

[7]:

                            from pypath.inputs import lipidmaps

                          

executed in 0ms, finished 19:47:28 2024-10-06

When the function returns, the file is already downloaded and opened, but not parsed yet, hence the object reports 0 records:

[8]:

                            lmsd = lipidmaps.lmsd_sdf()
lmsd

executed in 1.29s, finished 19:47:47 2024-10-06

[8]:

<SDF file `structures.sdf`: 0 records>

One option to retrieve the records is to simply iterate the object:

[12]:

                            for lipid in lmsd:
    break
lipid

                          

executed in 0ms, finished 19:51:42 2024-10-06

[12]:

{'id': 'LMFA00000001',
 'source': 'LIPID_MAPS_STRUCTURE_DATABASE',
 'comment': '',
 'mol': '',
 'name': {'LM_ID': 'LMFA00000001',
  'SYSTEMATIC_NAME': '2-methoxy-12-methyloctadec-17-en-5-ynoyl anhydride',
  'FORMULA': 'C40H66O5',
  'INCHI_KEY': 'VOGBKCAANIAXCI-UHFFFAOYSA-N',
  'INCHI': 'InChI=1S/C40H66O5/c1-7-9-11-23-29-35(3)31-25-19-15-13-17-21-27-33-37(43-5)39(41)45-40(42)38(44-6)34-28-22-18-14-16-20-26-32-36(4)30-24-12-10-8-2/h7-8,35-38H,1-2,9-16,19-20,23-34H2,3-6H3',
  'SMILES': 'C(C(OC)CCC#CCCCCCC(C)CCCCC=C)(=O)OC(C(OC)CCC#CCCCCCC(C)CCCCC=C)=O',
  'ABBREVIATION': 'FA 40:7;O3',
  'SYNONYMS': 'Acetylenic acids',
  'PUBCHEM_CID': '10930192',
  'CHEBI_ID': '178363'},
 'annot': {'NAME': '2-methoxy-12-methyloctadec-17-en-5-ynoyl anhydride',
  'CATEGORY': 'Fatty Acyls [FA]',
  'MAIN_CLASS': 'Other Fatty Acyls [FA00]',
  'EXACT_MASS': '626.491025'}}

The same object is able to index the SDF file, and retrieve records on demand. The indexing covers all names, synonyms and identifiers used in the database.

[13]:

                            lmsd.index()

                          

executed in 24.31s, finished 19:54:26 2024-10-06

After indexing, the database shows its size:

[15]:

lmsd

executed in 0ms, finished 19:55:54 2024-10-06

[15]:

<SDF file `structures.sdf`: 48116 records>

[16]:

                            len(lmsd)

                          

executed in 0ms, finished 19:56:03 2024-10-06

[16]:

The records can be retrieved by any of their names or identifiers:

[14]:

                            lmsd['LMFA00000001']

                          

executed in 0ms, finished 19:54:52 2024-10-06

[14]:

[({'id': 'LMFA00000001',
   'source': 'LIPID_MAPS_STRUCTURE_DATABASE',
   'comment': '',
   'mol': '',
   'name': {'LM_ID': 'LMFA00000001',
    'SYSTEMATIC_NAME': '2-methoxy-12-methyloctadec-17-en-5-ynoyl anhydride',
    'FORMULA': 'C40H66O5',
    'INCHI_KEY': 'VOGBKCAANIAXCI-UHFFFAOYSA-N',
    'INCHI': 'InChI=1S/C40H66O5/c1-7-9-11-23-29-35(3)31-25-19-15-13-17-21-27-33-37(43-5)39(41)45-40(42)38(44-6)34-28-22-18-14-16-20-26-32-36(4)30-24-12-10-8-2/h7-8,35-38H,1-2,9-16,19-20,23-34H2,3-6H3',
    'SMILES': 'C(C(OC)CCC#CCCCCCC(C)CCCCC=C)(=O)OC(C(OC)CCC#CCCCCCC(C)CCCCC=C)=O',
    'ABBREVIATION': 'FA 40:7;O3',
    'SYNONYMS': 'Acetylenic acids',
    'PUBCHEM_CID': '10930192',
    'CHEBI_ID': '178363'},
   'annot': {'NAME': '2-methoxy-12-methyloctadec-17-en-5-ynoyl anhydride',
    'CATEGORY': 'Fatty Acyls [FA]',
    'MAIN_CLASS': 'Other Fatty Acyls [FA00]',
    'EXACT_MASS': '626.491025'}},
  0),
 ({'id': 'LMFA00000001',
   'source': 'LIPID_MAPS_STRUCTURE_DATABASE',
   'comment': '',
   'mol':

Output truncated: showing 1000 of 1803 characters

And it also supports the in operator:

[17]:

                            'PC(18:1/18:0)' in lmsd

                          

executed in 0ms, finished 19:57:02 2024-10-06

[17]:

True

[18]:

                            lmsd['PC(18:1/18:0)']

                          

executed in 1ms, finished 19:57:28 2024-10-06

[18]:

[({'id': 'LMGP01010888',
   'source': 'LIPID_MAPS_STRUCTURE_DATABASE',
   'comment': '',
   'mol': '',
   'name': {'LM_ID': 'LMGP01010888',
    'SYSTEMATIC_NAME': '1-(9Z-octadecenoyl)-2-octadecanoyl-sn-glycero-3-phosphocholine',
    'FORMULA': 'C44H86NO8P',
    'INCHI_KEY': 'NMJCSTNQFYPVOR-VHONOUADSA-N',
    'INCHI': 'InChI=1S/C44H86NO8P/c1-6-8-10-12-14-16-18-20-22-24-26-28-30-32-34-36-43(46)50-40-42(41-52-54(48,49)51-39-38-45(3,4)5)53-44(47)37-35-33-31-29-27-25-23-21-19-17-15-13-11-9-7-2/h20,22,42H,6-19,21,23-41H2,1-5H3/b22-20-/t42-/m1/s1',
    'SMILES': '[C@](COP(=O)([O-])OCC[N+](C)(C)C)([H])(OC(CCCCCCCCCCCCCCCCC)=O)COC(CCCCCCC/C=C\\CCCCCCCC)=O',
    'ABBREVIATION': 'PC 36:1',
    'SYNONYMS': 'Choline phosphate, 3-ester with L-1-oleo-2-stearin; L-1-Oleoyl-2-stearoyl lecithin; L-1-Oleoyl-2-stearoyl-3-phosphatidylcholine; OSPC; PC(18:1/18:0); PC(36:1); PC(18:0_18:1)',
    'PUBCHEM_CID': '24778936',
    'HMDB_ID': 'HMDB0008102',
    'CHEBI_ID': '76073',
    'SWISSLIPIDS_ID': 'SLM:000012

Output truncated: showing 1000 of 2352 characters

Finally, the records can be loaded into memory, in this case their retrieval is faster:

[21]:

                            lmsd.load()

                          

executed in 0ms, finished 20:02:08 2024-10-06

[23]:

                            lmsd['PC(18:1/18:0)']

                          

executed in 1ms, finished 20:02:51 2024-10-06

[23]:

[{'id': 'LMGP01010888',
  'source': 'LIPID_MAPS_STRUCTURE_DATABASE',
  'comment': '',
  'mol': '',
  'name': {'LM_ID': 'LMGP01010888',
   'SYSTEMATIC_NAME': '1-(9Z-octadecenoyl)-2-octadecanoyl-sn-glycero-3-phosphocholine',
   'FORMULA': 'C44H86NO8P',
   'INCHI_KEY': 'NMJCSTNQFYPVOR-VHONOUADSA-N',
   'INCHI': 'InChI=1S/C44H86NO8P/c1-6-8-10-12-14-16-18-20-22-24-26-28-30-32-34-36-43(46)50-40-42(41-52-54(48,49)51-39-38-45(3,4)5)53-44(47)37-35-33-31-29-27-25-23-21-19-17-15-13-11-9-7-2/h20,22,42H,6-19,21,23-41H2,1-5H3/b22-20-/t42-/m1/s1',
   'SMILES': '[C@](COP(=O)([O-])OCC[N+](C)(C)C)([H])(OC(CCCCCCCCCCCCCCCCC)=O)COC(CCCCCCC/C=C\\CCCCCCCC)=O',
   'ABBREVIATION': 'PC 36:1',
   'SYNONYMS': 'L-1-Oleoyl-2-stearoyl-3-phosphatidylcholine;PC(36:1);PC(18:0_18:1);PC(18:1/18:0);Choline phosphate, 3-ester with L-1-oleo-2-stearin;OSPC;L-1-Oleoyl-2-stearoyl lecithin',
   'PUBCHEM_CID': '24778936',
   'HMDB_ID': 'HMDB0008102',
   'CHEBI_ID': '76073',
   'SWISSLIPIDS_ID': 'SLM:000012332'},
  'annot': {'NA

Output truncated: showing 1000 of 2290 characters

NCBI E-Utils§

The ESummary endpoint of the NCBI E-Utils API provides metadata about records in NCBI databases. A client to this API endpoint is available in the pypath.inputs.eutils module. The parameter ids can be one integer, or a list of integers or strings:

[3]:

                            from pypath.inputs import eutils

eutils.esummary(ids = 6063, db = 'geoprofiles')

executed in 0ms, finished 22:43:56 2023-11-14

[3]:

{'uids': ['6063'],
 '6063': {'uid': '6063',
  'gds': '5',
  'gpl': '13',
  'erank': '8eSiQ',
  'evalue': 'joAzE',
  'title': 'Diurnal and circadian-regulated genes (I)',
  'taxon': 'Arabidopsis thaliana',
  'gdstype': 'Expression profiling by array',
  'valtype': 'log ratio',
  'idref': '6063',
  'genename': '',
  'genedesc': '',
  'ugname': 'AT4G11560',
  'ugdesc': 'Bromo-adjacent homology (BAH) domain-containing protein',
  'nucdesc': '9366 Lambda-PRL2 Arabidopsis thaliana cDNA clone 135J10T7, mRNA sequence',
  'entrez_gene_id': '',
  'gbacc': 'T46103',
  'ptacc': '',
  'cloneid': '135J10T7',
  'orf': '',
  'spotid': '',
  'vmin': '-0.395000',
  'vmax': '0.201000',
  'groups': 'A1B3C1',
  'abscall': '',
  'aflag': 20,
  'aoutl': '',
  'rstd': 31,
  'rmean': 50}}

A simple wrapper for PubMed is available in the pypath.inputs.pubmed module:

[2]:

                            from pypath.inputs import pubmed

pubmed.get_pubmeds('33209674')

executed in 0ms, finished 22:42:02 2023-11-14

[2]:

{'uids': ['33209674'],
 '33209674': {'uid': '33209674',
  'pubdate': '2020 Oct',
  'epubdate': '',
  'source': 'Transl Androl Urol',
  'authors': [{'name': 'Kim H', 'authtype': 'Author', 'clusterid': ''},
   {'name': 'Lee SH', 'authtype': 'Author', 'clusterid': ''},
   {'name': 'Kim DH', 'authtype': 'Author', 'clusterid': ''},
   {'name': 'Lee JY', 'authtype': 'Author', 'clusterid': ''},
   {'name': 'Hong SH', 'authtype': 'Author', 'clusterid': ''},
   {'name': 'Ha US', 'authtype': 'Author', 'clusterid': ''},
   {'name': 'Kim IH', 'authtype': 'Author', 'clusterid': ''}],
  'lastauthor': 'Kim IH',
  'title': 'Gemcitabine maintenance versus observation after first-line chemotherapy in patients with metastatic urothelial carcinoma: a retrospective study.',
  'sorttitle': 'gemcitabine maintenance versus observation after first line chemotherapy in patients with metastatic urothelial carcinoma a retrospective study',
  'volume': '9',
  'issue': '5',
  'pages': '2113-2121',
  'lang': ['eng']

Output truncated: showing 1000 of 2263 characters

One last example, querying the Entrez Gene database:

[4]:

                            from pypath.inputs import eutils

eutils.esummary(ids = 1956, db = 'gene')

executed in 0ms, finished 22:48:09 2023-11-14

[4]:

{'uids': ['1956'],
 '1956': {'uid': '1956',
  'name': 'EGFR',
  'description': 'epidermal growth factor receptor',
  'status': '',
  'currentid': '',
  'chromosome': '7',
  'geneticsource': 'genomic',
  'maplocation': '7p11.2',
  'otheraliases': 'ERBB, ERBB1, ERRP, HER1, NISBD2, PIG61, mENA',
  'otherdesignations': 'epidermal growth factor receptor|EGFR vIII|avian erythroblastic leukemia viral (v-erb-b) oncogene homolog|cell growth inhibiting protein 40|cell proliferation-inducing protein 61|epidermal growth factor receptor tyrosine kinase domain|erb-b2 receptor tyrosine kinase 1|proto-oncogene c-ErbB-1|receptor tyrosine-protein kinase erbB-1',
  'nomenclaturesymbol': 'EGFR',
  'nomenclaturename': 'epidermal growth factor receptor',
  'nomenclaturestatus': 'Official',
  'mim': ['131550'],
  'genomicinfo': [{'chrloc': '7',
    'chraccver': 'NC_000007.14',
    'chrstart': 55019016,
    'chrstop': 55211627,
    'exoncount': 32}],
  'geneweight': 580393,
  'summary': 'The protein encoded b

Output truncated: showing 1000 of 5417 characters

Download management§

Cache management and customization§

The pypath.omnipath.app saves the databases to pickle dumps by default under the ~/.pypath/pickles/ directory and after the first build loads them from there. The very first build of each database might take quite long time (up to >90 min in case of the OmniPath network or annotations) because of the large number of downloads. Subsequent builds will be much faster because pypath stores all the downloaded data in a local cache and downloads again only upon request from the user. Loading the databases from pickle dumps takes only seconds. However if you want to build with different settings you should be aware to set up a different cache file name.

Download failures§

Issuing hundreds of requests to dozens of servers sooner or later comes with failures. These might happen just by accident, especially on slow networks, it is always recommended to try again. The

Corrupted cache content§

Sometimes a truncated or corrupted file remains in the cache, in this case you can use the context managers in pypath.share.curl to control the cache. E.g. if the download of the DEPOD database failed and keeps failing due to a corrupted file, use the cache_delete_on context:

[7]:

                              from pypath.share import curl
from pypath.inputs import depod

with curl.cache_delete_on():
    depod = depod.depod_enzyme_substrate()

executed in 5.61s, finished 13:59:07 2022-12-02

The cache_off context forces download even if a cache item is available; the cache_print_on context prints paths to the accessed cache files to the terminal, though the paths can always be found in the log; the dry_run_on context sets up the pypath.share.curl.Curl object and stops just before the actual download.

Network communication issues: look into the curl debug log§

Downloads might fail also due to TLS or HTTP errors, wrong headers or parameters, and many other reasons. In this case a full debug output from curl might be very useful. The debug_on context writes curl debug into the logfile:

[8]:

                              from pypath.share import curl
from pypath.inputs import depod

with curl.debug_on():
    depod = depod.depod_enzyme_substrate()

executed in 0ms, finished 13:59:12 2022-12-02

Timeouts§

From the log we can find out if the download fails due to a timeout. In this case, the timeout parameters can be altered by a settings context. Apart from a timeout for the completion of the download, there is curl_connect_timeout (timeout for establishing connection to the server), and curl_extended_timeout, that is used for servers that are known to be exceptionally slow. Another parameter, curl_retries is the number of attempts before giving up. By default it’s 3, and that should be more than enough.

[9]:

                              from pypath.share import settings
from pypath.inputs import depod

with settings.context(curl_timeout = 360):
    depod = depod.depod_enzyme_substrate()

executed in 0ms, finished 13:59:17 2022-12-02

Access and inspect the `Curl` object§

Often the Curl object is created in a function from the pypath.inputs module, deep in a call stack, hence accessing it for investigation is difficult. Using the preserve_on context, the last Curl instance is kept under the pypath.share.curl.LASTCURL attribute:

[10]:

                              from pypath.share import curl
from pypath.inputs import depod

with curl.preserve_on():
    depod = depod.depod_enzyme_substrate()

depod_curl = curl.LASTCURL
depod_curl

                            

executed in 0ms, finished 13:59:24 2022-12-02

[10]:

<pypath.share.curl.Curl at 0x6947386dc8b0>

[11]:

                              depod_curl.url, depod_curl.req_headers, depod_curl.fileobj, depod_curl.status

                            

executed in 0ms, finished 13:59:28 2022-12-02

[11]:

('http://depod.bioss.uni-freiburg.de/download/DEPOD_201405_human_phosphatase-substrate.mitab',
 [],
 <_io.TextIOWrapper name='/home/denes/.pypath/cache/6a711369ecf9dcff8c5ed88996685b54-DEPOD_201405_human_phosphatase-substrate.mitab' mode='r' encoding='iso-8859-1'>,
 0)

Is it failing only for you?§

Okay, this is the one you should check first: we run almost all downloads in pypath daily, you can always check in the report wether a particular function run successfully last night on our server. If it fails also in our daily build, it still can be a transient error that disappears within a few days, or it can be a permanent error. In the latter case, we first try to fix the issue in pypath (maybe the behaviour or the address of the third party server has changed). If we have no way to fix it, we start hosting the data on our own server and make pypath download it from there.

Read the log§

Above we mentioned a lot the pypath log. Here is how to access the log, see more details in the section about logging:

[12]:

                              import pypath
pypath.log()

executed in 0ms, finished 13:59:34 2022-12-02

[2022-12-02 14:57:09] Welcome!
[2022-12-02 14:57:09] Logger started, logging into `/home/denes/pypath/notebooks/pypath_log/pypath-s3e92.log`.
[2022-12-02 14:57:09] Session `s3e92` started.
[2022-12-02 14:57:09] [pypath]
        - session ID: `s3e92`
        - working directory: `/home/denes/pypath/notebooks`
        - logfile: `/home/denes/pypath/notebooks/pypath_log/pypath-s3e92.log`
        - pypath version: 0.14.30
[2022-12-02 14:57:09] [curl] Creating Curl object to retrieve data from `https://www.ensembl.org/info/about/species.html`
[2022-12-02 14:57:09] [curl] Cache file path: `/home/denes/.pypath/cache/535b06d53a59e75bb693369bc5fdc556-species.html`
[2022-12-02 14:57:09] [curl] Cache file found, no need for download.
[2022-12-02 14:57:09] [curl] Opening plain text file `/home/denes/.pypath/cache/535b06d53a59e75bb693369bc5fdc556-species.html`.
[2022-12-02 14:57:09] [curl] Contents of `/home/denes/.pypath/cache/535b06d53a59e75bb693369bc5fdc556-species.html` has been read and the file has been closed.
[2022-1

Output truncated: showing 1000 of 112963 characters

TLS (SSL, HTTPS) errors§

Failed to verify certificate, invalid, expired, self-signed, missing certificates. These might be the most common reasons why people open issues for our software. TLS is a method for encrypted, typically HTTP, communication. The server has a certificate and uses it to sign and encrypt the data before sending it to the client. The client trusts the server certificate because it is signed by another certificate. And that is signed by another one, and so on, until we reach a so called root certificate that is known and trusted by the client. The number of root certificates used globally is so small that every single computer stores them locally and updates them time to time from trusted sources, such as the provider of the operating system, web browser or programming language. Having up-to-date certificate store and correctly configured TLS clients on your computer is your (or your system admin’s) duty, we can here only give a generic procedure to address these issues. In 97% of the cases the issue is in your computer, but sometimes the server might be responsible. If you experience a TLS issue:

Check the status of the server: initiate a scan at a free TLS checking service, such as SSL Labs: look for any issue with the certificate chain, such as missing or expired certificates, old or too new ciphers not supported by your client, etc.
Identify the server that your client failed to establish a TLS connection to (in case of pypath, look into the log)
Identify your software that contains the TLS client: in case of pypath, it uses pycurl, a Python module built on libcurl
Identify the provider of the client software: it can be PyPI, Anaconda, your operating system, etc.
Find out which certificate store that software uses: most of them uses the store from your operating system, but for example Java or Mozilla Firefox come with their own certificates
Check if the certificate store is up-to-date, update if necessary
Alternatively, identify the missing root certificate and add it manually to the store; you can also add a non-root certificate if the server has a serious issue and the chain can not be followed until a valid root certificate

Please open TLS related issues for our software only if you

Experience a server side issue with omnipathdb.org
You have a strong reason to think the reason is in the code written by us or can be easily fixed within our code

Resources§

[2]:

                          from pypath import resources
rc = resources.get_controller()
rc

                        

executed in 0ms, finished 14:27:45 2022-12-03

[2]:

<pypath.resources.controller.ResourceController at 0x6cc25e25dcf0>

Licenses§

The license of SIGNOR is CC BY-SA, it allows commercial (for-profit) use:

[3]:

                            rc.license('SIGNOR'), rc.license('SIGNOR').commercial

                          

executed in 0ms, finished 14:27:47 2022-12-03

[3]:

(<License CC BY-SA 4.0>, True)

Example: build a network for commercial use§

For our users, the most important aspect of licenses is whether they allow for-profit use in companies. In the near future we intend to provide more convenient interface for license options; until then, see the example below.

[4]:

                              from pypath.core import network
from pypath import resources

co = resources.get_controller()
pw_academic = co.collect_network('pathway')
pw_commercial = co.collect_network('pathway', license_purpose = 'commercial')

len(pw_academic), len(pw_commercial), set(pw_academic.values()) - set(pw_commercial.values())

                            

executed in 0ms, finished 18:45:22 2023-03-10

[4]:

(24,
 19,
 {<NetworkResource: Baccin2019 (post_translational, activity_flow)>,
  <NetworkResource: Cellinker (post_translational, activity_flow)>,
  <NetworkResource: HPMR (post_translational, activity_flow)>,
  <NetworkResource: PDZBase (post_translational, activity_flow)>,
  <NetworkResource: TRIP (post_translational, activity_flow)>})

Above we see that five resources have been disabled by applying the for-profit licensing restriction. The licenses of those five resources:

[5]:

                              [r.license for r in set(pw_academic.values()) - set(pw_commercial.values())]

                            

executed in 0ms, finished 18:48:02 2023-03-10

[5]:

[<License CC BY-NC-SA 3.0>,
 <License No license>,
 <License CC BY-NC 4.0>,
 <License CC BY-NC 4.0>,
 <License CC BY-NC 4.0>]

The licenses of the resources that allow for profit use:

[6]:

                              [r.license for r in pw_commercial.values()]

                            

executed in 0ms, finished 18:50:35 2023-03-10

[6]:

[<License CC BY 4.0>,
 <License CC BY-SA 3.0>,
 <License CC BY-SA 3.0>,
 <License CC BY 4.0>,
 <License CC BY-SA 3.0>,
 <License CC BY-SA 3.0>,
 <License CC BY 4.0>,
 <License NAR Open Access>,
 <License CC BY-SA 4.0>,
 <License CC BY 4.0>,
 <License GPLv3>,
 <License GPLv3>,
 <License GPLv3>,
 <License MIT>,
 <License GPLv3>,
 <License MIT>,
 <License MIT>,
 <License CC BY 4.0>,
 <License GPLv3>]

Taking a closer look at a non-profit license:

[10]:

                              license = pw_academic['trip'].license
license.purpose, license.purpose.enables('for-profit')

executed in 0ms, finished 18:54:45 2023-03-10

[10]:

(<License purpose: academic>, False)

The collected resources can be used directly to build databases, in this case a network database:

[11]:

                              net_academic = network.Network(pw_academic)
net_commercial = network.Network(pw_commercial)
net_academic, net_commercial

                            

executed in 1m 2.79s, finished 18:57:02 2023-03-10

[11]:

(<Network: 6833 nodes, 25607 interactions>,
 <Network: 6429 nodes, 23288 interactions>)

As we see, the for-profit usable network is smaller by about 400 nodes and 2,300 edges, and it might miss even more of the fine grained details, but likely it is suitable for analysis. No legal expert here, but some thoughts about licenses: even if you work for a company, you might download and explore data under any license, the restrictions apply if you start to actually use the resource; even if some resources restrict commercial use, you can always contact the copyright owners and ask them for permission, or ask your company to pay them licensing fee, so you can legally use their product.

Resource information§

[4]:

                            rc['MatrixDB']

                          

executed in 0ms, finished 14:27:49 2022-12-03

[4]:

{'yearUsedRelease': 2015,
 'releases': [2009, 2011, 2015],
 'urls': {'articles': ['http://bioinformatics.oxfordjournals.org/content/25/5/690.long',
   'http://nar.oxfordjournals.org/content/43/D1/D321.long',
   'http://nar.oxfordjournals.org/content/39/suppl_1/D235.long'],
  'webpages': ['http://matrixdb.univ-lyon1.fr/'],
  'omictools': ['http://omictools.com/matrixdb-tool']},
 'pubmeds': [19147664, 20852260, 25378329],
 'taxons': ['mammalia'],
 'annot': ['experiment'],
 'recommend': ['small, literature curated interaction resource; many interactions for',
  'receptors and extracellular proteins'],
 'descriptions': ['Protein data were imported from the UniProtKB/Swiss-Prot database (Bairoch et',
  'al., 2005) and identified by UniProtKB/SwissProt accession numbers. In order to',
  'list all the partners of a protein, interactions are associated by default to the',
  'accession number of the human protein. The actual source species used in experiments is',
  'indicated in the page repor

Output truncated: showing 1000 of 4479 characters

Resource definitions for a certain database or dataset§

Note: This does not work yet for all databases and datasets, but likely in the near future this will be the preferred method to access resource definitions.

[197]:

                            rc.collect_enzyme_substrate()

                          

executed in 0ms, finished 20:08:29 2022-12-02

[197]:

[<EnzymeSubstrateResource: phosphoELM>,
 <EnzymeSubstrateResource: dbPTM>,
 <EnzymeSubstrateResource: SIGNOR>,
 <EnzymeSubstrateResource: HPRD>,
 <EnzymeSubstrateResource: Li2012>,
 <EnzymeSubstrateResource: DEPOD>,
 <EnzymeSubstrateResource: PhosphoSite>,
 <EnzymeSubstrateResource: PhosphoNetworks>,
 <EnzymeSubstrateResource: MIMP>,
 <EnzymeSubstrateResource: ProtMapper>,
 <EnzymeSubstrateResource: KEA>]

The resource definitions carry all information necessary to load the resource, for example:

[202]:

                            phosphoelm = rc.collect_enzyme_substrate()[0]
phosphoelm.input_method, phosphoelm.id_type_enzyme

executed in 0ms, finished 20:09:51 2022-12-02

[202]:

('phosphoelm.phosphoelm_enzyme_substrate', 'uniprot')

Building networks§

For this you will need the Network class from the pypath.core.network module which takes care about building and querying the network. Also you need the pypath.resources.network module where you find a number of predefined input settings organized in larger categories (e.g. activity flow, enzyme-substrate, transcriptional regulation, etc). These input settings will tell pypath how to download and process the data.

[13]:

                          from pypath.core import network
from pypath.resources import network as netres

executed in 0ms, finished 13:59:49 2022-12-02

For example the netres.pathway is a collection of databases which fit into the activity flow concept, i.e. one protein either stimulates or inhibits the other. It is a dictionary with names as keys and the input settings as values:

[14]:

                          netres.pathway

                        

executed in 0ms, finished 13:59:52 2022-12-02

[14]:

{'trip': <NetworkResource: TRIP (post_translational, activity_flow)>,
 'spike': <NetworkResource: SPIKE (post_translational, activity_flow)>,
 'signalink3': <NetworkResource: SignaLink3 (post_translational, activity_flow)>,
 'guide2pharma': <NetworkResource: Guide2Pharma (post_translational, activity_flow)>,
 'ca1': <NetworkResource: CA1 (post_translational, activity_flow)>,
 'arn': <NetworkResource: ARN (post_translational, activity_flow)>,
 'nrf2': <NetworkResource: NRF2ome (post_translational, activity_flow)>,
 'macrophage': <NetworkResource: Macrophage (post_translational, activity_flow)>,
 'death': <NetworkResource: DeathDomain (post_translational, activity_flow)>,
 'pdz': <NetworkResource: PDZBase (post_translational, activity_flow)>,
 'signor': <NetworkResource: SIGNOR (post_translational, activity_flow)>,
 'adhesome': <NetworkResource: Adhesome (post_translational, activity_flow)>,
 'icellnet': <NetworkResource: ICELLNET (post_translational, activity_flow)>,
 'celltalkdb': <Net

Output truncated: showing 1000 of 1864 characters

Such a dictionary you can pass to the load method of the network.Network object. Then it will download the data from the original sources, translate the identifiers and merge the networks. Pypath stores all downloaded data in a cache, by default ~/.pypath/cache in your user’s home directory. For this reason when you load a resource for the first time it might take long but next time will be faster as data will be fetched from the cache. First create a pypath.network.Network object, then build the network:

[15]:

                          n = network.Network()
n.load(netres.pathway)

executed in 32.90s, finished 14:00:36 2022-12-02

[16]:

executed in 0ms, finished 14:02:23 2022-12-02

[16]:

<Network: 6833 nodes, 25607 interactions>

You can add more resource sets a similar way:

[18]:

                          n.load(netres.enzyme_substrate)

                        

executed in 30.04s, finished 14:04:29 2022-12-02

[19]:

executed in 0ms, finished 14:05:38 2022-12-02

[19]:

<Network: 7979 nodes, 35550 interactions>

To load one single resource simply pass the NetworkResource directly:

[20]:

                          n.load(netres.interaction['matrixdb'])

                        

executed in 0ms, finished 14:05:42 2022-12-02

[21]:

executed in 0ms, finished 14:05:44 2022-12-02

[21]:

<Network: 8002 nodes, 35748 interactions>

Which network datasets are pre-defined in pypath?§

You can find all the pre-defined datasets in the pypath.resources.network module. This module currently is a wrapper around an older module, pypath.resources.data_formats, the actual definitions are written in this latter. As already we mentined above, the pathway dataset contains the literature curated activity flow resources. This was the original focus of pypath and OmniPath, however since then we added a great variety of other kinds of resource definitions. Here we give an overview of these.

pypath.resources.network.pathway: activity flow networks with literature references
pypath.resources.network.activity_flow: synonym for pathway
pypath.resources.network.pathway_noref: activity flow networks without literature references
pypath.resources.network.pathway_all: all activity flow data
pypath.resources.network.ptm: enzyme-substrate interaction networks with literature references
pypath.resources.network.enzyme_substrate: synonym for ptm
pypath.resources.network.ptm_noref: enzyme-substrate networks without literature references
pypath.resources.network.ptm_all: all enzyme-substrate data
pypath.resources.network.interaction: undirected interactions from both literature curated and high-throughput collections (e.g. IntAct, BioGRID)
pypath.resources.network.interaction_misc: undirected, high-scale interaction networks without the constraint of having any literature reference (e.g. the unbiased human interactome screen from the Vidal lab)
pypath.resources.network.transcription_onebyone: transcriptional regulation databases (TF-target interactions) with all databases downloaded directly and processed by pypath
pypath.resources.network.transcription: transcriptional regulation only from the DoRothEA data
pypath.resources.network.mirna_target: miRNA-mRNA interactions from literature curated resources
pypath.resources.network.tf_mirna: transcriptional regulation of miRNA from literature curated resources
pypath.resources.network.lncrna_protein: lncRNA-protein interactions from literature curated datasets
pypath.resources.network.ligand_receptor: ligand-receptor interactions from both literature curated and other kinds of resources
pypath.resources.network.pathwaycommons: the PathwayCommons database
pypath.resources.network.reaction: process description databases; not guaranteed to work at this moment
pypath.resources.network.reaction_misc: alternative definitions to load process description databases; not guaranteed to work at this moment
pypath.resources.network.small_molecule_protein: signaling interactions between small molecules and proteins

To see the list of the resources in a dataset, you can check the dict keys or the name attribute of each element:

[22]:

                            netres.pathway.keys()

                          

executed in 0ms, finished 14:05:57 2022-12-02

[22]:

dict_keys(['trip', 'spike', 'signalink3', 'guide2pharma', 'ca1', 'arn', 'nrf2', 'macrophage', 'death', 'pdz', 'signor', 'adhesome', 'icellnet', 'celltalkdb', 'cellchatdb', 'connectomedb', 'talklr', 'cellinker', 'scconnect', 'hpmr', 'cellphonedb', 'ramilowski2015', 'lrdb', 'baccin2019'])

[23]:

                            [resource.name for resource in netres.pathway.values()]

                          

executed in 0ms, finished 14:06:00 2022-12-02

[23]:

['TRIP',
 'SPIKE',
 'SignaLink3',
 'Guide2Pharma',
 'CA1',
 'ARN',
 'NRF2ome',
 'Macrophage',
 'DeathDomain',
 'PDZBase',
 'SIGNOR',
 'Adhesome',
 'ICELLNET',
 'CellTalkDB',
 'CellChatDB',
 'connectomeDB2020',
 'talklr',
 'Cellinker',
 'scConnect',
 'HPMR',
 'CellPhoneDB',
 'Ramilowski2015',
 'LRdb',
 'Baccin2019']

The resource definitions above carry all the information about how to load the resource: which function to call, how to process the identifiers, references, directions, and all other attributes from the input. E.g. which column from SPIKE corresponds to the source node? Which identifier type is used? It is the second column, and it has gene symbols in it:

[24]:

                            netres.pathway['spike'].networkinput.id_col_a, netres.pathway['spike'].networkinput.id_type_a

                          

executed in 0ms, finished 14:06:07 2022-12-02

[24]:

(1, 'genesymbol')

The `Network` object§

Once you built a network you can use it for various purposes and write your own scripts for further processing or analysis. Below we create a Network object and populate it with the pathway dataset.

Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.

[2]:

                            from pypath.core import network
from pypath.resources import network as netres

n = network.Network()
n.load(netres.pathway)
n

                          

executed in 36.07s, finished 14:15:48 2022-12-02

[2]:

<Network: 6833 nodes, 25607 interactions>

Almost all data is stored as a dict node pairs vs. interactions in Network.interactions:

[3]:

                            n.interactions

                          

executed in 0ms, finished 14:17:02 2022-12-02

[3]:

{(<Entity: TRPC1>,
  <Entity: KCNMA1>): <Interaction: TRPC1 ============= KCNMA1 [Evidences: TRIP (2 references)]>,
 (<Entity: TRPC1>,
  <Entity: PPP3CA>): <Interaction: TRPC1 ============= PPP3CA [Evidences: TRIP (1 references)]>,
 (<Entity: CALM2>,
  <Entity: TRPC1>): <Interaction: CALM2 =======(-)==> TRPC1 [Evidences: TRIP (3 references)]>,
 (<Entity: CALM3>,
  <Entity: TRPC1>): <Interaction: CALM3 =======(-)==> TRPC1 [Evidences: TRIP (3 references)]>,
 (<Entity: CALM1>,
  <Entity: TRPC1>): <Interaction: CALM1 =======(-)==> TRPC1 [Evidences: TRIP (3 references)]>,
 (<Entity: CASP1>,
  <Entity: TRPC1>): <Interaction: CASP1 ============= TRPC1 [Evidences: TRIP (1 references)]>,
 (<Entity: TRPC1>,
  <Entity: CASP4>): <Interaction: TRPC1 ============= CASP4 [Evidences: TRIP (1 references)]>,
 (<Entity: TRPC1>,
  <Entity: CACNA1C>): <Interaction: TRPC1 ============= CACNA1C [Evidences: TRIP (1 references)]>,
 (<Entity: TRPC1>,
  <Entity: CAV1>): <Interaction: TRPC1 <=(+)======== CAV1 [Ev

Output truncated: showing 1000 of 118492 characters

The dict under Network.nodes is kept in sync with the interactions, and facilitates node access. Keys are primary identifiers (for proteins UniProt IDs by default), values are Entity objects:

[26]:

                            n.nodes

                          

executed in 0ms, finished 14:06:21 2022-12-02

[26]:

{'P48995': <Entity: TRPC1>,
 'Q12791': <Entity: KCNMA1>,
 'Q08209': <Entity: PPP3CA>,
 'P0DP24': <Entity: CALM2>,
 'P0DP25': <Entity: CALM3>,
 'P0DP23': <Entity: CALM1>,
 'P29466': <Entity: CASP1>,
 'P49662': <Entity: CASP4>,
 'Q13936': <Entity: CACNA1C>,
 'Q03135': <Entity: CAV1>,
 'P56539': <Entity: CAV3>,
 'Q14247': <Entity: CTTN>,
 'P14416': <Entity: DRD2>,
 'P11532': <Entity: DMD>,
 'P11362': <Entity: FGFR1>,
 'Q02790': <Entity: FKBP4>,
 'Q86YM7': <Entity: HOMER1>,
 'Q9NSC5': <Entity: HOMER3>,
 'Q99750': <Entity: MDFI>,
 'Q14571': <Entity: ITPR2>,
 'Q14573': <Entity: ITPR3>,
 'P29966': <Entity: MARCKS>,
 'Q13255': <Entity: GRM1>,
 'P20591': <Entity: MX1>,
 'P62166': <Entity: NCS1>,
 'Q96D31': <Entity: ORAI1>,
 'Q96SN7': <Entity: ORAI2>,
 'Q9BRQ5': <Entity: ORAI3>,
 'P11171': <Entity: EPB41>,
 'P61586': <Entity: RHOA>,
 'Q9Y225': <Entity: RNF24>,
 'P21817': <Entity: RYR1>,
 'P16615': <Entity: ATP2A2>,
 'Q93084': <Entity: ATP2A3>,
 'P60880': <Entity: SNAP25>,
 'Q13586': <Entity: STI

Output truncated: showing 1000 of 30573 characters

An interaction between a pair of entities can be accessed:

[27]:

                            n.interaction('EGF', 'EGFR')

                          

executed in 0ms, finished 14:06:27 2022-12-02

[27]:

<Interaction: EGFR <=(+)======== EGF [Evidences: Baccin2019, CellTalkDB, Fantom5, Guide2Pharma, HPMR, HPRD, ICELLNET, LRdb, Ramilowski2015, SIGNOR, SPIKE, SignaLink3, cellsignal.com, connectomeDB2020 (17 references)]>

Similarly, individual nodes can be looked up:

[28]:

                            n.entity('EGFR')

                          

executed in 0ms, finished 14:06:29 2022-12-02

[28]:

<Entity: EGFR>

Labels (gene symbols for proteins by default), identifiers (such as UniProt IDs) and Entity objects can be used to refer to nodes. Each node carries some basic information:

[29]:

                            egfr = n.entity('EGFR')
egfr.identifier, egfr.label, egfr.entity_type, egfr.id_type, egfr.taxon

executed in 0ms, finished 14:06:32 2022-12-02

[29]:

('P00533', 'EGFR', 'protein', 'uniprot', 9606)

Interactions feature a number of methods to access various information, such as their types, direction, effect, resources, references, etc. The very same methods are also available for the whole network. Below we only show a few examples of these methods.

[30]:

                            ia = n.interaction('EGF', 'EGFR')
ia

executed in 0ms, finished 14:06:34 2022-12-02

[30]:

<Interaction: EGFR <=(+)======== EGF [Evidences: Baccin2019, CellTalkDB, Fantom5, Guide2Pharma, HPMR, HPRD, ICELLNET, LRdb, Ramilowski2015, SIGNOR, SPIKE, SignaLink3, cellsignal.com, connectomeDB2020 (17 references)]>

[31]:

                            ia.get_resource_names()

                          

executed in 0ms, finished 14:06:47 2022-12-02

[31]:

{'Baccin2019',
 'CellTalkDB',
 'HPMR',
 'ICELLNET',
 'LRdb',
 'SIGNOR',
 'SPIKE',
 'SignaLink3',
 'connectomeDB2020'}

[32]:

                            ia.get_references()

                          

executed in 0ms, finished 14:06:50 2022-12-02

[32]:

{<Reference: 10085134>,
 <Reference: 10209155>,
 <Reference: 10788520>,
 <Reference: 12093292>,
 <Reference: 12297050>,
 <Reference: 12620237>,
 <Reference: 12648462>,
 <Reference: 15620700>,
 <Reference: 16274239>,
 <Reference: 17145710>,
 <Reference: 19531499>,
 <Reference: 20458382>,
 <Reference: 21071413>,
 <Reference: 23331499>,
 <Reference: 3494473>,
 <Reference: 6289330>,
 <Reference: 8639530>}

This is a valid direction for this interaction:

[33]:

                            ia.get_direction(('EGF', 'EGFR'))

                          

executed in 0ms, finished 14:06:53 2022-12-02

[33]:

True

The opposite direction is not supported by any of the resources:

[34]:

                            ia.get_direction(('EGFR', 'EGF'))

                          

executed in 0ms, finished 14:06:55 2022-12-02

[34]:

False

However, some resources provide no direction information, these are classified as “undirected”:

ia.get_direction(‘undirected’)

We can check which resources are those exactly:

[35]:

                            ia.get_direction('undirected', sources = True)

                          

executed in 0ms, finished 14:07:23 2022-12-02

[35]:

{'HPMR', 'SPIKE'}

Effect signs (stimulation, inhibition) are available in a similar way. The first one of the Boolean values mean stimulation (activation), the second one inhibition.

[36]:

                            ia.get_sign(('EGF', 'EGFR'))

                          

executed in 0ms, finished 14:07:25 2022-12-02

[36]:

[True, False]

Which resources support the effect signs:

[37]:

                            ia.get_sign(('EGF', 'EGFR'), sources = True)

                          

executed in 0ms, finished 14:07:28 2022-12-02

[37]:

[{'SIGNOR', 'SPIKE', 'SignaLink3'}, set()]

Many methods start by get_..., such as:

[38]:

                            ia.get_interaction_types()

                          

executed in 0ms, finished 14:07:30 2022-12-02

[38]:

{'post_translational'}

Others are called ..._by_..., these combine two get_... methods:

[39]:

                            ia.references_by_resource()

                          

executed in 0ms, finished 14:07:32 2022-12-02

[39]:

{'ICELLNET': {<Reference: 8639530>},
 'SIGNOR': {<Reference: 12297050>, <Reference: 12648462>},
 'SignaLink3': {<Reference: 10085134>,
  <Reference: 10209155>,
  <Reference: 19531499>,
  <Reference: 21071413>,
  <Reference: 23331499>},
 'Baccin2019': {<Reference: 10788520>,
  <Reference: 12093292>,
  <Reference: 12297050>,
  <Reference: 12620237>,
  <Reference: 15620700>,
  <Reference: 16274239>,
  <Reference: 6289330>},
 'LRdb': {<Reference: 10788520>,
  <Reference: 12093292>,
  <Reference: 12297050>,
  <Reference: 12620237>,
  <Reference: 15620700>,
  <Reference: 16274239>,
  <Reference: 6289330>},
 'SPIKE': {<Reference: 12297050>,
  <Reference: 17145710>,
  <Reference: 20458382>,
  <Reference: 3494473>},
 'CellTalkDB': {<Reference: 12093292>},
 'connectomeDB2020': {<Reference: 10788520>,
  <Reference: 12093292>,
  <Reference: 12297050>,
  <Reference: 12620237>,
  <Reference: 15620700>,
  <Reference: 16274239>,
  <Reference: 6289330>},
 'HPMR': {<Reference: 6289330>}}

And all these methods accept the same filtering parameters. E.g. if you are interested only in certain resources, it’s possible to restrict the query to those. For example, the two resources below provide no positive sign interaction:

[40]:

                            ia.get_interactions_positive(resources = {'ICELLNET', 'HPMR'})

                          

executed in 0ms, finished 14:07:39 2022-12-02

[40]:

()

While some other resources do:

[41]:

                            ia.get_interactions_positive(resources = {'SignaLink3'})

                          

executed in 0ms, finished 14:07:42 2022-12-02

[41]:

((<Entity: EGF>, <Entity: EGFR>),)

Or see the references that do or do not provide effect sign:

[42]:

                            ia.get_references(effect = True), ia.get_references(effect = False)

                          

executed in 0ms, finished 14:07:44 2022-12-02

[42]:

({<Reference: 10085134>,
  <Reference: 10209155>,
  <Reference: 12297050>,
  <Reference: 12648462>,
  <Reference: 19531499>,
  <Reference: 20458382>,
  <Reference: 21071413>,
  <Reference: 23331499>},
 {<Reference: 10085134>,
  <Reference: 10209155>,
  <Reference: 10788520>,
  <Reference: 12093292>,
  <Reference: 12297050>,
  <Reference: 12620237>,
  <Reference: 12648462>,
  <Reference: 15620700>,
  <Reference: 16274239>,
  <Reference: 17145710>,
  <Reference: 19531499>,
  <Reference: 20458382>,
  <Reference: 21071413>,
  <Reference: 23331499>,
  <Reference: 3494473>,
  <Reference: 6289330>,
  <Reference: 8639530>})

Network in pandas.DataFrame§

Contents of a pypath.core.network.Network object can be exported to a pandas.DataFrame:

[1]:

                            from pypath import omnipath
cu = omnipath.db.get_db('curated')
cu.make_df()
cu.df

                          

executed in 23.41s, finished 15:24:19 2022-12-03

[1]:

	id_a	id_b	type_a	type_b	directed	effect	type	dmodel	sources	references
0	P48995	Q12791	protein	protein	False	0	post_translational	{activity_flow}	{TRIP}	NaN
1	P48995	Q08209	protein	protein	False	0	post_translational	{activity_flow}	{TRIP}	NaN
2	P0DP23	P48995	protein	protein	True	-1	post_translational	{activity_flow}	{TRIP}	NaN
3	P0DP25	P48995	protein	protein	True	-1	post_translational	{activity_flow}	{TRIP}	NaN
4	P0DP24	P48995	protein	protein	True	-1	post_translational	{activity_flow}	{TRIP}	NaN
...	...	...	...	...	...	...	...	...	...	...
44033	Q14289	Q9ULZ3	protein	protein	True	0	post_translational	{enzyme_substrate}	{iPTMnet}	NaN
44034	P54646	Q9Y2I7	protein	protein	True	0	post_translational	{enzyme_substrate}	{iPTMnet}	NaN
44035	Q9BXM7	Q9Y2N7	protein	protein	True	0	post_translational	{enzyme_substrate}	{iPTMnet}	NaN
44036	P49137	Q9Y385	protein	protein	True	0	post_translational	{enzyme_substrate}	{iPTMnet}	NaN
44037	Q9UHC7	P04637	protein	protein	True	0	post_translational	{enzyme_substrate}	{iPTMnet}	NaN

44038 rows × 10 columns

In the pypath.omnipath.export module independent and more flexible interfaces are available for building network data frames. These are used also for building the tables used by the web server.

[12]:

                            from pypath import omnipath
from pypath.omnipath import export

cu = omnipath.db.get_db('curated')
e = export.Export(cu)
e.make_df(unique_pairs = False)
e.df

                          

executed in 22.65s, finished 19:20:12 2023-03-10

[12]:

	source	target	source_genesymbol	target_genesymbol	is_directed	is_stimulation	is_inhibition	consensus_direction	consensus_stimulation	consensus_inhibition	sources	references
0	P48995	Q12791	TRPC1	KCNMA1	0	0	0	0	0	0	TRIP	TRIP:19168436;TRIP:25139746
1	P48995	Q08209	TRPC1	PPP3CA	0	0	0	0	0	0	TRIP	TRIP:23228564
2	P0DP23	P48995	CALM1	TRPC1	1	0	1	1	0	1	TRIP	TRIP:11290752;TRIP:11983166;TRIP:12601176
3	P0DP25	P48995	CALM3	TRPC1	1	0	1	1	0	1	TRIP	TRIP:11290752;TRIP:11983166;TRIP:12601176
4	P0DP24	P48995	CALM2	TRPC1	1	0	1	1	0	1	TRIP	TRIP:11290752;TRIP:11983166;TRIP:12601176
...	...	...	...	...	...	...	...	...	...	...	...	...
36729	Q14289	Q9ULZ3	PTK2B	PYCARD	1	0	0	0	0	0	iPTMnet	iPTMnet:27796369
36730	P54646	Q9Y2I7	PRKAA2	PIKFYVE	1	0	0	0	0	0	iPTMnet	iPTMnet:24070423
36731	Q9BXM7	Q9Y2N7	PINK1	HIF3A	1	0	0	0	0	0	iPTMnet	iPTMnet:27551449
36732	P49137	Q9Y385	MAPKAPK2	UBE2J1	1	0	0	0	0	0	iPTMnet	iPTMnet:24020373
36733	Q9UHC7	P04637	MKRN1	TP53	1	0	0	0	0	0	iPTMnet	iPTMnet:19536131

36734 rows × 12 columns

The data frame built for the web service includes even more details. Using the extra_node_attrs and extra_edge_attrs arguments of the Export object, you can fully customise these data frames.

[13]:

                            e.webservice_interactions_df()
e.df

executed in 21.99s, finished 19:22:51 2023-03-10

[13]:

	source	target	source_genesymbol	target_genesymbol	is_directed	is_stimulation	is_inhibition	consensus_direction	consensus_stimulation	consensus_inhibition	...	dorothea_tfbs	dorothea_coexp	dorothea_level	type	curation_effort	extra_attrs	ncbi_tax_id_source	entity_type_source	ncbi_tax_id_target	entity_type_target
0	P48995	Q12791	TRPC1	KCNMA1	0	0	0	0	0	0	...	None	None		post_translational	2	{"TRIP_method":["Co-immunoprecipitation","Co-i...	9606	protein	9606	protein
1	P48995	Q08209	TRPC1	PPP3CA	0	0	0	0	0	0	...	None	None		post_translational	1	{"TRIP_method":["Co-immunoprecipitation"]}	9606	protein	9606	protein
2	P0DP23	P48995	CALM1	TRPC1	1	0	1	1	0	1	...	None	None		post_translational	3	{"TRIP_method":["Fluorescence probe labeling",...	9606	protein	9606	protein
3	P0DP25	P48995	CALM3	TRPC1	1	0	1	1	0	1	...	None	None		post_translational	3	{"TRIP_method":["Fluorescence probe labeling",...	9606	protein	9606	protein
4	P0DP24	P48995	CALM2	TRPC1	1	0	1	1	0	1	...	None	None		post_translational	3	{"TRIP_method":["Fluorescence probe labeling",...	9606	protein	9606	protein
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
36729	Q14289	Q9ULZ3	PTK2B	PYCARD	1	0	0	0	0	0	...	None	None		post_translational	1	{}	9606	protein	9606	protein
36730	P54646	Q9Y2I7	PRKAA2	PIKFYVE	1	0	0	0	0	0	...	None	None		post_translational	1	{}	9606	protein	9606	protein
36731	Q9BXM7	Q9Y2N7	PINK1	HIF3A	1	0	0	0	0	0	...	None	None		post_translational	1	{}	9606	protein	9606	protein
36732	P49137	Q9Y385	MAPKAPK2	UBE2J1	1	0	0	0	0	0	...	None	None		post_translational	1	{}	9606	protein	9606	protein
36733	Q9UHC7	P04637	MKRN1	TP53	1	0	0	0	0	0	...	None	None		post_translational	1	{}	9606	protein	9606	protein

36734 rows × 34 columns

Self interactions (loop edges) in the network§

Depending on the downstream application, loops might be beneficial or undesired. By default loops are disabled, but are enabled for OmniPath and the GRN networks among the built-in network databases. The allow_loops parameter can be set at the module level or at the instance level. If set at the module level, it will be valid for all subsequently created instances:

[14]:

                            from pypath.share import settings
settings.setup(network_allow_loops = True)

executed in 0ms, finished 19:32:52 2023-03-10

If set at the instance level, it will be valid for the instance:

[15]:

                            from pypath.core import network
n = network.Network(allow_loops = True)

executed in 0ms, finished 19:33:44 2023-03-10

If you want keep loops only for certain resources, load first the resources where loops should be removed, then remove the loops, and load the resources where you wish to keep the loops:

[30]:

                            from pypath.core import network
from pypath import resources

co = resources.get_controller()
pw = co.collect_network('pathway')
gr = co.collect_network('dorothea', interaction_types = 'transcriptional')

n = network.Network(pw, allow_loops = False)
n.load(gr, allow_loops = True)
n.count_loops()

                          

executed in 2m 24.45s, finished 19:56:41 2023-03-10

[30]:

[32]:

                            n.count_interactions_by_interaction_type()

                          

executed in 16.50s, finished 19:59:10 2023-03-10

[32]:

{'post_translational': 33571, 'transcriptional': 281262}

Molecular complexes in the network§

Currently pypath supports protein complexes, however, soon other kind of components, such as small molecules, nucleic acids, will be supported too. Complexes are represented by pypath.internals.intera.Complex objects, and can be network nodes. These objects optionally carry information about the defining resources, references, stoichiometry and custom attributes. Apart from the components and resources, none of these is mandatory. For more information, see the Protein complexes section in this notebook. Here we only show how complexes are included in networks. The Network object either represents each complex as a node (default behaviour), or expands the complex by creating a node for each of its components and apply all the interactions of the complex to all of its components. This latter method has adverse effects on network topology, and can be enabled by setting network_expand_complexes to True. Only a few resources list interactions of protein complexes, for example, SIGNOR, CollecTRI, Guide to Pharmacology, CellphoneDB, etc. Let’s load such a resource:

[1]:

                            from pypath.core import network
from pypath.resources import network as netres

n = network.Network(netres.collectri)

executed in 38.12s, finished 20:35:23 2023-03-27

We can retrieve various information about the complexes in the network, e.g. count them:

[2]:

                            n.count_complexes()

                          

executed in 1.45s, finished 20:37:11 2023-03-27

[2]:

Or list them:

[3]:

                            n.get_complexes()

                          

executed in 1.50s, finished 20:37:34 2023-03-27

[3]:

{<Entity: FOS_JUN>,
 <Entity: FOS_JUNB>,
 <Entity: FOS_JUND>,
 <Entity: JUN>,
 <Entity: FOSL1_JUN>,
 <Entity: FOSL2_JUN>,
 <Entity: JUN_JUNB>,
 <Entity: JUN_JUND>,
 <Entity: FOSB_JUN>,
 <Entity: FOSL1_JUNB>,
 <Entity: FOSL1_JUND>,
 <Entity: FOSL2_JUNB>,
 <Entity: FOSL2_JUND>,
 <Entity: JUNB>,
 <Entity: JUNB_JUND>,
 <Entity: FOSB_JUNB>,
 <Entity: JUND>,
 <Entity: FOSB_JUND>,
 <Entity: NFKB1>,
 <Entity: NFKB1_NFKB2>,
 <Entity: NFKB1_RELB>,
 <Entity: NFKB1_RELA>,
 <Entity: NFKB1_REL>,
 <Entity: NFKB2>,
 <Entity: NFKB2_RELB>,
 <Entity: NFKB2_RELA>,
 <Entity: NFKB2_REL>,
 <Entity: RELB>,
 <Entity: RELA_RELB>,
 <Entity: REL_RELB>,
 <Entity: RELA>,
 <Entity: REL_RELA>,
 <Entity: REL>}

In the network, these are Entity objects, and their identifier attribute is the Complex object:

[4]:

                            cplex_entity = list(n.get_complexes())[0]
cplex_entity

executed in 1.40s, finished 20:39:53 2023-03-27

[4]:

<Entity: REL_RELA>

[6]:

                            cplex = cplex_entity.identifier
cplex

executed in 0ms, finished 20:40:32 2023-03-27

[6]:

Complex: COMPLEX:Q04206_Q04864

When creating a data frame, the complex objects are added to the identifier cells, where we used to have UniProt IDs for single proteins. The labels are the gene symbols of the components, separated by underscore by default.

[8]:

                            from pypath.omnipath import export
from pypath.internals import intera

e = export.Export(n)
e.make_df(unique_pairs = False)
e.df[[isinstance(s, intera.Complex) for s in e.df.source]]

                          

executed in 9.65s, finished 20:44:06 2023-03-27

[8]:

	source	target	source_genesymbol	target_genesymbol	is_directed	is_stimulation	is_inhibition	consensus_direction	consensus_stimulation	consensus_inhibition	sources	references
1	(P17535, P15407)	P04040	FOSL1_JUND	CAT	1	1	0	1	1	0	CollecTRI;ExTRI_CollecTRI	CollecTRI:10022519;CollecTRI:10329043;CollecTR...
2	(P05412, P15408)	P04040	FOSL2_JUN	CAT	1	1	0	1	1	0	CollecTRI;ExTRI_CollecTRI	CollecTRI:10022519;CollecTRI:10329043;CollecTR...
3	(P05412, P15407)	P04040	FOSL1_JUN	CAT	1	1	0	1	1	0	CollecTRI;ExTRI_CollecTRI	CollecTRI:10022519;CollecTRI:10329043;CollecTR...
4	(P05412, P17275)	P04040	JUN_JUNB	CAT	1	1	0	1	1	0	CollecTRI;ExTRI_CollecTRI	CollecTRI:10022519;CollecTRI:10329043;CollecTR...
5	(P17275, P17535)	P04040	JUNB_JUND	CAT	1	1	0	1	1	0	CollecTRI;ExTRI_CollecTRI	CollecTRI:10022519;CollecTRI:10329043;CollecTR...
...	...	...	...	...	...	...	...	...	...	...	...	...
54980	(P17535, P01100)	P01270	FOS_JUND	PTH	1	1	0	1	1	0	CollecTRI;ExTRI_CollecTRI	CollecTRI:9989817
54981	(P17275, P15408)	P01270	FOSL2_JUNB	PTH	1	1	0	1	1	0	CollecTRI;ExTRI_CollecTRI	CollecTRI:9989817
54982	(P05412, P53539)	P01270	FOSB_JUN	PTH	1	1	0	1	1	0	CollecTRI;ExTRI_CollecTRI	CollecTRI:9989817
54983	(P17275, P15407)	P01270	FOSL1_JUNB	PTH	1	1	0	1	1	0	CollecTRI;ExTRI_CollecTRI	CollecTRI:9989817
54984	(P17275)	P01270	JUNB	PTH	1	1	0	1	1	0	CollecTRI;ExTRI_CollecTRI	CollecTRI:9989817

23235 rows × 12 columns

For some reason, pandas show the Complex objects as tuples.

[10]:

                            e.df[[isinstance(s, intera.Complex) for s in e.df.source]].source.iloc[0]

                          

executed in 0ms, finished 20:45:07 2023-03-27

[10]:

Complex: COMPLEX:P15407_P17535

[12]:

                            e.webservice_interactions_df()

                          

executed in 41.08s, finished 20:48:51 2023-03-27

[13]:

                            e.df

                          

executed in 0ms, finished 20:50:14 2023-03-27

[13]:

	source	target	source_genesymbol	target_genesymbol	is_directed	is_stimulation	is_inhibition	consensus_direction	consensus_stimulation	consensus_inhibition	...	dorothea_tfbs	dorothea_coexp	dorothea_level	type	curation_effort	extra_attrs	ncbi_tax_id_source	entity_type_source	ncbi_tax_id_target	entity_type_target
0	P01106	O14746	MYC	TERT	1	1	0	1	1	0	...	None	None		transcriptional	75	{}	9606	protein	9606	protein
1	(P17535, P15407)	P04040	FOSL1_JUND	CAT	1	1	0	1	1	0	...	None	None		transcriptional	14	{}	9606	complex	9606	protein
2	(P05412, P15408)	P04040	FOSL2_JUN	CAT	1	1	0	1	1	0	...	None	None		transcriptional	14	{}	9606	complex	9606	protein
3	(P05412, P15407)	P04040	FOSL1_JUN	CAT	1	1	0	1	1	0	...	None	None		transcriptional	14	{}	9606	complex	9606	protein
4	(P05412, P17275)	P04040	JUN_JUNB	CAT	1	1	0	1	1	0	...	None	None		transcriptional	14	{}	9606	complex	9606	protein
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
67945	Q01196	Q13094	RUNX1	LCP2	1	1	0	1	1	0	...	None	None		transcriptional	1	{}	9606	protein	9606	protein
67946	Q01196	Q6MZQ0	RUNX1	PRR5L	1	1	0	1	1	0	...	None	None		transcriptional	1	{}	9606	protein	9606	protein
67947	Q15672	P08151	TWIST1	GLI1	1	1	0	1	1	0	...	None	None		transcriptional	1	{}	9606	protein	9606	protein
67948	P22415	Q5SRE5	USF1	NUP188	1	1	0	1	1	0	...	None	None		transcriptional	1	{}	9606	protein	9606	protein
67949	Q9UQR1	Q5VYX0	ZNF148	RNLS	1	1	0	1	1	0	...	None	None		transcriptional	1	{}	9606	protein	9606	protein

67950 rows × 34 columns

When we export to CSV, the Complex objects are converted to the string notation familiar from the OmniPath web service. See for example COMPLEX:P15407_P17535 below, and its human readable label FOSL1_JUND in the gene symbols column:

[15]:

                            e.df[[ets == 'complex' for ets in e.df.entity_type_source]].to_csv(index = False)[:1000]

                          

executed in 0ms, finished 20:55:26 2023-03-27

[15]:

'source,target,source_genesymbol,target_genesymbol,is_directed,is_stimulation,is_inhibition,consensus_direction,consensus_stimulation,consensus_inhibition,sources,references,omnipath,kinaseextra,ligrecextra,pathwayextra,mirnatarget,dorothea,tf_target,lncrna_mrna,tf_mirna,small_molecule,dorothea_curated,dorothea_chipseq,dorothea_tfbs,dorothea_coexp,dorothea_level,type,curation_effort,extra_attrs,ncbi_tax_id_source,entity_type_source,ncbi_tax_id_target,entity_type_target\nCOMPLEX:P15407_P17535,P04040,FOSL1_JUND,CAT,1,1,0,1,1,0,CollecTRI;ExTRI_CollecTRI,CollecTRI:10022519;CollecTRI:10329043;CollecTRI:12036993;CollecTRI:12538496;CollecTRI:17935786;CollecTRI:7489329;CollecTRI:7651432;CollecTRI:7818486;CollecTRI:8867782;CollecTRI:9030359;CollecTRI:9136992;CollecTRI:9142914;CollecTRI:9168892;CollecTRI:9687385,False,False,False,False,False,False,False,False,False,False,,,,,,transcriptional,14,{},9606,complex,9606,protein\nCOMPLEX:P05412_P15408,P04040,FOSL2_JUN,CAT,1,1,0,1,1,0,CollecTRI;ExTRI_C

Output truncated: showing 1000 of 1004 characters

Translating identifiers§

The pypath.utils.mapping module is for ID translation, most of the time you can simply call the map_name method:

[1]:

                          from pypath.utils import mapping
mapping.map_name('P00533', 'uniprot', 'genesymbol')

executed in 1.38s, finished 12:31:45 2023-03-21

[1]:

{'EGFR'}

By default the map_name function returns a set because it accounts for ambiguous mapping. However most often the ID translation is unambiguous, and you want to retrieve only one ID. The map_name0 returns a string, even in case of ambiguity, it returns a random element from the resulted set:

[5]:

                          mapping.map_name0('GABARAPL3', 'genesymbol', 'uniprot')

                        

executed in 0ms, finished 14:17:31 2022-12-02

[5]:

'Q9BY60'

Molecules have large variety of identifiers, but in pypath two identifier types are special:

The primary identifier defines the molecule category, e.g. if UniProt is the primary identifier for proteins, then a protein is anything that has a UniProt ID
The label is a human readable identifier, for proteins it’s gene symbol

The primary ID and label types are configured for each molecule type (protein, miRNA, drug, etc) in the module settings. The mapping module provides shortcuts to translate between these identifiers: label and id_from_label.

[6]:

                          mapping.label('O75385')

                        

executed in 0ms, finished 14:17:33 2022-12-02

[6]:

'ULK1'

[7]:

                          mapping.id_from_label('ULK1')

                        

executed in 0ms, finished 14:17:35 2022-12-02

[7]:

{'O75385'}

[8]:

                          mapping.id_from_label0('ULK1')

                        

executed in 0ms, finished 14:17:37 2022-12-02

[8]:

'O75385'

Multiple IDs can be translated in one call, however, it’s not possible to know certainly which output corresponds to which input.

[9]:

                          mapping.map_names(['ULK1', 'EGFR', 'SMAD2'], 'genesymbol', 'uniprot')

                        

executed in 0ms, finished 14:17:40 2022-12-02

[9]:

{'O75385', 'P00533', 'Q15796'}

The default organism is defined in the module settings, it is human by default. Translating for other organisms requires the ncbi_tax_id argument. Most of the functions in pypath accepts also common or latin names, but map_name accepts only numeric taxon IDs for efficiency. Let’s translate a mouse identifier:

[10]:

                          mapping.map_name('Smad2', 'genesymbol', 'uniprot', ncbi_tax_id = 10090)

                        

executed in 0ms, finished 14:17:44 2022-12-02

[10]:

{'Q62432'}

If no direct translation table is available between two ID types, pypath will try to translate by an intermediate ID type.

[11]:

                          mapping.map_name('8408', 'entrez', 'genesymbol')

                        

executed in 0ms, finished 14:17:46 2022-12-02

[11]:

{'ULK1'}

Behind the scenes the chain_map function is called:

[12]:

                          m = mapping.get_mapper()
m.chain_map('8408', id_type = 'entrez', target_id_type = 'genesymbol', by_id_type = 'uniprot')

executed in 0ms, finished 14:17:47 2022-12-02

[12]:

{'ULK1'}

And the procedure corresponds to the following:

[13]:

                          mapping.map_names(
    mapping.map_name('8408', 'entrez', 'uniprot'),
    'uniprot',
    'genesymbol',
)

                        

executed in 0ms, finished 14:17:49 2022-12-02

[13]:

{'ULK1'}

Pre-defined ID translation tables§

A number of mapping tables are pre-defined, these load automatically on demand, and are removed from the memory if not used for some time (5 minutes by default). New mapping tables are saved directly into pickle files in the cache for a quick reload. Tables are either organism specific (hence loaded for each organism one-by-one), or non-organism specific, such as drug IDs (pypath uses integer 0 in this case in place of the numeric NCBI Taxonomy ID). The identifier translation data is retrieved from the following sources:

UniProt legacy API (main UniProt API until autumn 2022): internals.input_formats.UniprotMapping
UniProt uploadlists API (also outdated, replaced by the new UniProt API): internals.inputs_formats.UniprotListMapping
Ensembl Biomart: internals.input_formats.BiomartMapping and internals.input_formats.ArrayMapping (for microarray probes)
Protein Ontology Consortium: internals.input_formats.ProMapping
UniChem: internals.input_formats.UnichemMapping
Arbitrary files: internals.input_formats.FileMapping (this class is used to process data from miRBase, some files from the UniProt FTP site, and also user defined, custom cases)
RaMP: internals.input_formats.RampMapping
HMDB: internals.input_formats.HmdbMapping

Some of the classes above are instantiated in internals.maps, but most of the instances are created on the fly when loading a mapping table in utils.mapping.MapReader. This latter class is responsible to take a table definition and load a utils.mapping.MappingTable instance. The whole process is managed by utils.mapping.Mapper, this is the object all the ID translation queries are dispatched to. It has a method to list the defined ID translation tables:

[3]:

                            mapping.mapping_tables()

                          

executed in 0ms, finished 12:32:06 2023-03-21

[3]:

[MappingTableDefinition(id_type_a='embl', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='database(embl)', resource_id_type_b='id'),
 MappingTableDefinition(id_type_a='genesymbol', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='genes(PREFERRED)', resource_id_type_b='id'),
 MappingTableDefinition(id_type_a='genesymbol-syn', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='genes(ALTERNATIVE)', resource_id_type_b='id'),
 MappingTableDefinition(id_type_a='entrez', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='database(geneid)', resource_id_type_b='id'),
 MappingTableDefinition(id_type_a='hgnc', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='database(HGNC)', resource_id_type_b='id'),
 MappingTableDefinition(id_type_a='refseqp', id_type_b='uniprot', resource='uniprot', input_cl

Output truncated: showing 1000 of 29850 characters

Pypath uses synonyms to refer to ID types: these are intended to be short, clear and lowercase for ease of use. Most of the synonyms are defined in internals.input_formats, in the AC_QUERY, AC_MAPPING, BIOMART_MAPPING, PRO_MAPPING and ARRAY_MAPPING dictionaries. UniChem ID types are used exactly as provided by UniChem. To list all available ID types (below pypath is the synonym used here, original is the name in the original resource):

[4]:

                            mapping.id_types()

                          

executed in 0ms, finished 12:32:14 2023-03-21

[4]:

{IdType(pypath='CAS', original='CAS'),
 IdType(pypath='LIPIDMAPS', original='LIPIDMAPS'),
 IdType(pypath='MedChemExpress', original='MedChemExpress'),
 IdType(pypath='actor', original='actor'),
 IdType(pypath='affy', original='affy'),
 IdType(pypath='affymetrix', original='affymetrix'),
 IdType(pypath='agilent', original='agilent'),
 IdType(pypath='alzforum', original='Alzforum_mut'),
 IdType(pypath='araport', original='Araport'),
 IdType(pypath='atlas', original='atlas'),
 IdType(pypath='bigg', original='bigg'),
 IdType(pypath='bindingdb', original='bindingdb'),
 IdType(pypath='biocyc', original='biocyc'),
 IdType(pypath='brenda', original='brenda'),
 IdType(pypath='carotenoiddb', original='carotenoiddb'),
 IdType(pypath='cas', original='CAS'),
 IdType(pypath='cas', original='cas_registry_number'),
 IdType(pypath='cas_id', original='CAS'),
 IdType(pypath='cgnc', original='CGNC'),
 IdType(pypath='chebi', original='chebi'),
 IdType(pypath='chembl', original='chembl'),
 IdType(pypath='ch

Output truncated: showing 1000 of 8561 characters

Direct access to ID translation tables§

The Mapper (or the mapping module) is able to return ID translation tables as dicts or data frames:

[5]:

                            tbl = mapping.translation_dict('uniprot', 'genesymbol')
tbl

executed in 0ms, finished 12:33:55 2023-03-21

[5]:

<MappingTable from=uniprot, to=genesymbol, taxon=9606 (20243 IDs)>

[7]:

                            'P00533' in tbl

                          

executed in 0ms, finished 12:34:16 2023-03-21

[7]:

True

[8]:

                            tbl['P00533']

                          

executed in 0ms, finished 12:34:25 2023-03-21

[8]:

{'EGFR'}

[9]:

                            'EGFR' in tbl

                          

executed in 0ms, finished 12:34:33 2023-03-21

[9]:

False

[10]:

                            list(tbl.items())[:10]

                          

executed in 0ms, finished 12:34:50 2023-03-21

[10]:

[('Q00604', {'NDP'}),
 ('Q9HB19', {'PLEKHA2'}),
 ('Q16718', {'NDUFA5'}),
 ('P55769', {'SNU13'}),
 ('Q92886', {'NEUROG1'}),
 ('Q6T4R5', {'NHS'}),
 ('P80188', {'LCN2'}),
 ('Q86XR2', {'FAM129C'}),
 ('Q5T2W1', {'PDZK1'}),
 ('Q9BSH3', {'NICN1'})]

The same table as data frame:

[12]:

                            mapping.translation_df('uniprot', 'genesymbol')

                          

executed in 0ms, finished 12:35:18 2023-03-21

[12]:

	uniprot	genesymbol
0	Q00604	NDP
1	Q9HB19	PLEKHA2
2	Q16718	NDUFA5
3	P55769	SNU13
4	Q92886	NEUROG1
...	...	...
20375	Q96L92	SNX27
20376	Q9UNH6	SNX7
20377	Q5VWJ9	SNX30
20378	Q9BZZ2	SIGLEC1
20379	Q96BD0	SLCO4A1

20380 rows × 2 columns

Orthology translation§

The utils.orthology module (formerly utils.homology) handles translation of data between organism by orthologous gene pairs. Its most important function is translate. The source organism is human by default, the target must be provided, below we use mouse (NCBI Taxonomy 10090):

[2]:

                          from pypath.utils import orthology
orthology.translate('P00533', target = 10090)

executed in 22.33s, finished 18:03:50 2023-09-28

[2]:

{'Q01279'}

ID translation and orthology translation are integrated, hence not only UniProt IDs can be translated:

[3]:

                          orthology.translate('EGFR', target = 10090, id_type = 'genesymbol')

                        

executed in 22.08s, finished 18:04:16 2023-09-28

[3]:

{'Egfr'}

This module uses data from the Orthologous Matrix )OMA), NCBI HomoloGene and Ensembl. The latter covers more organisms, and accepts some parameters (high confidence, one-to-one vs. one-to-many mapping). The default is to use only OMA as that one is the most comprehensive, up to date and easy to use resource. These parameters can be controlled by the settings module, or passed to the functions above and below, for example:

[8]:

                          orthology.translate('P00533', target = 10090, oma = False, homologene = False, ensembl = True, ensembl_hc = False, ensembl_types = 'one2one')

                        

executed in 24.52s, finished 18:07:43 2023-09-28

[8]:

{'Q01279'}

Orthology translation tables as dictionaries§

The translation tables are available as dicts of sets, these are convenient for use outside of pypath:

[9]:

                            human_mouse_genesymbols = orthology.get_dict(target = 'mouse', id_type = 'genesymbol')
human_mouse_genesymbols['EGFR']

executed in 0ms, finished 18:08:26 2023-09-28

[9]:

{'Egfr'}

The relationship types and confdence levels can be included using the full_records argument:

[11]:

                            human_mouse_genesymbols = orthology.get_dict(target = 'mouse', id_type = 'genesymbol', full_records = True)
human_mouse_genesymbols['EGFR']

executed in 0ms, finished 18:10:13 2023-09-28

[11]:

{OmaOrtholog(id='Egfr', rel_type='1:1', score=12704.5703125)}

Orthology translation data frames§

Similarly, pandas.DataFrames are available:

[13]:

                            human_mouse_genesymbols = orthology.get_df(target = 'mouse', id_type = 'genesymbol', full_records = True)
human_mouse_genesymbols

executed in 0ms, finished 18:11:16 2023-09-28

[13]:

	source	target	rel_type	score
0	H4C3	H4c1	m:n	1262.050049
1	H4C3	H4c3	m:n	1262.050049
2	H4C3	H4c12	m:n	1262.050049
3	H4C3	H4c11	m:n	1262.050049
4	H4C3	H4c9	m:n	1262.050049
...	...	...	...	...
18446	GDAP2	Gdap2	1:1	5553.779785
18447	ITGA8	Itga8	1:1	10772.969727
18448	SEMA3F	Sema3f	1:1	9121.080078
18449	EEPD1	Eepd1	1:1	5874.350098
18450	DRG2	Drg2	1:1	4423.589844

18451 rows × 4 columns

Taxonomy§

Organisms matter everywhere, both in the input, output and processing parts of pypath. For this reason we created a utility module to deal with translation of organism identifiers. We prefer NCBI Taxonomy IDs as the primary organism identifier. These are simple numbers, 9606 is human, 10090 is mouse, etc. Many databases use common English names or latin (scientific) names. Then some databases use custom codes, such as hsapiens in Ensmebl (first letter of genus name + species name, without space, all lowercase); hsa in miRBase and KEGG (first letter of genus name, first two letters of species name). The pypath.utils.taxonomy module features some convenient functions for handling all these names.

Translating to NCBI Taxonomy, scientific names and common names§

The most often used is ensure_ncbi_tax_id, which returns the NCBI Taxonomy ID for any comprehensible input:

[21]:

                            from pypath.utils import taxonomy
taxonomy.ensure_ncbi_tax_id('human'), taxonomy.ensure_ncbi_tax_id('H sapiens'), taxonomy.ensure_ncbi_tax_id('hsapiens'), taxonomy.ensure_ncbi_tax_id(9606), taxonomy.ensure_ncbi_tax_id('Homo sapiens')

executed in 0ms, finished 14:18:22 2022-12-02

[21]:

(9606, 9606, 9606, 9606, 9606)

To access scientific names or common names:

[22]:

                            taxonomy.ensure_latin_name('cow')

                          

executed in 0ms, finished 14:18:25 2022-12-02

[22]:

'Bos taurus'

[23]:

                            taxonomy.ensure_common_name('Erithacus rubecula')

                          

executed in 0ms, finished 14:18:27 2022-12-02

[23]:

'European robin'

Organism from UniProt ID§

The uniprot_taxid function returns the taxonomy ID for a SwissProt ID. Unfortunately it does not work for TrEMBL IDs, that would require to keep too much data in memory.

[24]:

                            taxonomy.ensure_latin_name(taxonomy.uniprot_taxid('P53104'))

                          

executed in 1.19s, finished 14:18:30 2022-12-02

[24]:

'Saccharomyces cerevisiae'

UniProt§

UniProt is a huge, diverse resource that is essential for pypath as we use it as a reference set for proteomes and it provides ID translation data. Its input module pypath.inputs.uniprot is already more complex than an average input module. It harbors a little database manager that loads and unloads tables on demand, ensuring fast and convenient operation. Further services are available in the pypath.utils.uniprot module.

The UniProt input module§

All UniProt IDs for one organism§

The complete set of UniProt IDs for an organism is considered to be the proteome of the organism, and it is used in many procedures across pypath. All SwissProt IDs, all TrEMBL IDs or both together can be retrieved:

[119]:

                              from pypath.inputs import uniprot as iuniprot
(
    len(iuniprot.all_uniprots(organism = 10090)),
    len(iuniprot.all_swissprots(organism = 10090)),
    len(iuniprot.all_trembls(organism = 10090)),
)

                            

executed in 3m 33.99s, finished 16:07:43 2022-12-02

[119]:

(86440, 17131, 69300)

UniProt ID format validation§

UniProt defines a format for its accessions, any string can be checked against this template to tell if it’s possibly a valid ID:

[124]:

                              from pypath.inputs import uniprot as iuniprot
iuniprot.valid_uniprot('A0A8D0H0C2')

executed in 0ms, finished 16:17:41 2022-12-02

[124]:

True

UniProt ID validation§

Another functions check if an ID indeed exists in UniProt. These functions require loading the list of all UniProt IDs for the organism, hence calling them the first time might take even a few minutes (in case new download is necessary). Subsequent calls will be much faster.

[125]:

                              from pypath.inputs import uniprot as iuniprot
iuniprot.is_uniprot('P00533')

executed in 0ms, finished 16:17:44 2022-12-02

[125]:

True

[122]:

                              iuniprot.is_swissprot('P00533')

                            

executed in 0ms, finished 16:14:14 2022-12-02

[122]:

True

If the organism doesn’t match:

[123]:

                              iuniprot.is_uniprot('P00533', organism = 10090)

                            

executed in 0ms, finished 16:15:07 2022-12-02

[123]:

False

Single UniProt protein datasheet§

Raw contents of protein datasheets can be retrieved. The structure is a Python list with tuples of two elements, the first is the tag of the line, the second is the line content.

[126]:

                              from pypath.inputs import uniprot as iuniprot
iuniprot.protein_datasheet('P00533')

executed in 0ms, finished 16:18:06 2022-12-02

[126]:

[('ID', 'EGFR_HUMAN              Reviewed;        1210 AA.'),
 ('AC',
  'P00533; O00688; O00732; P06268; Q14225; Q68GS5; Q92795; Q9BZS2; Q9GZX1;'),
 ('AC', 'Q9H2C9; Q9H3C9; Q9UMD7; Q9UMD8; Q9UMG5;'),
 ('DT', '21-JUL-1986, integrated into UniProtKB/Swiss-Prot.'),
 ('DT', '01-NOV-1997, sequence version 2.'),
 ('DT', '12-OCT-2022, entry version 283.'),
 ('DE', 'RecName: Full=Epidermal growth factor receptor {ECO:0000305};'),
 ('DE', 'EC=2.7.10.1;'),
 ('DE', 'AltName: Full=Proto-oncogene c-ErbB-1;'),
 ('DE', 'AltName: Full=Receptor tyrosine-protein kinase erbB-1;'),
 ('DE', 'Flags: Precursor;'),
 ('GN', 'Name=EGFR {ECO:0000312|HGNC:HGNC:3236}; Synonyms=ERBB, ERBB1, HER1;'),
 ('OS', 'Homo sapiens (Human).'),
 ('OC',
  'Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;'),
 ('OC',
  'Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae;'),
 ('OC', 'Homo.'),
 ('OX', 'NCBI_TaxID=9606;'),
 ('RN', '[1]'),
 ('RP',
  'NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM

Output truncated: showing 1000 of 58080 characters

History of UniProt records§

[131]:

                              from pypath.inputs import uniprot as iuniprot
egfr_history = list(iuniprot.uniprot_history('P00533'))
egfr_history

                            

executed in 0ms, finished 16:21:15 2022-12-02

[131]:

[UniprotRecordHistory(entry_version='283', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_04', date='2022-10-12', replaces='', replaced_by=''),
 UniprotRecordHistory(entry_version='282', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_03', date='2022-08-03', replaces='', replaced_by=''),
 UniprotRecordHistory(entry_version='281', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_02', date='2022-05-25', replaces='', replaced_by=''),
 UniprotRecordHistory(entry_version='280', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_01', date='2022-02-23', replaces='', replaced_by=''),
 UniprotRecordHistory(entry_version='279', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2021_04', date='2021-09-29', replaces='', replaced_by=''),
 UniprotRecordHistory(entry_version='278', sequence_version='2', entry_name='EGFR_HUMAN', database='

Output truncated: showing 1000 of 50933 characters

[132]:

                              iuniprot.uniprot_recent_version('P00533')

                            

executed in 0ms, finished 16:21:57 2022-12-02

[132]:

UniprotRecordHistory(entry_version='283', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_04', date='2022-10-12', replaces='', replaced_by='')

[133]:

                              iuniprot.uniprot_history_recent_datasheet('P00533')

                            

executed in 1ms, finished 16:22:33 2022-12-02

[133]:

[('ID', 'EGFR_HUMAN              Reviewed;        1210 AA.'),
 ('AC',
  'P00533; O00688; O00732; P06268; Q14225; Q68GS5; Q92795; Q9BZS2; Q9GZX1;'),
 ('AC', 'Q9H2C9; Q9H3C9; Q9UMD7; Q9UMD8; Q9UMG5;'),
 ('DT', '21-JUL-1986, integrated into UniProtKB/Swiss-Prot.'),
 ('DT', '01-NOV-1997, sequence version 2.'),
 ('DT', '12-OCT-2022, entry version 283.'),
 ('DE', 'RecName: Full=Epidermal growth factor receptor {ECO:0000305};'),
 ('DE', 'EC=2.7.10.1;'),
 ('DE', 'AltName: Full=Proto-oncogene c-ErbB-1;'),
 ('DE', 'AltName: Full=Receptor tyrosine-protein kinase erbB-1;'),
 ('DE', 'Flags: Precursor;'),
 ('GN', 'Name=EGFR {ECO:0000312|HGNC:HGNC:3236}; Synonyms=ERBB, ERBB1, HER1;'),
 ('OS', 'Homo sapiens (Human).'),
 ('OC',
  'Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;'),
 ('OC',
  'Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae;'),
 ('OC', 'Homo.'),
 ('OX', 'NCBI_TaxID=9606;'),
 ('RN', '[1]'),
 ('RP',
  'NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM

Output truncated: showing 1000 of 58080 characters

The functions above are able to retrieve the latest datasheet of deleted UniProt records. However, they are slow as several queries are performed to process a single protein.

UniProt REST API§

UniProt deployed its new API in the autumn of 2022, since then pypath has fully transitioned to the new API. It is accessed by the inputs.uniprot.uniprot_data and inputs.uniprot.uniprot_query functions, though for some purposes higher level functions are more convenient for the users. For the functions above, a list of fields can be passed. By default it uses only SwissProt. The output is a dict of dicts with fields as top level keys and UniProt IDs as second level keys. The results often contain notes, additional info in parentheses, prefixes and postfixes for identifiers, that are not needed in every situation. Using uniprot_preprocess instead of uniprot_data cleans up some of this clutter.

[1]:

                              from pypath.inputs import uniprot as iuniprot
iuniprot.uniprot_data(fields = ('family', 'keywords', 'transmembrane'))

executed in 28.47s, finished 03:24:10 2023-11-16

[1]:

{'family': {'A0A087X1C5': 'Cytochrome P450 family',
  'A0A0B4J2F2': 'Protein kinase superfamily, CAMK Ser/Thr protein kinase family, AMPK subfamily',
  'A0A0K2S4Q6': 'CD300 family',
  'A0A1B0GTW7': 'Peptidase M8 family',
  'A0AV02': 'SLC12A transporter family',
  'A0AV96': 'RRM RBM47 family',
  'A0AVF1': 'IFT56 family',
  'A0AVI4': 'TMEM129 family',
  'A0AVK6': 'E2F/DP family',
  'A0AVT1': 'Ubiquitin-activating E1 family',
  'A0FGR8': 'Extended synaptotagmin family',
  'A0FGR9': 'Extended synaptotagmin family',
  'A0JLT2': 'Mediator complex subunit 19 family',
  'A0JP26': 'POTE family',
  'A0MZ66': 'Shootin family',
  'A0PJK1': 'Sodium:solute symporter (SSF) (TC 2.A.21) family',
  'A0PJY2': 'Krueppel C2H2-type zinc-finger protein family',
  'A0PK00': 'TMEM120 family',
  'A0PK11': 'Clarin family',
  'A1A4Y4': 'TRAFAC class dynamin-like GTPase superfamily, IRG family',
  'A1A519': 'FAM170 family',
  'A1A5B4': 'Anoctamin family',
  'A1A5C7': 'Major facilitator (TC 2.A.1) superfamily, Orga

Output truncated: showing 1000 of 510530 characters

The inputs.uiprot.query_builder funcion builds queries for the API.

[2]:

                              from pypath.inputs import uniprot
uniprot.query_builder('kinase', organism_id = 9606)

executed in 0ms, finished 03:30:18 2023-11-16

[2]:

'kinase AND organism_id:9606'

[3]:

                              uniprot.query_builder(organism = [9606, 10090, 10116])

                            

executed in 0ms, finished 03:30:49 2023-11-16

[3]:

'(organism_id:9606 OR organism_id:10090 OR organism_id:10116)'

[4]:

                              uniprot.query_builder({'organism_id': 9606, 'reviewed': True})

                            

executed in 0ms, finished 03:31:22 2023-11-16

[4]:

'(organism_id:9606 AND reviewed:true)'

[5]:

                              uniprot.query_builder({'length': (500,), 'mass': (50000,), 'op': 'OR'})

                            

executed in 0ms, finished 03:31:41 2023-11-16

[5]:

'(length:[500 TO *] OR mass:[50000 TO *])'

[6]:

                              uniprot.query_builder(lit_author = ['Huang', 'Kovac', '_AND'])

                            

executed in 0ms, finished 03:32:21 2023-11-16

[6]:

'(lit_author:Huang AND lit_author:Kovac)'

[7]:

                              uniprot.query_builder({'organism_id': [9606, 10090], 'reviewed': True})

                            

executed in 0ms, finished 03:32:41 2023-11-16

[7]:

'((organism_id:9606 OR organism_id:10090) AND reviewed:true)'

[8]:

                              uniprot.query_builder({'length': (100, None), 'organism_id': 9606})

                            

executed in 0ms, finished 03:33:04 2023-11-16

[8]:

'(length:[100 TO *] AND organism_id:9606)'

The query parameters can be passed the same way to uniprot_data and uniprot_query. For example, to query records in one proteome:

[10]:

                              from pypath.inputs import uniprot
uniprot.uniprot_query(proteome = 'UP000004102')[:10]

executed in 0ms, finished 03:36:16 2023-11-16

[10]:

['D1YM56',
 'D1YMJ2',
 'D1YN32',
 'D1YNB3',
 'D1YPZ1',
 'D1YR07',
 'D1YR15',
 'D1YR93',
 'D1YRB4',
 'D1YRB7']

All these functionalities are performed by the pypath.inputs.uniprot.UniprotQuery class.

Processed UniProt annotations§

For a few important fields we have dedicated processing functions with the aim of making their format cleaner and better usable. Sometimes even these do an imperfect job, and certain fields are badly truncated or contain residual fragments of the stripped labels.

Note: All the data presented below is part of the OmniPath annotations database, the recommended way to access it is by the database manager.

[136]:

                              from pypath.inputs import uniprot as iuniprot
iuniprot.uniprot_taxonomy()

executed in 1ms, finished 16:40:33 2022-12-02

[136]:

{'P00521': {'Abelson murine leukemia virus'},
 'P03333': {'Abelson murine leukemia virus'},
 'H8ZM73': {'Abies balsamea', 'Balsam fir', 'Pinus balsamea'},
 'H8ZM71': {'Abies balsamea', 'Balsam fir', 'Pinus balsamea'},
 'Q9MV51': {'Abies firma', 'Momi fir'},
 'O81086': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'O24474': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'O24475': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'O64404': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'O64405': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'Q948Z0': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'Q9M7D1': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'Q9M7D0': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'O22340': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'Q9M7C9': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
 'Q5K3V1': {'Abies homolepis', 'Nikko fir'},
 'P21715': {'Abrothrix jelskii', 'Akodon jelskii', "Jelski's altiplano mouse"},
 'P11140': {'Abru

Output truncated: showing 1000 of 56985 characters

[139]:

                              iuniprot.uniprot_ncbi_taxids_2()

                            

executed in 0ms, finished 16:42:33 2022-12-02

[139]:

{648330: Taxon(ncbi_id=648330, latin='Aedes albopictus densovirus (isolate Boublik/1994)', english='AalDNV', latin_synonym=None),
 10804: Taxon(ncbi_id=10804, latin='Adeno-associated virus 2', english='AAV-2', latin_synonym=None),
 648242: Taxon(ncbi_id=648242, latin='Adeno-associated virus 2 (isolate Srivastava/1982)', english='AAV-2', latin_synonym=None),
 118452: Taxon(ncbi_id=118452, latin='Abacion magnum', english='Millipede', latin_synonym=None),
 72259: Taxon(ncbi_id=72259, latin='Abaeis nicippe', english='Sleepy orange butterfly', latin_synonym='Eurema nicippe'),
 102642: Taxon(ncbi_id=102642, latin='Abax parallelepipedus', english='Ground beetle', latin_synonym=None),
 392897: Taxon(ncbi_id=392897, latin='Abalistes stellaris', english='Starry triggerfish', latin_synonym='Balistes stellaris'),
 75332: Taxon(ncbi_id=75332, latin='Abbottina rivularis', english='Chinese false gudgeon', latin_synonym='Gobio rivularis'),
 515833: Taxon(ncbi_id=515833, latin='Abdopus aculeatus', engl

Output truncated: showing 1000 of 118050 characters

[140]:

                              iuniprot.uniprot_locations()

                            

executed in 0ms, finished 16:42:50 2022-12-02

[140]:

{'Q96EC8': {UniprotLocation(location='Golgi apparatus membrane', features=('Multi-pass membrane protein',))},
 'Q6ZMS4': {UniprotLocation(location='Nucleus', features=None)},
 'Q8N8L2': {UniprotLocation(location='Nucleus', features=None)},
 'Q15916': {UniprotLocation(location='Nucleus', features=None)},
 'Q3MIS6': {UniprotLocation(location='Nucleus', features=None)},
 'Q6P280': {UniprotLocation(location='Nucleus', features=None)},
 'Q969W1': {UniprotLocation(location='Endoplasmic reticulum membrane', features=('Multi-pass membrane protein',))},
 'O14978': {UniprotLocation(location='Nucleus', features=None)},
 'Q66K41': {UniprotLocation(location='Nucleus', features=None)},
 'Q15937': {UniprotLocation(location='Nucleus', features=None)},
 'Q9P2J8': {UniprotLocation(location='Nucleus', features=None)},
 'Q8ND82': {UniprotLocation(location='Nucleus', features=None)},
 'Q9NP64': {UniprotLocation(location='Nucleolus', features=None),
  UniprotLocation(location='Nucleus', features=None)},
 'P

Output truncated: showing 1000 of 143466 characters

[141]:

                              iuniprot.uniprot_keywords()

                            

executed in 0ms, finished 16:43:06 2022-12-02

[141]:

{'P63120': {UniprotKeyword(keyword='Aspartyl protease'),
  UniprotKeyword(keyword='Autocatalytic cleavage'),
  UniprotKeyword(keyword='ERV'),
  UniprotKeyword(keyword='Hydrolase'),
  UniprotKeyword(keyword='Protease'),
  UniprotKeyword(keyword='Reference proteome'),
  UniprotKeyword(keyword='Ribosomal frameshifting'),
  UniprotKeyword(keyword='Transposable element')},
 'Q96EC8': {UniprotKeyword(keyword='Acetylation'),
  UniprotKeyword(keyword='Alternative splicing'),
  UniprotKeyword(keyword='Golgi apparatus'),
  UniprotKeyword(keyword='Membrane'),
  UniprotKeyword(keyword='Phosphoprotein'),
  UniprotKeyword(keyword='Reference proteome'),
  UniprotKeyword(keyword='Transmembrane'),
  UniprotKeyword(keyword='Transmembrane helix')},
 'Q6ZMS4': {UniprotKeyword(keyword='Metal-binding'),
  UniprotKeyword(keyword='Nucleus'),
  UniprotKeyword(keyword='Phosphoprotein'),
  UniprotKeyword(keyword='Reference proteome'),
  UniprotKeyword(keyword='Repeat'),
  UniprotKeyword(keyword='Zinc'),
  Unipro

Output truncated: showing 1000 of 445111 characters

[142]:

                              iuniprot.uniprot_families()

                            

executed in 0ms, finished 16:43:22 2022-12-02

[142]:

{'P63120': {UniprotFamily(family='Peptidase A2', subfamily='HERV class-II K(HML-2)')},
 'Q96EC8': {UniprotFamily(family='YIP1', subfamily=None)},
 'Q6ZMS4': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q8N8L2': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q3MIS6': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q86UK7': {UniprotFamily(family='ZNF598', subfamily=None)},
 'Q6P280': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q969W1': {UniprotFamily(family='DHHC palmitoyltransferase', subfamily=None)},
 'O14978': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q15937': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q9P2J8': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
 'Q8IUH4': {UniprotFamily(family='DHHC palmitoyltransferase',

Output truncated: showing 1000 of 77892 characters

[143]:

                              iuniprot.uniprot_tissues()

                            

executed in 1.12s, finished 16:43:55 2022-12-02

[143]:

{'Q15916': {UniprotTissue(tissue='Brain', level='high'),
  UniprotTissue(tissue='Wide', level='high')},
 'Q969W1': {UniprotTissue(tissue='Wide', level='undefined')},
 'O14978': {UniprotTissue(tissue='Brain', level='undefined'),
  UniprotTissue(tissue='Colon', level='undefined'),
  UniprotTissue(tissue='Heart', level='undefined'),
  UniprotTissue(tissue='Kidney', level='undefined'),
  UniprotTissue(tissue='Leukocyte', level='undefined'),
  UniprotTissue(tissue='Liver', level='undefined'),
  UniprotTissue(tissue='Lung', level='undefined'),
  UniprotTissue(tissue='Ovary', level='undefined'),
  UniprotTissue(tissue='Pancreas', level='undefined'),
  UniprotTissue(tissue='Placenta', level='undefined'),
  UniprotTissue(tissue='Prostate', level='undefined'),
  UniprotTissue(tissue='Skeletal muscle', level='undefined'),
  UniprotTissue(tissue='Small intestine', level='undefined'),
  UniprotTissue(tissue='Spleen', level='undefined'),
  UniprotTissue(tissue='Testis', level='undefined'),
  Uniprot

Output truncated: showing 1000 of 318790 characters

[144]:

                              iuniprot.uniprot_topology()

                            

executed in 0ms, finished 16:44:13 2022-12-02

[144]:

{'Q96EC8': {UniprotTopology(topology='Cytoplasmic', start=2, end=84),
  UniprotTopology(topology='Cytoplasmic', start=137, end=146),
  UniprotTopology(topology='Cytoplasmic', start=206, end=212),
  UniprotTopology(topology='Lumenal', start=106, end=115),
  UniprotTopology(topology='Lumenal', start=168, end=184),
  UniprotTopology(topology='Lumenal', start=234, end=236),
  UniprotTopology(topology='Transmembrane', start=85, end=105),
  UniprotTopology(topology='Transmembrane', start=116, end=136),
  UniprotTopology(topology='Transmembrane', start=147, end=167),
  UniprotTopology(topology='Transmembrane', start=185, end=205),
  UniprotTopology(topology='Transmembrane', start=213, end=233)},
 'Q969W1': {UniprotTopology(topology='Cytoplasmic', start=1, end=77),
  UniprotTopology(topology='Cytoplasmic', start=138, end=198),
  UniprotTopology(topology='Cytoplasmic', start=288, end=377),
  UniprotTopology(topology='Lumenal', start=99, end=116),
  UniprotTopology(topology='Lumenal', start=220,

Output truncated: showing 1000 of 544230 characters

The UniProt utils module§

Datasheets§

The pypath.utils.uniprot module is an API around UniProt protein datasheets. It is not suitable for bulk retrieval: that would work but take really long time. Calling its bulk methods with more than a few dozens or hundreds of proteins might take minutes, as it downloads protein datasheets one-by-one. To retrieve the full datasheets of one or more proteins use query:

[153]:

                              from pypath.utils import uniprot
uniprot.query('P00533', 'O75385', 'Q14457')

executed in 1ms, finished 17:57:18 2022-12-02

[153]:

[<UniProt datasheet P00533 (EGFR)>,
 <UniProt datasheet O75385 (ULK1)>,
 <UniProt datasheet Q14457 (BECN1)>]

[154]:

                              ulk1 = uniprot.query('O75385')
ulk1

executed in 0ms, finished 17:57:58 2022-12-02

[154]:

<UniProt datasheet O75385 (ULK1)>

Many attributes are available from the datasheet objects, just a few examples:

[156]:

                              ulk1.weight, ulk1.length, ulk1.subcellular_location, ulk1.sequence

                            

executed in 0ms, finished 17:59:18 2022-12-02

[156]:

(112631,
 1050,
 'Cytoplasm, cytosol. Preautophagosomal structure. Note=Under starvation conditions, is localized to puncate structures primarily representing the isolation membrane that sequesters a portion of the cytoplasm resulting in the formation of an autophagosome.',
 'MEPGRGGTETVGKFEFSRKDLIGHGAFAVVFKGRHREKHDLEVAVKCINKKNLAKSQTLLGKEIKILKELKHENIVALYDFQEMANSVYLVMEYCNGGDLADYLHAMRTLSEDTIRLFLQQIAGAMRLLHSKGIIHRDLKPQNILLSNPAGRRANPNSIRVKIADFGFARYLQSNMMAATLCGSPMYMAPEVIMSQHYDGKADLWSIGTIVYQCLTGKAPFQASSPQDLRLFYEKNKTLVPTIPRETSAPLRQLLLALLQRNHKDRMDFDEFFHHPFLDASPSVRKSPPVPVPSYPSSGSGSSSSSSSTSHLASPPSLGEMQQLQKTLASPADTAGFLHSSRDSGGSKDSSCDTDDFVMVPAQFPGDLVAEAPSAKPPPDSLMCSGSSLVASAGLESHGRTPSPSPPCSSSPSPSGRAGPFSSSRCGASVPIPVPTQVQNYQRIERNLQSPTQFQTPRSSAIRRSGSTSPLGFARASPSPPAHAEHGGVLARKMSLGGGRPYTPSPQVGTIPERPGWSGTPSPQGAEMRGGRSPRPGSSAPEHSPRTSGLGCRLHSAPNLSDLHVVRPKLPKPPTDPLGAVFSPPQASPPQPSHGLQSCRNLRGSPKLPDFLQRNPLPPILGSPTKAVPSFDFPKTPSSQNLLALLARQGVVMTPPRNRTLPDLSEVGPFHGQPLGPGLRPGEDPKGPFGRSFSTSRLTDLLLKAAFGTQAPDPGSTESLQEK

Output truncated: showing 1000 of 1329 characters

The collect function collects certain features for a set of proteins.

Warning: This is a really inefficient way of retrieving data from UniProt. If you work with more than a handful of proteins, go for pypath.inputs.uniprot_data instead.

[158]:

                              uniprot.collect(['P00533', 'O75385', 'Q14457'], 'weight', 'length')

                            

executed in 0ms, finished 18:02:29 2022-12-02

[158]:

OrderedDict([('ac', ['P00533', 'O75385', 'Q14457']),
             ('weight', [134277, 112631, 51896]),
             ('length', [1210, 1050, 450])])

Tables§

UniProt data can be printed to the console in a tabular format:

[159]:

                              uniprot.print_features(['P00533', 'O75385', 'Q14457'], 'weight', 'length')

                            

executed in 0ms, finished 18:07:18 2022-12-02

╒═══════╤════════╤══════════╤══════════╕
│   No. │ ac     │   weight │   length │
╞═══════╪════════╪══════════╪══════════╡
│     1 │ P00533 │   134277 │     1210 │
├───────┼────────┼──────────┼──────────┤
│     2 │ O75385 │   112631 │     1050 │
├───────┼────────┼──────────┼──────────┤
│     3 │ Q14457 │    51896 │      450 │
╘═══════╧════════╧══════════╧══════════╛

There is a shortcut to print essential characterization of proteins as such a table. The info function is really useful if you get to a set of proteins at some point of your analysis and you want to quickly check what kind they are. To iterate through multiple groups of proteins, use utils.uniprot.browse. The columns and format of these tables can be customized by kwargs.

[160]:

                              uniprot.info(['P00533', 'O75385', 'Q14457'])

                            

executed in 0ms, finished 18:09:45 2022-12-02

=====> [3 proteins] <=====
╒═══════╤════════╤══════════════╤══════════╤══════════╤═════════════╤══════════════╤════════════╤══════════════╕
│   No. │ ac     │ genesymbol   │   length │   weight │ full_name   │ function_o   │ keywords   │ subcellula   │
│       │        │              │          │          │             │ r_genecard   │            │ r_location   │
│       │        │              │          │          │             │ s            │            │              │
╞═══════╪════════╪══════════════╪══════════╪══════════╪═════════════╪══════════════╪════════════╪══════════════╡
│     1 │ P00533 │ EGFR         │     1210 │   134277 │ Epidermal   │ Receptor     │ 3D-        │ Cell         │
│       │        │              │          │          │ growth      │ tyrosine     │ structure, │ membrane;    │
│       │        │              │          │          │ factor      │ kinase       │ Alternativ │ Single-      │
│       │        │              │          │          │ receptor    │

Output truncated: showing 1000 of 20254 characters

Sanitizing UniProt IDs§

It is important to know that the ID translation module always do a number of checks when translating to UniProt IDs. Unless the uniprot_cleanup parameter is disabled. It translates secondary IDs to primary, attempts to map TrEMBL IDs to SwissProts by gene symbols, removes IDs of other organisms or invalid format. To exploit this behaviour it’s enough to map from UniProt to UniProt:

[162]:

                            from pypath.utils import mapping
mapping.map_name('Q9UQ28', 'uniprot', 'uniprot')

executed in 0ms, finished 18:20:02 2022-12-02

[162]:

{'O75385'}

Enzyme-substrate interactions§

The database is an instance of pypath.core.enz_sub.EnzymeSubstrateAggregator class. The database is built with the default or current configuration by the core.enz_sub.get_db method.

Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.

[25]:

                          from pypath.core import enz_sub
es = enz_sub.get_db()

executed in 8m 1.81s, finished 14:26:37 2022-12-02

Instead, let’s acquire the database from the manager:

[6]:

                          from pypath import omnipath
es = omnipath.db.get_db('enz_sub')

executed in 7.27s, finished 15:37:33 2022-12-03

The database itself is stored as a dictionary (EnzymeSubstrateAggregator.enz_sub) with pairs of proteins as keys and a list of special objects representing enzyme-substrate interactions as values. These can be accessed by pairs of labels, identifiers or Entity objects, e.g. mTOR phosphorylates AKT1:

[27]:

                          es[('MTOR', 'AKT1')]

                        

executed in 0ms, finished 14:40:55 2022-12-02

[27]:

[<MTOR => Residue AKT1-1:S473:phosphorylation [Evidences: HPRD, KEA, MIMP, PhosphoSite, ProtMapper, REACH, SIGNOR, Sparser, dbPTM, phosphoELM (15 references)]>,
 <MTOR => Residue AKT1-1:T450:phosphorylation [Evidences: HPRD, MIMP, PhosphoSite, ProtMapper, phosphoELM (0 references)]>,
 <MTOR => Residue AKT1-1:T308:phosphorylation [Evidences: ProtMapper, Sparser (1 references)]>]

Enzyme-substrate objects§

Let’s take a closer look at one of the enzyme-PTM relationships, represented by pypath.internals.intera.DomainMotif objects. Below some of the attributes are shown:

[28]:

                            e_ptm = es[('MTOR', 'AKT1')][0]
e_ptm.ptm.protein, e_ptm.ptm.protein.identifier, e_ptm.ptm.isoform, e_ptm.ptm.residue, e_ptm.ptm.residue.name, e_ptm.ptm.residue.number, e_ptm.ptm.typ, e_ptm.domain.protein

executed in 0ms, finished 14:40:57 2022-12-02

[28]:

(<Entity: AKT1>,
 'P31749',
 1,
 <Residue AKT1-1:S473>,
 'S',
 473,
 'phosphorylation',
 <Entity: MTOR>)

The resources and references are available in Evidences objects:

[29]:

                            e_ptm.evidences

                          

executed in 0ms, finished 14:41:00 2022-12-02

[29]:

<Evidences: HPRD, KEA, MIMP, PhosphoSite, ProtMapper, REACH, SIGNOR, Sparser, dbPTM, phosphoELM (15 references)>

[30]:

                            e_ptm.evidences.get_resource_names()

                          

executed in 0ms, finished 14:41:03 2022-12-02

[30]:

{'KEA', 'MIMP', 'PhosphoSite', 'ProtMapper', 'SIGNOR', 'dbPTM'}

[31]:

                            e_ptm.evidences.get_references()

                          

executed in 0ms, finished 14:41:04 2022-12-02

[31]:

{<Reference: 14761976>,
 <Reference: 15047712>,
 <Reference: 15364915>,
 <Reference: 15718470>,
 <Reference: 15899889>,
 <Reference: 16221682>,
 <Reference: 17013611>,
 <Reference: 19844585>,
 <Reference: 20333297>,
 <Reference: 20489726>,
 <Reference: 21157483>,
 <Reference: 21592956>,
 <Reference: 23006971>,
 <Reference: 8978681>,
 <Reference: 9736715>}

Enzyme-substrate data frame§

The dabase object is able to export its contents into a pandas.DataFrame:

[7]:

                            es.make_df()
es.df

executed in 1.03s, finished 15:37:39 2022-12-03

[7]:

	enzyme	enzyme_genesymbol	substrate	substrate_genesymbol	isoforms	residue_type	residue_offset	modification	sources	references	curation_effort
0	P31749	AKT1	P63104	YWHAZ	1	S	58	phosphorylation	HPRD;HPRD_MIMP;KEA;MIMP;PhosphoSite;PhosphoSit...	HPRD:11956222;KEA:11956222;KEA:12861023;KEA:16...	11
1	P31749	AKT1	P63104	YWHAZ	1	S	184	phosphorylation	HPRD;HPRD_MIMP;KEA;MIMP;PhosphoSite_MIMP;phosp...	HPRD:11956222;KEA:11956222;KEA:15071501	3
2	P45983	MAPK8	P63104	YWHAZ	1	S	184	phosphorylation	HPRD;HPRD_MIMP;KEA;MIMP;PhosphoNetworks;Phosph...	HPRD:15696159;KEA:11956222;KEA:15071501;KEA:15...	9
3	P06493	CDK1	P11171	EPB41	1	S	712	phosphorylation	HPRD_MIMP;MIMP;PhosphoSite_MIMP;ProtMapper;REA...	ProtMapper:15525677;dbPTM:15525677;dbPTM:18220...	5
4	P06493	CDK1	P11171	EPB41	1;2;5;7	T	60	phosphorylation	MIMP;PhosphoSite;PhosphoSite_MIMP;ProtMapper;R...	ProtMapper:15525677;dbPTM:15525677;dbPTM:2171679	3
...	...	...	...	...	...	...	...	...	...	...	...
41421	P29597	TYK2	P51692	STAT5B	1	Y	699	phosphorylation	KEA	KEA:10830280;KEA:11751923;KEA:12411494	3
41422	Q06418	TYRO3	P19174	PLCG1	1;2	Y	771	phosphorylation	KEA	KEA:12601080;KEA:15144186;KEA:15592455;KEA:160...	8
41423	Q9H4A3	WNK1	Q8TAX0	OSR1	1	T	185	phosphorylation	KEA	KEA:18270262	1
41424	Q9H4A3	WNK1	Q96J92	WNK4	1;3	S	335	phosphorylation	KEA	KEA:15883153	1
41425	Q9NYL2	MAP3K20	Q92903	CDS1	1	T	68	phosphorylation	KEA	KEA:10973490	1

41426 rows × 11 columns

Protein sequences§

The APIs for sequences are very basic, because we’ve never really needed them; but the fundamentals are probably there to make a nice, powerful API. Still, I don’t believe pypath will ever be strong in sequences, it’s just not our main topic.

[186]:

                          from pypath.utils import homology
seqc = homology.SequenceContainer(preload_seq = [9606])
akt1 = seqc.get_seq('P31749')
akt1.get_region(start = 10, end = 19, isoform = 2)

                        

executed in 0ms, finished 19:40:09 2022-12-02

[186]:

(10, 19, 'TFIIRCLQWT')

[187]:

                          from pypath.utils import seq
human_proteome = seq.swissprot_seq()
human_proteome

                        

executed in 0ms, finished 19:44:52 2022-12-02

[187]:

{'P63120': <pypath.utils.seq.Seq at 0x689900d45cc0>,
 'Q96EC8': <pypath.utils.seq.Seq at 0x689908ea8f70>,
 'Q6ZMS4': <pypath.utils.seq.Seq at 0x689908eaa4a0>,
 'Q8N8L2': <pypath.utils.seq.Seq at 0x6899223538b0>,
 'Q15916': <pypath.utils.seq.Seq at 0x689922353c70>,
 'O60384': <pypath.utils.seq.Seq at 0x689922350730>,
 'Q3MIS6': <pypath.utils.seq.Seq at 0x689922353310>,
 'Q86UK7': <pypath.utils.seq.Seq at 0x689922353760>,
 'Q6P280': <pypath.utils.seq.Seq at 0x689922353190>,
 'Q969W1': <pypath.utils.seq.Seq at 0x689922350d90>,
 'O14978': <pypath.utils.seq.Seq at 0x689922353220>,
 'P61129': <pypath.utils.seq.Seq at 0x689922353370>,
 'Q66K41': <pypath.utils.seq.Seq at 0x6899223534f0>,
 'Q15937': <pypath.utils.seq.Seq at 0x689922350c70>,
 'Q9P2J8': <pypath.utils.seq.Seq at 0x689922351450>,
 'Q8ND82': <pypath.utils.seq.Seq at 0x689922353910>,
 'Q9NP64': <pypath.utils.seq.Seq at 0x6899223502b0>,
 'P98182': <pypath.utils.seq.Seq at 0x689922350280>,
 'Q8IUH4': <pypath.utils.seq.Seq at 0x68992235

Output truncated: showing 1000 of 53045 characters

[191]:

                          list(human_proteome['P00533'].findall('YGCT'))

                        

executed in 0ms, finished 19:48:41 2022-12-02

[191]:

[SeqLookup(isoform=1, offset=625)]

Annotations§

This database provides various annotations about the function, structure, localization and many other properties of the proteins and genes. The database is an instance of pypath.core.annot.AnnotationTable class. The database is built with the default or current configuration by the core.annot.get_db method.

Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.

[38]:

                          from pypath.core import annot
an = annot.get_db()
an

                        

executed in 1ms, finished 15:07:08 2022-12-02

[38]:

<Annotation database: 3788067 records about 51636 entities from 78 resources>

Load a single annotation resource§

The annotations database is huge, on disk it takes up 1-2 GB of space, it consists of 60-70 resources. But all these resources are not integrated with each other, each can be loaded individually, by their dedicated classes in the core.annot module. This practice can be recommended and will be supported better in the future. Let’s load one resource:

[8]:

                            from pypath.core import annot
cpad = annot.Cpad()
cpad

                          

executed in 48.26s, finished 15:38:57 2022-12-03

[8]:

<CPAD annotations: 2308 records about 1358 entities>

The resulted object is derived from the AnnotationBase class, its data is stored under the annot attribute, in a dict where identifiers are keys and sets of annotation records are the values. The keys of the records are shown by the get_names method:

[35]:

                            cpad.get_names()

                          

executed in 0ms, finished 15:06:45 2022-12-02

[35]:

('regulator_type',
 'effect_on_pathway',
 'pathway',
 'effect_on_cancer',
 'effect_on_cancer_outcome',
 'cancer',
 'pathway_category')

For each name we can list the possible values:

[36]:

                            cpad.get_values('cancer')

                          

executed in 0ms, finished 15:06:47 2022-12-02

[36]:

{'Acute lymphoblastic leukemia (ALL) (precursor T lymphoblastic leukemia)',
 'Acute myeloid leukemia (AML)',
 'Basal cell carcinoma',
 'Bladder cancer',
 'Breast cancer',
 'Cervical cancer',
 'Cholangiocarcinoma',
 'Choriocarcinoma',
 'Chronic lymphocytic leukemia (CLL)',
 'Chronic myeloid leukemia (CML)',
 'Colorectal cancer',
 'Endometrial cancer',
 'Esophageal cancer',
 "Ewing's sarcoma",
 'Gallbladder cancer',
 'Gastric cancer',
 'Glioma',
 'Hepatocellular carcinoma',
 'Hodgkin lymphoma',
 'Infantile hemangioma',
 'Laryngeal cancer',
 'Malignant melanoma',
 'Malignant pleural mesothelioma',
 'Mantle cell lymphoma',
 'Multiple myeloma',
 'Nasopharyngeal cancer',
 'Neuroblastoma',
 'Non-small cell lung cancer',
 'Oral cancer',
 'Osteosarcoma',
 'Ovarian cancer',
 'Pancreatic cancer',
 'Pituitary adenomas',
 'Prostate cancer',
 'Renal cell carcinoma',
 'Small cell lung cancer',
 'Squamous cell carcinoma',
 'Synovial sarcoma',
 'Testicular cancer',
 'Thyroid cancer'}

Based on their annotations the select method filters the annotated molecules. For example, 78 complexes, miRNAs and proteins are annotated as inhibiting colorectal cancer:

[37]:

                            cpad.select(cancer = 'Colorectal cancer', effect_on_cancer = 'Inhibiting')

                          

executed in 0ms, finished 15:06:50 2022-12-02

[37]:

{'A6NDV4',
 Complex: COMPLEX:O14745,
 Complex: COMPLEX:O14862,
 Complex: COMPLEX:O15169_P25054,
 Complex: COMPLEX:O94813,
 Complex: COMPLEX:O94953,
 Complex: COMPLEX:P00533,
 Complex: COMPLEX:P06733,
 Complex Glucose transporter complex 1: COMPLEX:P11166,
 Complex: COMPLEX:P25054,
 Complex: COMPLEX:P40261,
 Complex: COMPLEX:P49327,
 Complex: COMPLEX:P54687,
 Complex PTEN phosphatase complex: COMPLEX:P60484,
 Complex: COMPLEX:Q01973,
 Complex: COMPLEX:Q12888,
 Complex: COMPLEX:Q13620,
 Complex: COMPLEX:Q96CX2,
 Complex: COMPLEX:Q99558,
 'MIMAT0000069',
 'MIMAT0000089',
 'MIMAT0000093',
 'MIMAT0000262',
 'MIMAT0000274',
 'MIMAT0000422',
 'MIMAT0000427',
 'MIMAT0000437',
 'MIMAT0000449',
 'MIMAT0000455',
 'MIMAT0000460',
 'MIMAT0000461',
 'MIMAT0000617',
 'MIMAT0003266',
 'MIMAT0003320',
 'O14745',
 'O14862',
 'O15169',
 'O75473',
 'O75888',
 'O76041',
 'O94813',
 'O94953',
 'P00533',
 'P06733',
 'P06756',
 'P11166',
 'P13631',
 'P22676',
 'P25054',
 'P25791',
 'P40261',
 'P49327',
 'P546

Output truncated: showing 1000 of 1279 characters

Load the full annotations database by the database manager§

Alternatively, the full annotations database can be accessed in the usual way:

[215]:

                            from pypath import omnipath
an = omnipath.db.get_db('annotations')
an

                          

[215]:

<Annotation database: 5490653 records about 50872 entities from 68 resources>

The AnnotationTable object contains the resource specific annotation objects under the annots attribute:

[40]:

                            an.annots

                          

executed in 0ms, finished 15:07:39 2022-12-02

[40]:

{'CellTypist': <CellTypist annotations: 927 records about 473 entities>,
 'Integrins': <Integrins annotations: 62 records about 62 entities>,
 'CellCellInteractions': <CellCellInteractions annotations: 5544 records about 4960 entities>,
 'PanglaoDB': <PanglaoDB annotations: 8479 records about 4813 entities>,
 'Lambert2018': <Lambert2018 annotations: 3281 records about 3277 entities>,
 'CancerSEA': <CancerSEA annotations: 2515 records about 1992 entities>,
 'Phobius': <Phobius annotations: 35382 records about 35382 entities>,
 'GO_Intercell': <GO_Intercell annotations: 48799 records about 18377 entities>,
 'MatrixDB': <MatrixDB annotations: 18127 records about 15903 entities>,
 'Surfaceome': <Surfaceome annotations: 3558 records about 3558 entities>,
 'Matrisome': <Matrisome annotations: 1514 records about 1514 entities>,
 'HPA_secretome': <HPA_secretome annotations: 3568 records about 3568 entities>,
 'HPMR': <HPMR annotations: 1748 records about 1695 entities>,
 'CPAD': <CPAD annotati

Output truncated: showing 1000 of 5842 characters

For each of these you can query the names of the fields, their possible values and the set of proteins annotated with any combination of the values, just like for CPAD above. As another exemple, let’s take a look into the Matrisome database:

[41]:

                            matrisome = an.annots['Matrisome']

                          

executed in 0ms, finished 15:07:45 2022-12-02

[42]:

                            matrisome.get_names()

                          

executed in 0ms, finished 15:07:49 2022-12-02

[42]:

('mainclass', 'subclass', 'subsubclass')

[43]:

                            matrisome.get_values('subclass')

                          

executed in 0ms, finished 15:07:53 2022-12-02

[43]:

{'Collagens',
 'ECM Glycoproteins',
 'ECM Regulators',
 'ECM-affiliated Proteins',
 'Proteoglycans',
 'Secreted Factors',
 'n/a'}

[44]:

                            matrisome.get_subset(subclass = 'Collagens')

                          

executed in 0ms, finished 15:07:56 2022-12-02

[44]:

{'A6NMZ7',
 'A8TX70',
 'B4DZ39',
 Complex Collagen type I homotrimer: COMPLEX:P02452,
 Complex HT_DM_Cluster278: COMPLEX:P02452_P02462_P08572_P29400_P53420_Q01955_Q02388_Q14031_Q17RW2_Q8NFW1,
 Complex Collagen type I trimer: COMPLEX:P02452_P08123,
 Complex Collagen type II trimer: COMPLEX:P02458,
 Complex Collagen type XI trimer variant 1: COMPLEX:P02458_P12107_P13942,
 Complex: COMPLEX:P02458_P20908_P25067,
 Complex: COMPLEX:P02458_P20908_P25067_P29400,
 Complex: COMPLEX:P02458_P25067_P29400,
 Complex Collagen type III trimer: COMPLEX:P02461,
 Complex: COMPLEX:P02462,
 Complex Collagen type IV trimer variant 1: COMPLEX:P02462_P08572,
 Complex Collagen type XI trimer variant 2: COMPLEX:P05997_P12107,
 Complex Collagen type XI trimer variant 3: COMPLEX:P05997_P12107_P20908,
 Complex Collagen type V trimer variant 1: COMPLEX:P05997_P20908,
 Complex Collagen type V trimer variant 2: COMPLEX:P05997_P20908_P25940,
 Complex: COMPLEX:P08572,
 Complex: COMPLEX:P12109_P12110,
 Complex Collagen

Output truncated: showing 1000 of 3072 characters

Load only selected annotations§

Another option is to load only certain annotation resources into an AnnotationTable object. We refer to the resources by class names. For example, if you only want to load the pathway membership annotations from SIGNOR, SignaLink, NetPath and KEGG, you can provide the names of the appropriate classes:

[47]:

                            pathways = annot.AnnotationTable(
    protein_sources = (
        'SignalinkPathways',
        'KeggPathways',
        'NetpathPathways',
        'SignorPathways',
    ),
    complex_sources = (),
)
pathways

                          

executed in 12.07s, finished 15:09:48 2022-12-02

[47]:

<Annotation database: 28745 records about 6762 entities from 4 resources>

The AnnotationTable object provides methods to query all resources together, or build a boolean array out of them. To see all annotations of one protein:

[48]:

                            pathways.all_annotations('P00533')

                          

executed in 0ms, finished 15:10:17 2022-12-02

[48]:

[SignalinkPathway(pathway='Receptor tyrosine kinase'),
 SignalinkPathway(pathway='JAK/STAT'),
 KeggPathway(pathway='Proteoglycans in cancer'),
 KeggPathway(pathway='Regulation of actin cytoskeleton'),
 KeggPathway(pathway='Oxytocin signaling pathway'),
 KeggPathway(pathway='Phospholipase D signaling pathway'),
 KeggPathway(pathway='Pathways in cancer'),
 KeggPathway(pathway='Hepatocellular carcinoma'),
 KeggPathway(pathway='Colorectal cancer'),
 KeggPathway(pathway='Melanoma'),
 KeggPathway(pathway='EGFR tyrosine kinase inhibitor resistance'),
 KeggPathway(pathway='Human papillomavirus infection'),
 KeggPathway(pathway='Pancreatic cancer'),
 KeggPathway(pathway='Non-small cell lung cancer'),
 KeggPathway(pathway='Central carbon metabolism in cancer'),
 KeggPathway(pathway='Endocytosis'),
 KeggPathway(pathway='Endometrial cancer'),
 KeggPathway(pathway='Choline metabolism in cancer'),
 KeggPathway(pathway='Bladder cancer'),
 KeggPathway(pathway='Parathyroid hormone synthesis, secretion

Output truncated: showing 1000 of 2540 characters

Data frames of annotations§

Data from annotation objects can be exported to a pandas.DataFrame:

[9]:

                            cpad.make_df()
cpad.df

executed in 0ms, finished 15:40:14 2022-12-03

[9]:

	uniprot	genesymbol	entity_type	source	label	value	record_id
0	Q16181	SEPT7	protein	CPAD	regulator_type	protein	0
1	Q16181	SEPT7	protein	CPAD	effect_on_pathway	Upregulation	0
2	Q16181	SEPT7	protein	CPAD	pathway	Actin cytoskeleton pathway	0
3	Q16181	SEPT7	protein	CPAD	effect_on_cancer	Inhibiting	0
4	Q16181	SEPT7	protein	CPAD	effect_on_cancer_outcome	inhibit glioma cell migration	0
...	...	...	...	...	...	...	...
14396	COMPLEX:P30990	COMPLEX:NTS	complex	CPAD	cancer	Hepatocellular carcinoma	2306
14397	COMPLEX:P30990	COMPLEX:NTS	complex	CPAD	effect_on_pathway	Upregulation	2307
14398	COMPLEX:P30990	COMPLEX:NTS	complex	CPAD	pathway	ERK signaling pathway	2307
14399	COMPLEX:P30990	COMPLEX:NTS	complex	CPAD	effect_on_cancer	Activating	2307
14400	COMPLEX:P30990	COMPLEX:NTS	complex	CPAD	cancer	Gastric cancer	2307

14401 rows × 7 columns

The data frame has a long format. It can be converted to the more conventional wide format using standard pandas procedures (well, in tidyverse you would simply call tidyr::pivot_wider, in pandas you have to do an unintuitive sequence of 6 calls):

[10]:

                            index_cols = ['record_id', 'uniprot', 'genesymbol', 'label', 'entity_type']

(
    cpad.df.drop('source', axis=1).
    set_index(index_cols).
    unstack('label').
    droplevel(axis=1, level=0).
    reset_index().
    drop('record_id', axis=1)
)

                          

executed in 0ms, finished 15:40:19 2022-12-03

[10]:

label	uniprot	genesymbol	entity_type	cancer	effect_on_cancer	effect_on_cancer_outcome	effect_on_pathway	pathway	pathway_category	regulator_type
0	Q16181	SEPT7	protein	Glioma	Inhibiting	inhibit glioma cell migration	Upregulation	Actin cytoskeleton pathway	Regulation of actin cytoskeleton	protein
1	MIMAT0000431	hsa-miR-140	mirna	Squamous cell carcinoma	Inhibiting	suppress tumor cell migration and invasion	Upregulation	ADAM10 mediated Notch1 signaling pathway	Notch signaling pathway	mirna
2	MIMAT0005886	hsa-miR-1297	mirna	Prostate cancer	Inhibiting	inhibit proliferation and invasion	Upregulation	AEG1/Wnt signaling pathway	Wnt signaling pathway	mirna
3	Q9UP65	PLA2G4C	protein	Breast cancer	Inhibiting	inhibit EGF-induced chemotaxis	Downregulation	Akt signaling pathway	PI3K-Akt signaling pathway	protein
4	Q92600	CNOT9	protein	Breast cancer	Inhibiting	suppress cell proliferation	Downregulation	Akt signaling pathway	PI3K-Akt signaling pathway	protein
...	...	...	...	...	...	...	...	...	...	...
2303	COMPLEX:P16422	COMPLEX:EPCAM	complex	Prostate cancer	Inhibiting	NaN	Downregulation	PI3K-Akt-mTOR signaling pathway	NaN	NaN
2304	COMPLEX:Q9Y6Y0	COMPLEX:IVNS1ABP	complex	Prostate cancer	Inhibiting	NaN	Upregulation	Akt signaling pathway	NaN	NaN
2305	COMPLEX:Q96CX2	COMPLEX:KCTD12	complex	Colorectal cancer	Inhibiting	NaN	Upregulation	ERK signaling pathway	NaN	NaN
2306	COMPLEX:P30990	COMPLEX:NTS	complex	Hepatocellular carcinoma	Activating	NaN	Upregulation	Wnt/beta-catenin signaling pathway	NaN	NaN
2307	COMPLEX:P30990	COMPLEX:NTS	complex	Gastric cancer	Activating	NaN	Upregulation	ERK signaling pathway	NaN	NaN

2308 rows × 10 columns

Inter-cellular signaling roles§

pypath does not combine the annotations in the annot module, exactly what goes in goes out. For example, WNT pathway from Signor and SignaLink won’t be merged automatically. However with the pypath.core.annot.CustomAnnotation class anyone can do it. For inter-cellular communication categories the pypath.core.intercell module combines the data from all the relevant resources and creates categories based on a combination of evidences. The database is an instance of the IntercellAnnotation object, and the build is executed by the pypath.core.intercell.get_db function.

Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.

[53]:

                          from pypath.core import intercell
ic = intercell.get_db() # this takes quite some time
                       # unless you load annotations from a pickle cache
ic

                        

executed in 0ms, finished 15:13:03 2022-12-02

[53]:

<Intercell annotations: 310033 records about 43617 entities>

[11]:

                          from pypath import omnipath
ic = omnipath.db.get_db('intercell')
ic

                        

executed in 2m 55.47s, finished 15:43:27 2022-12-03

[11]:

<Intercell annotations: 301527 records about 48570 entities>

This object stores its data under the classes attribute. Classes are defined in pypath.core.intercell_annot.annot_combined_classes. In addition, we manually revised and excluded some proteins from the more generic classes, these are listed in pypath.core.intercell_annot.excludes. Each class has the following properties:

name: all lowercase, human understandable name, without repeating the parent class (e.g. WNT receptors will be simply wnt, and the parent class will be receptor)
parent: for a specific class the parent is the generic category it belongs to; for generic classes the name and parent are the same
resource: the resource the data comes from, or OmniPath for composite classes (combined from multiple resources)
scope: specific or generic; e.g. TGF ligand is specific, ligand is generic
aspect: locational (e.g. plasma membrane) or functional (e.g. transporter)

Read more about the design of the intercell database in our paper.

[55]:

                          ic.classes

                        

executed in 0ms, finished 15:16:54 2022-12-02

[55]:

{AnnotDefKey(name='transmembrane', parent='transmembrane', resource='UniProt_location'): <AnnotationGroup `transmembrane` from UniProt_location, 5150 elements>,
 AnnotDefKey(name='transmembrane', parent='transmembrane', resource='UniProt_topology'): <AnnotationGroup `transmembrane` from UniProt_topology, 5760 elements>,
 AnnotDefKey(name='transmembrane', parent='transmembrane', resource='UniProt_keyword'): <AnnotationGroup `transmembrane` from UniProt_keyword, 7041 elements>,
 AnnotDefKey(name='transmembrane', parent='transmembrane_predicted', resource='Phobius'): <AnnotationGroup `transmembrane` from Phobius, 6444 elements>,
 AnnotDefKey(name='transmembrane_phobius', parent='transmembrane_predicted', resource='Almen2009'): <AnnotationGroup `transmembrane_phobius` from Almen2009, 2072 elements>,
 AnnotDefKey(name='transmembrane_sosui', parent='transmembrane_predicted', resource='Almen2009'): <AnnotationGroup `transmembrane_sosui` from Almen2009, 1663 elements>,
 AnnotDefKey(name='trans

Output truncated: showing 1000 of 143945 characters

An easy way to access the classes is the select method. The AnnotationGroup objects behave as plain Python sets, and besides that, they feature many further attributes and methods.

[56]:

                          gaba_receptors = ic.select('gaba', parent = 'receptor')
gaba_receptors

executed in 0ms, finished 15:17:00 2022-12-02

[56]:

<AnnotationGroup `gaba` from HGNC, 40 elements>

[57]:

                          gaba_receptors.members

                        

executed in 0ms, finished 15:17:02 2022-12-02

[57]:

{'A8MPY1',
 Complex GABA-A receptor (GABRA1, GABRB2, GABRD): COMPLEX:O14764_P14867_P47870,
 Complex GABA-A receptor, alpha-4/beta-3/delta: COMPLEX:O14764_P28472_P48169,
 Complex GABA-A receptor, alpha-6/beta-3/delta: COMPLEX:O14764_P28472_Q16445,
 Complex GABA-A receptor, alpha-4/beta-2/delta: COMPLEX:O14764_P47870_P48169,
 Complex GABA-A receptor, alpha-6/beta-2/delta: COMPLEX:O14764_P47870_Q16445,
 Complex GABBR1-GABBR2 complex: COMPLEX:O75899_Q9UBS5,
 Complex: COMPLEX:P14867,
 Complex GABA-A receptor, alpha-1/beta-3/gamma-2: COMPLEX:P14867_P18507_P28472,
 Complex GABA-A receptor (GABRA1, GABRB2, GABRG2): COMPLEX:P14867_P18507_P47870,
 Complex GABA-A receptor, alpha-5/beta-3/gamma-2: COMPLEX:P18507_P28472_P31644,
 Complex GABA-A receptor, alpha-3/beta-3/gamma-2: COMPLEX:P18507_P28472_P34903,
 Complex GABA-A receptor, alpha-2/beta-3/gamma-2: COMPLEX:P18507_P28472_P47869,
 Complex GABA-A receptor, alpha-6/beta-3/gamma-2: COMPLEX:P18507_P28472_Q16445,
 Complex: COMPLEX:P18507_Q8N1C3,
 C

Output truncated: showing 1000 of 1368 characters

Build an intercellular communication network§

The intercell database can be connected to a Network object to create an intercellular communication network:

[58]:

                            cu = omnipath.db.get_db('curated')
ic.register_network(cu)

executed in 0ms, finished 15:17:08 2022-12-02

Quantitative overview of intercell annotations§

A data frame with basic statistics is available:

[13]:

                            ic.counts_df()

                          

executed in 0ms, finished 15:45:17 2022-12-03

[13]:

	category	parent	database	scope	aspect	source	consensus_score	transmitter	receiver	secreted	plasma_membrane_transmembrane	plasma_membrane_peripheral	n_uniprot
0	transmembrane	transmembrane	UniProt_location	generic	locational	resource_specific	6	False	False	False	True	False	5150
1	transmembrane	transmembrane	UniProt_topology	generic	locational	resource_specific	6	False	False	False	True	False	5760
2	transmembrane	transmembrane	UniProt_keyword	generic	locational	resource_specific	1	False	False	False	False	False	7041
3	transmembrane	transmembrane_predicted	Phobius	generic	locational	resource_specific	1	False	False	False	False	False	6444
4	transmembrane_phobius	transmembrane_predicted	Almen2009	generic	locational	resource_specific	0	False	False	False	True	False	2072
...	...	...	...	...	...	...	...	...	...	...	...	...	...
1120	parin_adhesion_regulator	intracellular_intercellular_related	HGNC	specific	functional	resource_specific	0	True	False	False	False	False	5
1121	plakophilin_adhesion_regulator	intracellular_intercellular_related	HGNC	specific	functional	resource_specific	0	True	False	False	False	False	3
1122	actin_regulation_adhesome	intracellular_intercellular_related	Adhesome	specific	functional	resource_specific	0	True	False	False	False	False	22
1123	adhesion_cytoskeleton_adaptor	intracellular_intercellular_related	Adhesome	specific	functional	resource_specific	0	True	False	False	False	False	118
1124	intracellular_intercellular_related	intracellular_intercellular_related	OmniPath	generic	functional	composite	0	True	False	False	False	False	291

1125 rows × 13 columns

Intercell database as data frame§

Just like the other databases, the object can be exported into a pandas.DataFrame:

[14]:

                            ic.make_df()
ic.df[:10]

executed in 22.72s, finished 15:45:46 2022-12-03

[14]:

	category	parent	database	scope	aspect	source	uniprot	genesymbol	entity_type	consensus_score	transmitter	receiver	secreted	plasma_membrane_transmembrane	plasma_membrane_peripheral
0	transmembrane	transmembrane	UniProt_location	generic	locational	resource_specific	Q96JP9	CDHR1	protein	6	False	False	False	True	False
1	transmembrane	transmembrane	UniProt_location	generic	locational	resource_specific	Q9P126	CLEC1B	protein	8	False	False	False	True	False
2	transmembrane	transmembrane	UniProt_location	generic	locational	resource_specific	Q13585	GPR50	protein	6	False	False	False	True	False
3	transmembrane	transmembrane	UniProt_location	generic	locational	resource_specific	Q8N9I0	SYT2	protein	7	False	False	False	False	False
4	transmembrane	transmembrane	UniProt_location	generic	locational	resource_specific	O43614	HCRTR2	protein	6	False	False	False	True	False
5	transmembrane	transmembrane	UniProt_location	generic	locational	resource_specific	A6NJY1	SLC9B1P1	protein	4	False	False	False	False	False
6	transmembrane	transmembrane	UniProt_location	generic	locational	resource_specific	Q5RI15	COX20	protein	5	False	False	False	False	False
7	transmembrane	transmembrane	UniProt_location	generic	locational	resource_specific	Q13948	CUX1	protein	5	False	False	False	False	False
8	transmembrane	transmembrane	UniProt_location	generic	locational	resource_specific	Q8NGK4	OR52K1	protein	6	False	False	False	False	False
9	transmembrane	transmembrane	UniProt_location	generic	locational	resource_specific	Q8IYS2	KIAA2013	protein	7	False	False	False	True	False

Browse intercell categories§

Use the select method to access intercell classes:

[72]:

                            ic.select(definition = 'neurotensin', parent = 'receptor')

                          

executed in 0ms, finished 15:27:15 2022-12-02

[72]:

<AnnotationGroup `neurotensin` from HGNC, 2 elements>

Proteins in each category can be listed with their descriptions from UniProt. Loading the UniProt datasheets for each protein is a slow process, we don’t recomment calling this method on more than a few dozens of proteins.

[79]:

                            ic.show('neurotensin', parent = 'receptor')

                          

executed in 1ms, finished 15:35:58 2022-12-02

=====> [2 proteins] <=====
╒═══════╤════════╤══════════════╤══════════╤══════════╤═════════════╤══════════════╤════════════╤══════════════╕
│   No. │ ac     │ genesymbol   │   length │   weight │ full_name   │ function_o   │ keywords   │ subcellula   │
│       │        │              │          │          │             │ r_genecard   │            │ r_location   │
│       │        │              │          │          │             │ s            │            │              │
╞═══════╪════════╪══════════════╪══════════╪══════════╪═════════════╪══════════════╪════════════╪══════════════╡
│     1 │ O95665 │ NTSR2        │      410 │    45385 │ Neurotensi  │ Receptor     │ Cell       │ Cell         │
│       │        │              │          │          │ n receptor  │ for the tr   │ membrane,  │ membrane;    │
│       │        │              │          │          │ type 2      │ idecapepti   │ Disulfide  │ Multi-pass   │
│       │        │              │          │          │             │

Output truncated: showing 1000 of 7598 characters

Gene Ontology§

pypath.utils.go is an almost standalone module for management of the Gene Ontology tree and annotations. The main objects here are GeneOntology and GOAnnotation. The former represents the ontology tree, i.e. terms and their relationships, the latter their assignment to gene products. Both provides many versatile methods for querying.

[80]:

                          from pypath.utils import go
goa = go.GOAnnotation()

executed in 1.26s, finished 15:36:46 2022-12-02

[81]:

                          goa.ontology # the GeneOntology object

                        

executed in 0ms, finished 15:36:48 2022-12-02

[81]:

<pypath.utils.go.GeneOntology at 0x689946b55570>

[82]:

goa # the GOAnnotation object

executed in 0ms, finished 15:36:50 2022-12-02

[82]:

<pypath.utils.go.GOAnnotation at 0x68991cdc9b40>

Among many others, the most versatile method is select which is able to select the annotated gene products by various expressions built from GO terms or IDs. It understands AND, OR, NOT and parentheses.

[83]:

                          query = """(cell surface OR
        external side of plasma membrane OR
        extracellular region) AND
        (regulation of transmembrane transporter activity OR
        channel regulator activity)"""
result = goa.select(query)
print(list(result)[:7])

                        

executed in 0ms, finished 15:36:55 2022-12-02

['P21333', 'P80108', 'P62258', 'Q9NRX4', 'P54710', 'Q8NER1', 'P01303']

[84]:

                          goa.ontology.get_all_descendants('GO:0005576')

                        

executed in 0ms, finished 15:36:56 2022-12-02

[84]:

{'GO:0001507',
 'GO:0001527',
 'GO:0003351',
 'GO:0003355',
 'GO:0005201',
 'GO:0005576',
 'GO:0005577',
 'GO:0005582',
 'GO:0005583',
 'GO:0005584',
 'GO:0005585',
 'GO:0005586',
 'GO:0005587',
 'GO:0005588',
 'GO:0005590',
 'GO:0005591',
 'GO:0005592',
 'GO:0005595',
 'GO:0005596',
 'GO:0005599',
 'GO:0005601',
 'GO:0005602',
 'GO:0005604',
 'GO:0005606',
 'GO:0005607',
 'GO:0005608',
 'GO:0005609',
 'GO:0005610',
 'GO:0005611',
 'GO:0005612',
 'GO:0005614',
 'GO:0005615',
 'GO:0005616',
 'GO:0006858',
 'GO:0006859',
 'GO:0006860',
 'GO:0009519',
 'GO:0010367',
 'GO:0016914',
 'GO:0016942',
 'GO:0020003',
 'GO:0020004',
 'GO:0020005',
 'GO:0020006',
 'GO:0030020',
 'GO:0030021',
 'GO:0030023',
 'GO:0030197',
 'GO:0030345',
 'GO:0030934',
 'GO:0030935',
 'GO:0030938',
 'GO:0031012',
 'GO:0031395',
 'GO:0032311',
 'GO:0032579',
 'GO:0033165',
 'GO:0033166',
 'GO:0034358',
 'GO:0034359',
 'GO:0034360',
 'GO:0034361',
 'GO:0034362',
 'GO:0034363',
 'GO:0034364',
 'GO:0034365',
 'GO:00343

Output truncated: showing 1000 of 3104 characters

Protein complexes§

The pypath.complex module builds a non-redundant list of complexes from about 12 original resources. Complexes are unique considering their set of components, and optionally carry stoichiometry information. Homomultimers are also included, hence some complexes consist only of a single kind of protein. The database is an instance of pypath.core.complex.ComplexAggregator object and the built by the pypath.core.complex.get_db function.

Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.

[90]:

                          from pypath.core import complex
co = complex.get_db()
co.update_index()
co

                        

executed in 0ms, finished 15:39:31 2022-12-02

[90]:

<Complex database: 28173 complexes>

To retrieve all complexes containing a specific protein, here MTOR:

[91]:

                          co.proteins['P42345']

                        

executed in 0ms, finished 15:39:42 2022-12-02

[91]:

{Complex: COMPLEX:O00141_O15530_O75879_P23443_P34931_P42345_Q6R327_Q8N122_Q9BPZ7_Q9BVC4_Q9H672,
 Complex: COMPLEX:O00141_O15530_P07900_P23443_P31749_P31751_P42345_P78527_Q05513_Q05655_Q6R327_Q8N122_Q9BPZ7_Q9BVC4,
 Complex: COMPLEX:O00141_O15530_P0CG47_P0CG48_P23443_P42345_Q15118_Q6R327_Q8N122_Q96BR1_Q9BPZ7_Q9BVC4,
 Complex: COMPLEX:O00141_O15530_P23443_P42345_Q15118_Q6R327_Q8N122_Q96BR1_Q96J02_Q9BPZ7_Q9BVC4,
 Complex: COMPLEX:O00141_O75879_P0CG48_P23443_P34931_P42345_P62753_Q6R327_Q8N122_Q9BPZ7_Q9BVC4_Q9NY26,
 Complex: COMPLEX:O00141_P0CG48_P23443_P36894_P42345_P62942_P68106_Q15427_Q6R327_Q8N122_Q9BPZ7_Q9BVC4,
 Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P46781_P62753_Q6R327_Q8N122_Q96KQ7_Q9BPZ7_Q9BVC4_Q9NY26,
 Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P62753_P62942_Q6R327_Q8N122_Q9BPZ7_Q9BVC4_Q9NY26,
 Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P62753_Q15172_Q6R327_Q8IW41_Q9BPZ7_Q9BVC4_Q9H672,
 Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P62753_Q6R327_Q70Z35_Q8N122_Q8TCU6_Q9BPZ7

Output truncated: showing 1000 of 5348 characters

Note some of the complexes have human readable names, these are preferred at printing if available from any of the databases. Otherwise the complexes are labelled by COMPLEX:list-of-components.

Protein complex objects§

Take a closer look on one complex object. The hash of the is equivalent with the string representation below, where the UniProt IDs are unique and alphabetically sorted. Hence you can look up complexes using strings as keys despite the dict keys are in fact pypath.intera.Complex objects:

[97]:

                            cplex = co.complexes['COMPLEX:Q09472_Q92793']
cplex

executed in 0ms, finished 15:41:36 2022-12-02

[97]:

Complex CBP/p300: COMPLEX:Q09472_Q92793

[98]:

                            cplex.components # stoichiometry

                          

executed in 0ms, finished 15:41:38 2022-12-02

[98]:

{'Q92793': 1, 'Q09472': 1}

[99]:

                            cplex.sources # resources

                          

executed in 0ms, finished 15:41:39 2022-12-02

[99]:

{'Signor'}

Protein complex data frame§

The database can be exported into a pandas.DataFrame:

[18]:

                            co.make_df()
co.df

executed in 3.40s, finished 15:47:16 2022-12-03

[18]:

	name	components	components_genesymbols	stoichiometry	sources	references	identifiers
0	NFY	P23511_P25208_Q13952	NFYA_NFYB_NFYC	1:1:1	CORUM;Compleat;PDB;Signor;ComplexPortal;hu.MAP...	15243141;14755292;9372932	Signor:SIGNOR-C1;CORUM:4478;Compleat:HC1449;in...
1	mTORC2	P68104_P85299_Q6R327_Q8TB45_Q9BVC4	DEPTOR_EEF1A1_MLST8_PRR5_RICTOR	0:0:0:0:0	Signor		Signor:SIGNOR-C2
2	mTORC1	P42345_Q8N122_Q8TB45_Q96B36_Q9BVC4	AKT1S1_DEPTOR_MLST8_MTOR_RPTOR	0:0:0:0:0	Signor		Signor:SIGNOR-C3
3	SCF-betaTRCP	P63208_Q13616_Q9Y297	BTRC_CUL1_SKP1	1:1:1	CORUM;Compleat;Signor	9990852	Signor:SIGNOR-C5;CORUM:227;Compleat:HC757
4	CBP/p300	Q09472_Q92793	CREBBP_EP300	0:0	Signor		Signor:SIGNOR-C6
...	...	...	...	...	...	...	...
28168	Npnt complex 2	Q5SZK8_Q6UXI9_Q86XX4	FRAS1_FREM2_NPNT	0:0:0	CellChatDB
28169	NRP1_NRP2	O14786_O60462_Q9Y4D7	NRP1_NRP2_PLXND1	0:0:0	CellChatDB
28170	NRP2_PLXNA2	O60462_O75051	NRP2_PLXNA2	0:0	CellChatDB
28171	NRP2_PLXNA4	O60462_Q9HCM2	NRP2_PLXNA4	0:0	CellChatDB
28172	PTCH2_SMO	Q99835_Q9Y6C5	PTCH2_SMO	0:0	CellChatDB

28173 rows × 7 columns

Saving datasets as pickles§

The large datasets above are compiled from many resources. Even if these are already available in the cache, the data processing often takes longer than convenient, e.g. from a few minutes up to half an hour. Most of the data integration objects in pypath provide methods to save and load their contents as pickle dumps. In fact, the database manager does this all the time, in a coordinated way – for this reason, the methods below should be used only with good reason, and relying on the database manager is preferred.

[ ]:

                          # for `pypath.annot.AnnotationTable` objects:
a.save_to_pickle('myannots.pickle')
a = annot.AnnotationTable(pickle_file = 'myannots.pickle')
# for `pypath.complex.ComplexAggregator` objects:
complexdb.save_to_pickle('mycomplexes.pickle')
complexdb = complex.ComplexAggregator(pickle_file = 'mycomplexes.pickle')

                        

Log messages and sessions§

In pypath all modules sends messages to a log file named by default by the session ID (a 5 char random string). The default path to the log file is ./pypath_log/pypath-xxxxx.log where xxxxx is the session ID.

Warning: The logger of pypath is really verbose, the log files can grow huge: several tens of thousands of lines, few MBs. It is recommended to empty the pypath_log directories time to time.

Basic info about the session§

The info function prints the most important information about the current session:

[100]:

                            import pypath
pypath.info()

executed in 0ms, finished 15:41:55 2022-12-02

[2022-12-02 16:41:55] [pypath]
        - session ID: `l0n17`
        - working directory: `/home/denes/pypath/notebooks`
        - logfile: `/home/denes/pypath/notebooks/pypath_log/pypath-l0n17.log`
        - pypath version: 0.14.31

Another function prints a disclaimer about licenses. Until recently this message was printed every time upon import, it is still important, but we removed it as in certain situations it can be annoying.

[101]:

                            pypath.disclaimer()

                          

executed in 0ms, finished 15:41:59 2022-12-02

        === d i s c l a i m e r ===

        All data accessed through this module,
        either as redistributed copy or downloaded using the
        programmatic interfaces included in the present module,
        are free to use at least for academic research or
        education purposes.
        Please be aware of the licenses of all the datasets
        you use in your analysis, and please give appropriate
        credits for the original sources when you publish your
        results. To find out more about data sources please
        look at `pypath/resources/data/resources.json` or
        https://omnipathdb.org/info and
        `pypath.resources.urls.urls`.

Read the log file§

Calling pypath.log opens the logfile by the default console application for paginating text files (in GNU systems typically less):

[ ]:

                            pypath.log()

                          

executed in 0ms, finished 15:42:08 2022-12-02

The logger and the log file are bound to the session (the 5 random characters is the session ID):

[104]:

                            pypath.session

                          

executed in 0ms, finished 15:42:27 2022-12-02

[104]:

<Session l0n17>

The logger:

[105]:

                            pypath.session.log

                          

executed in 0ms, finished 15:42:46 2022-12-02

[105]:

Logger [/home/denes/pypath/notebooks/pypath_log/pypath-l0n17.log]

The path to the log file:

[106]:

                            pypath.session.log.fname

                          

executed in 0ms, finished 15:42:49 2022-12-02

[106]:

'/home/denes/pypath/notebooks/pypath_log/pypath-l0n17.log'

Logging to the console§

Each log message has a numeric priority level, and messages with lower level than a threshold are printed to the console. By default only important warnings are dispatched to the console. To log everything to the console, set the threshold to a large number:

[107]:

                            pypath.session.log.console_level = 10

from pypath.inputs import signor

si = signor.signor_interactions()
pypath.session.log.console_level = -1

executed in 0ms, finished 15:42:56 2022-12-02

[2022-12-02 16:42:55] [curl] Creating Curl object to retrieve data from `https://signor.uniroma2.it/download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file path: `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file found, no need for download.
[2022-12-02 16:42:55] [curl] Opening plain text file `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`.
[2022-12-02 16:42:55] [curl] Creating Curl object to retrieve data from `https://signor.uniroma2.it/download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file path: `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file found, no need for download.
[2022-12-02 16:42:55] [curl] Opening plain text file `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`.
[2022-12-02 16:42:55] [curl] Creating Curl object to retrieve data from `https

Output truncated: showing 1000 of 1046 characters

Disable logging§

To avoid creation of a log file (and the directory pypath_log) set the environment variable PYPATH_LOG or the builtins.PYPATH_LOG attribute:

[ ]:

                            # shell:
export PYPATH_LOG="/dev/null"
# then, start Python and use pypath

                          

[108]:

                            import os
import builtins
builtins.PYPATH_LOG=os.devnull
import pypath

                          

executed in 0ms, finished 15:43:10 2022-12-02

Write to the log§

Sending a single message§

First we change the console level so we can see the log messages. The label is optional. The priority of the message is given by the level, notice that the second message won’t be printed to the console as its level is higher than 10:

[109]:

                              pypath.session.log.console_level = 10
pypath.session.log.msg('Greetings from the pypath tutorial notebook! :)', label = 'book')
pypath.session.log.msg('Not important, not shown on console but printed to the logfile.', level = 11)

                            

executed in 0ms, finished 15:43:13 2022-12-02

[2022-12-02 16:43:13] [book] Greetings from the pypath tutorial notebook! :)

Connect a module or class to the pypath logger§

The preferred way of connecting to the logger is to make a class inherit from the Logger class. Here the name will be the default label for all messages coming from the instances of this class:

[110]:

                              from pypath.share import session

class ChildOfLogger(session.Logger):

    def __init__(self):

        session.Logger.__init__(self, name = 'child')

    def say_something(self):

        self._log('Have a nice day! :D')

col = ChildOfLogger()
col.say_something()

executed in 0ms, finished 15:43:17 2022-12-02

[2022-12-02 16:43:17] [child] Have a nice day! :D

Alternatively, a logger can be created anywhere and used from any module or function:

[111]:

                              from pypath.share import session

_logger = session.Logger(name = 'mylogger')
_log = _logger._log

_log('Message from a stray logger')

executed in 0ms, finished 15:43:20 2022-12-02

[2022-12-02 16:43:20] [mylogger] Message from a stray logger

Finally we just set the console level to a lower value, to avoid flooding the rest of this book with log messages:

[112]:

                              pypath.session.log.console = -1

                            

executed in 0ms, finished 15:43:23 2022-12-02

BEL export§

Warning: This section hasn’t been thoroughly revised for long time, some parts might be outdated or broken.

Biological Expression Language (BEL, https://bel-commons.scai.fraunhofer.de/) is a versatile description language to capture relationships between various biological entities spanning wide range of the levels of biological organization. pypath has a dedicated module to convert the network and the enzyme-substrate interactions to BEL format:

[ ]:

                          from pypath.legacy import main
from pypath.resources import data_formats
from pypath.omnipath import bel

                        

[ ]:

                          pa = main.PyPath()
pa.init_network(data_formats.pathway)

You can provide one or more resources to the Bel class. Supported resources currently are pypath.main.PyPath and pypath.ptm.PtmAggregator.

[ ]:

                          b = bel.Bel(resource = pa)

                        

From the resources we compile a BELGraph object which provides a Python interface for various operations and you can also export the data in BEL format:

[ ]:

                          b.main()

                        

[ ]:

                          b.bel_graph

                        

[ ]:

                          b.bel_graph.summarize()

                        

[ ]:

                          b.export_relationships('omnipath_pathways.bel')

                        

[ ]:

                          with open('omnipath_pathways.bel', 'r') as fp:
    bel_str = fp.read()

[ ]:

                          print(bel_str[:333])

                        

CellPhoneDB export§

Warning: This section hasn’t been thoroughly revised for long time, some parts might be outdated or broken.

CellPhoneDB is a statistical method and a database for inferring inter-cellular communication pathways between specific cell types from single-cell data. OmniPath/pypath uses CellPhoneDB as a resource for interaction, protein complex and annotation data. Apart from this, pypath is able to export its data in the appropriate format to provide input for the CellPhoneDB Python module. For this you can use the pypath.cellphonedb module:

[ ]:

                          from pypath.omnipath import cellphonedb
from pypath.share import settings

settings.setup(network_expand_complexes = False)

Here you can provide parameters for the network or provide an already built network. Also you can provide the datasets as pickles to make them load really fast. Otherwise this step will take quite long.

[ ]:

                          c = cellphonedb.CellPhoneDB()

                        

You can access each of the CellPhoneDB input files as a pandas.DataFrame and also they’ve been exported to csv files. For example the interaction_input.csv contains interactions from all the resources used for building the network (here Signor, SingnaLink, etc.):

[ ]:

                          c.interaction_dataframe[:10]

                        

The proteins and complexes are annotated (transmembrane, peripheral, secreted, etc.) using data from the pypath.intercell module (identical to the http://omnipathdb.org/intercell query of the web service):

[ ]:

                          c.protein_dataframe[:10]

                        

[ ]:

The legacy igraph-based network object§

Warning: This section hasn’t been thoroughly revised for long time, some parts might be outdated or broken.

Until about 2019 (before pypath version 0.9) pypath used an igraph.Graph object (igraph.org) to organize all data structures around. This legacy API still present in pypath.legacy.main, however it is not maintained. This section of the book is still here, but will be removed soon, along with the legacy module.

[43]:

                          from pypath.legacy import main

                        

No module `cairo` available.
Some plotting functionalities won't be accessible.

[ ]:

                          pa = main.PyPath()
#pa.load_omnipath() # This is commented out because it takes > 1h
                    # to run it for the first time due to the vast
                    # amount of data download.
                    # Once you populated the cache it still takes
                    # approx. 30 min to build the entire OmniPath
                    # as the process consists of quite some data
                    # processing. If you dump it in a pickle, you
                    # can load the network in < 1 min

                        

I just want a network quickly and play around with pypath§

You can find the predefined formats in the pypath.resources.network module. For example, to load one resource from there, let’s say SIGNOR:

[ ]:

                            from pypath.legacy import main
from pypath.resources import network as netres
pa = main.PyPath()
pa.load_resources({'signor': netres.pathway['signor']})

                          

Or to load all activity flow resources with literature references:

[ ]:

                            from pypath.legacy import main
from pypath.resources import network as netres

[ ]:

                            pa = main.PyPath()
pa.init_network(netres.pathway)

Or to load all activity flow resources, including the ones without literature references:

[ ]:

                            pa = main.PyPath()
pa.init_network(data_formats.pathway_all)

How do I build networks from any data with pypath?§

Here we show how to build a network from your own files. The advantage of building network with pypath is that you don’t need to worry about merging redundant elements, neither about different formats and identifiers. Let’s say you have two files with network data:

network1.csv

entrezA,entrezB,effect
1950,1956,inhibition
5290,207,stimulation
207,2932,inhibition
1956,5290,stimulation

network2.sif

EGF + EGFR
EGFR + PIK3CA
EGFR + SOS1
PIK3CA + RAC1
RAC1 + MAP3K1
SOS1 + HRAS
HRAS + MAP3K1
PIK3CA + AKT1
AKT1 - GSK3B

Note: you need to create these files in order to load them.

Defining input formats§

[ ]:

                              import pypath
import pypath.iinput_formats as input_formats

input1 = input_formats.ReadSettings(
    name = 'egf1',
    input = 'network1.csv',
    header = True,
    separator = ',',
    id_col_a = 0,
    id_col_b = 1,
    id_type_a = 'entrez',
    id_type_b = 'entrez',
    sign = (2, 'stimulation', 'inhibition'),
    ncbi_tax_id = 9606,
)

input2 = input_formats.ReadSettings(
    name = 'egf2',
    input = 'network2.sif',
    separator = ' ',
    id_col_a = 0,
    id_col_b = 2,
    id_type_a = 'genesymbol',
    id_type_b = 'genesymbol',
    sign = (1, '+', '-'),
    ncbi_tax_id = 9606,
)

                            

Creating PyPath object and loading the 2 test files§

[ ]:

                              inputs = {
    'egf1': input1,
    'egf2': input2
}

pa = main.PyPath()
pa.reload()
pa.init_network(lst = inputs)

                            

Structure of the legacy network object§

[ ]:

                            from pypath.legacy import main as legacy
pa = legacy.PyPath()

[ ]:

                            pa.graph

                          

Number of edges and nodes:

[ ]:

                            pa.ecount, pa.vcount

                          

The edge and vertex sequences you can access in the es and vs attributes, you can iterate these or index by integers. The edge and vertex attributes you can access by string keys. E.g. get the sources of edge 0:

[ ]:

                            pa.graph.es[81]['sources']

                          

Directions and signs§

By default the igraph object is undirected but it carries all direction information in Python objects assigned to each edge. Pypath can convert it to a directed igraph object, but you still need the Direction objects to have the signs, as igraph has no signed network representation. Certain methods need the directed igraph object and they will automatically create it, but you can create it manually:

[ ]:

                              pa.get_directed()

                            

You find the directed network in the pa.dgraph attribute:

[ ]:

                              pa.dgraph

                            

Now let’s take a look on the pypath.main.Direction objects which contain details about directions and signs. First as an example, select a random edge:

[ ]:

                              edge = pa.graph.es[3241]

                            

The Direction object is in the dirs edge attribute:

[ ]:

                              d = edge['dirs']

                            

It has a method to print its content a human readable way:

[ ]:

                              print(pa.graph.es[3241]['dirs'])

                            

From this we see the databases phosphoELM and Signor agree that protein P17252 has an effect on Q15139 and Signor in addition tells us this effect is stimulatory. However in your scripts you can query the Direction objects a number of ways. Each Direction object calls the two possible directions either straight or reverse:

[ ]:

                              d.straight

                            

[ ]:

                              d.reverse

                            

It can tell you if one of these directions is supported by any of the network resources:

[ ]:

                              d.get_dir(d.straight)

                            

Or it can return those resources:

[ ]:

                              d.get_dir(d.straight, sources = True)

                            

The opposite direction is not supported by any resource:

[ ]:

                              d.get_dir(d.reverse, sources = True)

                            

Similar way the signs can be queried. The returned pair of boolean values mean if the interaction in this direction is stimulatory or inhibitory, respectively.

[ ]:

                              d.get_sign(d.straight)

                            

Or you can ask whether it is inhibition:

[ ]:

                              d.is_inhibition(d.straight)

                            

Or if the interaction is directed at all:

[ ]:

                              d.is_directed()

                            

Sometimes resources don’t agree, for example one tells an interaction is inhibition while according to others it is stimulation; or one tells A effects B and another resource the other way around. Here we preserve all these potentially contradicting information in the Direction object and at the end you decide what to do with it depending on your purpose. If you want to get rid of ambiguity there is a method to get a consensus direction and sign which returns the attributes the most resources agree on:

[ ]:

                              d.consensus_edges()

                            

Accessing nodes in the network§

In igraph the vertices are numbered but this numbering can change at certain operations. Instead the we can use the vertex attributes. In PyPath for proteins the name attribute is UniProt ID by default and the label is Gene Symbol.

[ ]:

                              pa.graph.vs['name'][:5]

                            

[ ]:

                              pa.graph.vs['label'][:5]

                            

The PyPath object offers a number of helper methods to access the nodes by their names. For example, uniprot or up returns the igraph.Vertex for a UniProt ID:

[ ]:

                              type(pa.up('P00533'))

                            

Similarly genesymbol or gs for Gene Symbols:

[ ]:

                              type(pa.gs('ESR1'))

                            

Each of these has a “plural” version:

[ ]:

                              len(list(pa.gss(['MTOR', 'ATG16L2', 'ULK1'])))

                            

And a generic method where you can mix UniProts and Gene Symbols:

[ ]:

                              len(list(pa.proteins(['MTOR', 'P00533'])))

                            

Querying relationships with our without causality§

Above you could see how to query the directions and names of individual edges and nodes. Building on top of these, other methods give a way to query causality, i.e. which proteins are affected by an other one, and which others are its regulators. The example below returns the nodes PIK3CA is stimulated by, the gs prefix tells we query by the Gene Symbol:

[ ]:

                            pa.gs_stimulated_by('PIK3CA')

                          

It returns a so called _NamedVertexSeq object, which you can get a series of igraph.Vertex objects or Gene Symbols or UniProt IDs from:

[ ]:

                            list(pa.gs_stimulated_by('PIK3CA').gs())[:5]

                          

[ ]:

                            list(pa.gs_stimulated_by('PIK3CA').up())[:5]

                          

Note, the names of these methods are a bit contraintuitive, the for example the gs_stimulates returns the genes stimulated by PIK3CA:

[ ]:

                            list(pa.gs_stimulates('PIK3CA').gs())[:5]

                          

[ ]:

                            'PIK3CA' in set(pa.affected_by('AKT1').gs())

                          

There are many similary methods, inhibited_by returns negative regulators, affected_by does not consider +/- signs, without gs_ and up_ prefixes you can provide either of these identifiers, neighbors does not consider the direction. At the end .gs() converts the result for a list of Gene Symbols, up() to UniProts, .ids() to vertex IDs and by default it yields igraph.Vertex objects:

[ ]:

                            list(pa.neighbors('AKT1').ids())[:5]

                          

Finally, with neighborhood methods return the indirect neighborhood in custom number of steps (however size of the neighborhood increases rapidly with number of steps):

[ ]:

                            print(list(pa.neighborhood('ATG3', 1).gs()))

                          

[ ]:

                            print(list(pa.neighborhood('ATG3', 2).gs()))

                          

[ ]:

                            len(list(pa.neighborhood('ATG3', 3).gs()))

                          

[ ]:

                            len(list(pa.neighborhood('ATG3', 4).gs()))

                          

Accessing edges by identifiers§

Just like nodes also edges can be accessed by identifiers like Gene Symbols. get_edge returns an igraph.Edge if the edge exists otherwise None.

[ ]:

                            type(pa.get_edge('EGF', 'EGFR'))

                          

[ ]:

                            type(pa.get_edge('EGF', 'P00533'))

                          

[ ]:

                            type(pa.get_edge('EGF', 'AKT1'))

                          

[ ]:

                            print(pa.get_edge('EGF', 'EGFR')['dirs'])

                          

Literature references§

Select a random edge and in the references attribute you find a list of references:

[ ]:

                            edge = pa.get_edge( 'MAP1LC3B', 'SQSTM1')
edge['references']

Each reference has a PubMed ID:

[ ]:

                            edge['references'][0].pmid

                          

[ ]:

                            edge['references'][0].open()

                          

These 3 references come from 3 different databases, but there must be 2 overlaps between them:

[ ]:

                            edge['refs_by_source']

                          

Plotting the network with igraph§

Here we use the network created above (because it is reasonable size, not like the networks we could get from most of the network databases). Igraph has excellent plotting abilities built on top of the cairo library.

[ ]:

                            import igraph
plot = igraph.plot(pa.graph, target = 'egf_network.png',
            edge_width = 0.3, edge_color = '#777777',
            vertex_color = '#97BE73', vertex_frame_width = 0,
            vertex_size = 70.0, vertex_label_size = 15,
            vertex_label_color = '#FFFFFF',
            # due to a bug in either igraph or IPython,
            # vertex labels are not visible on inline plots:
            inline = False, margin = 120)
from IPython.display import Image
Image(filename='egf_network.png')

                          

Table of Contents

The pypath book§

Introduction§

Build, load and save databases§

The OmniPath app§

Built-in database definitions§

Networks§

Strictly literature curated network§

The OmniPath network with extra activity flow, enzyme-substrate and ligand-receptor interactions§

Transcriptional regulation network from DoRothEA and other resources§

Literature curated miRNA post-transcriptional regulation network§

Transcriptional regulation of miRNA§

lncRNA-mRNA interactions§

Small molecule-protein interactions§

Enzyme-substrate relationships§

Protein complexes§

Annotations§

Inter-cellular communication roles§

Data directly from the original resources§

Interesting resources§

RaMP§

TL;DR§

HMDB (Human Metabolome Database)§

Direct access to HMDB data§

Higher level access to HMDB data§

ID translation with HMDB§

SwissLipids§

LIPID MAPS§

NCBI E-Utils§

Download management§

Cache management and customization§

Download failures§

Corrupted cache content§

Network communication issues: look into the curl debug log§

Timeouts§

Access and inspect the Curl object§

Is it failing only for you?§

Read the log§

TLS (SSL, HTTPS) errors§

Resources§

Licenses§

Example: build a network for commercial use§

Resource information§

Resource definitions for a certain database or dataset§

Building networks§

Which network datasets are pre-defined in pypath?§

The Network object§

Network in pandas.DataFrame§

Self interactions (loop edges) in the network§

Molecular complexes in the network§

Translating identifiers§

Pre-defined ID translation tables§

Direct access to ID translation tables§

Orthology translation§

Orthology translation tables as dictionaries§

Orthology translation data frames§

Taxonomy§

Translating to NCBI Taxonomy, scientific names and common names§

Organism from UniProt ID§

UniProt§

The UniProt input module§

All UniProt IDs for one organism§

UniProt ID format validation§

UniProt ID validation§

Single UniProt protein datasheet§

History of UniProt records§

UniProt REST API§

Processed UniProt annotations§

The UniProt utils module§

Datasheets§

Tables§

Sanitizing UniProt IDs§

Enzyme-substrate interactions§

Enzyme-substrate objects§

Enzyme-substrate data frame§

Protein sequences§

Annotations§

Load a single annotation resource§

Load the full annotations database by the database manager§

Load only selected annotations§

Access and inspect the `Curl` object§

The `Network` object§