Introduction

This notebook shows how to locate Transcription Factors (TFs) in pypath.

Analysis

In [1]:
# Show all the plots inside the notebook
%matplotlib inline
In [2]:
# load packages
import pypath
import igraph  # import igraph to use the plot function

import numpy as np
import pandas as pd
import seaborn as sns
/usr/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
In [3]:
pa = pypath.PyPath()

	=== d i s c l a i m e r ===

	All data coming with this module
	either as redistributed copy or downloaded using the
	programmatic interfaces included in the present module
	are available under public domain, are free to use at
	least for academic research or education purposes.
	Please be aware of the licences of all the datasets
	you use in your analysis, and please give appropriate
	credits for the original sources when you publish your
	results. To find out more about data sources please
	look at `pypath.descriptions` and
	`pypath.data_formats.urls`.

	ยป New session started,
	session ID: 'v46q0'
	logfile:'./log/v46q0.log'.
In [4]:
pa.init_network(pfile = 'cache/default_network.pickle')
	:: Network loaded from `cache/default_network.pickle`. 6710 nodes, 24833 edges.
	:: Loading 'genesymbol' to 'uniprot' mapping table

We will use GO annotations to locate TFs.

In [5]:
# load go annotations:
pa.load_go()
	:: Loading gene_association.goa_human.gz from cache, previously downloaded from ftp.ebi.ac.uk
	:: Loading GO annotations: finished, 100.0%
In [6]:
# get the GO annotation:
pa.go_dict()
	:: Loading GAnnotation from cache, previously downloaded from www.ebi.ac.uk
In [7]:
# get also the directed network
pa.get_directed()
#pa.ugraph = pa.graph
#pa.graph = pa.dgraph
	:: Setting directions: finished, 100.0%
In [8]:
# list names instead of IDs:
# (9606 is an NCBI taxonomy ID)
map(pa.go[9606].get_name, set(pa.gs('GATA1')['go']['C']))
Out[8]:
['nucleoplasm',
 'nucleus',
 'transcriptional repressor complex',
 'transcription factor complex']

Some GO terms that may be useful: (C) transcription factor complex (C) transcriptional repressor complex (P) cell surface receptor signaling pathway ( ) plasma membrane receptor complex (C) plasma membrane (C) cell surface

In [9]:
tf = pa.dgraph.vs.select(lambda vertex: pa.go[9606].get_term('transcription factor complex') in vertex['go']['C'])
tfr = pa.dgraph.vs.select(lambda vertex: pa.go[9606].get_term('transcriptional repressor complex') in vertex['go']['C'])
print('Number of nodes annotated as \'transcription factor complex\': {}'.format(len(tf)))
print('Number of nodes annotated as \'transcriptional repressor complex\': {}'.format(len(tfr)))
# Note: some nodes may be annotated with both GO terms
print('Number of nodes annotated with any of the two terms above: {}'.format(len(set(tf['label']+tfr['label']))))
Number of nodes annotated as 'transcription factor complex': 131
Number of nodes annotated as 'transcriptional repressor complex': 35
Number of nodes annotated with any of the two terms above: 160

We can also look for nodes annotated with several GO terms. For example, we can try to locate all the nodes corresponding to cell membrane proteins located in its surface.

In [10]:
filter_func = lambda vertex: pa.go[9606].get_term('cell surface') in vertex['go']['C'] and pa.go[9606].get_term('plasma membrane') in vertex['go']['C']
pm = pa.dgraph.vs.select(filter_func)
print('Number of nodes annotated with \'cell surface\' and \'plasma membrane\': {}'.format(len(pm['label'])))
Number of nodes annotated with 'cell surface' and 'plasma membrane': 211
map(pa.go[9606].get_name, set(pm[0]['go']['F']))

Locate nodes with no inputs or no outputs. Also, check that there are no isolated nodes.

In [11]:
only_in = pa.dgraph.vs.select(lambda vertex: vertex.outdegree()==0)
only_out = pa.dgraph.vs.select(lambda vertex: vertex.indegree()==0)
isolated = pa.graph.vs.select(lambda vertex: vertex.degree()==0)
print('Number of nodes with no output arcs: {}'.format(len(only_in)))
print('Number of nodes with no input arcs: {}'.format(len(only_out)))
print('Number of nodes with no arcs: {}'.format(len(isolated)))
Number of nodes with no output arcs: 1725
Number of nodes with no input arcs: 854
Number of nodes with no arcs: 0
map(pa.go[9606].get_name, set([i for sublist in only_in for i in sublist['go']['C']]))

Paths between surface proteins and TFs

In [12]:
dnode_list = set()
rows = pm['label']
cols = tf['label'] + tfr['label']
ddistance = pd.DataFrame(np.nan, index=rows, columns=cols)
for igene1 in rows:
    for igene2 in cols:
        path = pa.dgraph.get_shortest_paths(pa.dgenesymbol(igene1)['name'], to=pa.dgenesymbol(igene2)['name'])[0]
        dnode_list.update(path)
        ddistance.loc[igene1, igene2] = len(path)-1 if len(path)>0 else np.nan
/usr/lib/python2.7/site-packages/ipykernel/__main__.py:7: RuntimeWarning: Couldn't reach some vertices at structural_properties.c:740
In [13]:
interconnection_dgraph = pa.dgraph.induced_subgraph(dnode_list)
In [14]:
# for directed graphs with many edges, plotting the network may be prohibitive
#igraph.plot(interconnection_dgraph, layout=interconnection_dgraph.layout_auto(), vertex_label=None)
In [15]:
sns.plt.hist(interconnection_dgraph.degree(), bins=100)
Out[15]:
(array([ 76.,  51.,  87.,  42.,  34.,  61.,  22.,  42.,  19.,  16.,  30.,
          9.,  17.,  27.,   8.,  18.,   6.,  10.,  11.,   5.,   6.,   9.,
          5.,   7.,   4.,   2.,   6.,   3.,   4.,   5.,   2.,   3.,   2.,
          4.,   3.,   2.,   1.,   1.,   1.,   2.,   0.,   0.,   1.,   0.,
          1.,   0.,   2.,   2.,   1.,   1.,   2.,   0.,   2.,   1.,   0.,
          0.,   2.,   1.,   1.,   0.,   1.,   0.,   1.,   0.,   0.,   1.,
          0.,   1.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,
          1.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   1.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,   0.,   0.,
          1.]),
 array([   1.  ,    2.38,    3.76,    5.14,    6.52,    7.9 ,    9.28,
          10.66,   12.04,   13.42,   14.8 ,   16.18,   17.56,   18.94,
          20.32,   21.7 ,   23.08,   24.46,   25.84,   27.22,   28.6 ,
          29.98,   31.36,   32.74,   34.12,   35.5 ,   36.88,   38.26,
          39.64,   41.02,   42.4 ,   43.78,   45.16,   46.54,   47.92,
          49.3 ,   50.68,   52.06,   53.44,   54.82,   56.2 ,   57.58,
          58.96,   60.34,   61.72,   63.1 ,   64.48,   65.86,   67.24,
          68.62,   70.  ,   71.38,   72.76,   74.14,   75.52,   76.9 ,
          78.28,   79.66,   81.04,   82.42,   83.8 ,   85.18,   86.56,
          87.94,   89.32,   90.7 ,   92.08,   93.46,   94.84,   96.22,
          97.6 ,   98.98,  100.36,  101.74,  103.12,  104.5 ,  105.88,
         107.26,  108.64,  110.02,  111.4 ,  112.78,  114.16,  115.54,
         116.92,  118.3 ,  119.68,  121.06,  122.44,  123.82,  125.2 ,
         126.58,  127.96,  129.34,  130.72,  132.1 ,  133.48,  134.86,
         136.24,  137.62,  139.  ]),
 <a list of 100 Patch objects>)
In [16]:
sns.plt.plot(ddistance.as_matrix().ravel(), '.')
Out[16]:
[<matplotlib.lines.Line2D at 0x1252f6bd0>]
In [17]:
tmp = ddistance.as_matrix().ravel()
sns.plt.hist(tmp[~np.isnan(tmp)], bins=100)
Out[17]:
(array([  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   2.60000000e+01,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   5.80000000e+02,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   3.44500000e+03,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          5.63100000e+03,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   3.62200000e+03,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   1.19700000e+03,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   3.46000000e+02,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          2.88000000e+02,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          1.29000000e+02,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   1.90000000e+01,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   9.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          2.00000000e+00]),
 array([  0.  ,   0.12,   0.24,   0.36,   0.48,   0.6 ,   0.72,   0.84,
          0.96,   1.08,   1.2 ,   1.32,   1.44,   1.56,   1.68,   1.8 ,
          1.92,   2.04,   2.16,   2.28,   2.4 ,   2.52,   2.64,   2.76,
          2.88,   3.  ,   3.12,   3.24,   3.36,   3.48,   3.6 ,   3.72,
          3.84,   3.96,   4.08,   4.2 ,   4.32,   4.44,   4.56,   4.68,
          4.8 ,   4.92,   5.04,   5.16,   5.28,   5.4 ,   5.52,   5.64,
          5.76,   5.88,   6.  ,   6.12,   6.24,   6.36,   6.48,   6.6 ,
          6.72,   6.84,   6.96,   7.08,   7.2 ,   7.32,   7.44,   7.56,
          7.68,   7.8 ,   7.92,   8.04,   8.16,   8.28,   8.4 ,   8.52,
          8.64,   8.76,   8.88,   9.  ,   9.12,   9.24,   9.36,   9.48,
          9.6 ,   9.72,   9.84,   9.96,  10.08,  10.2 ,  10.32,  10.44,
         10.56,  10.68,  10.8 ,  10.92,  11.04,  11.16,  11.28,  11.4 ,
         11.52,  11.64,  11.76,  11.88,  12.  ]),
 <a list of 100 Patch objects>)

Using Pypath for retrieving TFs

In [9]:
pa.set_transcription_factors()
pa_tf = pa.transcription_factors()
	:: Loading nrg2538-s3.txt from cache, previously downloaded from www.nature.com
	:: Loading HUMAN_9606_idmapping.dat.gz from cache, previously downloaded from ftp.uniprot.org
	:: Processing ID conversion list: finished, 100.0%
	:: Loading 'uniprot-sec' to 'uniprot-pri' mapping table
	:: Loading 'genesymbol' to 'trembl' mapping table
	:: Loading 'genesymbol' to 'swissprot' mapping table
	:: Loading 'genesymbol-syn' to 'swissprot' mapping table
	:: Loading 'hgnc' to 'uniprot' mapping table
In [10]:
pa_tf = pa.graph.vs.select(lambda vertex: vertex['tf'] is True)
len(pa_tf)
Out[10]:
530
In [11]:
with pypath.dataio.cache_off():
    pa.set_receptors()
	:: Downloading data from receptome.stanford.edu. Waiting for reply...                    Success.
In [12]:
pa_rec = pa.graph.vs.select(lambda vertex: vertex['rec'] is True)
len(pa_rec)
Out[12]:
0
In [13]:
pypath.dataio.get_hpmr()
	:: Loading findGenes.asp from cache, previously downloaded from receptome.stanford.edu
Out[13]:
[]
In [14]:
pypath.data_formats.urls['hpmr']['url']
Out[14]:
'http://receptome.stanford.edu/hpmr/SearchDB/findGenes.asp?textName=*'
In [15]:
html = pypath.dataio.curl(pypath.data_formats.urls['hpmr']['url'], silent = False)
	:: Loading findGenes.asp from cache, previously downloaded from receptome.stanford.edu
In [16]:
pa.graph.es[0]
Out[16]:
igraph.Edge(<igraph.Graph object at 0x1199af050>, 0, {'dirs': <pypath.pypath.Direction object at 0x11d8ac1d0>, 'signor_mechanism': [], 'ca1_type': [], 'macrophage_location': [], 'refs_by_source': {'SignaLink3': [<pypath.pypath.Reference object at 0x11d8ac210>], 'DIP': [<pypath.pypath.Reference object at 0x13257b890>], 'Guide2Pharma': [<pypath.pypath.Reference object at 0x126ff6890>]}, 'sources': ['SignaLink3', 'DIP', 'Guide2Pharma'], 'references': [<pypath.pypath.Reference object at 0x11d8ac210>], 'spike_effect': [], 'macrophage_type': [], 'trip_methods': [], 'netbiol_is_direct': ['unknown'], 'sources_by_type': {'PPI': ['SignaLink3', 'DIP', 'Guide2Pharma']}, 'hprd_mechanism': [], 'dd_methods': [], 'negative': None, 'ca1_effect': [], 'negative_refs': None, 'type': ['PPI'], 'psite_evidences': [], 'netpath_methods': [], 'dip_methods': [u'biochemical'], 'netbiol_is_directed': 'directed', 'netpath_type': [], 'domino_methods': [], 'netpath_pathways': [], 'refs_by_type': {'PPI': [<pypath.pypath.Reference object at 0x11d8ac210>]}, 'matrixdb_methods': [], 'spike_mechanism': [], 'is_direct': u'', 'netbiol_effect': ['unknown'], 'netbiol_mechanism': ['physical association'], 'is_directed': u'', 'dip_type': [u'physical interaction'], 'mppi_evidences': []})
In [17]:
pa.graph.es[0]['sources_by_type']
Out[17]:
{'PPI': ['SignaLink3', 'DIP', 'Guide2Pharma']}
In [18]:
pa.graph.es[0]['type']
Out[18]:
['PPI']
In [19]:
tmp1 = list(tf['label'] + tfr['label'])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-19-65d99d7403f3> in <module>()
----> 1 tmp1 = list(tf['label'] + tfr['label'])

NameError: name 'tf' is not defined
In [20]:
tmp2 = list(pa_tf['label'])
In [21]:
tmp3 = set(tmp2).intersection(tmp1)
len(tmp3)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-21-1f730f01d768> in <module>()
----> 1 tmp3 = set(tmp2).intersection(tmp1)
      2 len(tmp3)

NameError: name 'tmp1' is not defined
In [ ]:
 
In [ ]:
 
In [ ]: