Search Mailing List Archives

Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt

Emily Dimmer edimmer at
Fri Sep 12 03:36:07 PDT 2008

Having just spoken to Ensembl they do generally take annotations from 
MOD files on the GO Consoritum site and then supplement these 
annotations with those that GOA provides. They also appear to take 
annotations for all evidence codes. However for the Ensembl Compara IEA 
method, which makes use of the 1:1 and apparent 1:1 orthology 
information, annotations are projected using the same kinds of criteria 
that we use to project annotations via ISS - i.e. only IDA, IMP, IEP, 
IGI and IPI annotations are transferred. Further information is located 
However! in the case of rat, it does appear that Ensembl have not been 
taking the RGD association file, only the GOA rat file. This is probably 
because Ensembl relies on UniProtKB to RGD id mappings, and currently 
UniProtKB does not have an entry for Tacc3. Therefore the only 
annotations that Ensembl is displaying are those generated from the 
Ensembl Compara projection method - so these annotations will have 
originated from the human or mouse orthologs. Please also note that 
there can be quite a long gap between GO cross-reference updates at 
Ensembl - they are not able to update on a monthly basis, so the 
annotation sets you are seeing, could be a number of months old.

On the GOA front - we take all MOD annotations which map to UniProtKB 
accessions, and which have an evidence code other than IEA or ISS (so we 
do take ND and IC coded annotations). The ISS exclusion is a decision 
one we are revisiting, historically it was decided to exclude these to 
avoid any potential circular ISS annotations, however I think that there 
ISS annotation sets we should now be taking in and with which we 
shouldn't have any problems.

I do agree that Ensembl should be displaying additional information in 
their GO cross-references, (including references, sources etc). They are 
intending to to revise their cross-references shortly, and will look 
into this further.


Valerie Wood wrote:
> All,
> Some other points maybe worth considering  here,
> 1.  Ensembl appear to derive their primary GO data from Uniprot; 
> Uniprot only include a subset of evidence codes which excludes some of 
> the curator assigned annotations from the MODs (including ND, ISS, 
> IC). Wouldn't it be preferable for Ensembl to use the MOD derived 
> curated data removing the need to create many of the IEA mappings?
> 2. Could UniProt import all of the curated data for the MODs, rather 
> than just a subset, especially for the reference genomes?
> 3. The Ensembl  entry has IEA to DNA binding but Tacc3 does not appear 
> to have DNA binding domains. What is the source of the Ensembl IEA 
> data for Tacc3 (it isn't recorded, the source of this would be useful)?
> Val
> Mike Cherry wrote:
>> Gabriel,
>> I wouldn't say this is a bug.  The 1302948 ID is used by RGD when the 
>> annotations have been created by the RGD project.  Those annotations 
>> that have the ENSEMBL ID ENSRNOP00000034933 have been created by 
>> ENSEMBL.  RGD is just passing the ENSEMBL annotations through in 
>> their file.
>> The gene association file is created by RGD.  While some groups do 
>> map all the external IDs to internal IDs this is not done by all.
>> One suggestion for your example is to filter out the IEA 
>> annotations.  That would remove the ENSEMBL associations for this 
>> example.  You would likely want to do that anyway, or at least 
>> compare your statistics with and without the computationally defined 
>> annotations.
>> -Mike
>> On Sep 9, 2008, at 10:22 AM, Gabriel Berriz wrote:
>>> On 2008.09.08 Mon, at 18:09, Judith Blake wrote:
>>>> Gabriel,
>>>> The gene association files are non-redundant.  Primary model organisms
>>>> have responsibility for integrating annotations from mulitple sources
>>>> and submitting a non-redundant file to the GOdb.  QC checks on the 
>>>> files
>>>> also remove redundancies.
>>> Hi, Judy.  My word choice was not a very good one when I wrote of 
>>> "redundancies", so let me give an example of what I meant.  It comes 
>>> from the latest gene_association.rgd.gz file.  (This example is the 
>>> first one I followed up on of the 1000 or so that I mentioned in my 
>>> previous email.)
>>> The latest gene_association.rgd.gz file contains 15 associations for 
>>> RGD ID 1302948, and 4 associations for ENSEMBL ID 
>>> ENSRNOP00000034933.  In fact, according to both Ensembl and RGD 
>>> ( these two 
>>> identifiers both refer to the same entity (transforming acidic 
>>> coiled-coil containing protein 3, aka Tacc3).  Hence, the file uses 
>>> two names for the same thing.  Why?
>>> The reason why I bring this problem up is that, in our work, we 
>>> compute statistics that are very sensitive to how many genes have a 
>>> particular GO attribute, therefore it is crucial for us to count the 
>>> associations in this example as being 19 belonging to the same 
>>> protein, rather than 15 belonging to one and 4 belonging to 
>>> another.  This accounting task is made significantly more difficult 
>>> by the fact that the association file uses two different names for 
>>> the same thing.
>>> Maybe I'm wrong here, but this looks to me like a bug rather than a 
>>> feature:  I can't see that any good could come of using multiple 
>>> names for the same thing in a document like this.
>>> If it is indeed a bug, would it be too difficult to fix?  I.e. would 
>>> it be too difficult for GO and the purveyors of associations files 
>>> to use a consistent nomenclature whenever possible?
>>> If it's of any help with this, we have a tool, called Synergizer, 
>>> for bulk mapping of identifiers from one namespace to another, and 
>>> it is a simple matter to set up a pipeline to do it automatically 
>>> (see  We'd be happy to 
>>> help with this in any way we can.  (Although I imagine that the 
>>> organizations that generate such associations files are the ultimate 
>>> experts for resolving such nomenclature issues.)
>>> Also, as I said earlier, the example above is not isolated.  For R. 
>>> norvegicus alone there are about 1000, and that's only focusing on 
>>> RGD vs. ENSEMBL IDs.  And the problem is not limited to R. 
>>> norvegicus.  Among the organisms that I have analyzed, I found a 
>>> similar nomenclature inconsistencies with several others, including 
>>> B. taurus, G. gallus, C. elegans, and H. sapiens.
>>> Thanks for your comments!
>>> Gabriel Berriz
>>> =============================================================
>>> Gabriel F. Berriz, PhD
>>> Bioinformatics Developer
>>> Roth Lab
>>> Biological Chemistry and Molecular Pharmacology -- Harvard Medical 
>>> School
>>> Seeley G. Mudd Building 322B
>>> Boston, MA 02115-5701
>>> Telephone: 617.432.3555
>>> Fax: 617.432.3557
>>> _______________________________________________
>>> Gofriends mailing list
>>> Gofriends at
>> _______________________________________________
>> Gofriends mailing list
>> Gofriends at


Do you need any additional GO annotation resources?
Which proteins would you like annotated with GO?

Let us know in the GOA User Survey, available at:


    Emily Dimmer Ph.D.
    GOA Coordinator
    Wellcome Trust Genome Campus
    Cambridge CB10 1SD, U.K.
    Tel:     +44 1223 494654
    Fax:    +44 1223 494468
    email:  edimmer at

More information about the go-friends mailing list