Search Mailing List Archives

Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[Gofriends] [Fwd: Re: Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt]

Petri, Victoria vpetri at
Tue Sep 9 13:27:31 PDT 2008



Hi Gabriel,


The gene association files are non-redundant. 


The RGD GO annotations come from two sources: manual annotation of genes
and annotations that are brought in electronically from MGI and GOA via
QC_based pipelines. 


For data from GOA for which a match is not found in RGD that information
is appended at the end of the gene association file 'as is', or a match
is found but the annotation is already in the database for that gene. It
is important to keep in mind that GOA annotates proteins rather than
genes (which we and other MODs do) and if multiple protein transcripts
get the same annotation - which is not a redundancy - one could/would be
loaded into the database and the others would be appended at the end of


As Mike has already suggested, I would filter out IEAs which would 1)
remove the Ensembl IDs in question and 2) keep in annotations that have
been experimentally determined either for rat or for an orthologous
gene. If possible I would also compare protein IDs associated with one
gene versus Ensembl IDs at the end of the gene association file because
of the one-to-many gene-to-protein relationship.




Victoria Petri, Ph.D.

Research Scientist

Rat Genome Database 


Bioinformatics Program

Human and Molecular Genetics Center

Medical College of Wisconsin 

8701 Watertown Plank Road, Milwaukee, WI 53226

(414) 456-8871

Fax (414) 456-6595 

vpetri at

vpetri at



-----Original Message-----
From: Judith Blake [mailto:jblake at] 
Sent: Tuesday, September 09, 2008 1:14 PM
To: Shimoyama, Mary
Cc: Petri, Victoria
Subject: [Fwd: Re: [Gofriends] Redundancy in


Hi Mary,


Can you respond here.  Is this  a curation issue for these organisms?  

Is mouse not on this list because of the substantial resources we can 

bring to this project?




-------- Original Message --------

Subject:    Re: [Gofriends] Redundancy in

Date:       Tue, 9 Sep 2008 13:22:49 -0400

From:       Gabriel Berriz <gberriz at>

To:   Judith Blake <jblake at>

CC:   <gofriends at>

References:       <31552965-46E2-46A9-9C76-92C7EE3D179F at>

<48C5A292.9030005 at>




On 2008.09.08 Mon, at 18:09, Judith Blake wrote:

> Gabriel,


> The gene association files are non-redundant.  Primary model organisms

> have responsibility for integrating annotations from mulitple sources

> and submitting a non-redundant file to the GOdb.  QC checks on the

> also remove redundancies.



Hi, Judy.  My word choice was not a very good one when I wrote of 

"redundancies", so let me give an example of what I meant.  It comes 

from the latest gene_association.rgd.gz file.  (This example is the 

first one I followed up on of the 1000 or so that I mentioned in my 

previous email.)


The latest gene_association.rgd.gz file contains 15 associations for RGD

ID 1302948, and 4 associations for ENSEMBL ID ENSRNOP00000034933.  In 

fact, according to both Ensembl and RGD 

( these 

two identifiers both refer to the same entity (transforming acidic 

coiled-coil containing protein 3, aka Tacc3).  Hence, the file uses two 

names for the same thing.  Why?


The reason why I bring this problem up is that, in our work, we compute 

statistics that are very sensitive to how many genes have a particular 

GO attribute, therefore it is crucial for us to count the associations 

in this example as being 19 belonging to the same protein, rather than 

15 belonging to one and 4 belonging to another.  This accounting task is

made significantly more difficult by the fact that the association file 

uses two different names for the same thing.


Maybe I'm wrong here, but this looks to me like a bug rather than a 

feature:  I can't see that any good could come of using multiple names 

for the same thing in a document like this.


If it is indeed a bug, would it be too difficult to fix?  I.e. would it 

be too difficult for GO and the purveyors of associations files to use a

consistent nomenclature whenever possible?


If it's of any help with this, we have a tool, called Synergizer, for 

bulk mapping of identifiers from one namespace to another, and it is a 

simple matter to set up a pipeline to do it automatically (see  We'd be happy to help 

with this in any way we can.  (Although I imagine that the organizations

that generate such associations files are the ultimate experts for 

resolving such nomenclature issues.)


Also, as I said earlier, the example above is not isolated.  For R. 

norvegicus alone there are about 1000, and that's only focusing on RGD 

vs. ENSEMBL IDs.  And the problem is not limited to R. norvegicus. 

 Among the organisms that I have analyzed, I found a similar 

nomenclature inconsistencies with several others, including B. taurus, 

G. gallus, C. elegans, and H. sapiens.


Thanks for your comments!


Gabriel Berriz


Gabriel F. Berriz, PhD

Bioinformatics Developer

Roth Lab

Biological Chemistry and Molecular Pharmacology -- Harvard Medical

Seeley G. Mudd Building 322B

Boston, MA 02115-5701

Telephone: 617.432.3555

Fax: 617.432.3557





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the go-friends mailing list