Search Mailing List Archives


Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt

Gabriel Berriz gberriz at hms.harvard.edu
Tue Sep 9 10:22:49 PDT 2008


On 2008.09.08 Mon, at 18:09, Judith Blake wrote:
> Gabriel,
>
> The gene association files are non-redundant.  Primary model organisms
> have responsibility for integrating annotations from mulitple sources
> and submitting a non-redundant file to the GOdb.  QC checks on the  
> files
> also remove redundancies.



Hi, Judy.  My word choice was not a very good one when I wrote of  
"redundancies", so let me give an example of what I meant.  It comes  
from the latest gene_association.rgd.gz file.  (This example is the  
first one I followed up on of the 1000 or so that I mentioned in my  
previous email.)

The latest gene_association.rgd.gz file contains 15 associations for  
RGD ID 1302948, and 4 associations for ENSEMBL ID ENSRNOP00000034933.   
In fact, according to both Ensembl and RGD (http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=1302948 
) these two identifiers both refer to the same entity (transforming  
acidic coiled-coil containing protein 3, aka Tacc3).  Hence, the file  
uses two names for the same thing.  Why?

The reason why I bring this problem up is that, in our work, we  
compute statistics that are very sensitive to how many genes have a  
particular GO attribute, therefore it is crucial for us to count the  
associations in this example as being 19 belonging to the same  
protein, rather than 15 belonging to one and 4 belonging to another.   
This accounting task is made significantly more difficult by the fact  
that the association file uses two different names for the same thing.

Maybe I'm wrong here, but this looks to me like a bug rather than a  
feature:  I can't see that any good could come of using multiple names  
for the same thing in a document like this.

If it is indeed a bug, would it be too difficult to fix?  I.e. would  
it be too difficult for GO and the purveyors of associations files to  
use a consistent nomenclature whenever possible?

If it's of any help with this, we have a tool, called Synergizer, for  
bulk mapping of identifiers from one namespace to another, and it is a  
simple matter to set up a pipeline to do it automatically (see http://llama.med.harvard.edu/synergizer/doc) 
.  We'd be happy to help with this in any way we can.  (Although I  
imagine that the organizations that generate such associations files  
are the ultimate experts for resolving such nomenclature issues.)

Also, as I said earlier, the example above is not isolated.  For R.  
norvegicus alone there are about 1000, and that's only focusing on RGD  
vs. ENSEMBL IDs.  And the problem is not limited to R. norvegicus.   
Among the organisms that I have analyzed, I found a similar  
nomenclature inconsistencies with several others, including B. taurus,  
G. gallus, C. elegans, and H. sapiens.

Thanks for your comments!

Gabriel Berriz
=============================================================
Gabriel F. Berriz, PhD
Bioinformatics Developer
Roth Lab
Biological Chemistry and Molecular Pharmacology -- Harvard Medical  
School
Seeley G. Mudd Building 322B
Boston, MA 02115-5701
Telephone: 617.432.3555
Fax: 617.432.3557



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/go-friends/attachments/20080909/45ffad96/attachment.html>


More information about the go-friends mailing list