Search Mailing List Archives
[Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt
Gabriel Berriz
gberriz at hms.harvard.edu
Tue Sep 9 10:22:49 PDT 2008
On 2008.09.08 Mon, at 18:09, Judith Blake wrote:
> Gabriel,
>
> The gene association files are non-redundant. Primary model organisms
> have responsibility for integrating annotations from mulitple sources
> and submitting a non-redundant file to the GOdb. QC checks on the
> files
> also remove redundancies.
Hi, Judy. My word choice was not a very good one when I wrote of
"redundancies", so let me give an example of what I meant. It comes
from the latest gene_association.rgd.gz file. (This example is the
first one I followed up on of the 1000 or so that I mentioned in my
previous email.)
The latest gene_association.rgd.gz file contains 15 associations for
RGD ID 1302948, and 4 associations for ENSEMBL ID ENSRNOP00000034933.
In fact, according to both Ensembl and RGD (http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=1302948
) these two identifiers both refer to the same entity (transforming
acidic coiled-coil containing protein 3, aka Tacc3). Hence, the file
uses two names for the same thing. Why?
The reason why I bring this problem up is that, in our work, we
compute statistics that are very sensitive to how many genes have a
particular GO attribute, therefore it is crucial for us to count the
associations in this example as being 19 belonging to the same
protein, rather than 15 belonging to one and 4 belonging to another.
This accounting task is made significantly more difficult by the fact
that the association file uses two different names for the same thing.
Maybe I'm wrong here, but this looks to me like a bug rather than a
feature: I can't see that any good could come of using multiple names
for the same thing in a document like this.
If it is indeed a bug, would it be too difficult to fix? I.e. would
it be too difficult for GO and the purveyors of associations files to
use a consistent nomenclature whenever possible?
If it's of any help with this, we have a tool, called Synergizer, for
bulk mapping of identifiers from one namespace to another, and it is a
simple matter to set up a pipeline to do it automatically (see http://llama.med.harvard.edu/synergizer/doc)
. We'd be happy to help with this in any way we can. (Although I
imagine that the organizations that generate such associations files
are the ultimate experts for resolving such nomenclature issues.)
Also, as I said earlier, the example above is not isolated. For R.
norvegicus alone there are about 1000, and that's only focusing on RGD
vs. ENSEMBL IDs. And the problem is not limited to R. norvegicus.
Among the organisms that I have analyzed, I found a similar
nomenclature inconsistencies with several others, including B. taurus,
G. gallus, C. elegans, and H. sapiens.
Thanks for your comments!
Gabriel Berriz
=============================================================
Gabriel F. Berriz, PhD
Bioinformatics Developer
Roth Lab
Biological Chemistry and Molecular Pharmacology -- Harvard Medical
School
Seeley G. Mudd Building 322B
Boston, MA 02115-5701
Telephone: 617.432.3555
Fax: 617.432.3557
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/go-friends/attachments/20080909/45ffad96/attachment.html>
More information about the go-friends
mailing list