Search Mailing List Archives
[Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt
cherry at stanford.edu
Tue Sep 9 12:44:26 PDT 2008
I wouldn't say this is a bug. The 1302948 ID is used by RGD when the
annotations have been created by the RGD project. Those annotations
that have the ENSEMBL ID ENSRNOP00000034933 have been created by
ENSEMBL. RGD is just passing the ENSEMBL annotations through in their
The gene association file is created by RGD. While some groups do map
all the external IDs to internal IDs this is not done by all.
One suggestion for your example is to filter out the IEA annotations.
That would remove the ENSEMBL associations for this example. You
would likely want to do that anyway, or at least compare your
statistics with and without the computationally defined annotations.
On Sep 9, 2008, at 10:22 AM, Gabriel Berriz wrote:
> On 2008.09.08 Mon, at 18:09, Judith Blake wrote:
>> The gene association files are non-redundant. Primary model
>> have responsibility for integrating annotations from mulitple sources
>> and submitting a non-redundant file to the GOdb. QC checks on the
>> also remove redundancies.
> Hi, Judy. My word choice was not a very good one when I wrote of
> "redundancies", so let me give an example of what I meant. It comes
> from the latest gene_association.rgd.gz file. (This example is the
> first one I followed up on of the 1000 or so that I mentioned in my
> previous email.)
> The latest gene_association.rgd.gz file contains 15 associations for
> RGD ID 1302948, and 4 associations for ENSEMBL ID
> ENSRNOP00000034933. In fact, according to both Ensembl and RGD (http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=1302948
> ) these two identifiers both refer to the same entity (transforming
> acidic coiled-coil containing protein 3, aka Tacc3). Hence, the
> file uses two names for the same thing. Why?
> The reason why I bring this problem up is that, in our work, we
> compute statistics that are very sensitive to how many genes have a
> particular GO attribute, therefore it is crucial for us to count the
> associations in this example as being 19 belonging to the same
> protein, rather than 15 belonging to one and 4 belonging to
> another. This accounting task is made significantly more difficult
> by the fact that the association file uses two different names for
> the same thing.
> Maybe I'm wrong here, but this looks to me like a bug rather than a
> feature: I can't see that any good could come of using multiple
> names for the same thing in a document like this.
> If it is indeed a bug, would it be too difficult to fix? I.e. would
> it be too difficult for GO and the purveyors of associations files
> to use a consistent nomenclature whenever possible?
> If it's of any help with this, we have a tool, called Synergizer,
> for bulk mapping of identifiers from one namespace to another, and
> it is a simple matter to set up a pipeline to do it automatically
> (see http://llama.med.harvard.edu/synergizer/doc). We'd be happy to
> help with this in any way we can. (Although I imagine that the
> organizations that generate such associations files are the ultimate
> experts for resolving such nomenclature issues.)
> Also, as I said earlier, the example above is not isolated. For R.
> norvegicus alone there are about 1000, and that's only focusing on
> RGD vs. ENSEMBL IDs. And the problem is not limited to R.
> norvegicus. Among the organisms that I have analyzed, I found a
> similar nomenclature inconsistencies with several others, including
> B. taurus, G. gallus, C. elegans, and H. sapiens.
> Thanks for your comments!
> Gabriel Berriz
> Gabriel F. Berriz, PhD
> Bioinformatics Developer
> Roth Lab
> Biological Chemistry and Molecular Pharmacology -- Harvard Medical
> Seeley G. Mudd Building 322B
> Boston, MA 02115-5701
> Telephone: 617.432.3555
> Fax: 617.432.3557
> Gofriends mailing list
> Gofriends at geneontology.org
More information about the go-friends