Search Mailing List Archives


Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[go-helpdesk] Fwd: [go-software] Fwd: Term enrichment tool

Seth Carbon sjcarbon at lbl.gov
Wed Jun 6 13:53:12 PDT 2012


TE People,

There are caveats and details that I don't normally want to bore people
with when answering user questions, but in the case of these term
enrichment tool questions there are a few parallel threads that run
along the same topics and not everybody is getting all of the different
emails, so I'd like to go into a little more detail about term
enrichment and search in AmiGO 1.8 and try to cover the gaps for people
on all the threads.

In AmiGO, when you enter a gene product identifier into the search box,
it goes through all of the different places in the database that you
might find the identifier, like:

 gene_product.full_name
 gene_product.symbol
 gene_product_synonym
 dbxref.xref_key
 dbxref.xref_dbname/dbxref.xref_key

It then (a little hazy on this bit) sorts and ranks them, and if it only
finds one it will take you straight to that page. In the case that the
search mechanism finds something likely in the dbxref table, it still
does not know exactly what it has, so it starts doing different database
joins to try and figure it out--a gene_product acc, a sequence dbxref,
evidence for an association, etc. If it links back to a gene_product,
AmiGO would then count that as a hit.

For every additional step above, or additional ambiguity in the query,
more time is necessary resolve the identifier. Taking a little time
isn't really a problem when trying to resolve a single identifier, but
when you are trying to do things bulk, additional steps start causing
problems.

Some people will remember that the term enrichment tool used to timeout
on people all the time. It turned out that one of the biggest hits to
its performance was trying to resolve very large numbers of possibly
ambiguous identifiers. Unfortunately, the way around the problem with
the hardware and database we had was to limit how much time the TE tool
was willing to spend on an identifier before giving up (it's actually
more complicated with handling of duplicates, keeping track of what did
and didn't work, etc.). In the end, the TE tool checks only for
gene_product symbol, synonym, and acc (gene_product dbxref), with
everything else being ignored for the sake of time.

This should answer: if you're able to get to a page in AmiGO using an
identifier then the TE tool should be able to find it too? No, not
necessarily. However, if users are will to take the step of
"pre-resolving" the identifiers for AmiGO using one of the mapping tools
or files (http://wiki.geneontology.org/index.php/Tools_ID_Mapping), the
TE tool may be able to process the input with what it is able to access
quickly.

In the specific case of Christian, 11 of the 14 identifiers are in the
the database, and they only exist as GermOnline dbxrefs (i.e. they are
not symbols, full_names, or synonyms). Probing a bit more, they are not
gene_product acc dbxrefs (which means that they will not be found by the
TE tool), but look to be sequence dbxrefs; and a couple more steps would
bring us to the gene_product.

For people who emailed me with specific questions on this topic and
weren't answered with the above, let me know and we can finish in the
originating thread.

Cheers,

-Seth


More information about the go-helpdesk mailing list