Search Mailing List Archives

Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

text mining

andrew brian clegg a.clegg at
Thu May 16 07:37:06 PDT 2002

On Thu, 16 May 2002, Ramneek Gupta wrote:

> Perhaps one of the things you're aiming to build is a
> substitution matrix for ontology terms (using e.g.
> evolutionarily related or sequence related proteins).
> Of course, assuming that this isn't the way GO terms
> are associated to the proteins in the first place..

Substitution matrix in the sense of PAM or BLOSUM? Not what I had in mind,
although one of the first steps in indexing the document set will be to
create a term matrix -- it'll be more of a simple distance matrix though,
where term t and term s are a distance of d units apart, with d
calculated from the number of edges between d and s in the GO graph.

I'd weight d according to the nature of the relationship -- allowing me to
weight is-a and part-of relations seperately -- and according to the
direction of the traversal... e.g. going from a node to a parent node will
impose a fairly low penalty, because every document about 'ribosomal
chaperones' is about 'chaperones', but going the other way will cost more,
since not every document about 'chaperones' is about 'ribosomal

It might be possible to adjust this weighting dynamically based on the
number and arrangement of terms in the ontology (e.g. number of sibling
terms for the term being indexed) and the relative frequency of terms in
the document set. Or at least re-adjust them at each re-indexing.

Guess you'd want to put a cutoff for d in at some point, so that a
document only shows up in searches for s if it contains terms that are
closer than an arbitrary distance d' from s.

A possible way to do it would be to model some multidimensional 'term
space', where each GO term has a location in term space, and so does each
document -- where the document's location would depend on the terms it
contains. Haven't figured out how the maths would work for this though, or
if it would be possible to do it without having to specify 12,000
co-ordinates for each document (one for each term in the ontology). Then
also you'd have the problem of documents covering disjoint, distant
topics; do you give them an 'average' location in term space or several

etc. etc. etc.



This message is from the GOFriends moderated mailing list.  A list of public
announcements and discussion of the Gene Ontology (GO) project.
Problems with the list?           E-mail: owner-gofriends at
Subscribing   send   "subscribe"   to   gofriends-request at
Unsubscribing send   "unsubscribe"  to  gofriends-request at

More information about the go-friends mailing list